Title: | Competitive Balances for Taxonomic Enrichment Analysis in R |
---|---|
Description: | This package implements CBEA, a method to perform set-based analysis for microbiome relative abundance data. This approach constructs a competitive balance between taxa within the set and remainder taxa per sample. More details can be found in the Nguyen et al. 2021+ manuscript. Additionally, this package adds support functions to help users perform taxa-set enrichment analyses using existing gene set analysis methods. In the future we hope to also provide curated knowledge driven taxa sets. |
Authors: | Quang Nguyen [aut, cre] |
Maintainer: | Quang Nguyen <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.7.0 |
Built: | 2024-11-19 03:29:28 UTC |
Source: | https://github.com/bioc/CBEA |
cbea
is used compute enrichment scores
per sample for pre-defined sets using the CBEA
(Competitive Balances for Enrichment Analysis).
cbea( obj, set, output, distr = NULL, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... ) ## S4 method for signature 'TreeSummarizedExperiment' cbea( obj, set, output, distr = NULL, abund_values, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... ) ## S4 method for signature 'data.frame' cbea( obj, set, taxa_are_rows = FALSE, id_col = NULL, output, distr = NULL, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... ) ## S4 method for signature 'matrix' cbea( obj, set, taxa_are_rows = FALSE, output, distr = NULL, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... )
cbea( obj, set, output, distr = NULL, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... ) ## S4 method for signature 'TreeSummarizedExperiment' cbea( obj, set, output, distr = NULL, abund_values, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... ) ## S4 method for signature 'data.frame' cbea( obj, set, taxa_are_rows = FALSE, id_col = NULL, output, distr = NULL, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... ) ## S4 method for signature 'matrix' cbea( obj, set, taxa_are_rows = FALSE, output, distr = NULL, adj = FALSE, n_perm = 100, parametric = TRUE, thresh = 0.05, init = NULL, control = NULL, parallel_backend = NULL, ... )
obj |
The element of class |
set |
|
output |
(String). The form of the output of the model.
Has to be either |
distr |
(String). The choice of distribution for the null. Can be either |
adj |
(Logical). Whether correlation adjustment procedure is utilized. Defaults to FALSE. |
n_perm |
(Numeric). Add bootstrap resamples to both the permuted and unpermuted data set. This might help with stabilizing the distribution fitting procedure, especially if the sample size is low. Defaults to 1. |
parametric |
(Logical). Indicate whether a parametric distribution will be fitted to estimate
z-scores, CDF values, and p-values. Defaults to |
thresh |
(Numeric). Threshold for significant p-values if |
init |
(Named List). Initialization parameters for estimating the null distribution. Default is NULL. |
control |
(Named List). Additional arguments to be passed to
|
parallel_backend |
See documentation |
... |
Additional arguments not used at the moment. |
abund_values |
(Character). Character value for selecting the |
taxa_are_rows |
(Logical). Indicate whether the data frame or matrix has taxa as rows |
id_col |
(Character Vector). Vector of character to indicate metadata
columns to keep (for example, |
This function support different formats of the OTU table, however
for best results please use TreeSummarizedExperiment
.
phyloseq
is supported, however CBEA
will not explicitly import
phyloseq
package and will require users to install them separately.
If use data.frame
or matrix
, users should specify whether taxa are rows using the
taxa_are_rows
option. Additionally, for data.frame
, users can specify
metadata columns to be kept via the id_col
argument.
The output
argument specifies what type of values will be returned in the final matrix.
The options pval
or sig
returns either unadjusted p-values or dummy variables
indicating whether a set is significantly enriched in that sample (based on unadjusted
p-values thresholded at thresh
).
The option raw
returns raw scores computed for each set without any distribution fitting
or inference procedure. Users can use this option to examine the distribution of CBEA scores
under the null.
R
An n
by m
matrix of enrichment scores at
the sample level
data(hmp_gingival) seq <- hmp_gingival$data set <- hmp_gingival$set # n_perm = 10 to reduce runtime mod <- cbea(obj = seq, set = set, output = "zscore", abund_values = "16SrRNA", distr = "norm", parametric = TRUE, adj = TRUE, thresh = 0.05, n_perm = 10)
data(hmp_gingival) seq <- hmp_gingival$data set <- hmp_gingival$set # n_perm = 10 to reduce runtime mod <- cbea(obj = seq, set = set, output = "zscore", abund_values = "16SrRNA", distr = "norm", parametric = TRUE, adj = TRUE, thresh = 0.05, n_perm = 10)
Get CBEA scores for a given matrix and a vector of column indices
get_raw_score(X, idx)
get_raw_score(X, idx)
X |
(Matrix). OTU table of matrix format where taxa are columns and samples are rows |
idx |
(Integer vector). Vector of integers indicating the column ids of taxa in a set |
A matrix of size n
by 1
where n
is the total number of samples
data(hmp_gingival) seq <- hmp_gingival$data seq_matrix <- SummarizedExperiment::assays(seq)[[1]] seq_matrix <- t(seq_matrix) + 1 rand_set <- sample(seq_len(ncol(seq_matrix)), size = 10) scores <- get_raw_score(X = seq_matrix, idx = rand_set)
data(hmp_gingival) seq <- hmp_gingival$data seq_matrix <- SummarizedExperiment::assays(seq)[[1]] seq_matrix <- t(seq_matrix) + 1 rand_set <- sample(seq_len(ncol(seq_matrix)), size = 10) scores <- get_raw_score(X = seq_matrix, idx = rand_set)
CBEAout
objectThis function cleans up all diagnostics of the cbea
method
(from the CBEAout
object) into a nice tibble::tibble()
## S3 method for class 'CBEAout' glance(x, statistic, ...)
## S3 method for class 'CBEAout' glance(x, statistic, ...)
x |
An object of type |
statistic |
What type of diagnostic to return. Users can choose to return
|
... |
Unused, kept for consistency with generics |
A tibble::tibble()
summarizing diagnostic fits per set (as row)
# load the data data(hmp_gingival) mod <- cbea(hmp_gingival$data, hmp_gingival$set, abund_values = "16SrRNA", output = "sig", distr = "norm", adj = FALSE, n_perm = 5, parametric = TRUE) glance(mod, "fit_diagnostic")
# load the data data(hmp_gingival) mod <- cbea(hmp_gingival$data, hmp_gingival$set, abund_values = "16SrRNA", output = "sig", distr = "norm", adj = FALSE, n_perm = 5, parametric = TRUE) glance(mod, "fit_diagnostic")
Compute geometric mean of a vector
using exp(mean(log(.x)))
format
gmean(vec)
gmean(vec)
vec |
A vector of values with length |
A numeric value of the
geometric mean of the vector vec
ex <- abs(rnorm(10)) gmean(ex)
ex <- abs(rnorm(10)) gmean(ex)
This function computes the geometric mean by row of a numeric matrix
gmeanRow(X)
gmeanRow(X)
X |
A numeric matrix with |
A numeric vector of the geometric
mean of the matrix X
with length n
ex <- matrix(rnorm(100), nrow = 10, ncol = 10) ex <- abs(ex) gmeanRow(ex)
ex <- matrix(rnorm(100), nrow = 10, ncol = 10) ex <- abs(ex) gmeanRow(ex)
Gingival data set from the Human Microbiome Project
data(hmp_gingival)
data(hmp_gingival)
A list with two elements
The microbiome relative abundance data with relevant metadata
obtained from the Human Microbiome Project via the HMP16SData
package (snapshot: 11-15-2021).
The data set is hosted the container of type
phyloseq
. Using the
mia
package users can convert it
to the TreeSummarizedExperiment
type.
Sets of microbes based on their metabolism annotation at the Genera level. Annotations obtained via Calagaro et al.'s repository on Zenodo (https://doi.org/10.5281/zenodo.3942108)
Data can be downloaded directly from https://hmpdacc.org/hmp/
R interface of the data from https://doi.org/doi:10.18129/B9.bioc.HMP16SData
Beghini F, Renson A, Zolnik CP, Geistlinger L, Usyk M, Moody TU, et al. Tobacco Exposure Associated with Oral Microbiota Oxygen Utilization in the New York City Health and Nutrition Examination Study. Annals of Epidemiology. 2019;34:18–25.e3. doi:10.1016/j.annepidem.2019.03.005
Consortium THMP, Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, et al. Structure, Function and Diversity of the Healthy Human Microbiome. Nature. 2012;486(7402):207–214. doi:10.1038/nature11234.
Calgaro M, Romualdi C, Waldron L, Risso D, Vitulo N. Assessment of Statistical Methods from Single Cell, Bulk RNA-Seq, and Metagenomics Applied to Microbiome Data. Genome Biology. 2020;21(1):191. doi:10.1186/s13059-020-02104-1
Schiffer L, Azhar R, Shepherd L, Ramos M, Geistlinger L, Huttenhower C, et al. HMP16SData: Efficient Access to the Human Microbiome Project through Bioconductor. American Journal of Epidemiology. 2019;doi:10.1093/aje/kwz006.
This function takes a list of lists from each object and turns it into a CBEAout type object
new_CBEAout(out, call)
new_CBEAout(out, call)
out |
A list containing scores for each set |
call |
A list containing all important arguments for printing |
A new CBEAout object (which is a cleaner list of lists)
The Two Component Mixture Normal Distribution
pmnorm(q, mu, sigma, lambda, log = FALSE, verbose = FALSE) dmnorm(x, mu, sigma, lambda, log = FALSE, verbose = FALSE)
pmnorm(q, mu, sigma, lambda, log = FALSE, verbose = FALSE) dmnorm(x, mu, sigma, lambda, log = FALSE, verbose = FALSE)
q , x
|
(Vector). Values to calculate distributional values of. |
mu |
(Vector). A two value vector of mean values. |
sigma |
(Vector). A two value vector of component-wise variances |
lambda |
(Vector). A two value vector of component mixing coefficients |
log |
(Boolean). Whether returning probabilities are in log format |
verbose |
(Boolean). Whether to return component values. |
A numeric value representing the probability density value of a two-component mixture distribution
pmnorm
: Cumulative Distribution Function
dmnorm
: Probability Density Function
library(mixtools) lambda <- c(0.7,0.3) mu <- c(1,2) sigma <- c(1,1) v <- rnormmix(100, lambda=lambda, mu=mu, sigma=sigma) pmnorm(v, lambda=lambda,mu=mu,sigma=sigma) dmnorm(v, lambda=lambda,mu=mu,sigma=sigma)
library(mixtools) lambda <- c(0.7,0.3) mu <- c(1,2) sigma <- c(1,1) v <- rnormmix(100, lambda=lambda, mu=mu, sigma=sigma) pmnorm(v, lambda=lambda,mu=mu,sigma=sigma) dmnorm(v, lambda=lambda,mu=mu,sigma=sigma)
Print dispatch for CBEAout objects
## S3 method for class 'CBEAout' print(x, ...)
## S3 method for class 'CBEAout' print(x, ...)
x |
The |
... |
Undefined arguments, keeping consistency for generics |
Text for printing
This function takes in a CBEA type object and collects all values across all sets and samples that were evaluated.
## S3 method for class 'CBEAout' tidy(x, ...)
## S3 method for class 'CBEAout' tidy(x, ...)
x |
A |
... |
Unused, included for generic consistency only. |
A tidy tibble::tibble()
summarizing scores per sample per set.
# load the data data(hmp_gingival) mod <- cbea(hmp_gingival$data, hmp_gingival$set, abund_values = "16SrRNA", output = "sig", distr = "norm", adj = FALSE, n_perm = 5, parametric = TRUE) tidy(mod)
# load the data data(hmp_gingival) mod <- cbea(hmp_gingival$data, hmp_gingival$set, abund_values = "16SrRNA", output = "sig", distr = "norm", adj = FALSE, n_perm = 5, parametric = TRUE) tidy(mod)