Title: | Identify Contaminants in Marker-gene and Metagenomics Sequencing Data |
---|---|
Description: | Simple statistical identification of contaminating sequence features in marker-gene or metagenomics data. Works on any kind of feature derived from environmental sequencing data (e.g. ASVs, OTUs, taxonomic groups, MAGs,...). Requires DNA quantitation data or sequenced negative control samples. |
Authors: | Benjamin Callahan [aut, cre], Nicole Marie Davis [aut], Felix G.M. Ernst [ctb] |
Maintainer: | Benjamin Callahan <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.27.0 |
Built: | 2025-01-16 02:52:46 UTC |
Source: | https://github.com/bioc/decontam |
The frequency of each sequence (or OTU) in the input feature table as a function of the concentration of amplified DNA in each sample is used to identify contaminant sequences.
isContaminant(seqtab, ...) ## S4 method for signature 'ANY' isContaminant( seqtab, conc = NULL, neg = NULL, method = c("auto", "frequency", "prevalence", "combined", "minimum", "either", "both"), batch = NULL, batch.combine = c("minimum", "product", "fisher"), threshold = 0.1, normalize = TRUE, detailed = TRUE )
isContaminant(seqtab, ...) ## S4 method for signature 'ANY' isContaminant( seqtab, conc = NULL, neg = NULL, method = c("auto", "frequency", "prevalence", "combined", "minimum", "either", "both"), batch = NULL, batch.combine = c("minimum", "product", "fisher"), threshold = 0.1, normalize = TRUE, detailed = TRUE )
seqtab |
(Required). |
... |
Not used currently |
conc |
(Optional). |
neg |
(Optional). |
method |
(Optional).
|
batch |
(Optional). |
batch.combine |
(Optional). Default "minimum". For each input sequence variant (or OTU) the probabilities calculated in each batch are combined into a single probability that is compared to 'codethreshold' to classify contaminants. Valid values: "minimum", "product", "fisher". |
threshold |
(Optional). Default |
normalize |
(Optional). Default TRUE.
If TRUE, the input |
detailed |
(Optional). Default TRUE.
If TRUE, the return value is a |
If detailed=TRUE
a data.frame
with classification information.
If detailed=FALSE
a logical
vector is returned, with TRUE indicating contaminants.
st <- readRDS(system.file("extdata", "st.rds", package="decontam")) # conc should be positive and non-zero conc <- c(6413, 3581.0, 5375, 4107, 4291, 4260, 4171, 2765, 33, 48) neg <- c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE) # Use frequency or frequency and prevalence to identify contaminants isContaminant(st, conc=conc, method="frequency", threshold=0.2) isContaminant(st, conc=conc, neg=neg, method="both", threshold=c(0.1,0.5))
st <- readRDS(system.file("extdata", "st.rds", package="decontam")) # conc should be positive and non-zero conc <- c(6413, 3581.0, 5375, 4107, 4291, 4260, 4171, 2765, 33, 48) neg <- c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE) # Use frequency or frequency and prevalence to identify contaminants isContaminant(st, conc=conc, method="frequency", threshold=0.2) isContaminant(st, conc=conc, neg=neg, method="both", threshold=c(0.1,0.5))
The prevalence of each sequence (or OTU) in the input feature table across samples and negative controls is used to identify non-contaminant sequences. Note that the null hypothesis here is that sequences **are** contaminants. This function is intended for use on low-biomass samples in which a large proportion of the sequences are likely to be contaminants.
isNotContaminant(seqtab, ...) ## S4 method for signature 'ANY' isNotContaminant( seqtab, neg = NULL, method = "prevalence", threshold = 0.5, normalize = TRUE, detailed = FALSE )
isNotContaminant(seqtab, ...) ## S4 method for signature 'ANY' isNotContaminant( seqtab, neg = NULL, method = "prevalence", threshold = 0.5, normalize = TRUE, detailed = FALSE )
seqtab |
(Required). Integer matrix. A feature table recording the observed abundances of each sequence (or OTU) in each sample. Rows should correspond to samples, and columns to sequences (or OTUs). |
... |
Not used currently |
neg |
(Required). |
method |
(Optional). Default "prevalence". The method used to test for contaminants. Currently the only method supported is prevalence. prevalence: Contaminants are identified by increased prevalence in negative controls. |
threshold |
(Optional). Default |
normalize |
(Optional). Default TRUE.
If TRUE, the input |
detailed |
(Optional). Default FALSE.
If TRUE, the return value is a |
If detailed=FALSE
a logical
vector is returned, with TRUE indicating non-contaminants.
If detailed=TRUE
a data.frame
is returned instead.
st <- readRDS(system.file("extdata", "st.rds", package="decontam")) samdf <- readRDS(system.file("extdata", "samdf.rds", package="decontam")) isNotContaminant(st, samdf$quant_reading, threshold=0.05)
st <- readRDS(system.file("extdata", "st.rds", package="decontam")) samdf <- readRDS(system.file("extdata", "samdf.rds", package="decontam")) isNotContaminant(st, samdf$quant_reading, threshold=0.05)
Plots DNA concentration as a function of experimental conditions. This function is intended as a convenient exploration of potential covariation between DNA concentrations and conditions that could influence the community composition, as this could lead to higher rates of false-positive contaminant identifications.
plot_condition(seqtab, condition, conc, batch = NULL, log = FALSE)
plot_condition(seqtab, condition, conc, batch = NULL, log = FALSE)
seqtab |
(Required). |
condition |
(Required). |
conc |
(Required). |
batch |
(Optional). |
log |
(Optional). |
# MUC is a phyloseq object, MUC.conc is the vector of sample concentrations MUC <- readRDS(system.file("extdata", "MUClite.rds", package="decontam")) MUC.conc <- readRDS(system.file("extdata", "MUCconc.rds", package="decontam")) plot_condition(MUC, "Habitat", MUC.conc) # Plot against random quantitative variable plot_condition(MUC, runif(length(MUC.conc)), MUC.conc, log=TRUE)
# MUC is a phyloseq object, MUC.conc is the vector of sample concentrations MUC <- readRDS(system.file("extdata", "MUClite.rds", package="decontam")) MUC.conc <- readRDS(system.file("extdata", "MUCconc.rds", package="decontam")) plot_condition(MUC, "Habitat", MUC.conc) # Plot against random quantitative variable plot_condition(MUC, runif(length(MUC.conc)), MUC.conc, log=TRUE)
Plots the frequencies of selected sequence features vs. each sample's DNA concentration.
plot_frequency( seqtab, taxa, conc, neg = NULL, normalize = TRUE, showModels = TRUE, log = TRUE, facet = TRUE )
plot_frequency( seqtab, taxa, conc, neg = NULL, normalize = TRUE, showModels = TRUE, log = TRUE, facet = TRUE )
seqtab |
(Required). |
taxa |
(Required). |
conc |
(Required). |
neg |
(Optional). |
normalize |
(Optional). |
showModels |
(Optional). |
log |
(Optional). |
facet |
(Optional). |
A ggplot2
object.
Will be rendered to default device if printed
,
or can be stored and further modified.
See ggsave
for additional options.
# MUC is a phyloseq object, MUC.conc is the vector of sample concentrations MUC <- readRDS(system.file("extdata", "MUClite.rds", package="decontam")) MUC.conc <- readRDS(system.file("extdata", "MUCconc.rds", package="decontam")) plot_frequency(MUC, "Seq1", conc=MUC.conc) # The concentration can also be reference directly as the quant_reading sample variable in MUC plot_frequency(MUC, "Seq1", conc="quant_reading") plot_frequency(MUC, c("Seq1", "Seq10", "Seq33"), conc="quant_reading", log=FALSE)
# MUC is a phyloseq object, MUC.conc is the vector of sample concentrations MUC <- readRDS(system.file("extdata", "MUClite.rds", package="decontam")) MUC.conc <- readRDS(system.file("extdata", "MUCconc.rds", package="decontam")) plot_frequency(MUC, "Seq1", conc=MUC.conc) # The concentration can also be reference directly as the quant_reading sample variable in MUC plot_frequency(MUC, "Seq1", conc="quant_reading") plot_frequency(MUC, c("Seq1", "Seq10", "Seq33"), conc="quant_reading", log=FALSE)