Title: | Accurate, high-resolution sample inference from amplicon sequencing data |
---|---|
Description: | The dada2 package infers exact amplicon sequence variants (ASVs) from high-throughput amplicon sequencing data, replacing the coarser and less accurate OTU clustering approach. The dada2 pipeline takes as input demultiplexed fastq files, and outputs the sequence variants and their sample-wise abundances after removing substitution and chimera errors. Taxonomic classification is available via a native implementation of the RDP naive Bayesian classifier, and species-level assignment to 16S rRNA gene fragments by exact matching. |
Authors: | Benjamin Callahan <[email protected]>, Paul McMurdie, Susan Holmes |
Maintainer: | Benjamin Callahan <[email protected]> |
License: | LGPL-2 |
Version: | 1.35.0 |
Built: | 2024-10-30 05:40:27 UTC |
Source: | https://github.com/bioc/dada2 |
The dada2 package is centered around the DADA2 algorithm for accurate high-resolution of sample composition from amplicon sequencing data. The DADA2 algorithm is both more sensitive and more specific than commonly used OTU methods, and resolves amplicon sequence variants (ASVs) that differ by as little as one nucleotide.
The dada2 package also provides a full set of tools for taking raw amplicon sequencing data all the way through to a feature table representing sample composition. Provided facilities include:
Quality filtering (filterAndTrim
, fastqFilter
, fastqPairedFilter
)
Dereplication (derepFastq
)
Learn error rates (learnErrors
)
Sample Inference (dada
)
Chimera Removal (removeBimeraDenovo
, isBimeraDenovo
, isBimeraDenovoTable
)
Merging of Paired Reads (mergePairs
)
Taxonomic Classification (assignTaxonomy
, assignSpecies
)
Benjamin Callahan [email protected]
Paul J McMurdie II [email protected]
Michael Rosen [email protected]
Susan Holmes [email protected]
addSpecies
wraps the assignSpecies
function to assign genus-species
binomials to the input sequences by exact matching against a reference fasta. Those binomials
are then merged with the input taxonomic table with species annotations appended as an
additional column to the input table.
Only species identifications where the genera in the input table and the binomial
classification are consistent are included in the return table.
addSpecies( taxtab, refFasta, allowMultiple = FALSE, tryRC = FALSE, n = 2000, verbose = FALSE )
addSpecies( taxtab, refFasta, allowMultiple = FALSE, tryRC = FALSE, n = 2000, verbose = FALSE )
taxtab |
(Required). A taxonomic table, the output of |
refFasta |
(Required). The path to the reference fasta file, or an R connection. Can be compressed. This reference fasta file should be formatted so that the id lines correspond to the genus-species binomial of the associated sequence: >SeqID genus species ACGAATGTGAAGTAA...... |
allowMultiple |
(Optional). Default FALSE. Defines the behavior when multiple exact matches against different species are returned. By default only unambiguous identifications are return. If TRUE, a concatenated string of all exactly matched species is returned. If an integer is provided, multiple identifications up to that many are returned as a concatenated string. |
tryRC |
(Optional). Default FALSE. If TRUE, the reverse-complement of each sequences will be used for classification if it is a better match to the reference sequences than the forward sequence. |
n |
(Optional). Default |
verbose |
(Optional). Default FALSE. If TRUE, print status to standard output. |
A character matrix one column larger than input. Rows correspond to sequences, and columns to the taxonomic levels. NA indicates that the sequence was not classified at that level.
seqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) training_fasta <- system.file("extdata", "example_train_set.fa.gz", package="dada2") taxa <- assignTaxonomy(seqs, training_fasta) species_fasta <- system.file("extdata", "example_species_assignment.fa.gz", package="dada2") taxa.spec <- addSpecies(taxa, species_fasta) taxa.spec.multi <- addSpecies(taxa, species_fasta, allowMultiple=TRUE)
seqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) training_fasta <- system.file("extdata", "example_train_set.fa.gz", package="dada2") taxa <- assignTaxonomy(seqs, training_fasta) species_fasta <- system.file("extdata", "example_species_assignment.fa.gz", package="dada2") taxa.spec <- addSpecies(taxa, species_fasta) taxa.spec.multi <- addSpecies(taxa, species_fasta, allowMultiple=TRUE)
assignSpecies
uses exact matching against a reference fasta to identify the
genus-species binomial classification of the input sequences.
assignSpecies( seqs, refFasta, allowMultiple = FALSE, tryRC = FALSE, n = 2000, verbose = FALSE )
assignSpecies( seqs, refFasta, allowMultiple = FALSE, tryRC = FALSE, n = 2000, verbose = FALSE )
seqs |
(Required). A character vector of the sequences to be assigned, or an object
coercible by |
refFasta |
(Required). The path to the reference fasta file, or an R connection. Can be compressed. This reference fasta file should be formatted so that the id lines correspond to the genus-species of the associated sequence: >SeqID genus species ACGAATGTGAAGTAA...... |
allowMultiple |
(Optional). Default FALSE. Defines the behavior when multiple exact matches against different species are returned. By default only unambiguous identifications are return. If TRUE, a concatenated string of all exactly matched species is returned. If an integer is provided, multiple identifications up to that many are returned as a concatenated string. |
tryRC |
(Optional). Default FALSE. If TRUE, the reverse-complement of each sequences will also be tested for exact matching to the reference sequences. |
n |
(Optional). Default |
verbose |
(Optional). Default FALSE. If TRUE, print status to standard output. |
A two-column character matrix. Rows correspond to the provided sequences, columns to the genus and species taxonomic levels. NA indicates that the sequence was not classified at that level.
seqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) species_fasta <- system.file("extdata", "example_species_assignment.fa.gz", package="dada2") spec <- assignSpecies(seqs, species_fasta)
seqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) species_fasta <- system.file("extdata", "example_species_assignment.fa.gz", package="dada2") spec <- assignSpecies(seqs, species_fasta)
assignTaxonomy implements the RDP Naive Bayesian Classifier algorithm described in Wang et al. Applied and Environmental Microbiology 2007, with kmer size 8 and 100 bootstrap replicates. Properly formatted reference files for several popular taxonomic databases are available http://benjjneb.github.io/dada2/training.html
assignTaxonomy( seqs, refFasta, minBoot = 50, tryRC = FALSE, outputBootstraps = FALSE, taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), multithread = FALSE, verbose = FALSE )
assignTaxonomy( seqs, refFasta, minBoot = 50, tryRC = FALSE, outputBootstraps = FALSE, taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), multithread = FALSE, verbose = FALSE )
seqs |
(Required). A character vector of the sequences to be assigned, or an object
coercible by |
refFasta |
(Required). The path to the reference fasta file, or an R connection Can be compressed. This reference fasta file should be formatted so that the id lines correspond to the taxonomy (or classification) of the associated sequence, and each taxonomic level is separated by a semicolon. Eg. >Kingom;Phylum;Class;Order;Family;Genus; ACGAATGTGAAGTAA...... |
minBoot |
(Optional). Default 50. The minimum bootstrap confidence for assigning a taxonomic level. |
tryRC |
(Optional). Default FALSE. If TRUE, the reverse-complement of each sequences will be used for classification if it is a better match to the reference sequences than the forward sequence. |
outputBootstraps |
(Optional). Default FALSE. If TRUE, bootstrap values will be retained in an integer matrix. A named list containing the assigned taxonomies (named "taxa") and the bootstrap values (named "boot") will be returned. Minimum bootstrap confidence filtering still takes place, to see full taxonomy set minBoot=0 |
taxLevels |
(Optional). Default is c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"). The taxonomic levels being assigned. Truncates if deeper levels not present in training fasta. |
multithread |
(Optional). Default is FALSE.
If TRUE, multithreading is enabled and the number of available threads is automatically determined.
If an integer is provided, the number of threads to use is set by passing the argument on to
|
verbose |
(Optional). Default FALSE. If TRUE, print status to standard output. |
A character matrix of assigned taxonomies exceeding the minBoot level of bootstrapping confidence. Rows correspond to the provided sequences, columns to the taxonomic levels. NA indicates that the sequence was not consistently classified at that level at the minBoot threshhold.
If outputBootstraps is TRUE, a named list containing the assigned taxonomies (named "taxa") and the bootstrap values (named "boot") will be returned.
seqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) training_fasta <- system.file("extdata", "example_train_set.fa.gz", package="dada2") taxa <- assignTaxonomy(seqs, training_fasta) taxa80 <- assignTaxonomy(seqs, training_fasta, minBoot=80, multithread=2)
seqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) training_fasta <- system.file("extdata", "example_train_set.fa.gz", package="dada2") taxa <- assignTaxonomy(seqs, training_fasta) taxa80 <- assignTaxonomy(seqs, training_fasta, minBoot=80, multithread=2)
Change concatenation of dada-class objects to list construction.
## S4 method for signature 'dada' c(x, ..., recursive = FALSE)
## S4 method for signature 'dada' c(x, ..., recursive = FALSE)
x |
A dada-class object |
... |
objects to be concatenated. |
recursive |
logical. If |
list.
Change concatenation of derep-class objects to list construction.
## S4 method for signature 'derep' c(x, ..., recursive = FALSE)
## S4 method for signature 'derep' c(x, ..., recursive = FALSE)
x |
A derep-class object |
... |
objects to be concatenated. |
recursive |
logical. If |
list.
This function takes as input a sequence table and returns a sequence table in which any sequences that are identical up to shifts or length variation, i.e. that have no mismatches or internal indels when aligned, are collapsed together. The most abundant sequence is chosen as the representative of the collapsed sequences. This function can be thought of as implementing greedy 100% OTU clustering with end-gapping ignored.
collapseNoMismatch( seqtab, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = FALSE )
collapseNoMismatch( seqtab, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = FALSE )
seqtab |
(Required). A sample by sequence matrix, the return of |
minOverlap |
(Optional). |
orderBy |
(Optional). |
identicalOnly |
(Optional). |
vec |
(Optional). |
band |
(Optional). |
verbose |
(Optional). |
Named integer matrix. A row for each sample, and a column for each collapsed sequence across all the samples. Note that the columns are named by the sequence which can make display a little unwieldy. Columns are in the same order (modulo the removed columns) as in the input matrix.
derep1 <- derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) derep2 <- derepFastq(system.file("extdata", "sam2F.fastq.gz", package="dada2")) dada1 <- dada(derep1, tperr1) dada2 <- dada(derep2, tperr1) seqtab <- makeSequenceTable(list(sample1=dada1, sample2=dada2)) collapseNoMismatch(seqtab)
derep1 <- derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) derep2 <- derepFastq(system.file("extdata", "sam2F.fastq.gz", package="dada2")) dada1 <- dada(derep1, tperr1) dada2 <- dada(derep2, tperr1) seqtab <- makeSequenceTable(list(sample1=dada1, sample2=dada2)) collapseNoMismatch(seqtab)
The dada function takes as input dereplicated amplicon sequencing reads and returns the inferred composition of the sample (or samples). Put another way, dada removes all sequencing errors to reveal the members of the sequenced community.
If dada is run in selfConsist=TRUE mode, the algorithm will infer both the sample composition and the parameters of its error model from the data.
dada( derep, err, errorEstimationFunction = loessErrfun, selfConsist = FALSE, pool = FALSE, priors = character(0), multithread = FALSE, verbose = TRUE, ... )
dada( derep, err, errorEstimationFunction = loessErrfun, selfConsist = FALSE, pool = FALSE, priors = character(0), multithread = FALSE, verbose = TRUE, ... )
derep |
(Required). |
err |
(Required). 16xN numeric matrix, or an object coercible by The matrix of estimated rates for each possible nucleotide transition (from sample nucleotide to read nucleotide). Rows correspond to the 16 possible transitions (t_ij) indexed such that 1:A->A, 2:A->C, ..., 16:T->T Columns correspond to quality scores. Each entry must be between 0 and 1. Typically there are 41 columns for the quality scores 0-40. However, if USE_QUALS=FALSE, the matrix must have only one column. If selfConsist = TRUE, |
errorEstimationFunction |
(Optional). Function. Default If USE_QUALS = TRUE, If USE_QUALS = FALSE, this argument is ignored, and transition rates are estimated by maximum likelihood (t_ij = n_ij/n_i). |
selfConsist |
(Optional). If selfConsist = TRUE, the algorithm will alternate between sample inference and error rate estimation
until convergence. Error rate estimation is performed by If selfConsist=FALSE the algorithm performs one round of sample inference based on the provided |
pool |
(Optional). If pool = TRUE, the algorithm will pool together all samples prior to sample inference. If pool = FALSE, sample inference is performed on each sample individually. If pool = "pseudo", the algorithm will perform pseudo-pooling between individually processed samples. This argument has no effect if only 1 sample is provided, and |
priors |
(Optional). The priors argument provides a set of sequences for which there is prior information suggesting they may truly exist, i.e. are not errors. The abundance p-value of dereplicated sequences that exactly match one of the priors are calculated without conditioning on presence, allowing singletons to be detected, and are compared to a reduced threshold 'OMEGA_P' when forming new partitions. |
multithread |
(Optional). Default is FALSE.
If TRUE, multithreading is enabled and the number of available threads is automatically determined.
If an integer is provided, the number of threads to use is set by passing the argument on to
|
verbose |
(Optional). Default TRUE. Print verbose text output. More fine-grained control is available by providing an integer argument.
|
... |
(Optional). All dada_opts can be passed in as arguments to the dada() function.
See |
Briefly, dada
implements a statistical test for the notion that a specific sequence was seen too many times
to have been caused by amplicon errors from currently inferred sample sequences. Overly-abundant
sequences are used as the seeds of new partitions of sequencing reads, and the final set of partitions
is taken to represent the denoised composition of the sample. A more detailed explanation of the algorithm
is found in two publications:
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP (2016). DADA2: High resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581-3.
Rosen MJ, Callahan BJ, Fisher DS, Holmes SP (2012). Denoising PCR-amplified metagenome data. BMC bioinformatics, 13(1), 283.
dada
depends on a parametric error model of substitutions. Thus the quality of its sample inference is affected
by the accuracy of the estimated error rates. selfConsist
mode allows these error rates to be inferred
from the data.
All comparisons between sequences performed by dada
depend on pairwise alignments. This step is the most
computationally intensive part of the algorithm, and two alignment heuristics have been implemented for speed:
A kmer-distance screen and banded Needleman-Wunsch alignmemt. See setDadaOpt
.
A dada-class
object or list of such objects if a list of dereps was provided.
fn1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") fn2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2") derep1 = derepFastq(fn1) derep2 = derepFastq(fn2) dada(fn1, err=tperr1) dada(list(sam1=derep1, sam2=derep2), err=tperr1, selfConsist=TRUE) dada(derep1, err=inflateErr(tperr1,3), BAND_SIZE=32, OMEGA_A=1e-20)
fn1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") fn2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2") derep1 = derepFastq(fn1) derep2 = derepFastq(fn2) dada(fn1, err=tperr1) dada(list(sam1=derep1, sam2=derep2), err=tperr1, selfConsist=TRUE) dada(derep1, err=inflateErr(tperr1,3), BAND_SIZE=32, OMEGA_A=1e-20)
dada
A multi-item List with the following named values...
$denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences.
$clustering: An informative data.frame containing information on each cluster.
$sequence: A character vector of each denoised sequence. Identical to names($denoised).
$quality: The average quality scores for each cluster (row) by position (col).
$map: Integer vector that maps the unique (index of derep$unique) to the denoised sequence (index of dada$denoised).
$birth_subs: A data.frame containing the substitutions at the birth of each new cluster.
$trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col) observed in the final output of the dada algorithm.
$err_in: The err matrix used for this invocation of dada.
$err_out: The err matrix estimated from the output of dada. NULL if err_function not provided.
$opts: A list of the dada_opts used for this invocation of dada.
A list
with the following three members.
$uniques: Named integer vector. Named by the unique sequence, valued by abundance.
$quals: Numeric matrix of average quality scores by position for each unique. Uniques are rows, positions are cols.
$map: Integer vector of length the number of reads, and value the index (in $uniques) of the unique to which that read was assigned.
This can be created from a FastQ sequence file using
derepFastq
A custom interface to FastqStreamer
for dereplicating amplicon sequences from fastq or compressed fastq files,
while also controlling peak memory requirement to support large files.
derepFastq(fls, n = 1e+06, verbose = FALSE, qualityType = "Auto")
derepFastq(fls, n = 1e+06, verbose = FALSE, qualityType = "Auto")
fls |
(Required). |
n |
(Optional). |
verbose |
(Optional). Default FALSE.
If TRUE, throw standard R |
qualityType |
(Optional). |
A derep-class
object or list of such objects.
# Test that chunk-size, `n`, does not affect the result. testFastq = system.file("extdata", "sam1F.fastq.gz", package="dada2") derep1 = derepFastq(testFastq, verbose = TRUE) derep1.35 = derepFastq(testFastq, n = 35, verbose = TRUE) all.equal(getUniques(derep1), getUniques(derep1.35)[names(getUniques(derep1))])
# Test that chunk-size, `n`, does not affect the result. testFastq = system.file("extdata", "sam1F.fastq.gz", package="dada2") derep1 = derepFastq(testFastq, verbose = TRUE) derep1.35 = derepFastq(testFastq, n = 35, verbose = TRUE) all.equal(getUniques(derep1), getUniques(derep1.35)[names(getUniques(derep1))])
A dataset containing the error matrix estimated by DADA2 from the forward reads of the Illumina Miseq 2x250 sequenced Balanced mock community (see manuscript).
A numerical matrix with 16 rows and 41 columns. Rows correspond to the 16 transition (eg. A2A, A2C, ...) Columns correspond to consensus quality scores 0 to 40.
A dataset containing the error matrix estimated by DADA2 from the reverse reads of the Illumina Miseq 2x250 sequenced Balanced mock community (see manuscript).
A numerical matrix with 16 rows and 41 columns. Rows correspond to the 16 transition (eg. A2A, A2C, ...) Columns correspond to consensus quality scores 0 to 40.
fastqFilter takes an input fastq file (can be compressed), filters it based on several
user-definable criteria, and outputs those reads which pass the filter
to a new fastq file (also can be compressed). Several functions in the ShortRead
package are leveraged to do this filtering.
fastqFilter( fn, fout, truncQ = 2, truncLen = 0, maxLen = Inf, minLen = 20, trimLeft = 0, trimRight = 0, maxN = 0, minQ = 0, maxEE = Inf, rm.phix = TRUE, rm.lowcomplex = 0, orient.fwd = NULL, n = 1e+06, OMP = TRUE, qualityType = "Auto", compress = TRUE, verbose = FALSE, ... )
fastqFilter( fn, fout, truncQ = 2, truncLen = 0, maxLen = Inf, minLen = 20, trimLeft = 0, trimRight = 0, maxN = 0, minQ = 0, maxEE = Inf, rm.phix = TRUE, rm.lowcomplex = 0, orient.fwd = NULL, n = 1e+06, OMP = TRUE, qualityType = "Auto", compress = TRUE, verbose = FALSE, ... )
fn |
(Required). The path to the input fastq file. |
fout |
(Required). The path to the output file.
Note that by default ( |
truncQ |
(Optional). Default 2.
Truncate reads at the first instance of a quality score less than or equal to |
truncLen |
(Optional). Default 0 (no truncation).
Truncate reads after |
maxLen |
(Optional). Default Inf (no maximum). Remove reads with length greater than maxLen. maxLen is enforced on the raw reads. |
minLen |
(Optional). Default 20. Remove reads with length less than minLen. minLen is enforced after all other trimming and truncation. |
trimLeft |
(Optional). Default 0.
The number of nucleotides to remove from the start of each read. If both |
trimRight |
(Optional). Default 0.
The number of nucleotides to remove from the end of each read. If both |
maxN |
(Optional). Default 0.
After truncation, sequences with more than |
minQ |
(Optional). Default 0. After truncation, reads contain a quality score below minQ will be discarded. |
maxEE |
(Optional). Default |
rm.phix |
(Optional). Default TRUE.
If TRUE, discard reads that match against the phiX genome, as determined by
|
rm.lowcomplex |
(Optional). Default 0.
If greater than 0, reads with an effective number of kmers less than this value will be removed.
The effective number of kmers is determined by |
orient.fwd |
(Optional). Default NULL. A character string present at the start of valid reads. Only allows unambiguous nucleotides. This string is compared to the start of each read, and the reverse complement of each read. If it exactly matches the start of the read, the read is kept. If it exactly matches the start of the reverse-complement read, the read is reverse-complemented and kept. Otherwise the read if filtered out. The primary use of this parameter is to unify the orientation of amplicon sequencing libraries that are a mixture of forward and reverse orientations, and that include the forward primer on the reads. |
n |
(Optional). The number of records (reads) to read in and filter at any one time.
This controls the peak memory requirement so that very large fastq files are supported.
Default is |
OMP |
(Optional). Default TRUE.
Whether or not to use OMP multithreading when calling |
qualityType |
(Optional). |
compress |
(Optional). Default TRUE. Whether the output fastq file should be gzip compressed. |
verbose |
(Optional). Default FALSE. Whether to output status messages. |
... |
(Optional). Arguments passed on to |
integer(2)
.
The number of reads read in, and the number of reads that passed the filter and were output.
fastqPairedFilter
FastqStreamer
trimTails
testFastq = system.file("extdata", "sam1F.fastq.gz", package="dada2") filtFastq <- tempfile(fileext=".fastq.gz") fastqFilter(testFastq, filtFastq, maxN=0, maxEE=2) fastqFilter(testFastq, filtFastq, trimLeft=10, truncLen=200, maxEE=2, verbose=TRUE)
testFastq = system.file("extdata", "sam1F.fastq.gz", package="dada2") filtFastq <- tempfile(fileext=".fastq.gz") fastqFilter(testFastq, filtFastq, maxN=0, maxEE=2) fastqFilter(testFastq, filtFastq, trimLeft=10, truncLen=200, maxEE=2, verbose=TRUE)
fastqPairedFilter filters pairs of input fastq files (can be compressed) based on several
user-definable criteria, and outputs those read pairs which pass the filter in both directions
to two new fastq file (also can be compressed). Several functions
in the ShortRead
package are leveraged to do this filtering. The filtered forward/reverse reads
remain identically ordered.
fastqPairedFilter( fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, 0), maxLen = c(Inf, Inf), minLen = c(20, 20), trimLeft = c(0, 0), trimRight = c(0, 0), minQ = c(0, 0), maxEE = c(Inf, Inf), rm.phix = c(TRUE, TRUE), rm.lowcomplex = c(0, 0), matchIDs = FALSE, orient.fwd = NULL, id.sep = "\\s", id.field = NULL, n = 1e+06, OMP = TRUE, qualityType = "Auto", compress = TRUE, verbose = FALSE, ... )
fastqPairedFilter( fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, 0), maxLen = c(Inf, Inf), minLen = c(20, 20), trimLeft = c(0, 0), trimRight = c(0, 0), minQ = c(0, 0), maxEE = c(Inf, Inf), rm.phix = c(TRUE, TRUE), rm.lowcomplex = c(0, 0), matchIDs = FALSE, orient.fwd = NULL, id.sep = "\\s", id.field = NULL, n = 1e+06, OMP = TRUE, qualityType = "Auto", compress = TRUE, verbose = FALSE, ... )
fn |
(Required). A |
fout |
(Required). A FILTERING AND TRIMMING ARGUMENTS If a length 1 vector is provided, the same parameter value is used for the forward and reverse reads. If a length 2 vector is provided, the first value is used for the forward reads, and the second for the reverse reads. |
maxN |
(Optional). Default 0.
After truncation, sequences with more than |
truncQ |
(Optional). Default 2.
Truncate reads at the first instance of a quality score less than or equal to |
truncLen |
(Optional). Default 0 (no truncation).
Truncate reads after |
maxLen |
(Optional). Default Inf (no maximum). Remove reads with length greater than maxLen. maxLen is enforced on the raw reads. |
minLen |
(Optional). Default 20. Remove reads with length less than minLen. minLen is enforced after all other trimming and truncation. |
trimLeft |
(Optional). Default 0.
The number of nucleotides to remove from the start of each read. If both |
trimRight |
(Optional). Default 0.
The number of nucleotides to remove from the end of each read. If both |
minQ |
(Optional). Default 0. After truncation, reads contain a quality score below minQ will be discarded. |
maxEE |
(Optional). Default |
rm.phix |
(Optional). Default TRUE.
If TRUE, discard reads that match against the phiX genome, as determined by
|
rm.lowcomplex |
(Optional). Default 0.
If greater than 0, reads with an effective number of kmers less than this value will be removed.
The effective number of kmers is determined by |
matchIDs |
(Optional). Default FALSE.
Whether to enforce matching between the id-line sequence identifiers of the forward and reverse fastq files.
If TRUE, only paired reads that share id fields (see below) are output.
If FALSE, no read ID checking is done.
Note: |
orient.fwd |
(Optional). Default NULL. A character string present at the start of valid reads. Only allows unambiguous nucleotides. This string is compared to the start of the forward and reverse reads. If it exactly matches the start of the forward read, the read is kept. If it exactly matches the start of the reverse read, the fwd/rev reads are swapped. Otherwise the read if filtered out. The primary use of this parameter is to unify the orientation of amplicon sequencing libraries that are a mixture of forward and reverse orientations, and that include the forward primer on the reads. ID MATCHING ARGUMENTS The following optional arguments enforce matching between the sequence identification strings in the forward and reverse reads, and can automatically detect and match ID fields in Illumina format, e.g: EAS139:136:FC706VJ:2:2104:15343:197393. ID matching is not required when using standard Illumina output fastq files. |
id.sep |
(Optional). Default "\s" (white-space).
The separator between fields in the id-line of the input fastq files. Passed to the |
id.field |
(Optional). Default NULL (automatic detection). The field of the id-line containing the sequence identifier. If NULL (the default) and matchIDs is TRUE, the function attempts to automatically detect the sequence identifier field under the assumption of Illumina formatted output. |
n |
(Optional). The number of records (reads) to read in and filter at any one time.
This controls the peak memory requirement so that very large fastq files are supported.
Default is |
OMP |
(Optional). Default TRUE.
Whether or not to use OMP multithreading when calling |
qualityType |
(Optional). |
compress |
(Optional). Default TRUE. Whether the output fastq files should be gzip compressed. |
verbose |
(Optional). Default FALSE. Whether to output status messages. |
... |
(Optional). Arguments passed on to |
integer(2)
.
The number of reads read in, and the number of reads that passed the filter and were output.
fastqFilter
FastqStreamer
trimTails
testFastqF = system.file("extdata", "sam1F.fastq.gz", package="dada2") testFastqR = system.file("extdata", "sam1R.fastq.gz", package="dada2") filtFastqF <- tempfile(fileext=".fastq.gz") filtFastqR <- tempfile(fileext=".fastq.gz") fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), maxN=0, maxEE=2) fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), trimLeft=c(10, 20), truncLen=c(240, 200), maxEE=2, rm.phix=TRUE, rm.lowcomplex=5, kmerSize=2)
testFastqF = system.file("extdata", "sam1F.fastq.gz", package="dada2") testFastqR = system.file("extdata", "sam1R.fastq.gz", package="dada2") filtFastqF <- tempfile(fileext=".fastq.gz") filtFastqR <- tempfile(fileext=".fastq.gz") fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), maxN=0, maxEE=2) fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), trimLeft=c(10, 20), truncLen=c(240, 200), maxEE=2, rm.phix=TRUE, rm.lowcomplex=5, kmerSize=2)
Filters and trims an input fastq file(s) (can be compressed) based on several user-definable criteria, and outputs fastq file(s) (compressed by default) containing those trimmed reads which passed the filters. Corresponding forward and reverse fastq file(s) can be provided as input, in which case filtering is performed on the forward and reverse reads independently, and both reads must pass for the read pair to be output.
filterAndTrim( fwd, filt, rev = NULL, filt.rev = NULL, compress = TRUE, truncQ = 2, truncLen = 0, trimLeft = 0, trimRight = 0, maxLen = Inf, minLen = 20, maxN = 0, minQ = 0, maxEE = Inf, rm.phix = TRUE, rm.lowcomplex = 0, orient.fwd = NULL, matchIDs = FALSE, id.sep = "\\s", id.field = NULL, multithread = FALSE, n = 1e+05, OMP = TRUE, qualityType = "Auto", verbose = FALSE )
filterAndTrim( fwd, filt, rev = NULL, filt.rev = NULL, compress = TRUE, truncQ = 2, truncLen = 0, trimLeft = 0, trimRight = 0, maxLen = Inf, minLen = 20, maxN = 0, minQ = 0, maxEE = Inf, rm.phix = TRUE, rm.lowcomplex = 0, orient.fwd = NULL, matchIDs = FALSE, id.sep = "\\s", id.field = NULL, multithread = FALSE, n = 1e+05, OMP = TRUE, qualityType = "Auto", verbose = FALSE )
fwd |
(Required). |
filt |
(Required). |
rev |
(Optional). Default NULL.
The file path(s) to the reverse fastq file(s) from paired-end sequence data corresponding to those
provided to the |
filt.rev |
(Optional). Default NULL, but required if |
compress |
(Optional). Default TRUE. If TRUE, the output fastq file(s) are gzipped. FILTERING AND TRIMMING PARAMETERS ——— Note: When filtering paired reads... If a length 1 vector is provided, the same parameter value is used for the forward and reverse reads. If a length 2 vector is provided, the first value is used for the forward reads, and the second for the reverse reads. |
truncQ |
(Optional). Default 2.
Truncate reads at the first instance of a quality score less than or equal to |
truncLen |
(Optional). Default 0 (no truncation).
Truncate reads after |
trimLeft |
(Optional). Default 0.
The number of nucleotides to remove from the start of each read. If both |
trimRight |
(Optional). Default 0.
The number of nucleotides to remove from the end of each read. If both |
maxLen |
(Optional). Default Inf (no maximum). Remove reads with length greater than maxLen. maxLen is enforced before trimming and truncation. |
minLen |
(Optional). Default 20. Remove reads with length less than minLen. minLen is enforced after trimming and truncation. |
maxN |
(Optional). Default 0.
After truncation, sequences with more than |
minQ |
(Optional). Default 0.
After truncation, reads contain a quality score less than |
maxEE |
(Optional). Default |
rm.phix |
(Optional). Default TRUE.
If TRUE, discard reads that match against the phiX genome, as determined by |
rm.lowcomplex |
(Optional). Default 0.
If greater than 0, reads with an effective number of kmers less than this value will be removed.
The effective number of kmers is determined by |
orient.fwd |
(Optional). Default NULL. A character string present at the start of valid reads. Only allows unambiguous nucleotides. This string is compared to the start of each read, and the reverse complement of each read. If it exactly matches the start of the read, the read is kept. If it exactly matches the start of the reverse-complement read, the read is reverse-complemented and kept. Otherwise the read if filtered out. For paired reads, the string is compared to the start of the forward and reverse reads, and if it matches the start of the reverse read the reaads are swapped and kept. The primary use of this parameter is to unify the orientation of amplicon sequencing libraries that are a mixture of forward and reverse orientations, and that include the forward primer on the reads. |
matchIDs |
(Optional). Default FALSE. Paired-read filtering only.
Whether to enforce matching between the id-line sequence identifiers of the forward and reverse fastq files.
If TRUE, only paired reads that share id fields (see below) are output.
If FALSE, no read ID checking is done.
Note: |
id.sep |
(Optional). Default "\s" (white-space). Paired-read filtering only.
The separator between fields in the id-line of the input fastq files. Passed to the |
id.field |
(Optional). Default NULL (automatic detection). Paired-read filtering only. The field of the id-line containing the sequence identifier. If NULL (the default) and matchIDs is TRUE, the function attempts to automatically detect the sequence identifier field under the assumption of Illumina formatted output. |
multithread |
(Optional). Default is FALSE.
If TRUE, input files are filtered in parallel via |
n |
(Optional). Default |
OMP |
(Optional). Default TRUE.
Whether or not to use OMP multithreading when calling |
qualityType |
(Optional). |
verbose |
(Optional). Default FALSE. Whether to output status messages. |
filterAndTrim
is a multithreaded convenience interface for the fastqFilter
and fastqPairedFilter
filtering functions.
Note that error messages and tracking are not handled gracefully when using the multithreading
functionality. If errors arise, it is recommended to re-run without multithreading to
troubleshoot the issue.
Integer matrix. Returned invisibly (i.e. only if assigned to something). Rows correspond to the input files, columns record the reads.in and reads.out after filtering.
fastqFilter
fastqPairedFilter
FastqStreamer
testFastqs = c(system.file("extdata", "sam1F.fastq.gz", package="dada2"), system.file("extdata", "sam2F.fastq.gz", package="dada2")) filtFastqs <- c(tempfile(fileext=".fastq.gz"), tempfile(fileext=".fastq.gz")) filterAndTrim(testFastqs, filtFastqs, maxN=0, maxEE=2, verbose=TRUE) filterAndTrim(testFastqs, filtFastqs, truncQ=2, truncLen=200, rm.phix=TRUE, rm.lowcomplex=8)
testFastqs = c(system.file("extdata", "sam1F.fastq.gz", package="dada2"), system.file("extdata", "sam2F.fastq.gz", package="dada2")) filtFastqs <- c(tempfile(fileext=".fastq.gz"), tempfile(fileext=".fastq.gz")) filterAndTrim(testFastqs, filtFastqs, maxN=0, maxEE=2, verbose=TRUE) filterAndTrim(testFastqs, filtFastqs, truncQ=2, truncLen=200, rm.phix=TRUE, rm.lowcomplex=8)
Get DADA options
getDadaOpt(option = NULL)
getDadaOpt(option = NULL)
option |
(Optional). Character. The DADA option(s) to get. |
Named list of option/value pairs. Returns NULL if an invalid option is requested.
getDadaOpt("BAND_SIZE") getDadaOpt()
getDadaOpt("BAND_SIZE") getDadaOpt()
Extract already computed error rates.
getErrors(obj, detailed = FALSE, enforce = TRUE)
getErrors(obj, detailed = FALSE, enforce = TRUE)
obj |
(Required). An R object with error rates. Supported objects: dada-class; list of dada-class; numeric matrix; named list with $err_out, $err_in, $trans. |
detailed |
(Optional). Default FALSE. If FALSE, an error rate matrix corresponding to $err_out is returned. If TRUE, a named list with $err_out, $err_in and $trans. $err_in and $trans can be NULL. |
enforce |
(Optional). Default TRUE. If TRUE, will check validity of $err_out and error if invalid or NULL. |
A numeric matrix of error rates. Or, if detailed=TRUE, a named list with $err_out, $err_in and $trans.
fl1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") drp <- derepFastq(fl1) dd <- dada(drp, err=NULL, selfConsist=TRUE) err <- getErrors(dd)
fl1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") drp <- derepFastq(fl1) dd <- dada(drp, err=NULL, selfConsist=TRUE) err <- getErrors(dd)
This function extracts the sequences from several different data objects, including
including dada-class
and derep-class
objects, as well as
data.frame
objects that have both $sequence and $abundance columns. This function
wraps the getUniques
function, but return only the names (i.e. the sequences).
Can also be provided the file path to a fasta or fastq file, a taxonomy table, or a
DNAStringSet object. Sequences are coerced to upper-case characters.
getSequences(object, collapse = FALSE, silence = TRUE)
getSequences(object, collapse = FALSE, silence = TRUE)
object |
(Required). The object from which to extract the sequences. |
collapse |
(Optional). Default FALSE.
Should duplicate sequences detected in |
silence |
(Optional). Default TRUE. Suppress reporting of the detection and merger of duplicated input sequences. |
character
. A character vector of the sequences.
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1) getSequences(derep1)[1:5] getSequences(dada1)[1:5] getSequences(dada1$clustering)[1:5]
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1) getSequences(derep1)[1:5] getSequences(dada1)[1:5] getSequences(dada1$clustering)[1:5]
This function extracts the uniques-vector
from several different data objects,
including dada-class
and derep-class
objects, as well as
data.frame
objects that have both $sequence and $abundance columns.
The return value is an integer vector named by sequence and valued by abundance. If the input is
already in uniques-vector
format, that same vector will be returned.
getUniques(object, collapse = TRUE, silence = FALSE)
getUniques(object, collapse = TRUE, silence = FALSE)
object |
(Required). The object from which to extract the |
collapse |
(Optional). Default TRUE.
Should duplicate sequences detected in |
silence |
(Optional). Default FALSE. Suppress reporting of the detection and merger of duplicated input sequences. |
integer
.
An integer vector named by unique sequence and valued by abundance.
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1) getUniques(derep1)[1:3] getUniques(dada1)[1:3] getUniques(dada1$clustering)[1:3]
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1) getUniques(derep1)[1:3] getUniques(dada1)[1:3] getUniques(dada1$clustering)[1:3]
Error rates are "inflated" by the specified factor, while appropriately saturating so that rates cannot exceed 1. The formula is: new_err_rate <- err_rate * inflate / (1 + (inflate-1) * err_rate)
inflateErr(err, inflation, inflateSelfTransitions = FALSE)
inflateErr(err, inflation, inflateSelfTransitions = FALSE)
err |
(Required). A numeric matrix of transition rates (16 rows, named "A2A", "A2C", ...). |
inflation |
(Required). The fold-factor by which to inflate the transition rates. |
inflateSelfTransitions |
(Optional). Default FALSE. If True, self-transitions (eg. A->A) are also inflated. |
An error rate matrix of the same dimensions as the input error rate matrix.
tperr2 <- inflateErr(tperr1, 2) tperr3.all <- inflateErr(tperr1, 3, inflateSelfTransitions=TRUE)
tperr2 <- inflateErr(tperr1, 2) tperr3.all <- inflateErr(tperr1, 3, inflateSelfTransitions=TRUE)
This function attempts to find an exact bimera of the parent sequences that matches the input sequence. A bimera is a two-parent chimera, in which the left side is made up of one parent sequence, and the right-side made up of a second parent sequence. If an exact bimera is found TRUE is returned, otherwise FALSE. Bimeras that are one-off from exact are also identified if the allowOneOff argument is TRUE.
isBimera( sq, parents, allowOneOff = FALSE, minOneOffParentDistance = 4, maxShift = 16 )
isBimera( sq, parents, allowOneOff = FALSE, minOneOffParentDistance = 4, maxShift = 16 )
sq |
(Required). A |
parents |
(Required). Character vector. A vector of possible "parent" sequence that could form the left and right sides of the bimera. |
allowOneOff |
(Optional). A |
minOneOffParentDistance |
(Optional). A |
maxShift |
(Optional). A |
logical(1)
.
TRUE if sq is a bimera of two of the parents. Otherwise FALSE.
isBimeraDenovo
, removeBimeraDenovo
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) sqs1 <- getSequences(derep1) isBimera(sqs1[[20]], sqs1[1:10])
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) sqs1 <- getSequences(derep1) isBimera(sqs1[[20]], sqs1[1:10])
This function is a wrapper around isBimera
for collections of unique
sequences (i.e. sequences with associated abundances). Each sequence is evaluated
against a set of "parents" drawn from the sequence collection that are sufficiently
more abundant than the sequence being evaluated. A logical vector is returned, with
an entry for each input sequence indicating whether it was (was not) consistent with
being a bimera of those more abundant "parents".
isBimeraDenovo( unqs, minFoldParentOverAbundance = 2, minParentAbundance = 8, allowOneOff = FALSE, minOneOffParentDistance = 4, maxShift = 16, multithread = FALSE, verbose = FALSE )
isBimeraDenovo( unqs, minFoldParentOverAbundance = 2, minParentAbundance = 8, allowOneOff = FALSE, minOneOffParentDistance = 4, maxShift = 16, multithread = FALSE, verbose = FALSE )
unqs |
(Required). A |
minFoldParentOverAbundance |
(Optional). A |
minParentAbundance |
(Optional). A |
allowOneOff |
(Optional). A |
minOneOffParentDistance |
(Optional). A |
maxShift |
(Optional). A |
multithread |
(Optional). Default is FALSE.
If TRUE, multithreading is enabled and the number of available threads is automatically determined.
If an integer is provided, the number of threads to use is set by passing the argument on to
|
verbose |
(Optional). |
logical
of length the number of input unique sequences.
TRUE if sequence is a bimera of more abundant "parent" sequences. Otherwise FALSE.
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE) is.bim <- isBimeraDenovo(dada1) is.bim2 <- isBimeraDenovo(dada1$denoised, minFoldParentOverAbundance = 2, allowOneOff=TRUE)
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE) is.bim <- isBimeraDenovo(dada1) is.bim2 <- isBimeraDenovo(dada1$denoised, minFoldParentOverAbundance = 2, allowOneOff=TRUE)
This function implements a table-specific version of de novo bimera detection. In short, bimeric sequences are flagged on a sample-by-sample basis. Then, a vote is performed for each sequence across all samples in which it appeared. If the sequence is flagged in a sufficiently high fraction of samples, it is identified as a bimera. A logical vector is returned, with an entry for each sequence in the table indicating whether it was identified as bimeric by this consensus procedure.
isBimeraDenovoTable( seqtab, minSampleFraction = 0.9, ignoreNNegatives = 1, minFoldParentOverAbundance = 1.5, minParentAbundance = 2, allowOneOff = FALSE, minOneOffParentDistance = 4, maxShift = 16, multithread = FALSE, verbose = FALSE )
isBimeraDenovoTable( seqtab, minSampleFraction = 0.9, ignoreNNegatives = 1, minFoldParentOverAbundance = 1.5, minParentAbundance = 2, allowOneOff = FALSE, minOneOffParentDistance = 4, maxShift = 16, multithread = FALSE, verbose = FALSE )
seqtab |
(Required). A sequence table. That is, an integer matrix with colnames corresponding to DNA sequences. |
minSampleFraction |
(Optional). Default is 0.9. The fraction of samples in which a sequence must be flagged as bimeric in order for it to be classified as a bimera. |
ignoreNNegatives |
(Optional). Default is 1.
The number of unflagged samples to ignore when evaluating whether the fraction of samples
in which a sequence was flagged as a bimera exceeds |
minFoldParentOverAbundance |
(Optional). Default is 1.5. Only sequences greater than this-fold more abundant than a sequence can be its "parents". Evaluated on a per-sample basis. |
minParentAbundance |
(Optional). Default is 2. Only sequences at least this abundant can be "parents". Evaluated on a per-sample basis. |
allowOneOff |
(Optional). Default is FALSE. If FALSE, sequences that have one mismatch or indel to an exact bimera are also flagged as bimeric. |
minOneOffParentDistance |
(Optional). Default is 4. Only sequences with at least this many mismatches to the potential bimeric sequence considered as possible "parents" when flagging one-off bimeras. There is no such screen when considering exact bimeras. |
maxShift |
(Optional). Default is 16. Maximum shift allowed when aligning sequences to potential "parents". |
multithread |
(Optional). Default is FALSE. If TRUE, multithreading is enabled. NOT YET IMPLEMENTED. |
verbose |
(Optional). Default FALSE. Print verbose text output. |
logical
of length equal to the number of sequences in the input table.
TRUE if sequence is identified as a bimera. Otherwise FALSE.
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) derep2 = derepFastq(system.file("extdata", "sam2F.fastq.gz", package="dada2")) dd <- dada(list(derep1,derep2), err=NULL, errorEstimationFunction=loessErrfun, selfConsist=TRUE) seqtab <- makeSequenceTable(dd) isBimeraDenovoTable(seqtab) isBimeraDenovoTable(seqtab, allowOneOff=TRUE, minSampleFraction=0.5)
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) derep2 = derepFastq(system.file("extdata", "sam2F.fastq.gz", package="dada2")) dd <- dada(list(derep1,derep2), err=NULL, errorEstimationFunction=loessErrfun, selfConsist=TRUE) seqtab <- makeSequenceTable(dd) isBimeraDenovoTable(seqtab) isBimeraDenovoTable(seqtab, allowOneOff=TRUE, minSampleFraction=0.5)
This function compares the word-profile of the input sequences to the phiX genome, and the reverse complement of the phiX genome. If enough exactly matching words are found, the sequence is flagged.
isPhiX(seqs, wordSize = 16, minMatches = 2, nonOverlapping = TRUE, ...)
isPhiX(seqs, wordSize = 16, minMatches = 2, nonOverlapping = TRUE, ...)
seqs |
(Required). A |
wordSize |
(Optional). Default 16. The size of the words to use for comparison. |
minMatches |
(Optional). Default 2. The minimum number of words in the input sequences that must match the phiX genome (or its reverse complement) for the sequence to be flagged. |
nonOverlapping |
(Optional). Default TRUE. If TRUE, only non-overlapping matching words are counted. |
... |
(Optional). Ignored. |
logical(1)
.
TRUE if sequence matched the phiX genome.
fastqFilter
, fastqPairedFilter
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) sqs1 <- getSequences(derep1) is.phi <- isPhiX(sqs1) is.phi <- isPhiX(sqs1, wordSize=20, minMatches=1)
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) sqs1 <- getSequences(derep1) is.phi <- isPhiX(sqs1) is.phi <- isPhiX(sqs1, wordSize=20, minMatches=1)
This function is a wrapper around isShift for collections of unique sequences. Each unique sequence is evaluated against a set of "parents" drawn from the sequence collection that are more abundant than the sequence being evaluated.
isShiftDenovo(unqs, minOverlap = 20, flagSubseqs = FALSE, verbose = FALSE)
isShiftDenovo(unqs, minOverlap = 20, flagSubseqs = FALSE, verbose = FALSE)
unqs |
(Required). A |
minOverlap |
(Optional). A |
flagSubseqs |
(Optional). A |
verbose |
(Optional). |
logical
of length the number of input unique sequences.
TRUE if sequence is an exact shift of a more abundant sequence. Otherwise FALSE.
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE) is.shift <- isShiftDenovo(dada1) is.shift <- isShiftDenovo(dada1$denoised, minOverlap=50, verbose=TRUE)
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE) is.shift <- isShiftDenovo(dada1) is.shift <- isShiftDenovo(dada1$denoised, minOverlap=50, verbose=TRUE)
derep-class
objects.Error rates are learned by alternating between sample inference and error rate estimation
until convergence. Sample inferences is performed by the dada
function.
Error rate estimation is performed by errorEstimationFunction
.
The output of this function serves as input to the dada function call as the err
parameter.
learnErrors( fls, nbases = 1e+08, nreads = NULL, errorEstimationFunction = loessErrfun, multithread = FALSE, randomize = FALSE, MAX_CONSIST = 10, OMEGA_C = 0, qualityType = "Auto", verbose = FALSE, ... )
learnErrors( fls, nbases = 1e+08, nreads = NULL, errorEstimationFunction = loessErrfun, multithread = FALSE, randomize = FALSE, MAX_CONSIST = 10, OMEGA_C = 0, qualityType = "Auto", verbose = FALSE, ... )
fls |
(Required). |
nbases |
(Optional). Default 1e8. The minimum number of total bases to use for error rate learning. Samples are read into memory until at least this number of total bases has been reached, or all provided samples have been read in. |
nreads |
(Optional). Default NULL. DEPRECATED. Please update your code to use the nbases parameter. |
errorEstimationFunction |
(Optional). Function. Default
|
multithread |
(Optional). Default is FALSE.
If TRUE, multithreading is enabled and the number of available threads is automatically determined.
If an integer is provided, the number of threads to use is set by passing the argument on to
|
randomize |
(Optional). Default FALSE. If FALSE, samples are read in the provided order until enough reads are obtained. If TRUE, samples are picked at random from those provided. |
MAX_CONSIST |
(Optional). Default 10. The maximum number of times to step through the self-consistency loop. If convergence was not reached in MAX_CONSIST steps, the estimated error rates in the last step are returned. |
OMEGA_C |
(Optional). Default 0.
The threshold at which unique sequences inferred to contain errors are corrected in the final output,
and used to estimate the error rates (see more at |
qualityType |
(Optional). |
verbose |
(Optional). Default TRUE Print verbose text output. More fine-grained control is available by providing an integer argument.
|
... |
(Optional). Additional arguments will be passed on to the |
A named list with three entries: $err_out: A numeric matrix with the learned error rates. $err_in: The initialization error rates (unimportant). $trans: A feature table of observed transitions for each type (eg. A->C) and quality score.
derepFastq
, plotErrors
, loessErrfun
, dada
fl1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") fl2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2") err <- learnErrors(c(fl1, fl2)) err <- learnErrors(c(fl1, fl2), nbases=5000000, randomize=TRUE) # Using a list of derep-class objects dereps <- derepFastq(c(fl1, fl2)) err <- learnErrors(dereps, multithread=TRUE, randomize=TRUE, MAX_CONSIST=20)
fl1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") fl2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2") err <- learnErrors(c(fl1, fl2)) err <- learnErrors(c(fl1, fl2), nbases=5000000, randomize=TRUE) # Using a list of derep-class objects dereps <- derepFastq(c(fl1, fl2)) err <- learnErrors(dereps, multithread=TRUE, randomize=TRUE, MAX_CONSIST=20)
This function accepts a matrix of observed transitions, with each transition
corresponding to a row (eg. row 2 = A->C) and each column to a quality score
(eg. col 31 = Q30). It returns a matrix of estimated error
rates of the same shape. Error rates are estimates by a loess
fit
of the observed rates of each transition as a function of the quality score.
Self-transitions (i.e. A->A) are taken to be the left-over probability.
loessErrfun(trans)
loessErrfun(trans)
trans |
(Required). A matrix of the observed transition counts. Must be 16 rows, with the rows named "A2A", "A2C", ... |
A numeric matrix with 16 rows and the same number of columns as trans.
The estimated error rates for each transition (row, eg. "A2C") and quality score
(column, eg. 31), as determined by loess
smoothing over the quality
scores within each transition category.
derep1 <- derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1) err.new <- loessErrfun(dada1$trans)
derep1 <- derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1) err.new <- loessErrfun(dada1$trans)
This function constructs a sequence table (analogous to an OTU table) from the provided list of samples.
makeSequenceTable(samples, orderBy = "abundance")
makeSequenceTable(samples, orderBy = "abundance")
samples |
(Required). A |
orderBy |
(Optional). |
Named integer matrix. A row for each sample, and a column for each unique sequence across all the samples. Note that the columns are named by the sequence which can make display a little unwieldy.
derep1 <- derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) derep2 <- derepFastq(system.file("extdata", "sam2F.fastq.gz", package="dada2")) dada1 <- dada(derep1, tperr1) dada2 <- dada(derep2, tperr1) seqtab <- makeSequenceTable(list(sample1=dada1, sample2=dada2))
derep1 <- derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) derep2 <- derepFastq(system.file("extdata", "sam2F.fastq.gz", package="dada2")) dada1 <- dada(derep1, tperr1) dada2 <- dada(derep2, tperr1) seqtab <- makeSequenceTable(list(sample1=dada1, sample2=dada2))
This function attempts to merge each denoised pair of forward and reverse reads, rejecting any pairs which do not sufficiently overlap or which contain too many (>0 by default) mismatches in the overlap region. Note: This function assumes that the fastq files for the forward and reverse reads were in the same order.
mergePairs( dadaF, derepF, dadaR, derepR, minOverlap = 12, maxMismatch = 0, returnRejects = FALSE, propagateCol = character(0), justConcatenate = FALSE, trimOverhang = FALSE, verbose = FALSE, ... )
mergePairs( dadaF, derepF, dadaR, derepR, minOverlap = 12, maxMismatch = 0, returnRejects = FALSE, propagateCol = character(0), justConcatenate = FALSE, trimOverhang = FALSE, verbose = FALSE, ... )
dadaF |
(Required). A |
derepF |
(Required). |
dadaR |
(Required). A |
derepR |
(Required). |
minOverlap |
(Optional). Default 12. The minimum length of the overlap required for merging the forward and reverse reads. |
maxMismatch |
(Optional). Default 0. The maximum mismatches allowed in the overlap region. |
returnRejects |
(Optional). Default FALSE.
If TRUE, the pairs that that were rejected based on mismatches in the overlap
region are retained in the return |
propagateCol |
(Optional). |
justConcatenate |
(Optional). Default FALSE. If TRUE, the forward and reverse-complemented reverse read are concatenated rather than merged, with a NNNNNNNNNN (10 Ns) spacer inserted between them. |
trimOverhang |
(Optional). Default FALSE. If TRUE, "overhangs" in the alignment between the forwards and reverse read are trimmed off. "Overhangs" are when the reverse read extends past the start of the forward read, and vice-versa, as can happen when reads are longer than the amplicon and read into the other-direction primer region. |
verbose |
(Optional). Default FALSE. If TRUE, a summary of the function results are printed to standard output. |
... |
(Optional). Further arguments to pass on to |
A data.frame
, or a list of data.frames
.
The return data.frame
(s) has a row for each unique pairing of forward/reverse denoised sequences,
and the following columns:
$abundance
: Number of reads corresponding to this forward/reverse combination.
$sequence
: The merged sequence.
$forward
: The index of the forward denoised sequence.
$reverse
: The index of the reverse denoised sequence.
$nmatch
: Number of matches nts in the overlap region.
$nmismatch
: Number of mismatches in the overlap region.
$nindel
: Number of indels in the overlap region.
$prefer
: The sequence used for the overlap region. 1=forward; 2=reverse.
$accept
: TRUE if overlap between forward and reverse denoised sequences was at least
minOverlap
and had at most maxMismatch
differences. FALSE otherwise.
$...
: Additional columns specified in propagateCol
.
A list of data.frames are returned if a list of input objects was provided.
derepFastq
, dada
, fastqPairedFilter
fnF <- system.file("extdata", "sam1F.fastq.gz", package="dada2") fnR = system.file("extdata", "sam1R.fastq.gz", package="dada2") dadaF <- dada(fnF, selfConsist=TRUE) dadaR <- dada(fnR, selfConsist=TRUE) merger <- mergePairs(dadaF, fnF, dadaR, fnR) merger <- mergePairs(dadaF, fnF, dadaR, fnR, returnRejects=TRUE, propagateCol=c("n0", "birth_ham")) merger <- mergePairs(dadaF, fnF, dadaR, fnR, justConcatenate=TRUE)
fnF <- system.file("extdata", "sam1F.fastq.gz", package="dada2") fnR = system.file("extdata", "sam1R.fastq.gz", package="dada2") dadaF <- dada(fnF, selfConsist=TRUE) dadaR <- dada(fnR, selfConsist=TRUE) merger <- mergePairs(dadaF, fnF, dadaR, fnR) merger <- mergePairs(dadaF, fnF, dadaR, fnR, returnRejects=TRUE, propagateCol=c("n0", "birth_ham")) merger <- mergePairs(dadaF, fnF, dadaR, fnR, justConcatenate=TRUE)
This function combines sequence tables together into one merged sequences table.
mergeSequenceTables( table1 = NULL, table2 = NULL, ..., tables = NULL, repeats = "error", orderBy = "abundance", tryRC = FALSE )
mergeSequenceTables( table1 = NULL, table2 = NULL, ..., tables = NULL, repeats = "error", orderBy = "abundance", tryRC = FALSE )
table1 |
(Optional, default=NULL). Named integer matrix. Rownames correspond to samples
and column names correspond to sequences. The output of |
table2 |
(Optional, default=NULL). Named integer matrix. Rownames correspond to samples
and column names correspond to sequences. The output of |
... |
(Optional). Additional sequence tables. |
tables |
(Optional, default=NULL). Either a list of sequence tables, or a list/vector of RDS filenames
corresponding to sequence tables. If provided, |
repeats |
(Optional). Default "error". Specifies how merging should proceed in the presence of repeated sample names. Valid values: "error", "sum". If "sum", then samples with the same name are summed together in the merged table. |
orderBy |
(Optional). |
tryRC |
(Optional). |
Named integer matrix. A row for each sample, and a column for each unique sequence across all the samples. Note that the columns are named by the sequence which can make display unwieldy.
## Not run: mergetab <- mergeSequenceTables(seqtab1, seqtab2, seqtab3) # unnamed arguments assumed to be sequence tables input_tables <- list(seqtab1, seqtab2, seqtab3) mergetab <- mergeSequenceTables(tables=input_tables) # list of sequence tables files <- c(file1, file2, file3) mergetab <- mergeSequenceTables(tables=files) # vector of filenames ## End(Not run)
## Not run: mergetab <- mergeSequenceTables(seqtab1, seqtab2, seqtab3) # unnamed arguments assumed to be sequence tables input_tables <- list(seqtab1, seqtab2, seqtab3) mergetab <- mergeSequenceTables(tables=input_tables) # list of sequence tables files <- c(file1, file2, file3) mergetab <- mergeSequenceTables(tables=files) # vector of filenames ## End(Not run)
Deactivate renaming of dada-class objects.
## S4 replacement method for signature 'dada,ANY' names(x) <- value
## S4 replacement method for signature 'dada,ANY' names(x) <- value
x |
an R object. |
value |
a character vector of up to the same length as |
NULL.
Deactivate renaming of derep-class objects.
## S4 replacement method for signature 'derep,ANY' names(x) <- value
## S4 replacement method for signature 'derep,ANY' names(x) <- value
x |
an R object. |
value |
a character vector of up to the same length as |
NULL.
This function accepts a matrix of observed transitions, groups together all observed
transitions regardless of quality scores, and estimates the error rate for that transition
as the observed fraction of those transitions. This can be used in place of the default
loessErrfun
when calling learnErrors
or link{dada}
with the effect that quality scores will be effectively ignored.
noqualErrfun(trans, pseudocount = 1)
noqualErrfun(trans, pseudocount = 1)
trans |
(Required). A matrix of the observed transition counts. Must be 16 rows, with the rows named "A2A", "A2C", ... |
pseudocount |
(Optional). Default 1. Added to each type of transition. |
A numeric matrix with 16 rows and the same number of columns as trans. The estimated error rates for each transition (row, eg. "A2C") are identical across all columns (which correspond to quality scores).
fl1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") err.noqual <- learnErrors(fl1, errorEstimationFunction=noqualErrfun)
fl1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2") err.noqual <- learnErrors(fl1, errorEstimationFunction=noqualErrfun)
This function performs a Needleman-Wunsch alignment between two sequences.
nwalign( s1, s2, match = getDadaOpt("MATCH"), mismatch = getDadaOpt("MISMATCH"), gap = getDadaOpt("GAP_PENALTY"), homo_gap = NULL, band = -1, endsfree = TRUE, vec = FALSE )
nwalign( s1, s2, match = getDadaOpt("MATCH"), mismatch = getDadaOpt("MISMATCH"), gap = getDadaOpt("GAP_PENALTY"), homo_gap = NULL, band = -1, endsfree = TRUE, vec = FALSE )
s1 |
(Required). |
s2 |
(Required). |
match |
(Optional). |
mismatch |
(Optional). |
gap |
(Optional). |
homo_gap |
(Optional). |
band |
(Optional). |
endsfree |
(Optional). |
vec |
(Optional). |
character(2)
. The aligned sequences.
sq1 <- "CTAATACATGCAAGTCGAGCGAGTCTGCCTTGAAGATCGGAGTGCTTGCACTCTGTGAAACAAGATA" sq2 <- "TTAACACATGCAAGTCGAACGGAAAGGCCAGTGCTTGCACTGGTACTCGAGTGGCGAACGGGTGAGT" nwalign(sq1, sq2) nwalign(sq1, sq2, band=16)
sq1 <- "CTAATACATGCAAGTCGAGCGAGTCTGCCTTGAAGATCGGAGTGCTTGCACTCTGTGAAACAAGATA" sq2 <- "TTAACACATGCAAGTCGAACGGAAAGGCCAGTGCTTGCACTGGTACTCGAGTGGCGAACGGGTGAGT" nwalign(sq1, sq2) nwalign(sq1, sq2, band=16)
This function performs a Needleman-Wunsch alignment between two sequences, and then counts the number of mismatches and indels in that alignment. End gaps are not included in this count.
nwhamming(s1, s2, ...)
nwhamming(s1, s2, ...)
s1 |
(Required). |
s2 |
(Required). |
... |
(Optional). Further arguments to pass on to |
integer(1)
. The total number of mismatches and gaps, excluding gaps at the beginning
and end of the alignment.
sq1 <- "CTAATACATGCAAGTCGAGCGAGTCTGCCTTGAAGATCGGAGTGCTTGCACTCTGTGAAACAAGATA" sq2 <- "TTAACACATGCAAGTCGAACGGAAAGGCCAGTGCTTGCACTGGTACTCGAGTGGCGAACGGGTGAGT" nwhamming(sq1, sq2) nwhamming(sq1, sq2, band=16)
sq1 <- "CTAATACATGCAAGTCGAGCGAGTCTGCCTTGAAGATCGGAGTGCTTGCACTCTGTGAAACAAGATA" sq2 <- "TTAACACATGCAAGTCGAACGGAAAGGCCAGTGCTTGCACTGGTACTCGAGTGGCGAACGGGTGAGT" nwhamming(sq1, sq2) nwhamming(sq1, sq2, band=16)
This function accepts a matrix of observed transitions from PacBio CCS amplicon
sequencing data, with each transition
corresponding to a row (eg. row 2 = A->C) and each column to a quality score
(eg. col 31 = Q30). It returns a matrix of estimated error
rates of the same shape. Error rates are estimates by loessErrfun
for quality scores 0-92, and individually by the maximum likelihood estimate
for the maximum quality score of 93.
PacBioErrfun(trans)
PacBioErrfun(trans)
trans |
(Required). A matrix of the observed transition counts. Must be 16 rows, with the rows named "A2A", "A2C", ... |
A numeric matrix with 16 rows and the same number of columns as trans.
The estimated error rates for each transition (row, eg. "A2C") and quality score
(column, eg. 31), as determined by loess
smoothing over the quality
scores within each transition category.
derep.PB <- derepFastq(system.file("extdata", "samPB.fastq.gz", package="dada2")) dada.PB <- dada(derep.PB, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, selfConsist=TRUE) err.PB <- PacBioErrfun(dada.PB$trans)
derep.PB <- derepFastq(system.file("extdata", "samPB.fastq.gz", package="dada2")) dada.PB <- dada(derep.PB, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, selfConsist=TRUE) err.PB <- PacBioErrfun(dada.PB$trans)
This function plots a histogram of the distribution of sequence complexities
in the form of effective numbers of kmers as determined by seqComplexity
.
By default, kmers of size 2 are used, in which case a perfectly random sequences
will approach an effective kmer number of 16 = 4 (nucleotides) ^ 2 (kmer size).
plotComplexity( fl, kmerSize = 2, window = NULL, by = 5, n = 1e+05, bins = 100, aggregate = FALSE, ... )
plotComplexity( fl, kmerSize = 2, window = NULL, by = 5, n = 1e+05, bins = 100, aggregate = FALSE, ... )
fl |
(Required). |
kmerSize |
(Optional). Default 2. The size of the kmers (or "oligonucleotides" or "words") to use. |
window |
(Optional). Default NULL. The width in nucleotides of the moving window. If NULL the whole sequence is used. |
by |
(Optional). Default 5. The step size in nucleotides between each moving window tested. |
n |
(Optional). Default 100,000. The number of records to sample from the fastq file. |
bins |
(Optional). Default 100. The number of bins to use for the histogram. |
aggregate |
(Optional). Default FALSE. If TRUE, compute an aggregate quality profile for all fastq files provided. |
... |
(Optional). Arguments passed on to |
A ggplot2
object.
Will be rendered to default device if printed
,
or can be stored and further modified.
See ggsave
for additional options.
seqComplexity
oligonucleotideFrequency
plotComplexity(system.file("extdata", "sam1F.fastq.gz", package="dada2"))
plotComplexity(system.file("extdata", "sam1F.fastq.gz", package="dada2"))
This function plots the observed frequency of each transition (eg. A->C) as a function of the associated quality score. It also plots the final estimated error rates (if they exist). The initial input rates and the expected error rates under the nominal definition of quality scores can also be shown.
plotErrors( dq, nti = c("A", "C", "G", "T"), ntj = c("A", "C", "G", "T"), obs = TRUE, err_out = TRUE, err_in = FALSE, nominalQ = FALSE )
plotErrors( dq, nti = c("A", "C", "G", "T"), ntj = c("A", "C", "G", "T"), obs = TRUE, err_out = TRUE, err_in = FALSE, nominalQ = FALSE )
dq |
(Required). An object from which error rates can be extracted. Valid inputs are
coercible by |
nti |
(Optional). Default c("A","C","G","T"). Some combination of the 4 DNA nucleotides. |
ntj |
(Optional). Default c("A","C","G","T"). Some combination of the 4 DNA nucleotides. The error rates from nti->ntj will be plotted. If multiple nti or ntj are chosen, error rates from each-to-each will be plotted in a grid. |
obs |
(Optional). Default TRUE. If TRUE, the observed error rates are plotted as points. |
err_out |
(Optional). Default TRUE. If TRUE, plot the output error rates (solid line). |
err_in |
(Optional). Default FALSE. If TRUE, plot the input error rates (dashed line). |
nominalQ |
(Optional). Default FALSE. If TRUE, plot the expected error rates (red line) if quality scores exactly matched their nominal definition: Q = -10 log10(p_err). |
A ggplot2
object.
Will be rendered to default device if printed
,
or can be stored and further modified.
See ggsave
for additional options.
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2"), verbose = TRUE) dada1 <- dada(derep1, err = inflateErr(tperr1, 2), errorEstimationFunction = loessErrfun) plotErrors(dada1) plotErrors(dada1, "A", "C") plotErrors(dada1, nti="A", ntj=c("A","C","G","T"), err_in=TRUE, nominalQ=TRUE)
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2"), verbose = TRUE) dada1 <- dada(derep1, err = inflateErr(tperr1, 2), errorEstimationFunction = loessErrfun) plotErrors(dada1) plotErrors(dada1, "A", "C") plotErrors(dada1, nti="A", ntj=c("A","C","G","T"), err_in=TRUE, nominalQ=TRUE)
This function plots a visual summary of the distribution of quality scores as a function of sequence position for the input fastq file(s).
plotQualityProfile(fl, n = 5e+05, aggregate = FALSE)
plotQualityProfile(fl, n = 5e+05, aggregate = FALSE)
fl |
(Required). |
n |
(Optional). Default 500,000. The number of records to sample from the fastq file. |
aggregate |
(Optional). Default FALSE. If TRUE, compute an aggregate quality profile for all fastq files provided. |
The distribution of quality scores at each position is shown as a grey-scale heat map, with dark colors corresponding to higher frequency. The plotted lines show positional summary statistics: green is the mean, orange is the median, and the dashed orange lines are the 25th and 75th quantiles.
If the sequences vary in length, a red line will be plotted showing the percentage of reads that extend to at least that position.
A ggplot2
object.
Will be rendered to default device if printed
,
or can be stored and further modified.
See ggsave
for additional options.
plotQualityProfile(system.file("extdata", "sam1F.fastq.gz", package="dada2"))
plotQualityProfile(system.file("extdata", "sam1F.fastq.gz", package="dada2"))
This function reverse complements DNA sequence(s) provided.
This function is nothing more than a concisely-named convenience wrapper for
reverseComplement
that handles the character
vector
DNA sequences generated in the the dada2 package.
rc(sq)
rc(sq)
sq |
(Required). |
character
. The reverse-complemented DNA sequence(s).
R1492 <- "RGYTACCTTGTTACGACTT" rc(R1492) sqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) rc(sqs)
R1492 <- "RGYTACCTTGTTACGACTT" rc(R1492) sqs <- getSequences(system.file("extdata", "example_seqs.fa", package="dada2")) rc(sqs)
This function is a convenience interface for chimera removal. Two methods to identify chimeras are
supported: Identification from pooled sequences (see isBimeraDenovo
for details)
and identification by consensus across samples (see isBimeraDenovoTable
for details).
Sequence variants identified as bimeric are removed, and a bimera-free collection of unique sequences
is returned.
removeBimeraDenovo(unqs, method = "consensus", ..., verbose = FALSE)
removeBimeraDenovo(unqs, method = "consensus", ..., verbose = FALSE)
unqs |
(Required). A |
method |
(Optional). Default is "consensus". Only has an effect if a sequence table is provided. If "pooled": The samples in the sequence table are all pooled together for bimera
identification ( If "consensus": The samples in a sequence table are independently checked for bimeras,
and a consensus decision on each sequence variant is made ( If "per-sample": The samples in a sequence table are independently checked for bimeras,
and sequence variants are removed (zeroed-out) from samples independently ( |
... |
(Optional). Arguments to be passed to |
verbose |
(Optional). Default FALSE. Print verbose text output. |
A uniques vector, or an object of matching class if a data.frame or sequence table is provided. A list of such objects is returned if a list of input unqs was provided.
isBimeraDenovoTable
, isBimeraDenovo
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE) out.nobim <- removeBimeraDenovo(dada1) out.nobim <- removeBimeraDenovo(dada1$clustering, method="pooled", minFoldParentOverAbundance = 2)
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) dada1 <- dada(derep1, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE) out.nobim <- removeBimeraDenovo(dada1) out.nobim <- removeBimeraDenovo(dada1$clustering, method="pooled", minFoldParentOverAbundance = 2)
Removes primer(s) and orients the reads in input fastq file(s) (can be compressed). Reads that do not contain the primer(s) are discarded. Intended for use with PacBio CCS data. Faster external solutions such as cutadapt or trimmomatic are recommended for short-read data.
removePrimers( fn, fout, primer.fwd, primer.rev = NULL, max.mismatch = 2, allow.indels = FALSE, trim.fwd = TRUE, trim.rev = TRUE, orient = TRUE, compress = TRUE, verbose = FALSE )
removePrimers( fn, fout, primer.fwd, primer.rev = NULL, max.mismatch = 2, allow.indels = FALSE, trim.fwd = TRUE, trim.rev = TRUE, orient = TRUE, compress = TRUE, verbose = FALSE )
fn |
(Required). |
fout |
(Required). |
primer.fwd |
(Required). |
primer.rev |
(Optional). Default NULL. The reverse primer sequence expected to be at the end of the sequenced amplicon. Can contain IUPAC ambiguous nucleotide codes. NOTE: 'primer.rev' should be provided in the orientation that would appear in a DNA sequence starting at the forward primer and being read towards the reverse primer. Thus, it is often necessary to reverse-complement the reverse primer sequence before providing it to this function. |
max.mismatch |
(Optional). Default 2.
The number of mismatches to tolerate when matching reads to primer sequences.
See |
allow.indels |
(Optional). Default FALSE. If TRUE, indels ared allowed when matching the primer sequences to the read. If FALSE, no indels are allowed. Note that when 'allow.indels=TRUE', primer matching is significantly slower, currently about 4x slower. |
trim.fwd |
(Optional). Default TRUE. If TRUE, reads are trimmed to the end of the forward primer, i.e. the forward primer and any preceding sequence are trimmed off. |
trim.rev |
(Optional). Default TRUE. If TRUE, reads are trimmed to the beginning of the reverse primer, i.e. the reverse primer and any subsequent sequence are trimmed off. |
orient |
(Optional). Default TRUE. If TRUE, reads are re-oriented if the reverse complement of the read is a better match to the provided primer sequence(s). This is recommended for PacBio CCS reads, which come in a random mix of forward and reverse-complement orientations. |
compress |
(Optional). Default TRUE. If TRUE, the output fastq file(s) are gzipped. |
verbose |
(Optional). Default FALSE. Whether to output status messages. |
Integer matrix. Returned invisibly (i.e. only if assigned to something). Rows correspond to the input files, columns record the number of reads.in and reads.out after discarding reads that didn't match the provided primers.
F27 <- "AGRGTTYGATYMTGGCTCAG" R1492 <- "RGYTACCTTGTTACGACTT" fn <- system.file("extdata", "samPBprimers.fastq.gz", package="dada2") fn.noprime <- tempfile(fileext=".fastq.gz") removePrimers(fn, fn.noprime, primer.fwd=F27, primer.rev=rc(R1492), orient=TRUE, verbose=TRUE)
F27 <- "AGRGTTYGATYMTGGCTCAG" R1492 <- "RGYTACCTTGTTACGACTT" fn <- system.file("extdata", "samPBprimers.fastq.gz", package="dada2") fn.noprime <- tempfile(fileext=".fastq.gz") removePrimers(fn, fn.noprime, primer.fwd=F27, primer.rev=rc(R1492), orient=TRUE, verbose=TRUE)
This function calculates the kmer complexity of input sequences. Complexity is quantified as the Shannon richness of kmers, which can be thought of as the effective number of kmers if they were all at equal frequencies. If a window size is provided, the minimum Shannon richness observed over sliding window along the sequence is returned.
seqComplexity(seqs, kmerSize = 2, window = NULL, by = 5, ...)
seqComplexity(seqs, kmerSize = 2, window = NULL, by = 5, ...)
seqs |
(Required). A |
kmerSize |
(Optional). Default 2. The size of the kmers (or "oligonucleotides" or "words") to use. |
window |
(Optional). Default NULL. The width in nucleotides of the moving window. If NULL the whole sequence is used. |
by |
(Optional). Default 5. The step size in nucleotides between each moving window tested. |
... |
(Optional). Ignored. |
This function can be used to identify potentially artefactual or undesirable low-complexity sequences, or sequences with low-complexity regions, as are sometimes observed in Illumina sequencing runs. When such artefactual sequences are present, the Shannon kmer richness values returned by this function will typically show a clear bimodal signal.
Kmers with non-ACGT characters are ignored. Also note that no correction is performed for sequence lengths. This is important when using longer kmer lengths, where 4^wordSize approaches the length of the sequence, as shorter sequences will then have a lower effective richness simply due to their being too little sequence to sample all the possible kmers.
numeric
.
A vector of minimum kmer complexities for each sequence.
plotComplexity
oligonucleotideFrequency
sq.norm <- "TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGA" sq.lowc <- "TCCTTCTTCTCCTCTCTTTCTCCTTCTTTCTTTTTTTTCCCTTTCTCTTCTTCTTTTTCTTCCTTCCTTTTTTC" sq.part <- "TTTTTCTTCTCCCCCTTCCCCTTTCCTTTTCTCCTTTTTTCCTTTAGTGCAGTTGAGGCAGGCGGAATTCGTGG" sqs <- c(sq.norm, sq.lowc, sq.part) seqComplexity(sqs) seqComplexity(sqs, kmerSize=3, window=25)
sq.norm <- "TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGA" sq.lowc <- "TCCTTCTTCTCCTCTCTTTCTCCTTCTTTCTTTTTTTTCCCTTTCTCTTCTTCTTTTTCTTCCTTCCTTTTTTC" sq.part <- "TTTTTCTTCTCCCCCTTCCCCTTTCCTTTTCTCCTTTTTTCCTTTAGTGCAGTTGAGGCAGGCGGAATTCGTGG" sqs <- c(sq.norm, sq.lowc, sq.part) seqComplexity(sqs) seqComplexity(sqs, kmerSize=3, window=25)
setDadaOpt sets the default options used by the dada(...) function for your current session, much
like par
sets the session default plotting parameters. However, all dada options can be set as
part of the dada(...) function call itself by including a DADA_OPTION_NAME=VALUE argument.
setDadaOpt(...)
setDadaOpt(...)
... |
(Required). The DADA options to set, along with their new value. |
**Sensitivity**
OMEGA_A: This parameter sets the threshold for when DADA2 calls unique sequences significantly overabundant, and therefore creates a new partition with that sequence as the center. Default is 1e-40, which is a conservative setting to avoid making false positive inferences, but which comes at the cost of reducing the ability to identify some rare variants.
OMEGA_P: The threshold for unique sequences with prior evidence of existence (see 'priors' argument). Default is 1e-4.
OMEGA_C: The threshold at which unique sequences inferred to contain errors are corrected in the final output. The probability that each unique sequence is generated at its observed abundance from the center of its final partition is evaluated, and compared to OMEGA_C. If that probability is >= OMEGA_C, it is "corrected", i.e. replaced by the partition center sequence. The special value of 0 corresponds to correcting all input sequences, and any value > 1 corresponds to performing no correction on sequences found to contain errors. Default is 1e-40 (same as OMEGA_A).
DETECT_SINGLETONS: If set to TRUE, this removes the requirement for at least two reads with the same sequences to exist in order for a new ASV to be detected. It also somewhat increases sensitivity to other low abundance sequences as well, e.g. those present in just 2/3/4/... reads. Note, this applies to all unique sequences, not just those supported by prior evidence (see 'priors' argument), and so it does make false-positive detections more likely.
**Alignment**
MATCH: The score of a match in the Needleman-Wunsch alignment. Default is 4.
MISMATCH: The score of a mismatch in the Needleman-Wunsch alignment. Default is -5.
GAP_PENALTY: The cost of gaps in the Needleman-Wunsch alignment. Default is -8.
HOMOPOLYMER_GAP_PENALTY: The cost of gaps in homopolymer regions (>=3 repeated bases). Default is NULL, which causes homopolymer gaps to be treated as normal gaps.
BAND_SIZE: When set, banded Needleman-Wunsch alignments are performed. Banding restricts the net cumulative number of insertion of one sequence relative to the other. The default value of BAND_SIZE is 16. If DADA is applied to sequencing technologies with high rates of indels, such as 454 sequencing, the BAND_SIZE parameter should be increased. Setting BAND_SIZE to a negative number turns off banding (i.e. full Needleman-Wunsch).
**Sequence Comparison Heuristics**
USE_KMERS: If TRUE, a 5-mer distance screen is performed prior to performing each pairwise alignment, and if the 5mer-distance is greater than KDIST_CUTOFF, no alignment is performed. Default is TRUE.
KDIST_CUTOFF: The default value of 0.42 was chosen to screen pairs of sequences that differ by >10%, and was calibrated on Illumina sequenced 16S amplicon data. The assumption is that sequences that differ by such a large amount cannot be linked by amplicon errors (i.e. if you sequence one, you won't get a read of other) and so careful (and costly) alignment is unnecessary.
GAPLESS: If TRUE, the ordered kmer identity between pairs of sequences is compared to their unordered overlap. If equal, the optimal alignment is assumed to be gapless. Default is TRUE. Only relevant if USE_KMERS is TRUE.
GREEDY: The DADA2 algorithm is not greedy, but a very restricted form of greediness can be turned on via this option. If TRUE, unique sequences with reads less than those expected to be generated by resequencing just the central unique in their partition are "locked" to that partition. Modest (~30%) speedup, and almost no impact on output. Default is TRUE.
**New Partition Conditions**
MIN_FOLD: The minimum fold-overabundance for sequences to form new partitions. Default value is 1, which means this criteria is ignored.
MIN_HAMMING: The minimum hamming-separation for sequences to form new partitions. Default value is 1, which means this criteria is ignored.
MIN_ABUNDANCE: The minimum abundance for unique sequences form new partitions. Default value is 1, which means this criteria is ignored.
MAX_CLUST: The maximum number of partitions. Once this many partitions have been created, the algorithm terminates regardless of whether the statistical model suggests more real sequence variants exist. If set to 0 this argument is ignored. Default value is 0.
**Self Consistency**
MAX_CONSIST: The maximum number of steps when selfConsist=TRUE. If convergence is not reached in MAX_CONSIST steps, the algorithm will terminate with a warning message. Default value is 10.
**Pseudo-pooling Behavior**
PSEUDO_PREVALENCE: When performing pseudo-pooling, all sequence variants found in at least this many samples are used as priors for a subsequent round of sample inference. Only relevant if 'pool="pseudo"'. Default is 2.
PSEUDO_ABUNDANCE: When performing pseudo-pooling, all denoised sequence variants with total abundance (over all samples) greater than this are used as priors for a subsequent round of sample inference. Only relevant if 'pool="pseudo"'. Default is Inf (i.e. abundance ignored for this purpose).
**Error Model**
USE_QUALS: If TRUE, the dada(...) error model takes into account the consensus quality score of the dereplicated unique sequences. If FALSE, quality scores are ignored. Default is TRUE.
**Technical**
SSE: Controls the level of explicit SSE vectorization for kmer calculations. Default 2. Maintained for development reasons, should have no impact on output.
0: No explicit vectorization (but modern compilers will auto-vectorize the code).
1: Explicit SSE2.
2: Explicit, packed SSE2 using 8-bit integers. Slightly faster than SSE=1.
NULL.
setDadaOpt(OMEGA_A = 1e-20) setDadaOpt(MATCH=1, MISMATCH=-4, GAP_PENALTY=-6) setDadaOpt(GREEDY=TRUE, GAPLESS=TRUE)
setDadaOpt(OMEGA_A = 1e-20) setDadaOpt(MATCH=1, MISMATCH=-4, GAP_PENALTY=-6) setDadaOpt(GREEDY=TRUE, GAPLESS=TRUE)
See the general documentation of show
method for
expected behavior.
## S4 method for signature 'derep' show(object) ## S4 method for signature 'dada' show(object)
## S4 method for signature 'derep' show(object) ## S4 method for signature 'dada' show(object)
object |
Any R object |
NULL.
A dataset containing the error matrix estimated by fitting a piecewise linear model to the errors observed in the mock community featured in Schirmer 2015 (metaID 35).
A numerical matrix with 16 rows and 41 columns. Rows correspond to the 16 transition (eg. A2A, A2C, ...) Columns correspond to consensus quality scores 0 to 40.
The uniques vector is an integer
vector that is named by the unique sequence, and
valued by the abundance of that sequence. This format is commonly used within the
dada2-package
, for function inputs and outputs. The getUniques
function coerces a variety of input objects into the uniques-vector format, including
dada-class
and derep-class
objects.
A wrapper for writeFastq in the ShortRead package. Default output format is compatible with uchime.
uniquesToFasta(unqs, fout, ids = NULL, mode = "w", width = 20000, ...)
uniquesToFasta(unqs, fout, ids = NULL, mode = "w", width = 20000, ...)
unqs |
(Required).
A |
fout |
(Required). The file path of the output file. |
ids |
(Optional). |
mode |
(Optional). Default "w".
Passed on to |
width |
(Optional). Default 20000.
The number of characters per line in the file. Default is effectively one line
per sequence. Passed on to |
... |
Additional parameters passed on to |
NULL.
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) outfile <- tempfile(fileext=".fasta") uniquesToFasta(derep1, outfile) uniquesToFasta(derep1, outfile, ids=paste0("Sequence", seq(length(getSequences(derep1)))))
derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2")) outfile <- tempfile(fileext=".fasta") uniquesToFasta(derep1, outfile) uniquesToFasta(derep1, outfile, ids=paste0("Sequence", seq(length(getSequences(derep1)))))
Writes a named character vector of DNA sequences to a fasta file. Values are the sequences, and names are used for the id lines.
## S4 method for signature 'character' writeFasta(object, file, mode = "w", width = 20000L, ...)
## S4 method for signature 'character' writeFasta(object, file, mode = "w", width = 20000L, ...)
object |
(Required). A named |
file |
(Required). The output file. |
mode |
(Optional). Default "w". Append with "a". |
width |
(Optional). Default 20000L. Maximum line length before newline. |
... |
(Optional). Additonal arguments passed to |
NULL.