More detailed documentation is available at the DADA2 Home Page. In particular, the online tutorial workflow is the most detailed and up-to-date demonstration of applying DADA2 to multi-sample amplicon datasets.
The investigation of environmental microbial communities and microbiomes has been revolutionized by the development of high-throughput amplicon sequencing. In amplicon sequencing a particular genetic locus, for example the 16S rRNA gene in bacteria, is amplified from DNA extracted from the community of interest, and then sequenced on a next-generation sequencing platform. This technique removes the need to culture microbes in order to detect their presence, and cost-effectively provides a deep census of a microbial community.
However, the process of amplicon sequencing introduces errors into the sequencing data, and these errors severely complicate the interpretation of the results. DADA2 implements a novel algorithm that models the errors introduced during amplicon sequencing, and uses that error model to infer the true sample composition. DADA2 replaces the traditional “OTU-picking” step in amplicon sequencing workflows, producing instead higher-resolution tables of amplicon sequence variants (ASVs).
As seen in the paper introducing DADA2 and in further benchmarking, the DADA2 method is more sensitive and specific than traditional OTU methods: DADA2 detects real biological variation missed by OTU methods while outputting fewer spurious sequences. Another recent paper describes how replacing OTUs with ASVs improves the precision, comprehensiveness and reproducibility of marker-gene data analysys.
The starting point for the dada2 pipeline is a set of demultiplexed fastq files corresponding to the samples in your amplicon sequencing study. That is, dada2 expects there to be an individual fastq file for each sample (or two fastq files, one forward and one reverse, for each sample). Demultiplexing is often performed at the sequencing center, but if that has not been done there are a variety of tools do accomplish this, including the popular QIIME python scripts split_libraries_fastq.py followed by split_sequence_file_on_sample_ids.py, and the utility idemp, among others.
In addition, dada2 expects that there are no non-biological bases in the sequencing data. This may require pre-processing with outside software if, as is relatively common, your PCR primers were included in the sequenced amplicon region.
Once demultiplexed fastq files without non-biological nucleotides are in hand, the dada2 pipeline proceeds as follows:
filterAndTrim()
derepFastq()
learnErrors()
dada()
mergePairs()
makeSequenceTable()
removeBimeraDenovo()
The output of the dada2 pipeline is a feature table of amplicon sequence variants (an ASV table): A matrix with rows corresponding to samples and columns to ASVs, in which the value of each entry is the number of times that ASV was observed in that sample. This table is analogous to the traditional OTU table, except at higher resolution, i.e exact amplicon sequence variants rather than (usually 97%) clusters of sequencing reads.
We now go through the pipeline on a highly simplified dataset of just one paired-end sample (we’ll add a second later).
We’ll start by getting the filenames of our example paired-end fastq files. Usually you will define these filenames directly, or read them out of a directory, but for this tutorial we’re using files included with the package, which we can identify via a particular function call:
## [1] '1.35.0'
fnF1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2")
fnR1 <- system.file("extdata", "sam1R.fastq.gz", package="dada2")
filtF1 <- tempfile(fileext=".fastq.gz")
filtR1 <- tempfile(fileext=".fastq.gz")
Note that the dada2 package “speaks” the gzip format natively, therefore all fastq files can remain in the space-saving gzip format throughout.
Now that we have the filenames, we’re going to inspect the quality profile of our data:
After inspecting the quality profiles, it is clear that the reverse read quality drops off more severely than in the forward read. Thus we are going to trim the forward reads at position 240, and the reverse reads at position 200.
Filtering is an important step when dealing with sequence data, as low-quality sequences can contain unexpected and misleading errors. Trimming is also usually advised, as Illumina sequencing quality tends to drop off at the end of reads, and the initial nucleotides can also be problematic due to calibration issues:
filterAndTrim(fwd=fnF1, filt=filtF1, rev=fnR1, filt.rev=filtR1,
trimLeft=10, truncLen=c(240, 200),
maxN=0, maxEE=2,
compress=TRUE, verbose=TRUE)
## Read in 1500 paired-sequences, output 890 (59.3%) filtered paired-sequences.
The filterAndTrim(...)
function filters the forward and
reverse reads jointly, outputting only those pairs of reads that both
pass the filter. In this function call we did four things: We removed
the first trimLeft=10
nucleotides of each read. We
truncated the forward and reverse reads at
truncLen=c(240, 200)
nucleotides respectively. We filtered
out all reads with more than maxN=0
ambiguous nucleotides.
And we filtered out all reads with more than two expected
errors. The filtered output files were stored as gzipped fastq files
(compress=TRUE
).
This represents a fairly standard set of filtering/trimming parameters. However, it is always worth evaluating whether the filtering and trimming parameters you are using are appropriate for your data. One size does not fit all! (And are you sure you have removed your primers?)
An important consideration: If using paired-end sequencing data, you must maintain a suitable overlap (>20nts) between the forward and reverse reads after trimming! This is especially important to keep in mind for mult-V-region amplicions (such as V3-V4) in which there may be relatively little overlap to begin with, and thus little read-truncation is possible if reads are to be merged later on.
The next thing we want to do is “dereplicate” the filtered fastq files. During dereplication, we condense the data by collapsing together all reads that encode the same sequence, which significantly reduces later computation times:
## Dereplicating sequence entries in Fastq file: /tmp/RtmpMKDX3k/file160e119bd76b.fastq.gz
## Encountered 332 unique sequences from 890 total sequences read.
## Dereplicating sequence entries in Fastq file: /tmp/RtmpMKDX3k/file160e71884860.fastq.gz
## Encountered 485 unique sequences from 890 total sequences read.
Dereplication is a common step in almost all modern ASV inference (or
OTU picking) pipelines, but a unique feature of derepFastq
is that it maintains a summary of the quality information for each
dereplicated sequence in $quals
.
The dada algorithm uses a parametric model of the errors introduced by PCR amplification and sequencing. Those error parameters typically vary between sequencing runs and PCR protocols, so our method provides a way to estimate those parameters from the data itself.
## 204700 total bases in 890 reads from 1 samples will be used for learning the error rates.
## 169100 total bases in 890 reads from 1 samples will be used for learning the error rates.
We recommend using at least a subest of your data to learn your error rates for the most accurate sample inference.
The core method of the dada2 package is at the sample inference
stage. The dada(...)
function implements the
algorithm described in our paper, and is simultaneously more
sensitive and more specific than any OTU algorithm we have ever
tested.
Here we call the dada(...)
function, using the models of
the error rates wer learned in the previous step:
## Sample 1 - 890 reads in 332 unique sequences.
## Sample 1 - 890 reads in 485 unique sequences.
## dada-class: object describing DADA2 denoising results
## 8 sequence variants were inferred from 332 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
The dada(...)
algorithm inferred 8 sequence variants
from the forward reads.
We’ve inferred the sample sequences in the forward and reverse reads independently. Now it’s time to merge those inferred sequences together, throwing out those pairs of reads that don’t match:
## 834 paired-reads (in 7 unique pairings) successfully merged out of 882 (in 10 pairings) input.
The mergePairs(...)
function returns a
data.frame
corresponding to each successfully merged unique
sequence. The $forward
and $reverse
columns
record which forward and reverse sequence contributed to that merged
sequence.
The dada(...)
algorithm models and removes substitution
errors, but chimeras are another importance source of spurious sequences
in amplicon sequencing. Chimeras are formed during PCR amplification.
When one sequence is incompletely amplified, the incomplete amplicon
primes the next amplification step, yielding a spurious amplicon. The
result is a sequence read which is half of one sample sequence and half
another.
We identify and remove those sequence using the
removeBimeraDenovo(...)
function in the dada2 pipeline:
## Identified 0 bimeras out of 7 input sequences.
We now have a data.frame of merged, error-free, non-chimeric, amplicon sequence variants!
In order to show an example of making a sequence table, and to reiterate the workflow outlined above, we now process a second sample:
# Assign filenames
fnF2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2")
fnR2 <- system.file("extdata", "sam2R.fastq.gz", package="dada2")
filtF2 <- tempfile(fileext=".fastq.gz")
filtR2 <- tempfile(fileext=".fastq.gz")
# Filter and Trim
filterAndTrim(fwd=fnF2, filt=filtF2, rev=fnR2, filt.rev=filtR2, maxN=0, trimLeft=10, truncLen=c(240, 200), maxEE=2, compress=TRUE, verbose=TRUE)
## Read in 1500 paired-sequences, output 858 (57.2%) filtered paired-sequences.
## Dereplicating sequence entries in Fastq file: /tmp/RtmpMKDX3k/file160e18a4c5e1.fastq.gz
## Encountered 305 unique sequences from 858 total sequences read.
## Dereplicating sequence entries in Fastq file: /tmp/RtmpMKDX3k/file160e65898ca0.fastq.gz
## Encountered 448 unique sequences from 858 total sequences read.
# Infer sample composition (using already learned error rates)
dadaF2 <- dada(derepF2, err=errF, multithread=FALSE)
## Sample 1 - 858 reads in 305 unique sequences.
## Sample 1 - 858 reads in 448 unique sequences.
## 731 paired-reads (in 6 unique pairings) successfully merged out of 852 (in 10 pairings) input.
With that second sample processed, we can go ahead and create a sequence table.
Finally we want to combine the inferred samples into one unified
table. For that purpose we use makeSequenceTable
:
seqtab <- makeSequenceTable(list(merger1, merger2))
seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=TRUE)
## Identified 0 bimeras out of 7 input sequences.
## [1] 2 7
This is the final product of the dada2 pipeline, a matrix in which each row corresponds to a processed sample, and each column corresponds to an non-chimeric inferred sample sequence (a more precise analogue to the common “OTU table”). From here we recommend proceeding forward with our friend the phyloseq package for further analysis…