Title: | Tools for Exploratory Analysis of Variant Calls |
---|---|
Description: | Explore, diagnose, and compare variant calls using filters. |
Authors: | Michael Lawrence, Jeremiah Degenhardt, Robert Gentleman |
Maintainer: | Michael Lawrence <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.49.0 |
Built: | 2024-10-31 06:30:25 UTC |
Source: | https://github.com/bioc/VariantTools |
Matches the case
variants to the (typically unfiltered)
control
variants and returns case
, with the additional
metadata columns control.alt.depth
and
control.total.depth
, corresponding to altDepth(control)
and totalDepth(control)
, respectively.
annotateWithControlDepth(case, control, control.cov)
annotateWithControlDepth(case, control, control.cov)
case |
The variants of interest, as a |
control |
The control variants, typically unfiltered, as a |
control.cov |
The control coverage, as an |
If a case
variant is not found in control
, a count of
0
is assigned to control.alt.depth
, under the assumption
that the control
object is not filtered, i.e., it contains the
raw tallies.
case
, plus two new metadata columns, control.alt.depth
and control.total.depth
Michael Lawrence
callSampleSpecificVariants
, which uses this function.
bams <- LungCancerLines::LungCancerBamFiles() data(vignette) case <- callVariants(tallies_H1993) control <- tallies_H2073 control.cov <- coverage(bams$H2073) annotateWithControlDepth(case, control, control.cov)
bams <- LungCancerLines::LungCancerBamFiles() data(vignette) case <- callVariants(tallies_H1993) control <- tallies_H2073 control.cov <- coverage(bams$H2073) annotateWithControlDepth(case, control, control.cov)
Calls genotypes from a set of tallies (such as a VRanges
or VCF
file) and the coverage (currently as a BigWigFile
). We call the
genotype with the highest likelihood, where the likelihood is based on
a binomial model of the variant frequency.
## S4 method for signature 'VRanges' callGenotypes(variants, cov, param = CallGenotypesParam(variants), BPPARAM = defaultBPPARAM()) ## S4 method for signature 'TabixFile' callGenotypes(variants, cov, param = CallGenotypesParam(variants), BPPARAM = defaultBPPARAM()) CallGenotypesParam(genome, gq.breaks = c(0, 5, 20, 60, Inf), p.error = 0.05, which = tileGenome(seqinfo(genome), ntile=ntile), ntile = 100L)
## S4 method for signature 'VRanges' callGenotypes(variants, cov, param = CallGenotypesParam(variants), BPPARAM = defaultBPPARAM()) ## S4 method for signature 'TabixFile' callGenotypes(variants, cov, param = CallGenotypesParam(variants), BPPARAM = defaultBPPARAM()) CallGenotypesParam(genome, gq.breaks = c(0, 5, 20, 60, Inf), p.error = 0.05, which = tileGenome(seqinfo(genome), ntile=ntile), ntile = 100L)
variants |
Either |
cov |
The coverage, as an RleList or a |
param |
Parameters controlling the genotyping, constructed by
|
genome |
An object with a |
gq.breaks |
A numeric vector representing an increasing sequence of genotype quality breaks to segment the wildtype runs. |
p.error |
The binomial probability for an error. This is used to calculate the expected frequency of hom-ref and hom-alt variants. |
which |
A |
ntile |
When |
BPPARAM |
A |
In general, the behavior is very similar to that of the GATK
UnifiedGenotyper (see references). For every position in the tallies,
we compute a binomial likelihood for each of wildtype (0/0), het (0/1)
and hom-alt (1/1), assuming the alt allele frequency to be
p.error
, 0.5
and 1 - p.error
, respectively. The
genotype with the maximum likelihood is chosen as the genotype, and
the genotype quality is computed by taking the fraction of the maximum
likelihood over the sum of the three likelihoods.
We assume that any position not present in the input tallies is
wildtype (0/0) and compute the quality for every such position, using
the provided coverage data. For scalability reasons, we segment runs
of these positions according to user-specified breaks on the genotype
quality. The segments become special records in the returned
VRanges
, where the range represents the segment, the ref
is the first reference base, alt
is <NON_REF>
and the
totalDepth
is the mean of the coverage.
The genotype information is recorded as metadata columns named according to gVCF conventions:
GT
The genotype call string: 0/0
, 0/1
,
1/1
.
GQ
The numeric genotype quality, phred scaled. For wildtype runs, this is minimum over the run.
PL
A 3 column matrix with the likelihood for each genotype, phred scaled. We take the minimum over wildtype runs.
MIN_DP
The minimum coverage over a wildtype
run; NA
for single positions.
For callGenotypes
, a VRanges
annotated with the genotype
call information, as described in the details.
Michael Lawrence
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303.
bams <- LungCancerLines::LungCancerBamFiles() data(vignette) tallies <- tallies_H1993 sampleNames(tallies) <- "H1993" mcols(tallies) <- NULL cov <- coverage(bams$H1993) ## simple usage ## (need gmapR to find the genome in the GMAP database, otherwise, ## provide sequence directly as shown later) if (requireNamespace("gmapR", quietly=TRUE)) { genotypes <- callGenotypes(tallies, cov, BPPARAM=BiocParallel::SerialParam()) } ## customize params <- CallGenotypesParam(genome_p53, p.error = 1/1000) genotypes <- callGenotypes(tallies, cov, params) ## write to gVCF writeVcf(genotypes, tempfile("genotypes", fileext="vcf"), index=TRUE)
bams <- LungCancerLines::LungCancerBamFiles() data(vignette) tallies <- tallies_H1993 sampleNames(tallies) <- "H1993" mcols(tallies) <- NULL cov <- coverage(bams$H1993) ## simple usage ## (need gmapR to find the genome in the GMAP database, otherwise, ## provide sequence directly as shown later) if (requireNamespace("gmapR", quietly=TRUE)) { genotypes <- callGenotypes(tallies, cov, BPPARAM=BiocParallel::SerialParam()) } ## customize params <- CallGenotypesParam(genome_p53, p.error = 1/1000) genotypes <- callGenotypes(tallies, cov, params) ## write to gVCF writeVcf(genotypes, tempfile("genotypes", fileext="vcf"), index=TRUE)
Calls sample-specific variants by comparing case and control variants from paired samples, starting from the BAM files or unfiltered tallies. For example, these variants would be considered somatic mutations in a tumor vs. normal comparison.
SampleSpecificVariantFilters(control, control.cov, calling.filters, power = 0.8, p.value = 0.01) ## S4 method for signature 'BamFile,BamFile' callSampleSpecificVariants(case, control, tally.param, calling.filters = VariantCallingFilters(), post.filters = FilterRules(), ...) ## S4 method for signature 'character,character' callSampleSpecificVariants(case, control, ...) ## S4 method for signature 'VRanges,VRanges' callSampleSpecificVariants(case, control, control.cov, ...) ## DEPRECATED ## S4 method for signature 'GenomicRanges,GenomicRanges' callSampleSpecificVariants(case, control, control.cov, calling.filters = VariantCallingFilters(), post.filters = FilterRules(), ...)
SampleSpecificVariantFilters(control, control.cov, calling.filters, power = 0.8, p.value = 0.01) ## S4 method for signature 'BamFile,BamFile' callSampleSpecificVariants(case, control, tally.param, calling.filters = VariantCallingFilters(), post.filters = FilterRules(), ...) ## S4 method for signature 'character,character' callSampleSpecificVariants(case, control, ...) ## S4 method for signature 'VRanges,VRanges' callSampleSpecificVariants(case, control, control.cov, ...) ## DEPRECATED ## S4 method for signature 'GenomicRanges,GenomicRanges' callSampleSpecificVariants(case, control, control.cov, calling.filters = VariantCallingFilters(), post.filters = FilterRules(), ...)
case |
The BAM file for the case, or the called variants as output by
|
control |
The BAM file for the control, or the raw tallies as output by
|
tally.param |
Parameters controlling the variant tallying step,
as typically constructed by |
calling.filters |
Filters to use for the initial,
single-sample calling against reference, typically constructed by
|
post.filters |
Filters that are applied after the initial calling step. These
consider the set of variant calls as a whole and remove those with
suspicious patterns. They are only applied to the |
... |
For a BAM file, arguments to pass down to the |
control.cov |
The coverage for the control sample. |
power |
The power cutoff, beneath which a variant will not be called case-specific, due to lack of power in control. |
p.value |
The binomial p-value cutoff for determining whether the control frequency is sufficiently extreme (low) compared to the case frequency. A p-value below this cutoff means that the variant will be called case-specific. |
For each sample, the variants are tallied (when the input is BAM), QA
filtered (case only), called and determined to be sample-specific.
The callSampleSpecificVariants
function is fairly high-level,
but it still allows the user to override the parameters and filters
for each stage of the process. See TallyVariantsParam
,
VariantQAFilters
, VariantCallingFilters
and SampleSpecificVariantFilters
.
It is safest to pass a BAM file, so that the computations are
consistent for both samples. The GenomicRanges
method is
provided mostly for optimization purposes, since tallying the variants
over the entire genome is time-consuming. For small gene-size regions,
performance should not be a concern.
This is the algorithm that determines whether a variant is specific to the case sample:
Filter out all case calls that were also called in
control. The callSampleSpecificVariants
function does
not apply the QA filters when calling variants in
control. This prevents a variant from being called specific to
case merely due to questionable data in the control.
For the remaining case calls, calculate whether there was
sufficient power in control under the likelihood ratio test, for a
variant present at the p.lower
frequency. If that is below
the power
cutoff, discard it.
For the remaining case calls, test whether the control
frequency is sufficient extreme (low) compared to the case
frequency, under the binomial model. The null hypothesis is that
the frequencies are the same, so if the test p-value is above
p.value
, discard the variant. Otherwise, the variant is
called case-specific.
A VRanges
with the case-specific variants (such as
somatic mutations).
Michael Lawrence, Jeremiah Degenhardt
bams <- LungCancerLines::LungCancerBamFiles() if (requireNamespace("gmapR", quietly=TRUE)) { tally.param <- TallyVariantsParam(gmapR::TP53Genome(), high_base_quality = 23L, which = gmapR::TP53Which()) callSampleSpecificVariants(bams$H1993, bams$H2073, tally.param) } else { data(vignette) calling.filters <- VariantCallingFilters(read.count = 3L) called.variants <- callVariants(tallies_H1993, calling.filters) callSampleSpecificVariants(called.variants, tallies_H2073, coverage_H2073) }
bams <- LungCancerLines::LungCancerBamFiles() if (requireNamespace("gmapR", quietly=TRUE)) { tally.param <- TallyVariantsParam(gmapR::TP53Genome(), high_base_quality = 23L, which = gmapR::TP53Which()) callSampleSpecificVariants(bams$H1993, bams$H2073, tally.param) } else { data(vignette) calling.filters <- VariantCallingFilters(read.count = 3L) called.variants <- callVariants(tallies_H1993, calling.filters) callSampleSpecificVariants(called.variants, tallies_H2073, coverage_H2073) }
Calls variants from either a BAM file or a VRanges
object. The variants are called using a binomial likelihood
ratio test. Those calls are then subjected to a post-filtering step.
## S4 method for signature 'BamFile' callVariants(x, tally.param, calling.filters = VariantCallingFilters(...), post.filters = FilterRules(), ...) ## S4 method for signature 'character' callVariants(x, ...) ## S4 method for signature 'VRanges' callVariants(x, calling.filters = VariantCallingFilters(...), post.filters = FilterRules(), ...) VariantCallingFilters(read.count = 2L, p.lower = 0.2, p.error = 1/1000)
## S4 method for signature 'BamFile' callVariants(x, tally.param, calling.filters = VariantCallingFilters(...), post.filters = FilterRules(), ...) ## S4 method for signature 'character' callVariants(x, ...) ## S4 method for signature 'VRanges' callVariants(x, calling.filters = VariantCallingFilters(...), post.filters = FilterRules(), ...) VariantCallingFilters(read.count = 2L, p.lower = 0.2, p.error = 1/1000)
x |
Either a path to an indexed bam, a |
tally.param |
Parameters controlling the variant tallying step,
as typically constructed by |
calling.filters |
Filters used in the calling step, typically constructed with
|
post.filters |
Filters that are applied after the initial calling step. These consider the set of variant calls as a whole and remove those with suspicious patterns. |
...
Arguments for VariantCallingFilters
, listed below.
read.count |
Require at least this many high quality reads with the alternate base. The default value is designed to catch sequencing errors where coverage is too low to rely on the LRT. Increasing this value has a significant negative impact on power. |
p.lower |
The lower bound on the binomial probability for a true variant. |
p.error |
The binomial probability for a sequencing error (default is reasonable for Illumina data with the default quality cutoff). |
... |
Arguments to pass to |
There are two steps for calling
variants: the actual statistical test that decides whether a variant
exists in the data, and a post-filtering step. By default, the initial
calling is based on a binomial likelihood ratio test
(P(D|p=p.lower
) / P(D|p=p.error
) > 1). The test amounts
to excluding putative variants with less than ~4% alt frequency. A
variant is also required to be represented by at least 2 alt
reads. The post-filtering stage considers the set of variant calls as
a whole and removes variants with suspicious patterns. Currently,
there is a single post-filter, disabled by default, that removes
variants that are clumped together on the chromosome (see the
max.nbor.count
parameter).
For callVariants
, a VRanges
of the called variants (the
tallies that pass the calling filters). See the documentation
of bam_tally
for complete details.
For VariantCallingFilters
, a FilterRules
object with the filters for calling the variants.
Michael Lawrence, Jeremiah Degenhardt
bams <- LungCancerLines::LungCancerBamFiles() if (requireNamespace("gmapR")) { tally.param <- TallyVariantsParam(gmapR::TP53Genome(), high_base_quality = 23L, which = gmapR::TP53Which()) ## simple usage variants <- callVariants(bams$H1993, tally.param) } ## customize data(vignette) calling.filters <- VariantCallingFilters(p.error = 1/1000) callVariants(tallies_H1993, calling.filters)
bams <- LungCancerLines::LungCancerBamFiles() if (requireNamespace("gmapR")) { tally.param <- TallyVariantsParam(gmapR::TP53Genome(), high_base_quality = 23L, which = gmapR::TP53Which()) ## simple usage variants <- callVariants(bams$H1993, tally.param) } ## customize data(vignette) calling.filters <- VariantCallingFilters(p.error = 1/1000) callVariants(tallies_H1993, calling.filters)
Decides whether a position is variant, wildtype, or uncallable, according to the estimated power of the given calling filters.
callWildtype(reads, variants, calling.filters, pos = NULL, ...) minCallableCoverage(calling.filters, power = 0.80, max.coverage = 1000L)
callWildtype(reads, variants, calling.filters, pos = NULL, ...) minCallableCoverage(calling.filters, power = 0.80, max.coverage = 1000L)
reads |
The read alignments, i.e., a path to a BAM file, or the coverage,
including a |
variants |
The called variants, a tally |
calling.filters |
Filters used to call the variants. |
pos |
A |
power |
The chance of detecting a variant if one is there. |
max.coverage |
The max coverage to be considered for the minimum (should not need to be tweaked). |
... |
Arguments to pass down to |
For each position (in the genome, or as specified by pos
), the
coverage is compared against the return value of
minCallableCoverage
. If the coverage is above the callable
minimum, the position is called, either as a variant (if it is in
variants
) or wildtype. Otherwise, it is considered a no-call.
The minCallableCoverage
function
expects and only considers the filters returned by
VariantCallingFilters
.
A logical vector (or logical RleList
if pos
is
NULL
), that is TRUE
for wildtype, FALSE
for variant,
NA
for no-call.
Michael Lawrence
bams <- LungCancerLines::LungCancerBamFiles() bam <- bams$H1993 data(vignette) called.variants <- callVariants(tallies_H1993) pos <- c(called.variants, shift(called.variants, 3)) wildtype <- callWildtype(bam, called.variants, VariantCallingFilters(), pos = pos, power = 0.85)
bams <- LungCancerLines::LungCancerBamFiles() bam <- bams$H1993 data(vignette) called.variants <- callVariants(tallies_H1993) pos <- c(called.variants, shift(called.variants, 3)) wildtype <- callWildtype(bam, called.variants, VariantCallingFilters(), pos = pos, power = 0.85)
Functions for calculating concordance between variant sets and deciding whether two samples have identical genomes.
calculateVariantConcordance(gr1, gr2, which = NULL) calculateConcordanceMatrix(variantFiles, ...) callVariantConcordance(concordanceMatrix, threshold)
calculateVariantConcordance(gr1, gr2, which = NULL) calculateConcordanceMatrix(variantFiles, ...) callVariantConcordance(concordanceMatrix, threshold)
gr1 , gr2
|
The two tally |
which |
A |
variantFiles |
Character vector of paths to files representing tally
|
concordanceMatrix |
A matrix of concordance fractions between sample pairs, as returend by
|
threshold |
The concordance fraction above which edges are generated between samples when forming the graph. |
... |
Arguments to pass to the loading function, e.g., |
The calculateVariantConcordance
calculates the fraction of
concordant variants between two samples. Concordance is defined as
having the same position and alt allele.
The calculateConcordanceMatrix
function generates a numeric
matrix with the concordance for each pair of samples. It accepts paths
to serialized objects so that all variant calls are not loaded in
memory at once. This probably should support VCF files, eventually.
The callVariantConcordance
function generates a
concordant/non-concordant/undecidable status for each sample (that are
assumed to originate from the same individual), given the output of
calculateConcordanceMatrix
. The status is decided as follows. A
graph is formed from the concordance matrix using threshold
to
generate the edges. If there are multiple cliques in the graph that
each have more than one sample, every sample is declared
undecidable. Otherwise, the samples in the clique with more than one
sample, if any, are marked as concordant, and the others (in singleton
cliques) are marked as discordant.
Fraction of concordant variants for calculateVariantConcordance
, a
numeric matrix of concordances for calculateConcordanceMatrix
,
or a character vector of status codes, named by sample, for
callVariantConcordance
.
Cory Barr (code), Michael Lawrence (inferred documentation)
Gets values from an RleList
corresponding to positions (width 1
ranges) in a GRanges
(or VRanges
). The result is a
simple atomic vector.
extractCoverageForPositions(cov, pos)
extractCoverageForPositions(cov, pos)
cov |
An |
pos |
A |
Atomic vector with one value from cov
per position in pos
.
Michael Lawrence
These functions construct filters (implemented as functions) suitable
for collection into FilterRules
objects, which are then used to
filter variant calls. See examples.
SetdiffVariantsFilter(other) MinTotalDepthFilter(min.depth = 10L) MaxControlFreqFilter(control, control.cov, max.control.freq = 0.03) DepthFETFilter(control, control.cov, p.value.cutoff = 0.05)
SetdiffVariantsFilter(other) MinTotalDepthFilter(min.depth = 10L) MaxControlFreqFilter(control, control.cov, max.control.freq = 0.03) DepthFETFilter(control, control.cov, p.value.cutoff = 0.05)
other |
The set of variants (as a |
min.depth |
The minimum depth for a variant to pass. |
control |
The control set of variants (as a |
control.cov |
The coverage (as an |
max.control.freq |
The maximum alt frequency allowed in the control for a variant to be considered case-specific. |
p.value.cutoff |
Passing variants must have a p-value below this value. |
In all cases, a closure that returns a logical vector indicating which elements of its argument should be retained.
Michael Lawrence
There are some convenience functions that construct FilterRules
objects that contain one or more of these filters. Examples are
VariantQAFilters
and
VariantCallingFilters
.
## Find case-specific variants in a case/control study bams <- LungCancerLines::LungCancerBamFiles() data(vignette) case <- callVariants(tallies_H1993) control <- callVariants(tallies_H2073) control.cov <- coverage(bams$H2073) filters <- FilterRules(list(caseOnly = SetdiffVariantsFilter(control), minTotalDepth = MinTotalDepthFilter(min.depth=10L), maxControlFreq = MaxControlFreqFilter(control, control.cov, max.control.freq=0.03), depthFET = DepthFETFilter(control, control.cov, p.value.cutoff=0.05) )) specific <- subsetByFilter(case, filters)
## Find case-specific variants in a case/control study bams <- LungCancerLines::LungCancerBamFiles() data(vignette) case <- callVariants(tallies_H1993) control <- callVariants(tallies_H2073) control.cov <- coverage(bams$H2073) filters <- FilterRules(list(caseOnly = SetdiffVariantsFilter(control), minTotalDepth = MinTotalDepthFilter(min.depth=10L), maxControlFreq = MaxControlFreqFilter(control, control.cov, max.control.freq=0.03), depthFET = DepthFETFilter(control, control.cov, p.value.cutoff=0.05) )) specific <- subsetByFilter(case, filters)
These are deprecated functions for operating on the old
variant GRanges. New code should use match
and %in%
.
This function behaves like match
, where two elements match when
they share the same position and “alt” allele.
matchVariants(x, table) x %variant_in% table
matchVariants(x, table) x %variant_in% table
x |
The variants ( |
table |
The variants ( |
For matchVariants
, an integer vector with the matching index in
table
for each variant in x
, or NA
if there is no
match. For %variant_in%
, a logical vector indicating whether
there was such a match.
Michael Lawrence
This is an alternative to tallyVariants
for generating a
VRanges
from a set of alignments (BAM file) by counting the
nucleotides at each position. This function uses the samtools-based
applyPileups
function, instead of
bam_tally
. Fewer dependencies, with fewer
statistics (none beyond the fixed columns) available in the output.
pileupVariants(bams, genome, param = ApplyPileupsParam(), minAltDepth = 1L, baseOnly = TRUE, BPPARAM = defaultBPPARAM())
pileupVariants(bams, genome, param = ApplyPileupsParam(), minAltDepth = 1L, baseOnly = TRUE, BPPARAM = defaultBPPARAM())
bams |
A vector/list of BAM files as interpreted
by |
genome |
An object that provides sequence information via
|
param |
A |
minAltDepth |
Minimal alt depth to be included in the output. The default avoids outputting results for positions/alleles that show no differences. |
baseOnly |
Whether to drop records with “N” in either the ref or alt. |
BPPARAM |
Not yet supported. |
A VRanges object with read depth information for each position, allele, and sample.
Michael Lawrence
tallyVariants
for more statistics.
bams <- LungCancerLines::LungCancerBamFiles() if (requireNamespace("gmapR")) { param <- Rsamtools::ApplyPileupsParam(which=gmapR::TP53Which()) pileup <- pileupVariants(bams, gmapR::TP53Genome(), param) }
bams <- LungCancerLines::LungCancerBamFiles() if (requireNamespace("gmapR")) { param <- Rsamtools::ApplyPileupsParam(which=gmapR::TP53Which()) pileup <- pileupVariants(bams, gmapR::TP53Genome(), param) }
Applies filters to a set of called variants. The only current filter
is a cutoff on the weighted neighbor count of each variant. This
filtering is performed automatically by callVariants
, so
these functions are for when more control is desired.
postFilterVariants(x, post.filters = VariantPostFilters(...), ...) VariantPostFilters(max.nbor.count = 0.1, whitelist = NULL)
postFilterVariants(x, post.filters = VariantPostFilters(...), ...) VariantPostFilters(max.nbor.count = 0.1, whitelist = NULL)
x |
A tally |
post.filters |
The filters applied to the called variants. |
... |
Arguments passed to |
max.nbor.count |
Maximum allowed number of neighbors (weighted by distance) |
whitelist |
Positions to ignore; these will always pass the filter, and are excluded from the neighbor counting. |
The neighbor count is calculated within a 100bp window centered on the variant. Each neighbor is weighted by the inverse square root of the distance to the neighbor. This was motivated by fitting logistic regression models including a term the count (usually 0, 1, 2) at each distance. The inverse square root function best matched the trend in the coefficients.
For postFilterVariants
, a tally GRanges
of the variants that
pass the filters.
For VariantPostFilters
, a FilterRules
object with the filters.
Michael Lawrence and Jeremiah Degenhardt
bams <- LungCancerLines::LungCancerBamFiles() ## post-filters are not enabled by default during calling data(vignette) called.variants <- callVariants(tallies_H1993) ## but can be applied at a later time... postFilterVariants(called.variants, max.nbor.count = 0.15) # or enable during calling called.variants <- callVariants(tallies_H1993, post.filters = VariantPostFilters())
bams <- LungCancerLines::LungCancerBamFiles() ## post-filters are not enabled by default during calling data(vignette) called.variants <- callVariants(tallies_H1993) ## but can be applied at a later time... postFilterVariants(called.variants, max.nbor.count = 0.15) # or enable during calling called.variants <- callVariants(tallies_H1993, post.filters = VariantPostFilters())
Filters a tally GRanges
through a series of simple checks for
strand and read position (read position) biases.
qaVariants(x, qa.filters = VariantQAFilters(...), ...) VariantQAFilters(fisher.strand.p.value = 1e-4, min.mdfne = 10L)
qaVariants(x, qa.filters = VariantQAFilters(...), ...) VariantQAFilters(fisher.strand.p.value = 1e-4, min.mdfne = 10L)
x |
A tally |
qa.filters |
The filters used for the QA process, typically constructed with
|
... |
Arguments passed to |
fisher.strand.p.value |
p-value cutoff for the Fisher's Exact Test for strand bias (+/- counts, alt vs. ref). Any variants with p-values below this cutoff are discarded. |
min.mdfne |
Minimum allowed median distance of alt calls from their nearest end of the read. |
There are currently two QA filters:
Median distance of alt calls from nearest end of the read is
required to be >= min.mdfne
, which defaults to 10.
Fisher's Exact Test for strand bias, using the +/- counts, alt vs. ref. If the null is rejected, the variant is discarded.
For qaVariants
, a tally GRanges
of the variants that
pass the QA checks.
For VariantQAFilters
, a FilterRules
object with the QA and sanity filters.
Michael Lawrence and Jeremiah Degenhardt
data(vignette) qaVariants(tallies_H1993, fisher.strand.p.value = 1e-4)
data(vignette) qaVariants(tallies_H1993, fisher.strand.p.value = 1e-4)
Tallies the bases, qualities and read positions for every genomic
position in a BAM file. By default, this only returns the positions
for which an alternate base has been detected. The typical usage is
to pass a BAM file, the genome, the (fixed) readlen
and (if the
variant calling should consider quality) an appropriate
high_base_quality
cutoff.
Passing a which
argument allows computing on only a
subregion of the genome. which
is a ‘RangesList’ or
something coercible to one that limits the tally to that range or
set of ranges. By default, the entire genome is processed.
For parallel evaluation (see BPPARAM
): Specifically,
which
can be a ‘GenomicRanges’ or a ‘GRangesList’. If
which
is a ‘GenomicRanges’ and has length 1 it is tiled
to create chunks for parallel evaluation. If it is longer
than 1, each range becomes a chunk for parallel evaluation.
If which
is a ‘GRangesList’, each element (i.e. each
‘GenomicRanges’) becomes a chunk. The latter can be useful to
ensure balanced worker load, e.g. in the case of regions covering
multiple sequences(see equisplit
).
## S4 method for signature 'BamFile' tallyVariants(x, param = TallyVariantsParam(...), ..., BPPARAM = defaultBPPARAM()) ## S4 method for signature 'BamFileList' tallyVariants(x, ...) ## S4 method for signature 'character' tallyVariants(x, ...) TallyVariantsParam(genome, read_pos_breaks = NULL, high_base_quality = 0L, minimum_mapq = 13L, variant_strand = 1L, ignore_query_Ns = TRUE, ignore_duplicates = TRUE, mask = GRanges(), keep_extra_stats = TRUE, read_length = NA_integer_, read_pos = !is.null(read_pos_breaks), high_nm_score = NA_integer_, ...)
## S4 method for signature 'BamFile' tallyVariants(x, param = TallyVariantsParam(...), ..., BPPARAM = defaultBPPARAM()) ## S4 method for signature 'BamFileList' tallyVariants(x, ...) ## S4 method for signature 'character' tallyVariants(x, ...) TallyVariantsParam(genome, read_pos_breaks = NULL, high_base_quality = 0L, minimum_mapq = 13L, variant_strand = 1L, ignore_query_Ns = TRUE, ignore_duplicates = TRUE, mask = GRanges(), keep_extra_stats = TRUE, read_length = NA_integer_, read_pos = !is.null(read_pos_breaks), high_nm_score = NA_integer_, ...)
x |
An indexed BAM file, either a path, |
param |
The parameters for the tallying process, as a
|
... |
For |
genome |
The genome, either a |
read_pos_breaks |
The breaks used for tabulating the read positions (read
positions) at each position. If this information is included (not
|
high_base_quality |
The minimum cutoff for whether a base is
counted as high quality. By default, |
minimum_mapq |
Minimum MAPQ of a read for it to be included in
the tallies. This depend on the aligner; the default is reasonable
for |
variant_strand |
On how many strands must an alternate base be detected for a position to be returned. Highly recommended to set this to at least 1 (otherwise, the result is huge and includes many uninteresting reference rows). |
ignore_query_Ns |
Whether to ignore N calls in the
reads. Usually, there is no reason to set this to |
ignore_duplicates |
whether to ignore reads flagged as PCR/optical duplicates |
mask |
A |
read_length |
The expected read length, used for calculating the “median distance from nearest” end statistic. If not specified, an attempt is made to guess the read length from a random sample of the BAM file. If read length is found to be variable, statistics depending on the read length are not calculated. |
read_pos |
Whether to tally read positions, which can be computationally intensive. |
high_nm_score |
If not |
keep_extra_stats |
Whether to keep various summary statistics generated from the tallies; setting this to FALSE will save memory. The extra statistics are most useful for algorithm diagnostics and development. |
BPPARAM |
A
|
For tallyVariants
, the tally GRanges
.
For TallyVariantsParam
, an object with parameters suitable for
variant calling.
The VariantTallyParam
constructor is DEPRECATED.
Michael Lawrence, Jeremiah Degenhardt
if (requireNamespace("gmapR")) { tally.param <- TallyVariantsParam(gmapR::TP53Genome(), high_base_quality = 23L, which = gmapR::TP53Which()) bams <- LungCancerLines::LungCancerBamFiles() raw.variants <- tallyVariants(bams$H1993, tally.param) }
if (requireNamespace("gmapR")) { tally.param <- TallyVariantsParam(gmapR::TP53Genome(), high_base_quality = 23L, which = gmapR::TP53Which()) bams <- LungCancerLines::LungCancerBamFiles() raw.variants <- tallyVariants(bams$H1993, tally.param) }
The deprecated way to create a
VCF
object from a variant/tally
GRanges
. This can then be output to a file using
writeVcf
. The flavor of VCF is
specific for calling variants, not genotypes; see below.
variantGR2Vcf(x, sample.id, project = NULL, genome = unique(GenomicRanges::genome(x)))
variantGR2Vcf(x, sample.id, project = NULL, genome = unique(GenomicRanges::genome(x)))
x |
The variant/tally |
sample.id |
Unique ID for the sample in the VCF. |
project |
Description of the project/experiment; will be included in the VCF header. |
genome |
|
A variant GRanges
has an element for every unique combination
of position and alternate base. A VCF
object, like the file
format, has a row for every position, with multiple alternate alleles
collapsed within the row. This is the fundamental difference between
the two data structures. We feel that the GRanges
is easier to
manipulate for filtering tasks, while VCF
is obviously
necessary for communication with external databases and tools.
Normally, despite its name, VCF is used for communicating genotype calls. We are calling variants, not genotypes, so we have extended the format accordingly.
Here is the mapping in detail:
The rowRanges
is formed by dropping the metadata columns
from the GRanges
.
The colData
consists of a single column,
“Samples”, with a single row, set to 1 and named
sample.id
.
The exptData
has an element “header” with element
“reference” set to the seqlevels(x)
and element
“samples” set to sample.id
. This will also include the
necessary metadata for describing our extensions to the format.
The fixed
table has the “REF” and “ALT”
alleles, with “QUAL” and “FILTER” set to NA
.
The geno
list has six matrix elements, all with a
single column. The first is the mandatory “GT” element, the
genotype, which we set to NA
. Then there is “AD”
(list matrix with the read count for each REF and ALT),
“DP” (integer matrix with the total read count), and
“AP” (list matrix of 0/1 flags for whether whether REF
and/or ALT was present in the data).
A VCF
object.
This function is DEPRECATED. The callVariants
function
now returns a VRanges
object that can
be coerced to a VCF
object via as(x, "VCF")
.
Michael Lawrence, Jeremiah Degenhardt
## Not run: vcf <- variantGR2Vcf(variants, "H1993", "example") writeVcf(vcf, "H1993.vcf", index = TRUE) ## End(Not run)
## Not run: vcf <- variantGR2Vcf(variants, "H1993", "example") writeVcf(vcf, "H1993.vcf", index = TRUE) ## End(Not run)
Precomputed data for use in the vignette, mostly for the sake of Windows, where gmapR and its tallying functionality are unsupported.
data(vignette)
data(vignette)
The following objects are included:
Tallies for the two samples.
Coverage for the two samples.
A GRanges of the p53 exons
DNAStringSet with the genome sequence of the p53 region
The following demonstrates how we created these objects:
bams <- LungCancerLines::LungCancerBamFiles() tally.param <- TallyVariantsParam(gmapR::TP53Genome(), high_base_quality = 23L, which = range(p53) + 5e4, indels = TRUE, read_length = 75L) tallies_H1993 <- tallyVariants(bams$H1993, tally.param) tallies_H2073 <- tallyVariants(bams$H2073, tally.param) coverage_H1993 <- coverage(bams$H1993) coverage_H2073 <- coverage(bams$H2073) genome_p53 <- DNAStringSet(getSeq(gmapR::TP53Genome())) p53 <- gmapR:::exonsOnTP53Genome("TP53")
Computed from the data in the LungCancerLines package.
data(vignette)
data(vignette)