Title: | An R interface to the GMAP/GSNAP/GSTRUCT suite |
---|---|
Description: | GSNAP and GMAP are a pair of tools to align short-read data written by Tom Wu. This package provides convenience methods to work with GMAP and GSNAP from within R. In addition, it provides methods to tally alignment results on a per-nucleotide basis using the bam_tally tool. |
Authors: | Cory Barr, Thomas Wu, Michael Lawrence |
Maintainer: | Michael Lawrence <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.49.0 |
Built: | 2024-10-30 08:16:20 UTC |
Source: | https://github.com/bioc/gmapR |
Given a set of alignments, for each position in the genome output counts for the reference allele and all alternate alleles. Often used as a precursor to detecting variants. Indels will be supported soon.
## S4 method for signature 'BamFile' bam_tally(x, param, ...) ## S4 method for signature 'character' bam_tally(x, param, ...) variantSummary(x, read_pos_breaks = NULL, keep_ref_rows = FALSE, read_length = NA_integer_, high_nm_score = NA_integer_)
## S4 method for signature 'BamFile' bam_tally(x, param, ...) ## S4 method for signature 'character' bam_tally(x, param, ...) variantSummary(x, read_pos_breaks = NULL, keep_ref_rows = FALSE, read_length = NA_integer_, high_nm_score = NA_integer_)
x |
a |
param |
The |
read_pos_breaks |
The breaks, like those passed to |
keep_ref_rows |
Whether to keep the rows describing only the reference calls, i.e., where ref and alt are the same. These are useful when one needs the reference counts even when there are no alts at that position. |
read_length |
The expected read length. If the read length is NA, the MDFNE (median distance from nearest end) statistic will NOT be calculated. |
high_nm_score |
The value at which an NM value is considered high. |
... |
Arguments that override settings in |
The bam_tally
function returns an opaque pointer to a C-level
data structure with the class “TallyIIT”. Currently, the only
operation applicable to this object is variantSummary
.
The variantSummary
function returns
a VRanges
, with a range for each position
that passed the filters. The depth columns correspond to the counts
after quality filtering (except for indels, for which there is no
quality filtering). The following elementMetadata
columns are also present:
n.read.pos |
The number of unique read positions for the alt allele. |
n.read.pos.ref |
The number of unique read positions for the ref allele. |
raw.count.total |
The total number of reads at that position, including reference and all alternates. |
count.plus |
The number of positive strand reads for the alternate
allele, |
count.plus.ref |
The number of positive strand reads for the reference allele. |
count.minus |
The number of negative strand reads for the alternate
allele, |
count.minus.ref |
The number of negative strand reads for the reference allele. |
count.del.plus |
The plus strand deletion count over the position. |
count.del.minus |
The minus strand deletion count over the position. |
read.pos.mean |
Mean read position for the alt allele. |
read.pos.mean.ref |
Mean read position for the ref allele. |
read.pos.var |
Variance in the read positions for the alt allele. |
read.pos.var.ref |
Variance in the read positions for the ref allele. |
mdfne |
Median distance from nearest end for the alt allele. |
mdfne.ref |
Median distance from nearest end for the ref allele. |
count.high.nm |
The number of alt reads with an NM value at or above the
|
count.high.nm.ref |
The number of ref reads with an NM value at
or above the |
If codon counting was enabled, there will be a column giving the codon
strand: codon.strand
.
If the xs
parameter was TRUE
, there will be four
additional columns giving the counts by aligner-determined
strand: count.xs.plus
, count.xs.plus.ref
,
count.xs.minus
, and count.xs.minus.ref
.
An additional column is present for each bin formed by
the read_pos_breaks
parameter, with the read count for that bin.
Michael Lawrence
tallyVariants
in the VariantTools package provides a
high-level wrapper for this functionality.
"BamTallyParam"
A BamTallyParam
object stores parameters for
bam_tally
. The function of the same name serves as its
constructor.
BamTallyParam(genome, which = GRanges(), desired_read_group = NULL, minimum_mapq = 0L, concordant_only = FALSE, unique_only = FALSE, primary_only = FALSE, ignore_duplicates = FALSE, min_depth = 0L, variant_strand = 0L, variant_pct = 0, ignore_query_Ns = FALSE, indels = FALSE, min_softclip = 0L, max_softclip = 0L, exon_iit = NULL, IIT_BPPARAM = NULL, xs = FALSE, read_pos = FALSE, min_base_quality = 0L, noncovered = FALSE, nm = FALSE)
BamTallyParam(genome, which = GRanges(), desired_read_group = NULL, minimum_mapq = 0L, concordant_only = FALSE, unique_only = FALSE, primary_only = FALSE, ignore_duplicates = FALSE, min_depth = 0L, variant_strand = 0L, variant_pct = 0, ignore_query_Ns = FALSE, indels = FALSE, min_softclip = 0L, max_softclip = 0L, exon_iit = NULL, IIT_BPPARAM = NULL, xs = FALSE, read_pos = FALSE, min_base_quality = 0L, noncovered = FALSE, nm = FALSE)
genome |
A |
which |
A |
desired_read_group |
The name of the read group to which to limit the tallying; if not NULL, must be a single, non-NA string. |
minimum_mapq |
Minimum mapping quality for a read to be counted at all. |
concordant_only |
Consider only what gnsap calls “concordant” alignments. |
unique_only |
Consider only the uniquly mapped reads. |
primary_only |
Consider only primary pairs. |
ignore_duplicates |
Whether to ignore the reads flagged as PCR/optical duplicates. |
min_depth |
The minimum number of reads overlapping a position for it to be counted. |
variant_strand |
The number of strands on which a variant must be seen for it to be counted. This means that a value of 0 will report reference alleles in addition to variants. A value of 1 will report only positions where a variant was seen on at least one strand, and 2 requires the variant be seen on both strands. Setting this to 1 is a good way to save resources. |
variant_pct |
The minimum alternate allele fraction for a variant to be reported for a strand. |
ignore_query_Ns |
Whether to ignore the N base pairs when counting. Can save a lot of resources when processing low quality data. |
indels |
Whether to return indel counts. The |
min_softclip , max_softclip
|
Minimum and maximum length of soft clips that are considered for counting. Soft-clipping is often useful (for GSNAP at least) during alignment, and it should be preserved in the output. However, soft clipping can preferentially occur in regions of discordance with the reference, and if those clipped regions are ignored during counting, the allele fraction is misestimated. |
exon_iit |
An object which indicates the exons to be used for
tallying codons (a character value indicating an existing .iit file, a
|
IIT_BPPARAM |
A |
xs |
Whether to tabulate reads by XS tag, the aligner's best guess about the strand of transcription. |
read_pos |
Whether to tabulate by read position. |
min_base_quality |
Minimum base quality cutoff. Calls of lower quality are not counted, except in the total raw depth. |
noncovered |
Whether to report zero tallies, where there is no coverage. |
nm |
Whether to tally by NM tag, the number of mismatches for a read. |
Call the GMAP cmetindex
command to build an index suitable
for alignment of bisulfite-treated DNA, by allowing for C->T and G->A
differences.
cmetindex(db, use_snps = NULL)
cmetindex(db, use_snps = NULL)
db |
The GmapGenome object |
use_snps |
A GmapSnps object for generating a SNP-tolerant index |
Michael Lawrence
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) flyGG <- GmapGenome(Dmelanogaster, create = TRUE) cmetindex(flyGG) ## End(Not run)
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) flyGG <- GmapGenome(Dmelanogaster, create = TRUE) cmetindex(flyGG) ## End(Not run)
Many objects in gmapR represent data stored on disk. The
directory
accessor will return this directory.
directory(x)
directory(x)
x |
A GmapGenome or GmapSnps object |
a character vector
Construct the IIT (interval index tree) needed from the GMAP suite of tools to run from a genome file. IIT files are an oligomer index and what allow GMAP and GSNAP to efficiently lookup interval information for fast genomic mapping. Fast and SNP-tolerant detection of complex variants and splicing in short reads offers an depth explication of IIT files and their use in GMAP and GSNAP.
d |
genome name |
D |
destination directory for installation (defaults to gmapdb directory specified at configure time |
k |
k-mer value for genomic index (allowed: 12..15, default 14) |
S |
do not order chromosomes in numeric/alphabetic order, but use order in FASTA file(s) |
g |
files are gzipped, so need to gunzip each file first |
signature(x = "ANY", genome = "GmapGenome")
signature(x = "character", genome = "GmapGenome")
signature(x = "DNAStringSet", genome = "GmapGenome")
## Not run: flyGG <- GmapGenome(genome = "dm3", directory = ggd) gmap_build(x=Dmelanogaster, genome=flyGG) ## End(Not run)
## Not run: flyGG <- GmapGenome(genome = "dm3", directory = ggd) gmap_build(x=Dmelanogaster, genome=flyGG) ## End(Not run)
"GmapGenome"
The GmapGenome class represents a genome that has been indexed for use
with the GMAP suite of tools. It is typically used as a parameter to
the functions gsnap
and bam_tally
. This class
also provides the means to index new genomes, from either a FASTA file
or a BSgenome
object. Genome indexes are typically stored in a
centralized directory on the file system and are identified by a
string key.
GmapGenome(genome, directory =
GmapGenomeDirectory(create = create), name = genomeName(genome),
create = FALSE, ...)
:
Creates a GmapGenome
corresponding to the genome
argument, which may be either a string identifier of the genome
within directory
, a FastaFile
or
DNAStringSet
of the genome sequence, or
a BSgenome
object.
The genome index is stored in directory
argument, which may
be either a GmapGenomeDirectory
object, or a
string path.
The name
argument is the actual key used for storing the
genome index within directory
. If genome
is a
string, it is taken as the key. If a FastaFile
, it is the
basename of the file without the extension. If a BSgenome
,
it is the providerVersion
. Otherwise, the name
must
be specified. If create
is TRUE
, the genome index is
created if one with that name does not already exist. This
obviously only works if genome
actually contains the genome
sequence.
The first example below gives the typical and recommended usage when implementing a reproducible analysis.
getSeq(x, which = seqinfo(x))
: Extracts the genomic
sequence for each region in which
(something coercible to
GRanges
). The result is a character vector for now. This is
implemented in C and is very efficient. The default for which
will retrieve the entire genome.
as(object, "DNAStringSet")
: Extracts the entire
sequence of the genome as a DNAStringSet
. One consequence is
that this comes possible with rtracklayer:
export(object, "genome.fasta")
.
path(object)
: returns the path to the directory
containing the genome index files.
directory(x)
: returns the GmapGenomeDirectory
that is the parent of the directory containing the index files for
this genome.
genome(x)
: gets the name of this genome.
seqinfo(x)
: gets the Seqinfo
for this genome; only sequence names and lengths are available.
Michael Lawrence
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) flyGG <- GmapGenome(Dmelanogaster, create = TRUE) ## access system-wide genome using a key flyGG <- GmapGenome(genome = "dm3") which <- seqinfo(flyGG)["chr4"] firstchr <- getSeq(flyGG, which) genome(which) <- "hg19" ## should throw an error try(getSeq(flyGG, which)) ##create a GmapGenome from a FASTA file fa <- system.file("extdata/hg19.p53.fasta", package="gmapR") fastaFile <- rtracklayer::FastaFile(fa) gmapGenome <- GmapGenome(fastaFile, create=TRUE) ## End(Not run)
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) flyGG <- GmapGenome(Dmelanogaster, create = TRUE) ## access system-wide genome using a key flyGG <- GmapGenome(genome = "dm3") which <- seqinfo(flyGG)["chr4"] firstchr <- getSeq(flyGG, which) genome(which) <- "hg19" ## should throw an error try(getSeq(flyGG, which)) ##create a GmapGenome from a FASTA file fa <- system.file("extdata/hg19.p53.fasta", package="gmapR") fastaFile <- rtracklayer::FastaFile(fa) gmapGenome <- GmapGenome(fastaFile, create=TRUE) ## End(Not run)
"GmapGenomeDirectory"
The GmapGenomeDirectory class stores a path to a directory containing
a one or more genome-specific subdirectories, each represented by a
GmapGenome
. Inside those directories are the
files that the GMAP suite of tools uses for alignment, tallying, and
other operations. This class is typically used to create a
GmapGenome
object. The default directory is
~/.local/share/gmap
, following the freedesktop.org XDG
standard.
GmapGenomeDirectory(path = getDefaultGmapGenomePath(),
create = FALSE)
:
Creates an object pointing to the directory at path
,
creating it if it does not yet exist and create
is TRUE
.
path(object)
: gets the path to the genome directory.
genome(x)
: gets the names of the genomes in the
directory.
Michael Lawrence
gmapGenomePath <- file.path(getwd(), "newGmapGenomeDirectory") gmapGenomeDirectory <- GmapGenomeDirectory(gmapGenomePath, create = TRUE)
gmapGenomePath <- file.path(getwd(), "newGmapGenomeDirectory") gmapGenomeDirectory <- GmapGenomeDirectory(gmapGenomePath, create = TRUE)
"GmapSnpDirectory"
This class represents a directory containig one or more sets of SNPs, each corresponding to a genome. These SNP databases enable SNP-tolerant alignment with GMAP and GSNAP. If the underlying files have not been created, this class provides a means to do so.
signature(x = "GmapSnpDirectory", i = "ANY", j = "ANY")
: ...
signature(x = "GmapSnpDirectory")
: ...
signature(x = "GmapSnpDirectory")
: ...
signature(object = "GmapSnpDirectory")
: ...
Michael Lawrence
"GmapSnps"
This class represents a set of SNPs (single nucleotide polymorphisms) for use with GMAP and GSNAP (typically for SNP-tolerant alignment.)
GmapSnps(snps, directory, name = snps, create = FALSE, ...)
GmapSnps(snps, directory, name = snps, create = FALSE, ...)
snps |
A path to a VCF file |
directory |
The directory to create the IIT files used by GMAP and GSNAP |
name |
If provided, the name to give the database of SNPs. If not
provided, defauts to the |
create |
If the directory provided in the |
... |
Additional arguments to be passed to the SNPs replacement method. |
##TODO: doc these args
Objects can be created by calls of the form GmapSnps(snps,
directory, name, create)
.
name(x)
: returns the name of the GmapSnps
object
directory(x)
: returns the GmapGenomeDirectory
that is the parent of the directory containing the index files for
this GmapSnps
object.
signature(x = "GmapSnps")
: ...
Michael Lawrence
Given a set of alignments, align them to a genome using the GSNAP
algorithm. The GSNAP algorithm contains a number of features making it
a very high quality algorithm for dealing with short reads and those
from RNA-seq data in particular. Via the GsnapParam
class
and the gsnap
function, R users are given complete control over GSNAP.
## S4 method for signature 'character,character_OR_NULL,GsnapParam' gsnap(input_a, input_b, params, output = file.path(getwd(), file_path_sans_ext(basename(input_a), TRUE)), consolidate = TRUE, ...)
## S4 method for signature 'character,character_OR_NULL,GsnapParam' gsnap(input_a, input_b, params, output = file.path(getwd(), file_path_sans_ext(basename(input_a), TRUE)), consolidate = TRUE, ...)
input_a |
A path to the FASTA file containing reads to align against a
|
input_b |
If provided, a path to the FASTA file containing the second set of reads from paired-end sequencing data. |
params |
A |
output |
The output path for the GSNAP alignments. The results
will be saved in |
consolidate |
If GSNAP is run with multiple worker threads, each thread will output its own set of files. If consolidate is set to TRUE, these files will be merged. The default is TRUE. |
... |
Additional arguments to pass to GSNAP not specifically
supported by the |
A GsnapOutput
class.
Michael Lawrence
"GsnapOutput"
A GsnapOutput
object stores locations of data output by the
GSNAP alignment algorithm.
GsnapOutput
objects are created from the gsnap function,
though the function GsnapOutput
can also be used as a constructor.
In the code snippets below, x
is a GsnapOutput object.
as(x, BamFile)
,
as(x, BamFileList)
:
Returns either a BamFile or BamFileList object containing paths to the output of GSNAP.
asBam(x)
:
converts all gsnap SAM files to BAM files and creates the .bai index files.
Michael Lawrence
"GsnapParam"
A GsnapParam
object stores parameters for
gsnap
. The function of the same name serves as its
constructor.
GsnapParam(genome, unique_only = FALSE, molecule = c("RNA", "DNA"), max_mismatches = NULL, suboptimal_levels = 0L, mode = "standard", snps = NULL, npaths = if (unique_only) 1L else 100L, quiet_if_excessive = unique_only, nofails = unique_only, split_output = !unique_only, novelsplicing = FALSE, splicing = NULL, nthreads = 1L, part = NULL, batch = "2", terminal_threshold = if (molecule == "DNA") 1000L else 2L, gmap_mode = if (molecule == "DNA") "none" else "pairsearch,terminal,improve", clip_overlap = FALSE, ...)
GsnapParam(genome, unique_only = FALSE, molecule = c("RNA", "DNA"), max_mismatches = NULL, suboptimal_levels = 0L, mode = "standard", snps = NULL, npaths = if (unique_only) 1L else 100L, quiet_if_excessive = unique_only, nofails = unique_only, split_output = !unique_only, novelsplicing = FALSE, splicing = NULL, nthreads = 1L, part = NULL, batch = "2", terminal_threshold = if (molecule == "DNA") 1000L else 2L, gmap_mode = if (molecule == "DNA") "none" else "pairsearch,terminal,improve", clip_overlap = FALSE, ...)
genome |
A GmapGenome object to align against |
unique_only |
Whether only alignments with a unique match should be output. The default is FALSE. |
molecule |
The type of molecule sequenced; used to determine appropriate parameter defaults. |
max_mismatches |
The maximum number of mismatches to allow per alignment. If NULL, then the value defaults to ((readlength + 2) / 12 - 2)) |
suboptimal_levels |
Report suboptimal hits beyond best hit. The default is 0L. |
mode |
The alignment mode. It can be "standard", "cmet-stranded", "cmet-nonstranded", "atoi-stranded", or "atoi-nonstranded". The default is "standard". |
snps |
If not NULL, then a GmapSnps object. Provided SNPs will not count as mismatches. |
npaths |
The maximum number of paths to print. |
quiet_if_excessive |
If more than maximum number of paths are found, then no alignment from the read will be in the output. |
nofails |
Exclude failed alignments from output |
split_output |
Basename for multiple-file output, separately for nomapping, halfmapping_uniq, halfmapping_mult, unpaired_uniq, unpaired_mult, paired_uniq, paired_mult, concordant_uniq, and concordant_mult results (up to 9 files, or 10 if –fails-as-input is selected, or 3 for single-end reads) |
novelsplicing |
Logical indicating whether to look for novel splicing events. FALSE is the default. |
splicing |
If not NULL, a GmapSplices object. NULL is the default. |
nthreads |
The number of worker threads gsnap should use to align. |
part |
If not NULL, then process only the i-th out of every n sequences e.g., 0/100 or 99/100 (useful for distributing jobs to a computer farm). If NULL, then all sequences are processed. NULL is the default. |
batch |
This argument allows control over gsnap's memory mapping and allocation. The default is mode 2. Mode 0: {offsets=allocate, positions=mmap, genome=mmap}, Mode 1: {offsets=allocate, positions=mmap & preload,genome=mmap}, Mode 2: {offsets=allocate, positions=mmap & preload,genome=mmap & preload}, Mode 3: {offsets=allocate, positions=allocate,genome=mmap & preload}, Mode 4: {offsets=allocate, positions=allocate,genome=allocate} |
... |
Additional parameters for gsnap. See gsnap's full documentation for those available. |
terminal_threshold |
If this number of mismatches is exceeded, GSNAP will attempt to align from one of the sequence, eventually giving up and discarding the rest of the sequence. This is called a “terminal alignment”. By setting this to a high value, we have effectively disabled it for DNA, since terminal alignments were motivated by splicing alignment problems and other special cases. |
gmap_mode |
Specifies the GMAP pipeline executed when GSNAP delegates to GMAP (a Smith-Waterman aligner) in difficult cases. We have disabled this for DNA, since such difficult cases are only anticipated in the context of splicing or complex rearrangements. |
clip_overlap |
Whether to equally clip paired ends that
overlap each other (due to the fragment length being shorter than 2X
the read length). This can be important for getting accurate counts
from |
Internal methods, etc, that need an alias but are not intended for public use, at least not yet.
A GmapGenome
object is required to align reads using the GSNAP or GMAP
algorithms. The makeGmapGenomePackage
function allows users to
save a particular GmapGenome
object in an R package.
makeGmapGenomePackage(gmapGenome, version, maintainer, author, destDir = ".", license = "Artistic-2.0", pkgName)
makeGmapGenomePackage(gmapGenome, version, maintainer, author, destDir = ".", license = "Artistic-2.0", pkgName)
gmapGenome |
A GmapGenome object. |
version |
The version number of this package. |
maintainer |
The maintainer of the package. The string must contain a valid email address. |
author |
The author of the package |
destDir |
The path that the new |
license |
The package's license (and its version) |
pkgName |
The name the package should have. Though free form, names of the form GmapGenome.Organism.Source.Build are recommended. E.g., GmapGenome.Hsapiens.UCSC.hg19 |
Cory Barr
## Not run: library(gmapR) if (!require(BSgenome.Dmelanogaster.UCSC.dm3)) { library(BiocManager) BiocManager::install("BSgenome.Dmelanogaster.UCSC.dm3") library(BSgenome.Dmelanogaster.UCSC.dm3) } gmapGenomePath <- file.path(getwd(), "flyGenome") if (file.exists(gmapGenomePath)) unlink(gmapGenomePath, recursive=TRUE) ggd <- GmapGenomeDirectory(gmapGenomePath, create = TRUE) gmapGenome <- GmapGenome(genome=Dmelanogaster, directory = ggd, name = "dm3", create = TRUE) makeGmapGenomePackage(gmapGenome=gmapGenome, version="0.1.0", maintainer="<[email protected]>", author="Your Name", destDir=".", license="Artistic-2.0", pkgName="GmapGenome.Dmelanogaster.UCSC.dm3") ## End(Not run)
## Not run: library(gmapR) if (!require(BSgenome.Dmelanogaster.UCSC.dm3)) { library(BiocManager) BiocManager::install("BSgenome.Dmelanogaster.UCSC.dm3") library(BSgenome.Dmelanogaster.UCSC.dm3) } gmapGenomePath <- file.path(getwd(), "flyGenome") if (file.exists(gmapGenomePath)) unlink(gmapGenomePath, recursive=TRUE) ggd <- GmapGenomeDirectory(gmapGenomePath, create = TRUE) gmapGenome <- GmapGenome(genome=Dmelanogaster, directory = ggd, name = "dm3", create = TRUE) makeGmapGenomePackage(gmapGenome=gmapGenome, version="0.1.0", maintainer="<[email protected]>", author="Your Name", destDir=".", license="Artistic-2.0", pkgName="GmapGenome.Dmelanogaster.UCSC.dm3") ## End(Not run)
Returns a GmapGenome
object consisting of the
UCSC hg19 sequence centered on the region of the TP53 gene, with 1 Mb
flanking sequence on each side. This is intended as a
test/demonstration genome and can be used, e.g., in conjunction with
the LungCancerLines
data package.
TP53Genome() TP53Which()
TP53Genome() TP53Which()
For TP53Genome
, a GmapGenome
object. If this is the
first time the user has run this function, a side-effect will be the
generation of an on-disk genome index, under the name
“TP53_demo_VERSION” in the default genome directory, where
VERSION
is the version of the TxDb package providing the bounds
of the P53 gene.
For TP53Which
, a GRanges
of the extents of the TP53
gene, translated to the space of TP53Genome
.
Michael Lawrence, Cory Barr
TP53Genome()
TP53Genome()