Title: | Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import |
---|---|
Description: | This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files. |
Authors: | Martin Morgan [aut], Hervé Pagès [aut], Valerie Obenchain [aut], Nathaniel Hayden [aut], Busayo Samuel [ctb] (Converted Rsamtools vignette from Sweave to RMarkdown / HTML.), Bioconductor Package Maintainer [cre] |
Maintainer: | Bioconductor Package Maintainer <[email protected]> |
License: | Artistic-2.0 | file LICENSE |
Version: | 2.23.1 |
Built: | 2024-11-27 11:18:52 UTC |
Source: | https://github.com/bioc/Rsamtools |
This package provides facilities for parsing samtools BAM (binary) files representing aligned sequences.
See packageDescription('Rsamtools')
for package details. A
useful starting point is the scanBam
manual page.
This package documents the following classes for purely internal
reasons, see help pages in other packages: bzfile
, fifo
,
gzfile
, pipe
, unz
, url
.
Author: Martin Morgan
Maintainer: Biocore Team c/o BioC user list <[email protected]>
The current source code for samtools and bcftools is from https://github.com/samtools/samtools. Additional material is at http://samtools.sourceforge.net/.
packageDescription('Rsamtools')
packageDescription('Rsamtools')
WARNING: Starting with Bioconductor 3.14, applyPileups
is deprecated
in favor of pileup
.
applyPileups
scans one or more BAM files, returning
position-specific sequence and quality summaries.
applyPileups(files, FUN, ..., param)
applyPileups(files, FUN, ..., param)
files |
A |
FUN |
A function of 1 argument,
|
... |
Additional arguments, passed to methods. |
param |
An instance of the object returned by
|
Regardless of param
values, the algorithm follows samtools by
excluding reads flagged as unmapped, secondary, duplicate, or failing
quality control.
applyPileups
returns a list
equal in length to the
number of times FUN
has been called, with each element
containing the result of FUN
.
ApplyPileupsParam
returns an object describing the parameters.
Martin Morgan
http://samtools.sourceforge.net/
## The examples below are currently broken and have been disabled for now ## Not run: fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) fls <- PileupFiles(c(fl, fl)) calcInfo <- function(x) { ## information at each pile-up position info <- apply(x[["seq"]], 2, function(y) { y <- y[c("A", "C", "G", "T"),,drop=FALSE] y <- y + 1L # continuity cvg <- colSums(y) p <- y / cvg[col(y)] h <- -colSums(p * log(p)) ifelse(cvg == 4L, NA, h) }) list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info) } which <- GRanges(c("seq1", "seq2"), IRanges(c(1000, 1000), 2000)) param <- ApplyPileupsParam(which=which, what="seq") res <- applyPileups(fls, calcInfo, param=param) str(res) head(res[[1]][["pos"]]) # positions matching param head(res[[1]][["info"]]) # inforamtion in each file ## 'param' as part of 'files' fls1 <- PileupFiles(c(fl, fl), param=param) res1 <- applyPileups(fls1, calcInfo) identical(res, res1) ## yield by position, across ranges param <- ApplyPileupsParam(which=which, yieldSize=500L, yieldBy="position", what="seq") res <- applyPileups(fls, calcInfo, param=param) sapply(res, "[[", "seqnames") ## End(Not run)
## The examples below are currently broken and have been disabled for now ## Not run: fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) fls <- PileupFiles(c(fl, fl)) calcInfo <- function(x) { ## information at each pile-up position info <- apply(x[["seq"]], 2, function(y) { y <- y[c("A", "C", "G", "T"),,drop=FALSE] y <- y + 1L # continuity cvg <- colSums(y) p <- y / cvg[col(y)] h <- -colSums(p * log(p)) ifelse(cvg == 4L, NA, h) }) list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info) } which <- GRanges(c("seq1", "seq2"), IRanges(c(1000, 1000), 2000)) param <- ApplyPileupsParam(which=which, what="seq") res <- applyPileups(fls, calcInfo, param=param) str(res) head(res[[1]][["pos"]]) # positions matching param head(res[[1]][["info"]]) # inforamtion in each file ## 'param' as part of 'files' fls1 <- PileupFiles(c(fl, fl), param=param) res1 <- applyPileups(fls1, calcInfo) identical(res, res1) ## yield by position, across ranges param <- ApplyPileupsParam(which=which, yieldSize=500L, yieldBy="position", what="seq") res <- applyPileups(fls, calcInfo, param=param) sapply(res, "[[", "seqnames") ## End(Not run)
Use ApplyPileupsParam()
to create a parameter object influencing
what fields and which records are used to calculate pile-ups, and to
influence the values returned.
# Constructor ApplyPileupsParam(flag = scanBamFlag(), minBaseQuality = 13L, minMapQuality = 0L, minDepth = 0L, maxDepth = 250L, yieldSize = 1L, yieldBy = c("range", "position"), yieldAll = FALSE, which = GRanges(), what = c("seq", "qual")) # Accessors plpFlag(object) plpFlag(object) <- value plpMaxDepth(object) plpMaxDepth(object) <- value plpMinBaseQuality(object) plpMinBaseQuality(object) <- value plpMinDepth(object) plpMinDepth(object) <- value plpMinMapQuality(object) plpMinMapQuality(object) <- value plpWhat(object) plpWhat(object) <- value plpWhich(object) plpWhich(object) <- value plpYieldAll(object) plpYieldAll(object) <- value plpYieldBy(object) plpYieldBy(object) <- value plpYieldSize(object) plpYieldSize(object) <- value ## S4 method for signature 'ApplyPileupsParam' show(object)
# Constructor ApplyPileupsParam(flag = scanBamFlag(), minBaseQuality = 13L, minMapQuality = 0L, minDepth = 0L, maxDepth = 250L, yieldSize = 1L, yieldBy = c("range", "position"), yieldAll = FALSE, which = GRanges(), what = c("seq", "qual")) # Accessors plpFlag(object) plpFlag(object) <- value plpMaxDepth(object) plpMaxDepth(object) <- value plpMinBaseQuality(object) plpMinBaseQuality(object) <- value plpMinDepth(object) plpMinDepth(object) <- value plpMinMapQuality(object) plpMinMapQuality(object) <- value plpWhat(object) plpWhat(object) <- value plpWhich(object) plpWhich(object) <- value plpYieldAll(object) plpYieldAll(object) <- value plpYieldBy(object) plpYieldBy(object) <- value plpYieldSize(object) plpYieldSize(object) <- value ## S4 method for signature 'ApplyPileupsParam' show(object)
flag |
An instance of the object returned by
|
minBaseQuality |
The minimum read base quality below which the base is ignored when summarizing pileup information. |
minMapQuality |
The minimum mapping quality below which the entire read is ignored. |
minDepth |
The minimum depth of the pile-up below which the position is ignored. |
maxDepth |
The maximum depth of reads considered at any position; this can be used to limit memory consumption. |
yieldSize |
The number of records to include in each call to
|
yieldBy |
How records are to be counted. By range (in which case
|
yieldAll |
Whether to report all positions
( |
which |
A |
what |
A |
object |
An instace of class |
value |
An instance to be assigned to the corresponding slot of
the |
Objects are created by calls of the form ApplyPileupsParam()
.
Slot interpretation is as described in the ‘Arguments’ section.
flag
Object of class integer
encoding flags
to be kept when they have their '0' (keep0
) or '1'
(keep1
) bit set.
minBaseQuality
An integer(1)
.
minMapQuality
An integer(1)
.
minDepth
An integer(1)
.
maxDepth
An integer(1)
.
yieldSize
An integer(1)
.
yieldBy
An character(1)
.
yieldAll
A logical(1)
.
which
A GRanges
or IntegerRangesList
object.
what
A character()
.
See 'Usage' for details on invocation.
Constructor:
Returns a ApplyPileupsParam
object.
Accessors: get or set corresponding slot values; for setters,
value
is coerced to the type of the corresponding slot.
Returns or sets the named integer
vector of flags; see scanBamFlag
.
Returns or sets an
integer(1)
vector of miminum base qualities.
Returns or sets an
integer(1)
vector of miminum map qualities.
Returns or sets an
integer(1)
vector of miminum pileup depth.
Returns or sets an
integer(1)
vector of the maximum depth to which pileups are
calculated.
Returns or sets an
integer(1)
vector of yield size.
Returns or sets an
character(1)
vector determining how pileups will be
returned.
Returns or sets an
logical(1)
vector indicating whether all positions, or just
those satisfying pileup positions, are to be returned.
Returns or sets the object influencing which locations pileups are calculated over.
Returns or sets the character
vector describing what summaries are returned by pileup.
Methods:
Compactly display the object.
Martin Morgan
example(applyPileups)
example(applyPileups)
Use BamFile()
to create a reference to a BAM file (and
optionally its index). The reference remains open across calls to
methods, avoiding costly index re-loading.
BamFileList()
provides a convenient way of managing a list of
BamFile
instances.
## Constructors BamFile(file, index=file, ..., yieldSize=NA_integer_, obeyQname=FALSE, asMates=FALSE, qnamePrefixEnd=NA, qnameSuffixStart=NA) BamFileList(..., yieldSize=NA_integer_, obeyQname=FALSE, asMates=FALSE, qnamePrefixEnd=NA, qnameSuffixStart=NA) ## Opening / closing ## S3 method for class 'BamFile' open(con, ...) ## S3 method for class 'BamFile' close(con, ...) ## accessors; also path(), index(), yieldSize() ## S4 method for signature 'BamFile' isOpen(con, rw="") ## S4 method for signature 'BamFile' isIncomplete(con) ## S4 method for signature 'BamFile' obeyQname(object, ...) obeyQname(object, ...) <- value ## S4 method for signature 'BamFile' asMates(object, ...) asMates(object, ...) <- value ## S4 method for signature 'BamFile' qnamePrefixEnd(object, ...) qnamePrefixEnd(object, ...) <- value ## S4 method for signature 'BamFile' qnameSuffixStart(object, ...) qnameSuffixStart(object, ...) <- value ## actions ## S4 method for signature 'BamFile' scanBamHeader(files, ..., what=c("targets", "text")) ## S4 method for signature 'BamFile' seqinfo(x) ## S4 method for signature 'BamFileList' seqinfo(x) ## S4 method for signature 'BamFile' filterBam(file, destination, index=file, ..., filter=FilterRules(), indexDestination=TRUE, param=ScanBamParam(what=scanBamWhat())) ## S4 method for signature 'BamFile' indexBam(files, ...) ## S4 method for signature 'BamFile' sortBam(file, destination, ..., byQname=FALSE, maxMemory=512, byTag=NULL, nThreads=1L) ## S4 method for signature 'BamFileList' mergeBam(files, destination, ...) ## reading ## S4 method for signature 'BamFile' scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat())) ## counting ## S4 method for signature 'BamFile' idxstatsBam(file, index=file, ...) ## S4 method for signature 'BamFile' countBam(file, index=file, ..., param=ScanBamParam()) ## S4 method for signature 'BamFileList' countBam(file, index=file, ..., param=ScanBamParam()) ## S4 method for signature 'BamFile' quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE)
## Constructors BamFile(file, index=file, ..., yieldSize=NA_integer_, obeyQname=FALSE, asMates=FALSE, qnamePrefixEnd=NA, qnameSuffixStart=NA) BamFileList(..., yieldSize=NA_integer_, obeyQname=FALSE, asMates=FALSE, qnamePrefixEnd=NA, qnameSuffixStart=NA) ## Opening / closing ## S3 method for class 'BamFile' open(con, ...) ## S3 method for class 'BamFile' close(con, ...) ## accessors; also path(), index(), yieldSize() ## S4 method for signature 'BamFile' isOpen(con, rw="") ## S4 method for signature 'BamFile' isIncomplete(con) ## S4 method for signature 'BamFile' obeyQname(object, ...) obeyQname(object, ...) <- value ## S4 method for signature 'BamFile' asMates(object, ...) asMates(object, ...) <- value ## S4 method for signature 'BamFile' qnamePrefixEnd(object, ...) qnamePrefixEnd(object, ...) <- value ## S4 method for signature 'BamFile' qnameSuffixStart(object, ...) qnameSuffixStart(object, ...) <- value ## actions ## S4 method for signature 'BamFile' scanBamHeader(files, ..., what=c("targets", "text")) ## S4 method for signature 'BamFile' seqinfo(x) ## S4 method for signature 'BamFileList' seqinfo(x) ## S4 method for signature 'BamFile' filterBam(file, destination, index=file, ..., filter=FilterRules(), indexDestination=TRUE, param=ScanBamParam(what=scanBamWhat())) ## S4 method for signature 'BamFile' indexBam(files, ...) ## S4 method for signature 'BamFile' sortBam(file, destination, ..., byQname=FALSE, maxMemory=512, byTag=NULL, nThreads=1L) ## S4 method for signature 'BamFileList' mergeBam(files, destination, ...) ## reading ## S4 method for signature 'BamFile' scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat())) ## counting ## S4 method for signature 'BamFile' idxstatsBam(file, index=file, ...) ## S4 method for signature 'BamFile' countBam(file, index=file, ..., param=ScanBamParam()) ## S4 method for signature 'BamFileList' countBam(file, index=file, ..., param=ScanBamParam()) ## S4 method for signature 'BamFile' quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE)
... |
Additional arguments. For |
con |
An instance of |
x , object , file , files
|
A character vector of BAM file paths
(for |
index |
character(1); the BAM index file path (for
|
yieldSize |
Number of records to yield each time the file
is read from with |
asMates |
Logical indicating if records should be paired as mates. See ‘Fields’ section for details. |
qnamePrefixEnd |
Single character (or NA) marking the
end of the qname prefix. When specified, all characters prior to
and including the |
qnameSuffixStart |
Single character (or NA) marking the
start of the qname suffix. When specified, all characters following
and including the |
obeyQname |
Logical indicating if the BAM file is sorted
by |
value |
Logical value for setting |
what |
For |
filter |
A |
destination |
character(1) file path to write filtered reads to. |
indexDestination |
logical(1) indicating whether the destination file should also be indexed. |
byQname , maxMemory , byTag , nThreads
|
See |
param |
An optional |
rw |
Mode of file; ignored. |
main.groups.only |
See |
Objects are created by calls of the form BamFile()
.
The BamFile
class inherits fields from the
RsamtoolsFile
class and has fields:
Number of records to yield each time the file is
read from using scanBam
or, when length(bamWhich())
!= 0
, a threshold which yields records in complete ranges whose
sum first exceeds yieldSize
. Setting yieldSize
on a
BamFileList
does not alter existing yield sizes set on the
individual BamFile
instances.
A logical indicating if the records should be
returned as mated pairs. When TRUE
scanBam
attempts
to mate (pair) the records and returns two additional fields
groupid
and mate_status
. groupid
is an integer
vector of unique group ids; mate_status
is a factor with level
mated
for records successfully paired by the algorithm,
ambiguous
for records that are possibly mates but cannot be
assigned unambiguously, or unmated
for reads that did not
have valid mates.
Mate criteria:
Bit 0x40 and 0x80: Segments are a pair of first/last OR neither segment is marked first/last
Bit 0x100: Both segments are secondary OR both not secondary
Bit 0x10 and 0x20: Segments are on opposite strands
mpos match: segment1 mpos matches segment2 pos AND segment2 mpos matches segment1 pos
tid match
Flags, tags and ranges may be specified in the ScanBamParam
for fine tuning of results.
A logical(0) indicating if the file was sorted by
qname. In Bioconductor > 2.12 paired-end files do not need to be
sorted by qname
. Instead set asMates=TRUE
in the
BamFile
when using the readGAlignmentsList
function from the GenomicAlignments package.
BamFileList
inherits additional methods from
RsamtoolsFileList
and SimpleList
.
Opening / closing:
Opens the (local or remote) path
and
index
(if bamIndex
is not character(0)
),
files. Returns a BamFile
instance.
Closes the BamFile
con
; returning
(invisibly) the updated BamFile
. The instance may be
re-opened with open.BamFile
.
Tests whether the BamFile
con
has been
opened for reading.
Tests whether the BamFile
con
is
niether closed nor at the end of the file.
Accessors:
Returns a character(1) vector of BAM path names.
Returns a character(0) or character(1) vector of BAM index path names.
Return or set an integer(1) vector indicating yield size.
Return or set a logical(0) indicating if the file was sorted by qname.
Return or set a logical(0) indicating if the records should be returned as mated pairs.
Methods:
Visit the path in path(file)
, returning
the information contained in the file header; see
scanBamHeader
.
Visit the path in
path(file)
, returning a Seqinfo
,
character, or named integer vector containing information on the
anmes and / or lengths of each sequence. Seqnames are ordered
as they appear in the file.
Visit the path in path(file)
, returning the
result of scanBam
applied to the specified path.
Visit the path(s) in path(file)
, returning
the result of countBam
applied to the specified
path.
Visit the index in index(file)
, quickly
returning a data.frame
with columns seqnames
,
seqlength
, mapped
(number of mapped reads on
seqnames
) and unmapped
(number of unmapped reads).
Visit the path in path(file)
, returning the
result of filterBam
applied to the specified
path. A single file can be filtered to one or several
destinations, as described in filterBam
.
Visit the path in path(file)
, returning
the result of indexBam
applied to the specified
path.
Visit the path in path(file)
, returning the
result of sortBam
applied to the specified path.
Merge several BAM files into a single BAM file. See
mergeBam
for details; additional arguments supported
by mergeBam,character-method
are also available for
BamFileList
.
Compactly display the object.
Martin Morgan and Marc Carlson
The readGAlignments
,
readGAlignmentPairs
,
and readGAlignmentsList
functions defined in the GenomicAlignments package.
summarizeOverlaps
and
findSpliceOverlaps-methods in the
GenomicAlignments package for methods that work on a
BamFile and BamFileList objects.
## ## BamFile options. ## fl <- system.file("extdata", "ex1.bam", package="Rsamtools") bf <- BamFile(fl) bf ## When 'asMates=TRUE' scanBam() reads the data in as ## pairs. See 'asMates' above for details of the pairing ## algorithm. asMates(bf) <- TRUE ## When 'yieldSize' is set, scanBam() will iterate ## through the file in chunks. yieldSize(bf) <- 500 ## Some applications append a filename (e.g., NCBI Sequence Read ## Archive (SRA) toolkit) or allele identifier to the sequence qname. ## This may result in a unique qname for each record which presents a ## problem when mating paired-end reads (identical qnames is one ## criteria for paired-end mating). 'qnamePrefixEnd' and ## 'qnameSuffixStart' can be used to trim an unwanted prefix or suffix. qnamePrefixEnd(bf) <- "/" qnameSuffixStart(bf) <- "." ## ## Reading Bam files. ## fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) (bf <- BamFile(fl)) head(seqlengths(bf)) # sequences and lengths in BAM file if (require(RNAseqData.HNRNPC.bam.chr14)) { bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES) bfl bfl[1:2] # subset bfl[[1]] # select first element -- BamFile ## merged across BAM files seqinfo(bfl) head(seqlengths(bfl)) } length(scanBam(fl)[[1]][[1]]) # all records bf <- open(BamFile(fl)) # implicit index bf identical(scanBam(bf), scanBam(fl)) close(bf) ## Use 'yieldSize' to iterate through a file in chunks. bf <- open(BamFile(fl, yieldSize=1000)) while (nrec <- length(scanBam(bf)[[1]][[1]])) cat("records:", nrec, "\n") close(bf) ## Repeatedly visit multiple ranges in the BamFile. rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584))) bf <- open(BamFile(fl)) sapply(seq_len(length(rng)), function(i, bamFile, rng) { param <- ScanBamParam(which=rng[i], what="seq") bam <- scanBam(bamFile, param=param)[[1]] alphabetFrequency(bam[["seq"]], baseOnly=TRUE, collapse=TRUE) }, bf, rng) close(bf)
## ## BamFile options. ## fl <- system.file("extdata", "ex1.bam", package="Rsamtools") bf <- BamFile(fl) bf ## When 'asMates=TRUE' scanBam() reads the data in as ## pairs. See 'asMates' above for details of the pairing ## algorithm. asMates(bf) <- TRUE ## When 'yieldSize' is set, scanBam() will iterate ## through the file in chunks. yieldSize(bf) <- 500 ## Some applications append a filename (e.g., NCBI Sequence Read ## Archive (SRA) toolkit) or allele identifier to the sequence qname. ## This may result in a unique qname for each record which presents a ## problem when mating paired-end reads (identical qnames is one ## criteria for paired-end mating). 'qnamePrefixEnd' and ## 'qnameSuffixStart' can be used to trim an unwanted prefix or suffix. qnamePrefixEnd(bf) <- "/" qnameSuffixStart(bf) <- "." ## ## Reading Bam files. ## fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) (bf <- BamFile(fl)) head(seqlengths(bf)) # sequences and lengths in BAM file if (require(RNAseqData.HNRNPC.bam.chr14)) { bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES) bfl bfl[1:2] # subset bfl[[1]] # select first element -- BamFile ## merged across BAM files seqinfo(bfl) head(seqlengths(bfl)) } length(scanBam(fl)[[1]][[1]]) # all records bf <- open(BamFile(fl)) # implicit index bf identical(scanBam(bf), scanBam(fl)) close(bf) ## Use 'yieldSize' to iterate through a file in chunks. bf <- open(BamFile(fl, yieldSize=1000)) while (nrec <- length(scanBam(bf)[[1]][[1]])) cat("records:", nrec, "\n") close(bf) ## Repeatedly visit multiple ranges in the BamFile. rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584))) bf <- open(BamFile(fl)) sapply(seq_len(length(rng)), function(i, bamFile, rng) { param <- ScanBamParam(which=rng[i], what="seq") bam <- scanBam(bamFile, param=param)[[1]] alphabetFrequency(bam[["seq"]], baseOnly=TRUE, collapse=TRUE) }, bf, rng) close(bf)
Import binary ‘BAM’ files into a list structure, with facilities for selecting what fields and which records are imported, and other operations to manipulate BAM files.
scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat())) countBam(file, index=file, ..., param=ScanBamParam()) idxstatsBam(file, index=file, ...) scanBamHeader(files, ...) ## S4 method for signature 'character' scanBamHeader(files, ...) asBam(file, destination=sub("\\.sam(\\.gz)?", "", file), ...) ## S4 method for signature 'character' asBam(file, destination=sub("\\.sam(\\.gz)?", "", file), ..., overwrite=FALSE, indexDestination=TRUE) asSam(file, destination=sub("\\.bam", "", file), ...) ## S4 method for signature 'character' asSam(file, destination=sub("\\.bam", "", file), ..., overwrite=FALSE) filterBam(file, destination, index=file, ...) ## S4 method for signature 'character' filterBam(file, destination, index=file, ..., filter=FilterRules(), indexDestination=TRUE, param=ScanBamParam(what=scanBamWhat())) sortBam(file, destination, ...) ## S4 method for signature 'character' sortBam(file, destination, ..., byQname=FALSE, maxMemory=512, byTag=NULL, nThreads=1L) indexBam(files, ...) ## S4 method for signature 'character' indexBam(files, ...) mergeBam(files, destination, ...) ## S4 method for signature 'character' mergeBam(files, destination, ..., region = GRanges(), overwrite = FALSE, header = character(), byQname = FALSE, addRG = FALSE, compressLevel1 = FALSE, indexDestination = FALSE)
scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat())) countBam(file, index=file, ..., param=ScanBamParam()) idxstatsBam(file, index=file, ...) scanBamHeader(files, ...) ## S4 method for signature 'character' scanBamHeader(files, ...) asBam(file, destination=sub("\\.sam(\\.gz)?", "", file), ...) ## S4 method for signature 'character' asBam(file, destination=sub("\\.sam(\\.gz)?", "", file), ..., overwrite=FALSE, indexDestination=TRUE) asSam(file, destination=sub("\\.bam", "", file), ...) ## S4 method for signature 'character' asSam(file, destination=sub("\\.bam", "", file), ..., overwrite=FALSE) filterBam(file, destination, index=file, ...) ## S4 method for signature 'character' filterBam(file, destination, index=file, ..., filter=FilterRules(), indexDestination=TRUE, param=ScanBamParam(what=scanBamWhat())) sortBam(file, destination, ...) ## S4 method for signature 'character' sortBam(file, destination, ..., byQname=FALSE, maxMemory=512, byTag=NULL, nThreads=1L) indexBam(files, ...) ## S4 method for signature 'character' indexBam(files, ...) mergeBam(files, destination, ...) ## S4 method for signature 'character' mergeBam(files, destination, ..., region = GRanges(), overwrite = FALSE, header = character(), byQname = FALSE, addRG = FALSE, compressLevel1 = FALSE, indexDestination = FALSE)
file |
The character(1) file name of the ‘BAM’ ('SAM' for
|
files |
The character() file names of the ‘BAM’ file to be
processed. For |
index |
The character(1) name of the index file of the 'BAM' file being processed; this is given without the '.bai' extension. |
destination |
The character(1) file name of the location where
the sorted, filtered, or merged output file will be created. For
|
region |
A GRanges() instance with |
... |
Additional arguments, passed to methods. |
overwrite |
A logical(1) indicating whether the destination can be over-written if it already exists. |
filter |
A |
indexDestination |
A logical(1) indicating whether the created destination file should also be indexed. |
byQname |
A logical(1) indicating whether the sorted destination file should be sorted by Query-name (TRUE) or by mapping position (FALSE). |
header |
A character(1) file path for the header information to be used in the merged BAM file. |
addRG |
A logical(1) indicating whether the file name should be used as RG (read group) tag in the merged BAM file. |
compressLevel1 |
A logical(1) indicating whether the merged BAM file should be compressed to zip level 1. |
maxMemory |
A numerical(1) indicating the maximal amount of memory (in MB) that the function is allowed to use. |
byTag |
A character(1) indicating whether the BAM file should be sorted by the supplied tag value. |
nThreads |
An integer(1) indicating the number of threads the function should use. |
param |
An instance of |
The scanBam
function parses binary BAM files; text SAM files
can be parsed using R's scan
function, especially with
arguments what
to control the fields that are parsed.
countBam
returns a count of records consistent with
param
. If param
is empty then entire file would be
counted.
idxstatsBam
visit the index in index(file)
, and quickly
returns the number of mapped and unmapped reads on each seqname.
scanBamHeader
visits the header information in a BAM file,
returning for each file a list containing elements targets
and
text
, as described below. The SAM / BAM specification does not
require that the content of the header be consistent with the content
of the file, e.g., more targets may be present that are represented by
reads in the file. An optional character vector argument containing
one or two elements of what=c("targets", "text")
can be used to
specify which elements of the header are returned.
asBam
converts 'SAM' files to 'BAM' files, equivalent to
samtools view -Sb file > destination
. The 'BAM' file is sorted
and an index created on the destination (with extension '.bai') when
indexDestination=TRUE
.
asSam
converts 'BAM' files to 'SAM' files, equivalent to
samtools view file > destination
.
filterBam
parses records in file
. Records satisfying the
bamWhich
bamFlag
and bamSimpleCigar
criteria of
param
are accumulated to a default of yieldSize =
1000000
records (change this by specifying yieldSize
when
creating a BamFile
instance; see
BamFile-class
). These records are then parsed to
a DataFrame
and made available for further filtering by
user-supplied FilterRules
. Functions in the FilterRules
instance should expect a single DataFrame
argument representing
all information specified by param
. Each function must return a
logical
vector equal to the number of rows of the
DataFrame
. Return values are used to include (when TRUE
)
corresponding records in the filtered BAM file. The BAM file is
created at destination
. An index file is created on the
destination when indexDestination=TRUE
. It is more space- and
time-efficient to filter using bamWhich
, bamFlag
, and
bamSimpleCigar
, if appropriate, than to supply
FilterRules
. filter
may be a list of FilterRules
instances, in which case destination
must be a character vector
of equal length. The original file
is then separately filtered
into destination[[i]]
, using filter[[i]]
as the filter
criterion.
sortBam
sorts the BAM file given as its first argument,
analogous to the “samtools sort” function.
indexBam
creates an index for each BAM file specified,
analogous to the ‘samtools index’ function.
mergeBam
merges 2 or more sorted BAM files. As with samtools,
the RG (read group) dictionary in the header of the BAM files is not
reconstructed.
Details of the ScanBamParam
class are provide on its help page;
several salient points are reiterated here. ScanBamParam
can
contain a field what
, specifying the components of the BAM
records to be returned. Valid values of what
are available with
scanBamWhat
. ScanBamParam
can contain an argument
which
that specifies a subset of reads to return. This requires
that the BAM file be indexed, and that the file be named following
samtools convention as <bam_filename>.bai
. ScanBamParam
can contain an argument tag
to specify which tags will be
extracted.
The scanBam,character-method
returns a list of lists. The outer
list groups results from each IntegerRanges
list of
bamWhich(param)
; the outer list is of length one when
bamWhich(param)
has length 0. Each inner list contains elements
named after scanBamWhat()
; elements omitted from
bamWhat(param)
are removed. The content of non-null elements
are as follows, taken from the description in the samtools API
documentation:
qname: This is the QNAME field in SAM Spec v1.4. The query name, i.e., identifier, associated with the read.
flag: This is the FLAG field in SAM Spec v1.4.
A numeric value summarizing details of the read. See
ScanBamParam
and the flag
argument, and
scanBamFlag()
.
rname: This is the RNAME field in SAM Spec v1.4. The name of the reference to which the read is aligned.
strand: The strand to which the read is aligned.
pos: This is the POS field in SAM Spec v1.4.
The genomic coordinate at the start of the alignment.
Coordinates are ‘left-most’, i.e., at the 3' end of a
read on the '-' strand, and 1-based. The position excludes
clipped nucleotides, even though soft-clipped nucleotides are
included in seq
.
qwidth: The width of the query, as calculated from the
cigar
encoding; normally equal to the width of the query
returned in seq
.
mapq: This is the MAPQ field in SAM Spec v1.4. The MAPping Quality.
cigar: This is the CIGAR field in SAM Spec v1.4. The CIGAR string.
mrnm: This is the RNEXT field in SAM Spec v1.4. The reference to which the mate (of a paired end or mate pair read) aligns.
mpos: This is the PNEXT field in SAM Spec v1.4. The position to which the mate aligns.
isize: This is the TLEN field in SAM Spec v1.4. Inferred insert size for paired end alignments.
seq: This is the SEQ field in SAM Spec v1.4. The query sequence, in the 5' to 3' orientation. If aligned to the minus strand, it is the reverse complement of the original sequence.
qual: This is the QUAL field in SAM Spec v1.4.
Phred-encoded, phred-scaled base quality score, oriented as
seq
.
groupid: This is an integer vector of unique group ids
returned when asMates=TRUE
in a BamFile object.
groupid
values are used to create the partitioning
for a GAlignmentsList
object.
mate_status: Returned (always) when asMates=TRUE
in a BamFile
object. This is a factor indicating status (mated
,
ambiguous
, unmated
) of each record.
idxstatsBam
returns a data.frame
with columns
seqnames
, seqlength
, mapped
(number of mapped
reads on seqnames
) and unmapped
(number of unmapped
reads).
scanBamHeader
returns a list, with one element for each file
named in files
. The list contains two element. The
targets
element contains target (reference) sequence
lengths. The text
element is itself a list with each element a
list corresponding to tags (e.g., ‘@SQ’) found in the header,
and the associated tag values.
asBam
, asSam
return the file name of the destination file.
sortBam
returns the file name of the sorted file.
indexBam
returns the file name of the index file created.
filterBam
returns the file name of the destination file
created.
Martin Morgan <[email protected]>. Thomas Unterhiner
<[email protected]> (sortBam
).
http://samtools.sourceforge.net/
ScanBamParam
, scanBamWhat
,
scanBamFlag
fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) ## ## scanBam ## res0 <- scanBam(fl)[[1]] # always list-of-lists names(res0) length(res0[["qname"]]) lapply(res0, head, 3) table(width(res0[["seq"]])) # query widths table(res0[["qwidth"]], useNA="always") # query widths derived from cigar table(res0[["cigar"]], useNA="always") table(res0[["strand"]], useNA="always") table(res0[["flag"]], useNA="always") which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000))) p1 <- ScanBamParam(which=which, what=scanBamWhat()) res1 <- scanBam(fl, param=p1) names(res1) names(res1[[2]]) p2 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth")) res2 <- scanBam(fl, param=p2) p3 <- ScanBamParam( what="flag", # information to query from BAM file flag=scanBamFlag(isMinusStrand=FALSE)) length(scanBam(fl, param=p3)[[1]]$flag) ## ## idxstatsBam ## idxstatsBam(fl) ## ## filterBam ## param <- ScanBamParam( flag=scanBamFlag(isUnmappedQuery=FALSE), what="seq") dest <- filterBam(fl, tempfile(), param=param) countBam(dest) ## 3271 records ## filter to a single file filter <- FilterRules(list(MinWidth = function(x) width(x$seq) > 35)) dest <- filterBam(fl, tempfile(), param=param, filter=filter) countBam(dest) ## 398 records res3 <- scanBam(dest, param=ScanBamParam(what="seq"))[[1]] table(width(res3$seq)) ## filter 1 file to 2 destinations filters <- list( FilterRules(list(long=function(x) width(x$seq) > 35)), FilterRules(list(short=function(x) width(x$seq) <= 35)) ) destinations <- replicate(2, tempfile()) dest <- filterBam(fl, destinations, param=param, filter=filters) lapply(dest, countBam) ## ## sortBam ## sorted <- sortBam(fl, tempfile()) ## ## scanBamParam re-orders 'which'; recover original order ## gwhich <- as(which, "GRanges")[c(2, 1, 3)] # example data cnt <- countBam(fl, param=ScanBamParam(which=gwhich)) reorderIdx <- unlist(split(seq_along(gwhich), seqnames(gwhich))) cnt cnt[reorderIdx,]
fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) ## ## scanBam ## res0 <- scanBam(fl)[[1]] # always list-of-lists names(res0) length(res0[["qname"]]) lapply(res0, head, 3) table(width(res0[["seq"]])) # query widths table(res0[["qwidth"]], useNA="always") # query widths derived from cigar table(res0[["cigar"]], useNA="always") table(res0[["strand"]], useNA="always") table(res0[["flag"]], useNA="always") which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000))) p1 <- ScanBamParam(which=which, what=scanBamWhat()) res1 <- scanBam(fl, param=p1) names(res1) names(res1[[2]]) p2 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth")) res2 <- scanBam(fl, param=p2) p3 <- ScanBamParam( what="flag", # information to query from BAM file flag=scanBamFlag(isMinusStrand=FALSE)) length(scanBam(fl, param=p3)[[1]]$flag) ## ## idxstatsBam ## idxstatsBam(fl) ## ## filterBam ## param <- ScanBamParam( flag=scanBamFlag(isUnmappedQuery=FALSE), what="seq") dest <- filterBam(fl, tempfile(), param=param) countBam(dest) ## 3271 records ## filter to a single file filter <- FilterRules(list(MinWidth = function(x) width(x$seq) > 35)) dest <- filterBam(fl, tempfile(), param=param, filter=filter) countBam(dest) ## 398 records res3 <- scanBam(dest, param=ScanBamParam(what="seq"))[[1]] table(width(res3$seq)) ## filter 1 file to 2 destinations filters <- list( FilterRules(list(long=function(x) width(x$seq) > 35)), FilterRules(list(short=function(x) width(x$seq) <= 35)) ) destinations <- replicate(2, tempfile()) dest <- filterBam(fl, destinations, param=param, filter=filters) lapply(dest, countBam) ## ## sortBam ## sorted <- sortBam(fl, tempfile()) ## ## scanBamParam re-orders 'which'; recover original order ## gwhich <- as(which, "GRanges")[c(2, 1, 3)] # example data cnt <- countBam(fl, param=ScanBamParam(which=gwhich)) reorderIdx <- unlist(split(seq_along(gwhich), seqnames(gwhich))) cnt cnt[reorderIdx,]
Use BamViews()
to reference a set of disk-based BAM files to be
processed (e.g., queried using scanBam
) as a single
‘experiment’.
## Constructor BamViews(bamPaths=character(0), bamIndicies=bamPaths, bamSamples=DataFrame(row.names=make.unique(basename(bamPaths))), bamRanges, bamExperiment = list(), ...) ## S4 method for signature 'missing' BamViews(bamPaths=character(0), bamIndicies=bamPaths, bamSamples=DataFrame(row.names=make.unique(basename(bamPaths))), bamRanges, bamExperiment = list(), ..., auto.range=FALSE) ## Accessors bamPaths(x) bamSamples(x) bamSamples(x) <- value bamRanges(x) bamRanges(x) <- value bamExperiment(x) ## S4 method for signature 'BamViews' names(x) ## S4 replacement method for signature 'BamViews' names(x) <- value ## S4 method for signature 'BamViews' dimnames(x) ## S4 replacement method for signature 'BamViews,ANY' dimnames(x) <- value bamDirname(x, ...) <- value ## Subset ## S4 method for signature 'BamViews,ANY,ANY' x[i, j, ..., drop=TRUE] ## S4 method for signature 'BamViews,ANY,missing' x[i, j, ..., drop=TRUE] ## S4 method for signature 'BamViews,missing,ANY' x[i, j, ..., drop=TRUE] ## Input ## S4 method for signature 'BamViews' scanBam(file, index = file, ..., param = ScanBamParam(what=scanBamWhat())) ## S4 method for signature 'BamViews' countBam(file, index = file, ..., param = ScanBamParam()) ## Show ## S4 method for signature 'BamViews' show(object)
## Constructor BamViews(bamPaths=character(0), bamIndicies=bamPaths, bamSamples=DataFrame(row.names=make.unique(basename(bamPaths))), bamRanges, bamExperiment = list(), ...) ## S4 method for signature 'missing' BamViews(bamPaths=character(0), bamIndicies=bamPaths, bamSamples=DataFrame(row.names=make.unique(basename(bamPaths))), bamRanges, bamExperiment = list(), ..., auto.range=FALSE) ## Accessors bamPaths(x) bamSamples(x) bamSamples(x) <- value bamRanges(x) bamRanges(x) <- value bamExperiment(x) ## S4 method for signature 'BamViews' names(x) ## S4 replacement method for signature 'BamViews' names(x) <- value ## S4 method for signature 'BamViews' dimnames(x) ## S4 replacement method for signature 'BamViews,ANY' dimnames(x) <- value bamDirname(x, ...) <- value ## Subset ## S4 method for signature 'BamViews,ANY,ANY' x[i, j, ..., drop=TRUE] ## S4 method for signature 'BamViews,ANY,missing' x[i, j, ..., drop=TRUE] ## S4 method for signature 'BamViews,missing,ANY' x[i, j, ..., drop=TRUE] ## Input ## S4 method for signature 'BamViews' scanBam(file, index = file, ..., param = ScanBamParam(what=scanBamWhat())) ## S4 method for signature 'BamViews' countBam(file, index = file, ..., param = ScanBamParam()) ## Show ## S4 method for signature 'BamViews' show(object)
bamPaths |
A character() vector of BAM path names. |
bamIndicies |
A character() vector of BAM index file path names, without the ‘.bai’ extension. |
bamSamples |
A |
bamRanges |
Missing or a |
bamExperiment |
A list() containing additional information about the experiment. |
auto.range |
If |
... |
Additional arguments. |
x |
An instance of |
object |
An instance of |
value |
An object of appropriate type to replace content. |
i |
During subsetting, a logical or numeric index into
|
j |
During subsetting, a logical or numeric index into
|
drop |
A logical(1), ignored by all |
file |
An instance of |
index |
A character vector of indices, corresponding to the
|
param |
An optional |
Objects are created by calls of the form BamViews()
.
A character() vector of BAM path names.
A character() vector of BAM index path names.
A DataFrame
instance with as
many rows as length(bamPaths)
, containing sample information
associated with each path.
A GRanges
instance with
ranges defined on the spaces of the BAM files. Ranges are not
validated against the BAM files.
A list() containing additional information about the experiment.
See 'Usage' for details on invocation.
Constructor:
Returns a BamViews
object.
Accessors:
Returns a character() vector of BAM path names.
Returns a character() vector of BAM index path names.
Returns a DataFrame
instance
with as many rows as length(bamPaths)
, containing sample
information associated with each path.
Assign a DataFrame
instance
with as many rows as length(bamPaths)
, containing sample
information associated with each path.
Returns a GRanges
instance
with ranges defined on the spaces of the BAM files. Ranges are
not validated against the BAM files.
Assign a GRanges
instance
with ranges defined on the spaces of the BAM files. Ranges are
not validated against the BAM files.
Returns a list() containing additional information about the experiment.
Return the column names of the BamViews
instance; same as names(bamSamples(x))
.
Assign the column names of the BamViews
instance.
Return the row and column names of the
BamViews
instance.
Assign the row and column names of the
BamViews
instance.
Methods:
Subset the object by bamRanges
or bamSamples
.
Visit each path in bamPaths(file)
, returning
the result of scanBam
applied to the specified
path. bamRanges(file)
takes precedence over
bamWhich(param)
.
Visit each path in bamPaths(file)
, returning
the result of countBam
applied to the specified
path. bamRanges(file)
takes precedence over
bamWhich(param)
.
Compactly display the object.
Martin Morgan
fls <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) rngs <- GRanges(seqnames = Rle(c("chr1", "chr2"), c(9, 9)), ranges = c(IRanges(seq(10000, 90000, 10000), width=500), IRanges(seq(100000, 900000, 100000), width=5000)), Count = seq_len(18L)) v <- BamViews(fls, bamRanges=rngs) v v[1:5,] bamRanges(v[c(1:5, 11:15),]) bamDirname(v) <- getwd() v
fls <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) rngs <- GRanges(seqnames = Rle(c("chr1", "chr2"), c(9, 9)), ranges = c(IRanges(seq(10000, 90000, 10000), width=500), IRanges(seq(100000, 900000, 100000), width=5000)), Count = seq_len(18L)) v <- BamViews(fls, bamRanges=rngs) v v[1:5,] bamRanges(v[c(1:5, 11:15),]) bamDirname(v) <- getwd() v
Use BcfFile()
to create a reference to a BCF (and optionally
its index). The reference remains open across calls to methods,
avoiding costly index re-loading.
BcfFileList()
provides a convenient way of managing a list of
BcfFile
instances.
## Constructors BcfFile(file, index = file, mode=ifelse(grepl("\\.bcf$", file), "rb", "r")) BcfFileList(...) ## Opening / closing ## S3 method for class 'BcfFile' open(con, ...) ## S3 method for class 'BcfFile' close(con, ...) ## accessors; also path(), index() ## S4 method for signature 'BcfFile' isOpen(con, rw="") bcfMode(object) ## actions ## S4 method for signature 'BcfFile' scanBcfHeader(file, ...) ## S4 method for signature 'BcfFile' scanBcf(file, ..., param=ScanBcfParam()) ## S4 method for signature 'BcfFile' indexBcf(file, ...)
## Constructors BcfFile(file, index = file, mode=ifelse(grepl("\\.bcf$", file), "rb", "r")) BcfFileList(...) ## Opening / closing ## S3 method for class 'BcfFile' open(con, ...) ## S3 method for class 'BcfFile' close(con, ...) ## accessors; also path(), index() ## S4 method for signature 'BcfFile' isOpen(con, rw="") bcfMode(object) ## actions ## S4 method for signature 'BcfFile' scanBcfHeader(file, ...) ## S4 method for signature 'BcfFile' scanBcf(file, ..., param=ScanBcfParam()) ## S4 method for signature 'BcfFile' indexBcf(file, ...)
con , object
|
An instance of |
file |
A character(1) vector of the BCF file path or, (for
indexBcf) an instance of |
index |
A character(1) vector of the BCF index. |
mode |
A character(1) vector; |
param |
An optional |
... |
Additional arguments. For |
rw |
Mode of file; ignored. |
Objects are created by calls of the form BcfFile()
.
The BcfFile
class inherits fields from the
RsamtoolsFile
class.
BcfFileList
inherits methods from
RsamtoolsFileList
and SimpleList
.
Opening / closing:
Opens the (local or remote) path
and
index
(if bamIndex
is not character(0)
),
files. Returns a BcfFile
instance.
Closes the BcfFile
con
; returning
(invisibly) the updated BcfFile
. The instance may be
re-opened with open.BcfFile
.
Accessors:
Returns a character(1) vector of the BCF path name.
Returns a character(1) vector of BCF index name.
Returns a character(1) vector BCF mode.
Methods:
Visit the path in path(file)
, returning the
result of scanBcf
applied to the specified path.
Compactly display the object.
Martin Morgan
fl <- system.file("extdata", "ex1.bcf.gz", package="Rsamtools", mustWork=TRUE) bf <- BcfFile(fl) # implicit index bf identical(scanBcf(bf), scanBcf(fl)) rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584))) param <- ScanBcfParam(which=rng) bcf <- scanBcf(bf, param=param) ## all ranges ## ranges one at a time 'bf' open(bf) sapply(seq_len(length(rng)), function(i, bcfFile, rng) { param <- ScanBcfParam(which=rng) bcf <- scanBcf(bcfFile, param=param)[[1]] ## do extensive work with bcf isOpen(bf) ## file remains open }, bf, rng)
fl <- system.file("extdata", "ex1.bcf.gz", package="Rsamtools", mustWork=TRUE) bf <- BcfFile(fl) # implicit index bf identical(scanBcf(bf), scanBcf(fl)) rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584))) param <- ScanBcfParam(which=rng) bcf <- scanBcf(bf, param=param) ## all ranges ## ranges one at a time 'bf' open(bf) sapply(seq_len(length(rng)), function(i, bcfFile, rng) { param <- ScanBcfParam(which=rng) bcf <- scanBcf(bcfFile, param=param)[[1]] ## do extensive work with bcf isOpen(bf) ## file remains open }, bf, rng)
Import, coerce, or index variant call files in text or binary format.
scanBcfHeader(file, ...) ## S4 method for signature 'character' scanBcfHeader(file, ...) scanBcf(file, ...) ## S4 method for signature 'character' scanBcf(file, index = file, ..., param=ScanBcfParam()) asBcf(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) ## S4 method for signature 'character' asBcf(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) indexBcf(file, ...) ## S4 method for signature 'character' indexBcf(file, ...)
scanBcfHeader(file, ...) ## S4 method for signature 'character' scanBcfHeader(file, ...) scanBcf(file, ...) ## S4 method for signature 'character' scanBcf(file, index = file, ..., param=ScanBcfParam()) asBcf(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) ## S4 method for signature 'character' asBcf(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) indexBcf(file, ...) ## S4 method for signature 'character' indexBcf(file, ...)
file |
For |
index |
The character() file name(s) of the ‘BCF’ index to be processed. |
dictionary |
a character vector of the unique “CHROM” names in the VCF file. |
destination |
The character(1) file name of the location where
the BCF output file will be created. For |
param |
A instance of |
... |
Additional arguments, e.g., for
|
overwrite |
A logical(1) indicating whether the destination can be over-written if it already exists. |
indexDestination |
A logical(1) indicating whether the created destination file should also be indexed. |
bcf*
functions are restricted to the GENO fields supported by
‘bcftools’ (see documentation at the url below). The argument
param
allows portions of the file to be input, but requires
that the file be BCF or bgzip'd and indexed as a
TabixFile
. For similar functions operating on VCF
files see ?scanVcf
in the VariantAnnotation
package.
scanBcfHeader
returns a list, with one element for each file
named in file
. Each element of the list is itself a list containing
three elements. The Reference
element is a character() vector with
names of reference sequences. The Sample
element is a character()
vector of names of samples. The Header
element is a DataFrameList
with one DataFrame per unique key value in the header
(preceded by “##”).
NOTE: In Rsamtools >=1.33.6, the Header
DataFrameList no longer
contains a DataFrame named "META". The META DataFrame contained all "simple"
key-value headers lines from a bcf / vcf file. These "simple" header
lines are now parsed into individual DataFrames named for the unique
key.
scanBcf
returns a list, with one element per file. Each list has 9
elements, corresponding to the columns of the VCF specification: CHROM
,
POS
, ID
, REF
, ALT
QUAL
, FILTER
,
INFO
, FORMAT
, GENO
.
The GENO
element is itself a list, with elements corresponding
to fields supported by ‘bcftools’ (see documentation at the url below).
asBcf
creates a binary BCF file from a text VCF file.
indexBcf
creates an index into the BCF file.
Martin Morgan <[email protected]>.
http://vcftools.sourceforge.net/specs.html outlines the VCF specification.
http://samtools.sourceforge.net/mpileup.shtml contains
information on the portion of the specification implemented by
bcftools
.
http://samtools.sourceforge.net/ provides information on
samtools
.
fl <- system.file("extdata", "ex1.bcf.gz", package="Rsamtools", mustWork=TRUE) scanBcfHeader(fl) bcf <- scanBcf(fl) ## value: list-of-lists str(bcf[1:8]) names(bcf[["GENO"]]) str(head(bcf[["GENO"]][["PL"]])) example(BcfFile)
fl <- system.file("extdata", "ex1.bcf.gz", package="Rsamtools", mustWork=TRUE) scanBcfHeader(fl) bcf <- scanBcf(fl) ## value: list-of-lists str(bcf[1:8]) names(bcf[["GENO"]]) str(head(bcf[["GENO"]][["PL"]])) example(BcfFile)
razip()
is defunct. Please use bgzip()
instead.
bgzip
compresses tabix (e.g. SAM or VCF) or FASTA files
to a format that allows indexing for later fast random-access.
bgzip(file, dest=sprintf("%s.bgz", sub("\\.gz$", "", file)), overwrite = FALSE) ## Defunct! razip(file, dest=sprintf("%s.rz", sub("\\.gz$", "", file)), overwrite = FALSE)
bgzip(file, dest=sprintf("%s.bgz", sub("\\.gz$", "", file)), overwrite = FALSE) ## Defunct! razip(file, dest=sprintf("%s.rz", sub("\\.gz$", "", file)), overwrite = FALSE)
file |
A character(1) path to an existing uncompressed or gz-compressed file. This file will be compressed. |
dest |
A character(1) path to a file. This will be the compressed
file. If |
overwrite |
A logical(1) indicating whether |
The full path to dest
.
Martin Morgan <[email protected]>
http://samtools.sourceforge.net/
from <- system.file("extdata", "ex1.sam", package="Rsamtools", mustWork=TRUE) to <- tempfile() zipped <- bgzip(from, to)
from <- system.file("extdata", "ex1.sam", package="Rsamtools", mustWork=TRUE) to <- tempfile() zipped <- bgzip(from, to)
Functions listed on this page are no longer supported.
For yieldTabix
, use the yieldSize
argument of
TabixFiles
.
For BamSampler
, use REDUCEsampler
with
reduceByYield
in the GenomicFiles
package.
Martin Morgan <[email protected]>.
Use FaFile()
to create a reference to an indexed fasta
file. The reference remains open across calls to methods, avoiding
costly index re-loading.
FaFileList()
provides a convenient way of managing a list of
FaFile
instances.
## Constructors FaFile(file, index=sprintf("%s.fai", file), gzindex=sprintf("%s.gzi", file)) FaFileList(...) ## Opening / closing ## S3 method for class 'FaFile' open(con, ...) ## S3 method for class 'FaFile' close(con, ...) ## accessors; also path(), index() ## S4 method for signature 'FaFile' gzindex(object, asNA=TRUE) ## S4 method for signature 'FaFileList' gzindex(object, asNA=TRUE) ## S4 method for signature 'FaFile' isOpen(con, rw="") ## actions ## S4 method for signature 'FaFile' indexFa(file, ...) ## S4 method for signature 'FaFile' scanFaIndex(file, ...) ## S4 method for signature 'FaFileList' scanFaIndex(file, ..., as=c("GRangesList", "GRanges")) ## S4 method for signature 'FaFile' seqinfo(x) ## S4 method for signature 'FaFile' countFa(file, ...) ## S4 method for signature 'FaFile,GRanges' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'FaFile,IntegerRangesList' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'FaFile,missing' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'FaFile' getSeq(x, param, ...) ## S4 method for signature 'FaFileList' getSeq(x, param, ...)
## Constructors FaFile(file, index=sprintf("%s.fai", file), gzindex=sprintf("%s.gzi", file)) FaFileList(...) ## Opening / closing ## S3 method for class 'FaFile' open(con, ...) ## S3 method for class 'FaFile' close(con, ...) ## accessors; also path(), index() ## S4 method for signature 'FaFile' gzindex(object, asNA=TRUE) ## S4 method for signature 'FaFileList' gzindex(object, asNA=TRUE) ## S4 method for signature 'FaFile' isOpen(con, rw="") ## actions ## S4 method for signature 'FaFile' indexFa(file, ...) ## S4 method for signature 'FaFile' scanFaIndex(file, ...) ## S4 method for signature 'FaFileList' scanFaIndex(file, ..., as=c("GRangesList", "GRanges")) ## S4 method for signature 'FaFile' seqinfo(x) ## S4 method for signature 'FaFile' countFa(file, ...) ## S4 method for signature 'FaFile,GRanges' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'FaFile,IntegerRangesList' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'FaFile,missing' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'FaFile' getSeq(x, param, ...) ## S4 method for signature 'FaFileList' getSeq(x, param, ...)
con , object , x
|
An instance of |
file , index , gzindex
|
A character(1) vector of the fasta or fasta index
or bgzip index file path (for |
asNA |
logical indicating if missing output should be NA or character() |
param |
An optional |
... |
Additional arguments.
|
rw |
Mode of file; ignored. |
as |
A character(1) vector indicating the type of object to return.
|
Objects are created by calls of the form FaFile()
.
The FaFile
class inherits fields from the
RsamtoolsFile
class.
FaFileList
inherits methods from
RsamtoolsFileList
and SimpleList
.
Opening / closing:
Opens the (local or remote) path
and
index
files. Returns a FaFile
instance.
Closes the FaFile
con
; returning
(invisibly) the updated FaFile
. The instance may be
re-opened with open.FaFile
.
Accessors:
Returns a character(1) vector of the fasta path name.
Returns a character(1) vector of fasta index name (minus the '.fai' extension).
Methods:
Visit the path in path(file)
and create an
index file (with the extension ‘.fai’).
Read the sequence names and and widths of
recorded in an indexed fasta file, returning the information as a
GRanges
object.
Consult the index file for defined sequences
(seqlevels()
) and lengths (seqlengths()
).
Return the number of records in the fasta file.
Return the sequences indicated by param
as a
DNAStringSet
instance. seqnames(param)
selects the sequences to return; start(param)
and
end{param}
define the (1-based) region of the sequence to
return. Values of end(param)
greater than the width of the
sequence cause an error; use seqlengths(FaFile(file))
to
discover sequence widths. When param
is missing, all
records are selected. When length(param)==0
no records are
selected.
Returns the sequences indicated by param
from
the indexed fasta file(s) of file
.
For the FaFile
method, the return type is a
DNAStringSet
. The getSeq,FaFile
and
scanFa,FaFile,GRanges
methods differ in that getSeq
will reverse complement sequences selected from the minus strand.
For the FaFileList
method, the param
argument must
be a GRangesList
of the same length as file
,
creating a one-to-one mapping between the ith element of
file
and the ith element of param
; the return type
is a SimpleList
of DNAStringSet
instances, with
elements of the list in the same order as the input elements.
Compactly display the object.
Martin Morgan
fl <- system.file("extdata", "ce2dict1.fa", package="Rsamtools", mustWork=TRUE) fa <- open(FaFile(fl)) # open countFa(fa) (idx <- scanFaIndex(fa)) (dna <- scanFa(fa, param=idx[1:2])) ranges(idx) <- narrow(ranges(idx), -10) # last 10 nucleotides (dna <- scanFa(fa, param=idx[1:2]))
fl <- system.file("extdata", "ce2dict1.fa", package="Rsamtools", mustWork=TRUE) fa <- open(FaFile(fl)) # open countFa(fa) (idx <- scanFaIndex(fa)) (dna <- scanFa(fa, param=idx[1:2])) ranges(idx) <- narrow(ranges(idx), -10) # last 10 nucleotides (dna <- scanFa(fa, param=idx[1:2]))
Scan indexed fasta (or compressed fasta) files and their indicies.
indexFa(file, ...) ## S4 method for signature 'character' indexFa(file, ...) scanFaIndex(file, ...) ## S4 method for signature 'character' scanFaIndex(file, ...) countFa(file, ...) ## S4 method for signature 'character' countFa(file, ...) scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'character,GRanges' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'character,IntegerRangesList' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'character,missing' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
indexFa(file, ...) ## S4 method for signature 'character' indexFa(file, ...) scanFaIndex(file, ...) ## S4 method for signature 'character' scanFaIndex(file, ...) countFa(file, ...) ## S4 method for signature 'character' countFa(file, ...) scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'character,GRanges' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'character,IntegerRangesList' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet")) ## S4 method for signature 'character,missing' scanFa(file, param, ..., as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
file |
A character(1) vector containing the fasta file path. |
param |
An optional |
as |
A character(1) vector indicating the type of object to
return; default |
... |
Additional arguments, passed to |
indexFa
visits the path in file
and create an index file
at the same location but with extension ‘.fai’).
scanFaIndex
reads the sequence names and and widths of recorded
in an indexed fasta file, returning the information as a
GRanges
object.
countFa
returns the number of records in the fasta file.
scanFa
return the sequences indicated by param
as a
DNAStringSet
, RNAStringSet
,
AAStringSet
instance. seqnames(param)
selects the sequences to return; start(param)
and
end{param}
define the (1-based) region of the sequence to
return. Values of end(param)
greater than the width of the
sequence are set to the width of the sequence. When param
is
missing, all records are selected. When param
is
GRanges()
, no records are selected.
Martin Morgan <[email protected]>.
http://samtools.sourceforge.net/ provides information on
samtools
.
fa <- system.file("extdata", "ce2dict1.fa", package="Rsamtools", mustWork=TRUE) countFa(fa) (idx <- scanFaIndex(fa)) (dna <- scanFa(fa, idx[1:2])) ranges(idx) <- narrow(ranges(idx), -10) # last 10 nucleotides (dna <- scanFa(fa, idx[1:2]))
fa <- system.file("extdata", "ce2dict1.fa", package="Rsamtools", mustWork=TRUE) countFa(fa) (idx <- scanFaIndex(fa)) (dna <- scanFa(fa, idx[1:2])) ranges(idx) <- narrow(ranges(idx), -10) # last 10 nucleotides (dna <- scanFa(fa, idx[1:2]))
This function queries a tabix file, returning the names of the ‘sequences’ used as a key when creating the file.
headerTabix(file, ...) ## S4 method for signature 'character' headerTabix(file, ...)
headerTabix(file, ...) ## S4 method for signature 'character' headerTabix(file, ...)
file |
A |
... |
Additional arguments, currently ignored. |
A list(4)
of the sequence names, column indicies used to sort
the file, the number of lines skipped while indexing, and the comment
character used while indexing.
Martin Morgan <[email protected]>.
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools", mustWork=TRUE) headerTabix(fl)
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools", mustWork=TRUE) headerTabix(fl)
Index (with indexTabix
) files that have been sorted into
ascending sequence, start and end position ordering.
indexTabix(file, format=c("gff", "bed", "sam", "vcf", "vcf4", "psltbl"), seq=integer(), start=integer(), end=integer(), skip=0L, comment="#", zeroBased=FALSE, ...)
indexTabix(file, format=c("gff", "bed", "sam", "vcf", "vcf4", "psltbl"), seq=integer(), start=integer(), end=integer(), skip=0L, comment="#", zeroBased=FALSE, ...)
file |
A characater(1) path to a sorted, bgzip-compressed file. |
format |
The format of the data in the compressed file. A characater(1) matching one of the types named in the function signature. |
seq |
If |
start |
If |
end |
If |
skip |
The number of lines to be skipped at the beginning of the file. |
comment |
A single character which, when present as the first character in a line, indicates that the line is to be omitted. from indexing. |
zeroBased |
A logical(1) indicating whether coordinats in the file are zero-based. |
... |
Additional arguments. |
The return value of indexTabix
is an updated instance of
file
reflecting the newly-created index file.
Martin Morgan <[email protected]>.
http://samtools.sourceforge.net/tabix.shtml
from <- system.file("extdata", "ex1.sam", package="Rsamtools", mustWork=TRUE) to <- tempfile() zipped <- bgzip(from, to) idx <- indexTabix(zipped, "sam") tab <- TabixFile(zipped, idx)
from <- system.file("extdata", "ex1.sam", package="Rsamtools", mustWork=TRUE) to <- tempfile() zipped <- bgzip(from, to) idx <- indexTabix(zipped, "sam") tab <- TabixFile(zipped, idx)
pileup
uses PileupParam
and ScanBamParam
objects
to calculate pileup statistics for a BAM file. The result is a
data.frame
with columns summarizing counts of reads overlapping
each genomic position, optionally differentiated on nucleotide,
strand, and position within read.
pileup(file, index=file, ..., scanBamParam=ScanBamParam(), pileupParam=PileupParam()) ## PileupParam constructor PileupParam(max_depth=250, min_base_quality=13, min_mapq=0, min_nucleotide_depth=1, min_minor_allele_depth=0, distinguish_strands=TRUE, distinguish_nucleotides=TRUE, ignore_query_Ns=TRUE, include_deletions=TRUE, include_insertions=FALSE, left_bins=NULL, query_bins=NULL, cycle_bins=NULL) phred2ASCIIOffset(phred=integer(), scheme= c("Illumina 1.8+", "Sanger", "Solexa", "Illumina 1.3+", "Illumina 1.5+"))
pileup(file, index=file, ..., scanBamParam=ScanBamParam(), pileupParam=PileupParam()) ## PileupParam constructor PileupParam(max_depth=250, min_base_quality=13, min_mapq=0, min_nucleotide_depth=1, min_minor_allele_depth=0, distinguish_strands=TRUE, distinguish_nucleotides=TRUE, ignore_query_Ns=TRUE, include_deletions=TRUE, include_insertions=FALSE, left_bins=NULL, query_bins=NULL, cycle_bins=NULL) phred2ASCIIOffset(phred=integer(), scheme= c("Illumina 1.8+", "Sanger", "Solexa", "Illumina 1.3+", "Illumina 1.5+"))
file |
character(1) or |
index |
When |
... |
Additional arguments, perhaps used by methods. |
scanBamParam |
An instance of |
pileupParam |
An instance of |
max_depth |
integer(1); maximum number of overlapping alignments considered for each position in the pileup. |
min_base_quality |
integer(1); minimum ‘QUAL’ value for
each nucleotide in an alignment. Use |
min_mapq |
integer(1); minimum ‘MAPQ’ value for an alignment to be included in pileup. |
min_nucleotide_depth |
integer(1); minimum count of each nucleotide (independent of other nucleotides) at a given position required for said nucleotide to appear in the result. |
min_minor_allele_depth |
integer(1); minimum count of all nucleotides other than the major allele at a given position required for a particular nucleotide to appear in the result. |
distinguish_strands |
logical(1); |
distinguish_nucleotides |
logical(1); |
ignore_query_Ns |
logical(1); |
include_deletions |
logical(1); |
include_insertions |
logical(1); |
left_bins |
numeric; all same sign; unique positions within a
read to delimit bins. Position within read is based on counting from
the 5' end regardless of strand. Minimum of two values are
required so at least one range can be formed. If you want to delimit bins based on sequencing cycles to, e.g.,
discard later cycles, |
query_bins |
numeric; all same sign; unique positions within a
read to delimit bins. Position within a read is based on counting
from the 5' end when the read aligns to plus strand,
counting from the 3' end when read aligns to minus
strand. Minimum of two values are required so at least one range can
be formed. |
phred |
Either a numeric() (coerced to integer()) PHRED score
(e.g., in the range (0, 41) for the ‘Illumina 1.8+’ scheme)
or character() of printable ASCII characters. When |
scheme |
Encoding scheme, used to translate numeric() PHRED
scores to their ASCII code. Ignored when |
cycle_bins |
DEPRECATED. See |
NB: Use of pileup
assumes familiarity with
ScanBamParam
, and use of left_bins
and
query_bins
assumes familiarity with cut
.
pileup
visits each position in the BAM file, subject to
constraints implied by which
and flag
of
scanBamParam
. For a given position, all reads overlapping the
position that are consistent with constraints dictated by flag
of scanBamParam
and pileupParam
are used for
counting. pileup
also automatically excludes reads with flags
that indicate any of:
unmapped alignment (isUnmappedQuery
)
secondary alignment (isSecondaryAlignment
)
not passing quality controls (isNotPassingQualityControls
)
PCR or optical duplicate (isDuplicate
)
If no which
argument is supplied to the ScanBamParam
,
behavior reflects that of scanBam
: the entire file is visited
and counted.
Use a yieldSize
value when creating a BamFile
instance to manage memory consumption when using pileup with large BAM
files. There are some differences in how pileup
behavior is
affected when the yieldSize
value is set on the BAM file. See
more details below.
Many of the parameters of the pileupParam
interact. A simple
illustration is ignore_query_Ns
and
distinguish_nucleotides
, as mentioned in the
ignore_query_Ns
section.
Parameters for pileupParam
belong to two categories: parameters
that affect only the filtering criteria (so-called
‘behavior-only’ policies), and parameters that affect
filtering behavior and the schema of the results
(‘behavior+structure’ policies).
distinguish_nucleotides
and distinguish_strands
when set
to TRUE
each add a column (nucleotide
and strand
,
respectively) to the resulting data.frame
. If both are TRUE,
each combination of nucleotide
and strand
at a given
position is counted separately. Setting only one to TRUE
behaves as expected; for example, if only nucleotide
is set to
TRUE
, each nucleotide at a given position is counted
separately, but the distinction of on which strand the nucleotide
appears is ignored.
ignore_query_Ns
determines how ambiguous nucloetides are
treated. By default, ambiguous nucleotides (typically ‘N’ in
BAM files) in alignments are ignored. If ignore_query_Ns
is
FALSE
, ambiguous nucleotides are included in counts; further,
if ignore_query_Ns
is FALSE
and
distinguish_nucleotides
is TRUE
the ‘N’
nucleotide value appears in the nucleotide column when a base at a
given position is ambiguous.
By default, deletions with respect to the reference genome to which
the reads were aligned are included in the counts in a pileup. If
include_deletions
is TRUE
and
distinguish_nucleotides
is TRUE
, the nucleotide column
uses a ‘-’ character to indicate a deletion at a position.
The ‘=’ nucleotide code in the SEQ
field (to mean
‘identical to reference genome’) is supported by pileup, such
that a match with the reference genome is counted separately in the
results if distinguish_nucleotides
is TRUE
.
CIGAR support: pileup
handles the extended CIGAR format by only
heeding operations that contribute to counts (‘M’, ‘D’,
‘I’). If you are confused about how the different CIGAR
operations interact, see the SAM versions of the BAM files used for
pileup
unit testing for a fairly comprehensive human-readable
treatment.
Deletions and clipping:
The extended CIGAR allows a number of operations conceptually
similar to a ‘deletion’ with respect to the reference
genome, but offer more specialized meanings than a simple
deletion. CIGAR ‘N’ operations (not to be confused with
‘N’ used for non-determinate bases in the SEQ
field)
indicate a large skipped region, ‘S’ a soft clip, and
‘H’ a hard clip. ‘N’, ‘S’, and ‘H’
CIGAR operations are never counted: only counts of true deletions
(‘D’ in the CIGAR) can be included by setting
include_deletions
to TRUE
.
Soft clip operations contribute to the relative position of
nucleotides within a read, whereas hard clip and refskip
operations do not. For example, consider a sequence in a bam file
that starts at 11, with a CIGAR 2S1M
and SEQ
field
ttA
. The cycle position of the A
nucleotide will be
3
, but the reported position will be 11
. If using
left_bins
or query_bins
it might make sense to first
preprocess your files to remove soft clipping.
Insertions and padding:
pileup
can include counts of insertion operations by
setting include_insertions
to TRUE
. Details about
insertions are effectively truncated such that each insertion is
reduced to a single ‘+’ nucleotide. Information about
e.g. the nucleotide code and base quality within the insertion is
not included.
Because ‘+’ is used as a nucleotide code to represent
insertions in pileup
, counts of the ‘+’ nucleotide
participate in voting for min_nucleotide_depth
and
min_minor_allele_depth
functionality.
The position of an insertion is the position that precedes it on
the reference sequence. Note: this means if
include_insertions
is TRUE
a single read will
contribute two nucleotide codes (+
, plus that of the
non-insertion base) at a given position if the non-insertion base
passes filter criteria.
‘P’ CIGAR (padding) operations never affect counts.
The “bin” arguments query_bins
and left_bins
allow users to differentiate pileup counts based on arbitrary
non-overlapping regions within a read. pileup
relies on
cut
to derive bins, but limits input to numeric values
of the same sign (coerced to integer
s), including
+/-Inf
. If a “bin” argument is set pileup
automatically excludes bases outside the implicit outer range. Here
are some important points regarding bin arguments:
query_bins
vs. left_bins
:
BAM files store sequence data from the 5' end to the 3' end (regardless of to which strand the read aligns). That means for reads aligned to the minus strand, cycle position within a read is effectively reversed. Take care to use the appropriate bin argument for your use case.
The most common use of a bin argument is to bin later cycles
separately from earlier cycles; this is because accuracy typically
degrades in later cycles. For this application, query_bins
yields the correct behavior because bin positions are relative to
cycle order (i.e., sensitive to strand).
left_bins
(in contrast with query_bins
) determines
bin positions from the 5' end, regardless of strand.
Positive or negative bin values can be used to delmit bins
based on the effective “start” or “end” of a
read. For left_bin
the effective start is always the 5' end
(left-to-right as appear in the BAM file).
For query_bin
the effective start is the first cycle of the
read as it came off the sequencer; that means the 5' end for reads
aligned to the plus strand and 3' for reads aligned to the minus
strand.
From effective start of reads: use positive values,
0
, and (+)Inf
. Because cut
produces ranges in the form (first,last], ‘0’ should be
used to create a bin that includes the first position. To
account for variable-length reads in BAM files, use
(+)Inf
as the upper bound on a bin that extends to the
end of all reads.
From effective end of reads: use negative values
and -Inf
. -1
denotes the last position in a
read. Because cut
produces ranges in the form
(first,last], specify the lower bound of a bin by using one
less than the desired value, e.g., a bin that captures only
the second nucleotide from the last position:
query_bins=c(-3, -2)
. To account for variable-length
reads in BAM files, use -Inf
as the lower bound on a
bin that extends to the beginning of all reads.
pileup
obeys yieldSize
on BamFile
objects
with some differences from how scanBam
responds to
yieldSize
. Here are some points to keep in mind when using
pileup
in conjunction with yieldSize
:
BamFile
s with a yieldSize
value set, once
opened and used with pileup
, should not be used with
other functions that accept a BamFile
instance as
input. Create a new BamFile
instance instead of trying to
reuse.
pileup
only returns genomic positions for which all
input has been processed. pileup
will hold in reserve
positions for which only partial data has been
processed. Positions held in reserve will be returned upon
subsequent calls to pileup
when all the input for a given
position has been processed.
The correspondence between yieldSize and the number of rows
in the data.frame
returned from pileup
is not
1-to-1. yieldSize
only limits the number of
alignments processed from the BAM file each time the file
is read. Only a handful of alignments can lead to many distinct
records in the result.
Like scanBam
, pileup
uses an empty return
object (a zero-row data.frame
) to indicate end-of-input.
Sometimes reading yieldSize
records from the BAM file
does not result in any completed positions. In order to avoid
returning a zero-row data.frame
pileup
will
repeatedly process yieldSize
additional records until at
least one position can be returned to the user.
For pileup
a data.frame
with 1 row per unique
combination of differentiating column values that satisfied filter
criteria, with frequency (count
) of unique combination. Columns
seqnames
, pos
, and count
always appear; optional
strand
, nulceotide
, and left_bin
/
query_bin
columns appear as dictated by arguments to
PileupParam
.
Column details:
seqnames
: factor. SAM ‘RNAME’ field of
alignment.
pos
: integer(1). Genomic position of base. Derived by
offset from SAM ‘POS’ field of alignment.
strand
: factor. ‘strand’ to which read aligns.
nucleotide
: factor. ‘nucleotide’ of base,
extracted from SAM ‘SEQ’ field of alignment.
left_bin
/ query_bin
: factor. Bin in which base
appears.
count
: integer(1). Frequency of combination of
differentiating fields, as indicated by values of other columns.
See scanBam
for more detailed explanation of SAM fields.
If a which
argument is specified for the scanBamParam
, a
which_label
column (factor
in the form
‘rname:start-end’) is automatically included. The
which_label
column is used to maintain grouping of results,
such that two queries of the same genomic region can be
differentiated.
Order of rows in data.frame
is first by order of
seqnames
column based on the BAM index file, then
non-decreasing order on columns pos
, then nucleotide
,
then strand
, then left_bin
/ query_bin
.
PileupParam
returns an instance of PileupParam class.
phred2ASCIIOffset
returns a named integer vector of the same
length or number of characters in phred
. The names are the
ASCII encoding, and the values are the offsets to be used with
min_base_quality
.
Nathaniel Hayden [email protected]
For the relationship between PHRED scores and ASCII encoding, see https://en.wikipedia.org/wiki/FASTQ_format#Encoding.
## The examples below apply equally to pileup queries for specific ## genomic ranges (specified by the 'which' parameter of 'ScanBamParam') ## and queries across entire files; the entire-file examples are ## included first to make them easy to find. The more detailed examples ## of pileup use queries with a 'which' argument. library("RNAseqData.HNRNPC.bam.chr14") fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] ## Minimum base quality to be tallied p_param <- PileupParam(min_base_quality = 10L) ## no 'which' argument to ScanBamParam: process entire file at once res <- pileup(fl, pileupParam = p_param) head(res) table(res$strand, res$nucleotide) ## no 'which' argument to ScanBamParam with 'yieldSize' set on BamFile bf <- open(BamFile(fl, yieldSize=5e4)) repeat { res <- pileup(bf, pileupParam = p_param) message(nrow(res), " rows in result data.frame") if(nrow(res) == 0L) break } close(bf) ## pileup for the first half of chr14 with default Pileup parameters ## 'which' argument: process only specified genomic range(s) sbp <- ScanBamParam(which=GRanges("chr14", IRanges(1, 53674770))) res <- pileup(fl, scanBamParam=sbp, pileupParam = p_param) head(res) table(res$strand, res$nucleotide) ## specify pileup parameters: include ambiguious nucleotides ## (the 'N' level in the nucleotide column of the data.frame) p_param <- PileupParam(ignore_query_Ns=FALSE, min_base_quality = 10L) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) head(res) table(res$strand, res$nucleotide) ## Don't distinguish strand, require a minimum frequency of 7 for a ## nucleotide at a genomic position to be included in results. p_param <- PileupParam(distinguish_strands=TRUE, min_base_quality = 10L, min_nucleotide_depth=7) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) head(res) table(res$nucleotide, res$strand) ## Any combination of the filtering criteria is possible: let's say we ## want a "coverage pileup" that only counts reads with mapping ## quality >= 13, and bases with quality >= 10 that appear at least 4 ## times at each genomic position p_param <- PileupParam(distinguish_nucleotides=FALSE, distinguish_strands=FALSE, min_mapq=13, min_base_quality=10, min_nucleotide_depth=4) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) head(res) res <- res[, c("pos", "count")] ## drop seqnames and which_label cols plot(count ~ pos, res, pch=".") ## ASCII offsets to help specify min_base_quality, e.g., quality of at ## least 10 on the Illumina 1.3+ scale phred2ASCIIOffset(10, "Illumina 1.3+") ## Well-supported polymorphic positions (minor allele present in at ## least 5 alignments) with high map quality p_param <- PileupParam(min_minor_allele_depth=5, min_mapq=40, min_base_quality = 10, distinguish_strand=FALSE) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) dim(res) ## reduced to our biologically interesting positions head(xtabs(count ~ pos + nucleotide, res)) ## query_bins ## basic use of positive bins: single pivot; count bases that appear in ## first 10 cycles of a read separately from the rest p_param <- PileupParam(query_bins=c(0, 10, Inf), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## basic use of positive bins: simple range; only include bases in ## cycle positions 6-10 within read p_param <- PileupParam(query_bins=c(5, 10), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## basic use of negative bins: single pivot; count bases that appear in ## last 3 cycle positions of a read separately from the rest. p_param <- PileupParam(query_bins=c(-Inf, -4, -1), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## basic use of negative bins: drop nucleotides in last two cycle ## positions of each read p_param <- PileupParam(query_bins=c(-Inf, -3), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## typical use: beginning, middle, and end segments; because of the ## nature of sequencing technology, it is common for bases in the ## beginning and end segments of each read to be less reliable. pileup ## makes it easy to count each segment separately. ## Assume determined ahead of time that for the data 1-7 makes sense for ## beginning, 8-12 middle, >=13 end (actual cycle positions should be ## tailored to data in actual BAM files). p_param <- PileupParam(query_bins=c(0, 7, 12, Inf), ## note non-symmetric min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) xt <- xtabs(count ~ nucleotide + query_bin, res) print(xt) t(t(xt) / colSums(xt)) ## cheap normalization for illustrative purposes ## 'implicit outer range': in contrast to c(0, 7, 12, Inf), suppose we ## still want to have beginning, middle, and end segements, but know ## that the first three cycles and any bases beyond the 16th cycle are ## irrelevant. Hence, the implicit outer range is (3,16]; all bases ## outside of that are dropped. p_param <- PileupParam(query_bins=c(3, 7, 12, 16), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) xt <- xtabs(count ~ nucleotide + query_bin, res) print(xt) t(t(xt) / colSums(xt)) ## single-width bins: count each cycle within a read separately. p_param <- PileupParam(query_bins=seq(1,20), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) xt <- xtabs(count ~ nucleotide + query_bin, res) print(xt[ , c(1:3, 18:19)]) ## fit on one screen print(t(t(xt) / colSums(xt))[ , c(1:3, 18:19)])
## The examples below apply equally to pileup queries for specific ## genomic ranges (specified by the 'which' parameter of 'ScanBamParam') ## and queries across entire files; the entire-file examples are ## included first to make them easy to find. The more detailed examples ## of pileup use queries with a 'which' argument. library("RNAseqData.HNRNPC.bam.chr14") fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] ## Minimum base quality to be tallied p_param <- PileupParam(min_base_quality = 10L) ## no 'which' argument to ScanBamParam: process entire file at once res <- pileup(fl, pileupParam = p_param) head(res) table(res$strand, res$nucleotide) ## no 'which' argument to ScanBamParam with 'yieldSize' set on BamFile bf <- open(BamFile(fl, yieldSize=5e4)) repeat { res <- pileup(bf, pileupParam = p_param) message(nrow(res), " rows in result data.frame") if(nrow(res) == 0L) break } close(bf) ## pileup for the first half of chr14 with default Pileup parameters ## 'which' argument: process only specified genomic range(s) sbp <- ScanBamParam(which=GRanges("chr14", IRanges(1, 53674770))) res <- pileup(fl, scanBamParam=sbp, pileupParam = p_param) head(res) table(res$strand, res$nucleotide) ## specify pileup parameters: include ambiguious nucleotides ## (the 'N' level in the nucleotide column of the data.frame) p_param <- PileupParam(ignore_query_Ns=FALSE, min_base_quality = 10L) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) head(res) table(res$strand, res$nucleotide) ## Don't distinguish strand, require a minimum frequency of 7 for a ## nucleotide at a genomic position to be included in results. p_param <- PileupParam(distinguish_strands=TRUE, min_base_quality = 10L, min_nucleotide_depth=7) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) head(res) table(res$nucleotide, res$strand) ## Any combination of the filtering criteria is possible: let's say we ## want a "coverage pileup" that only counts reads with mapping ## quality >= 13, and bases with quality >= 10 that appear at least 4 ## times at each genomic position p_param <- PileupParam(distinguish_nucleotides=FALSE, distinguish_strands=FALSE, min_mapq=13, min_base_quality=10, min_nucleotide_depth=4) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) head(res) res <- res[, c("pos", "count")] ## drop seqnames and which_label cols plot(count ~ pos, res, pch=".") ## ASCII offsets to help specify min_base_quality, e.g., quality of at ## least 10 on the Illumina 1.3+ scale phred2ASCIIOffset(10, "Illumina 1.3+") ## Well-supported polymorphic positions (minor allele present in at ## least 5 alignments) with high map quality p_param <- PileupParam(min_minor_allele_depth=5, min_mapq=40, min_base_quality = 10, distinguish_strand=FALSE) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) dim(res) ## reduced to our biologically interesting positions head(xtabs(count ~ pos + nucleotide, res)) ## query_bins ## basic use of positive bins: single pivot; count bases that appear in ## first 10 cycles of a read separately from the rest p_param <- PileupParam(query_bins=c(0, 10, Inf), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## basic use of positive bins: simple range; only include bases in ## cycle positions 6-10 within read p_param <- PileupParam(query_bins=c(5, 10), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## basic use of negative bins: single pivot; count bases that appear in ## last 3 cycle positions of a read separately from the rest. p_param <- PileupParam(query_bins=c(-Inf, -4, -1), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## basic use of negative bins: drop nucleotides in last two cycle ## positions of each read p_param <- PileupParam(query_bins=c(-Inf, -3), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) ## typical use: beginning, middle, and end segments; because of the ## nature of sequencing technology, it is common for bases in the ## beginning and end segments of each read to be less reliable. pileup ## makes it easy to count each segment separately. ## Assume determined ahead of time that for the data 1-7 makes sense for ## beginning, 8-12 middle, >=13 end (actual cycle positions should be ## tailored to data in actual BAM files). p_param <- PileupParam(query_bins=c(0, 7, 12, Inf), ## note non-symmetric min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) xt <- xtabs(count ~ nucleotide + query_bin, res) print(xt) t(t(xt) / colSums(xt)) ## cheap normalization for illustrative purposes ## 'implicit outer range': in contrast to c(0, 7, 12, Inf), suppose we ## still want to have beginning, middle, and end segements, but know ## that the first three cycles and any bases beyond the 16th cycle are ## irrelevant. Hence, the implicit outer range is (3,16]; all bases ## outside of that are dropped. p_param <- PileupParam(query_bins=c(3, 7, 12, 16), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) xt <- xtabs(count ~ nucleotide + query_bin, res) print(xt) t(t(xt) / colSums(xt)) ## single-width bins: count each cycle within a read separately. p_param <- PileupParam(query_bins=seq(1,20), min_base_quality = 10) res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param) xt <- xtabs(count ~ nucleotide + query_bin, res) print(xt[ , c(1:3, 18:19)]) ## fit on one screen print(t(t(xt) / colSums(xt))[ , c(1:3, 18:19)])
Use PileupFiles()
to create a reference to BAM files (and
their indicies), to be used for calculating pile-up summaries.
## Constructors PileupFiles(files, ..., param=ApplyPileupsParam()) ## S4 method for signature 'character' PileupFiles(files, ..., param=ApplyPileupsParam()) ## S4 method for signature 'list' PileupFiles(files, ..., param=ApplyPileupsParam()) ## opening / closing ## open(con, ...) ## close(con, ...) ## accessors; also path() ## S4 method for signature 'PileupFiles' isOpen(con, rw="") plpFiles(object) plpParam(object) ## actions ## S4 method for signature 'PileupFiles,missing' applyPileups(files, FUN, ..., param) ## S4 method for signature 'PileupFiles,ApplyPileupsParam' applyPileups(files, FUN, ..., param) ## display ## S4 method for signature 'PileupFiles' show(object)
## Constructors PileupFiles(files, ..., param=ApplyPileupsParam()) ## S4 method for signature 'character' PileupFiles(files, ..., param=ApplyPileupsParam()) ## S4 method for signature 'list' PileupFiles(files, ..., param=ApplyPileupsParam()) ## opening / closing ## open(con, ...) ## close(con, ...) ## accessors; also path() ## S4 method for signature 'PileupFiles' isOpen(con, rw="") plpFiles(object) plpParam(object) ## actions ## S4 method for signature 'PileupFiles,missing' applyPileups(files, FUN, ..., param) ## S4 method for signature 'PileupFiles,ApplyPileupsParam' applyPileups(files, FUN, ..., param) ## display ## S4 method for signature 'PileupFiles' show(object)
files |
For For |
... |
Additional arguments, currently ignored. |
con , object
|
An instance of |
FUN |
A function of one argument; see |
param |
An instance of |
rw |
character() indicating mode of file; not used for
|
Objects are created by calls of the form PileupFiles()
.
The PileupFiles
class is implemented as an S4 reference
class. It has the following fields:
A list of BamFile
instances.
An instance of ApplyPileupsParam
.
Opening / closing:
Opens the (local or remote) path
and
index
of each file in the PileupFiles
instance. Returns a PileupFiles
instance.
Closes each file in the PileupFiles
instance; returning (invisibly) the updated
PileupFiles
. The instance may be re-opened with
open.PileupFiles
.
Accessors:
Returns the list
of the files in the
PileupFiles
instance.
Returns the ApplyPileupsParam
content of the
PileupFiles
instance.
Methods:
Calculate the pileup across all files in
files
according to criteria in param
(or
plpParam(files)
if param
is missing), invoking
FUN
on each range or collection of positions. See
applyPileups
.
Compactly display the object.
Martin Morgan
example(applyPileups)
example(applyPileups)
quickBamFlagSummary
groups the records of a BAM file based on their flag
bits and counts the number of records in each group.
quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE) ## S4 method for signature 'character' quickBamFlagSummary(file, index=file, ..., param=ScanBamParam(), main.groups.only=FALSE) ## S4 method for signature 'list' quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE)
quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE) ## S4 method for signature 'character' quickBamFlagSummary(file, index=file, ..., param=ScanBamParam(), main.groups.only=FALSE) ## S4 method for signature 'list' quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE)
file , index
|
For the character method, the path to the BAM file to read, and to the index file of the BAM file to read, respectively. For the list() method, |
... |
Additional arguments, perhaps used by methods. |
param |
An instance of |
main.groups.only |
If |
Nothing is returned. A summary of the counts is printed to the console
unless redirected by sink
.
Hervé Pagès
http://samtools.sourceforge.net/
BamFile
for a method for that class.
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) quickBamFlagSummary(bamfile)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) quickBamFlagSummary(bamfile)
Import files created by evaluation of samtools' pileup -cv
command.
readPileup(file, ...) ## S4 method for signature 'connection' readPileup(file, ..., variant=c("SNP", "indel", "all"))
readPileup(file, ...) ## S4 method for signature 'connection' readPileup(file, ..., variant=c("SNP", "indel", "all"))
file |
The file name, or
|
... |
Additional arguments, passed to methods. For instance,
specify |
variant |
Type of variant to parse; select one. |
readPileup
returns a GRanges
object.
The value returned by variant="SNP"
or variant="all"
contains:
The chromosome names (fastq ids) of the reference sequence
The nucleotide position (base 1) of the variant.
The nucleotide in the reference sequence.
The consensus nucleotide, as determined by samtools pileup.
The phred-scaled consensus quality.
The phred-scaled SNP quality (probability of the consensus being identical to the reference).
The root mean square mapping quality of reads overlapping the site.
The number of reads covering the site.
The value returned by variant="indel"
contains space, position,
reference, consensus, consensusQuality, snpQuality, maxMappingQuality,
and coverage fields, and:
The first (typically, in the reference sequence) and second allelic variants.
The number of reads supporting each allele.
The number of additional indels present.
Sean Davis
http://samtools.sourceforge.net/
fl <- system.file("extdata", "pileup.txt", package="Rsamtools", mustWork=TRUE) (res <- readPileup(fl)) xtabs(~referenceBase + consensusBase, mcols(res))[DNA_BASES,] ## Not run: ## uses a pipe, and arguments passed to read.table ## three successive piles of 100 records each cmd <- "samtools pileup -cvf human_b36_female.fa.gz na19240_3M.bam" p <- pipe(cmd, "r") snp <- readPileup(p, nrow=100) # variant="SNP" indel <- readPileup(p, nrow=100, variant="indel") all <- readPileup(p, nrow=100, variant="all") ## End(Not run)
fl <- system.file("extdata", "pileup.txt", package="Rsamtools", mustWork=TRUE) (res <- readPileup(fl)) xtabs(~referenceBase + consensusBase, mcols(res))[DNA_BASES,] ## Not run: ## uses a pipe, and arguments passed to read.table ## three successive piles of 100 records each cmd <- "samtools pileup -cvf human_b36_female.fa.gz na19240_3M.bam" p <- pipe(cmd, "r") snp <- readPileup(p, nrow=100) # variant="SNP" indel <- readPileup(p, nrow=100, variant="indel") all <- readPileup(p, nrow=100, variant="all") ## End(Not run)
RsamtoolsFile
is a base class for managing file references in
Rsamtools; it is not intended for direct use by users – see, e.g.,
BamFile
.
## accessors ## S4 method for signature 'RsamtoolsFile' path(object, ...) ## S4 method for signature 'RsamtoolsFile' index(object, ..., asNA = TRUE) ## S4 method for signature 'RsamtoolsFile' isOpen(con, rw="") ## S4 method for signature 'RsamtoolsFile' yieldSize(object, ...) yieldSize(object, ...) <- value ## S4 method for signature 'RsamtoolsFile' show(object)
## accessors ## S4 method for signature 'RsamtoolsFile' path(object, ...) ## S4 method for signature 'RsamtoolsFile' index(object, ..., asNA = TRUE) ## S4 method for signature 'RsamtoolsFile' isOpen(con, rw="") ## S4 method for signature 'RsamtoolsFile' yieldSize(object, ...) yieldSize(object, ...) <- value ## S4 method for signature 'RsamtoolsFile' show(object)
con , object
|
An instance of a class derived from
|
asNA |
logical indicating if missing output should be NA or character() |
rw |
Mode of file; ignored. |
... |
Additional arguments, unused. |
value |
Replacement value. |
Users do not directly create instances of this class; see, e.g.,
BamFile-class
.
The RsamtoolsFile
class is implemented as an S4 reference
class. It has the following fields:
An externalptr
initialized to an internal
structure with opened bam file and bam index pointers.
A character(1) vector of the file name.
A character(1) vector of the index file name.
An integer(1) vector of the number of records to yield.
Accessors:
Returns a character(1) vector of path names.
Returns a character(1) vector of index path names.
Return or set an integer(1) vector indicating yield size.
Methods:
Report whether the file is currently open.
Compactly display the object.
Martin Morgan
RsamtoolsFileList
is a base class for managing lists of file
references in Rsamtools; it is not intended for direct use –
see, e.g., BamFileList
.
## S4 method for signature 'RsamtoolsFileList' path(object, ...) ## S4 method for signature 'RsamtoolsFileList' index(object, ..., asNA = TRUE) ## S4 method for signature 'RsamtoolsFileList' isOpen(con, rw="") ## S3 method for class 'RsamtoolsFileList' open(con, ...) ## S3 method for class 'RsamtoolsFileList' close(con, ...) ## S4 method for signature 'RsamtoolsFileList' names(x) ## S4 method for signature 'RsamtoolsFileList' yieldSize(object, ...)
## S4 method for signature 'RsamtoolsFileList' path(object, ...) ## S4 method for signature 'RsamtoolsFileList' index(object, ..., asNA = TRUE) ## S4 method for signature 'RsamtoolsFileList' isOpen(con, rw="") ## S3 method for class 'RsamtoolsFileList' open(con, ...) ## S3 method for class 'RsamtoolsFileList' close(con, ...) ## S4 method for signature 'RsamtoolsFileList' names(x) ## S4 method for signature 'RsamtoolsFileList' yieldSize(object, ...)
con , object , x
|
An instance of a class derived from
|
asNA |
logical indicating if missing output should be NA or character() |
rw |
Mode of file; ignored. |
... |
Additional arguments. |
Users do not directly create instances of this class; see, e.g.,
BamFileList-class
.
This class inherits functions and methods for subseting, updating, and
display from the SimpleList
class.
Methods:
Report whether each file in the list is currently open.
Attempt to open each file in the list.
Attempt to close each file in the list.
Names of each element of the list or, if names are
NULL
, the basename of the path of each element.
Martin Morgan
Use ScanBamParam()
to create a parameter object influencing
what fields and which records are imported from a (binary) BAM file.
Use of which
requires that a BAM index file
(<filename>.bai
) exists.
# Constructor ScanBamParam(flag = scanBamFlag(), simpleCigar = FALSE, reverseComplement = FALSE, tag = character(0), tagFilter = list(), what = character(0), which, mapqFilter=NA_integer_) # Constructor helpers scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = NA, hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA, isFirstMateRead = NA, isSecondMateRead = NA, isNotPrimaryRead = NA, isSecondaryAlignment = NA, isNotPassingQualityControls = NA, isDuplicate = NA, isSupplementaryAlignment = NA) scanBamWhat() # Accessors bamFlag(object, asInteger=FALSE) bamFlag(object) <- value bamReverseComplement(object) bamReverseComplement(object) <- value bamSimpleCigar(object) bamSimpleCigar(object) <- value bamTag(object) bamTag(object) <- value bamTagFilter(object) bamTagFilter(object) <- value bamWhat(object) bamWhat(object) <- value bamWhich(object) bamWhich(object) <- value bamMapqFilter(object) bamMapqFilter(object) <- value ## S4 method for signature 'ScanBamParam' show(object) # Flag utils bamFlagAsBitMatrix(flag, bitnames=FLAG_BITNAMES) bamFlagAND(flag1, flag2) bamFlagTest(flag, value)
# Constructor ScanBamParam(flag = scanBamFlag(), simpleCigar = FALSE, reverseComplement = FALSE, tag = character(0), tagFilter = list(), what = character(0), which, mapqFilter=NA_integer_) # Constructor helpers scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = NA, hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA, isFirstMateRead = NA, isSecondMateRead = NA, isNotPrimaryRead = NA, isSecondaryAlignment = NA, isNotPassingQualityControls = NA, isDuplicate = NA, isSupplementaryAlignment = NA) scanBamWhat() # Accessors bamFlag(object, asInteger=FALSE) bamFlag(object) <- value bamReverseComplement(object) bamReverseComplement(object) <- value bamSimpleCigar(object) bamSimpleCigar(object) <- value bamTag(object) bamTag(object) <- value bamTagFilter(object) bamTagFilter(object) <- value bamWhat(object) bamWhat(object) <- value bamWhich(object) bamWhich(object) <- value bamMapqFilter(object) bamMapqFilter(object) <- value ## S4 method for signature 'ScanBamParam' show(object) # Flag utils bamFlagAsBitMatrix(flag, bitnames=FLAG_BITNAMES) bamFlagAND(flag1, flag2) bamFlagTest(flag, value)
flag |
For For |
simpleCigar |
A logical(1) vector which, when TRUE, returns only
those reads for which the cigar (run-length encoded representation
of the alignment) is missing or contains only matches / mismatches
( |
reverseComplement |
A logical(1) vectors. BAM files store reads
mapping to the minus strand as though they are on the plus
strand. Rsamtools obeys this convention by default
( |
tag |
A character vector naming tags to be extracted. A tag is an
optional field, with arbitrary information, stored with each
record. Tags are identified by two-letter codes, so all elements of
|
tagFilter |
A named |
what |
A character vector naming the fields to return
|
mapqFilter |
A non-negative integer(1) specifying the minimum
mapping quality to include. BAM records with mapping qualities less
than |
which |
A |
isPaired |
A logical(1) indicating whether unpaired (FALSE), paired (TRUE), or any (NA) read should be returned. |
isProperPair |
A logical(1) indicating whether improperly paired (FALSE), properly paired (TRUE), or any (NA) read should be returned. A properly paired read is defined by the alignment algorithm and might, e.g., represent reads aligning to identical reference sequences and with a specified distance. |
isUnmappedQuery |
A logical(1) indicating whether unmapped (TRUE), mapped (FALSE), or any (NA) read should be returned. |
hasUnmappedMate |
A logical(1) indicating whether reads with mapped (FALSE), unmapped (TRUE), or any (NA) mate should be returned. |
isMinusStrand |
A logical(1) indicating whether reads aligned to the plus (FALSE), minus (TRUE), or any (NA) strand should be returned. |
isMateMinusStrand |
A logical(1) indicating whether mate reads aligned to the plus (FALSE), minus (TRUE), or any (NA) strand should be returned. |
isFirstMateRead |
A logical(1) indicating whether the first mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA). |
isSecondMateRead |
A logical(1) indicating whether the second mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA). |
isNotPrimaryRead |
Deprecated; use |
isSecondaryAlignment |
A logical(1) indicating whether alignments that are secondary (TRUE), are not (FALSE) or whose secondary status does not matter (NA) should be returned. A non-primary alignment (“secondary alignment” in the SAM specification) might result when a read aligns to multiple locations. One alignment is designated as primary and has this flag set to FALSE; the remainder, for which this flag is TRUE, are designated by the aligner as secondary. |
isNotPassingQualityControls |
A logical(1) indicating whether reads passing quality controls (FALSE), reads not passing quality controls (TRUE), or any (NA) read should be returned. |
isDuplicate |
A logical(1) indicating that un-duplicated (FALSE), duplicated (TRUE), or any (NA) reads should be returned. 'Duplicated' reads may represent PCR or optical duplicates. |
isSupplementaryAlignment |
A logical(1) indicating that non-supplementary (FALSE), supplementary (TRUE), or any (NA) reads should be returned. The SAM specification indicates that 'supplementary' reads are part of a chimeric alignment. |
object |
An instance of class |
value |
An instance of the corresponding slot, to be assigned to
|
asInteger |
logical(1) indicating whether ‘flag’ should be
returned as an encoded integer vector ( |
bitnames |
Names of the flag bits to extract. Will be the colnames of the returned matrix. |
flag1 , flag2
|
Integer vectors containing ‘flag’ entries. |
Objects are created by calls of the form ScanBamParam()
.
flag
Object of class integer
encoding flags
to be kept when they have their '0' (keep0
) or '1'
(keep1
) bit set.
simpleCigar
Object of class logical
indicating, when TRUE, that only 'simple' cigars (empty or 'M') are
returned.
reverseComplement
Object of class logical
indicating, when TRUE, that reads on the minus strand are to be
reverse complemented (sequence) and reversed (quality).
tag
Object of class character
indicating what
tags are to be returned.
tagFilter
Object of class list
(named)
indicating tags to filter by, and the set of acceptable values for
each tag.
what
Object of class character
indicating
what fields are to be returned.
which
Object of class IntegerRangesList
indicating
which reference sequence and coordinate reads must overlap.
mapqFilter
Object of class integer
indicating
the minimum mapping quality required for input, or NA to indicate
no filtering.
See 'Usage' for details on invocation.
Constructor:
Returns a ScanBamParam
object. The
which
argument to the constructor can be one of several
different types, as documented above.
Accessors:
Returns or sets a character
vector of
tags to be extracted.
Returns or sets a named
list
of tags to filter by, and the set of their acceptable
values.
Returns or sets a character
vector
of fields to be extracted.
Returns or sets a IntegerRangesList
of
bounds on reads to be extracted. A length 0 IntegerRangesList
represents all reads.
Returns or sets an integer(2)
representation of reads flagged to be kept or excluded.
Returns or sets a
logical(1)
vector indicating whether reads without indels
or clipping be kept.
Returns or sets
a logical(1)
vector indicating whether reads on the minus
strand will be returned with sequence reverse complemented and
quality reversed.
Methods:
Compactly display the object.
Martin Morgan
## defaults p0 <- ScanBamParam() ## subset of reads based on genomic coordinates which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000))) p1 <- ScanBamParam(what=scanBamWhat(), which=which) ## subset of reads based on 'flag' value p2 <- ScanBamParam(what=scanBamWhat(), flag=scanBamFlag(isMinusStrand=FALSE)) ## subset of fields p3 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth")) ## use fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) res <- scanBam(fl, param=p2)[[1]] lapply(res, head) ## tags; NM: edit distance; H1: 1-difference hits p4 <- ScanBamParam(tag=c("NM", "H1"), what="flag") bam4 <- scanBam(fl, param=p4) str(bam4[[1]][["tag"]]) ## tagFilter p5 <- ScanBamParam(tag=c("NM", "H1"), tagFilter=list(NM=c(2, 3, 4))) bam5 <- scanBam(fl, param=p5) table(bam5[[1]][["tag"]][["NM"]]) ## flag utils flag <- scanBamFlag(isUnmappedQuery=FALSE, isMinusStrand=TRUE) p6 <- ScanBamParam(what="flag") bam6 <- scanBam(fl, param=p6) flag6 <- bam6[[1]][["flag"]] head(bamFlagAsBitMatrix(flag6[1:9])) colSums(bamFlagAsBitMatrix(flag6)) flag bamFlagAsBitMatrix(flag)
## defaults p0 <- ScanBamParam() ## subset of reads based on genomic coordinates which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000))) p1 <- ScanBamParam(what=scanBamWhat(), which=which) ## subset of reads based on 'flag' value p2 <- ScanBamParam(what=scanBamWhat(), flag=scanBamFlag(isMinusStrand=FALSE)) ## subset of fields p3 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth")) ## use fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) res <- scanBam(fl, param=p2)[[1]] lapply(res, head) ## tags; NM: edit distance; H1: 1-difference hits p4 <- ScanBamParam(tag=c("NM", "H1"), what="flag") bam4 <- scanBam(fl, param=p4) str(bam4[[1]][["tag"]]) ## tagFilter p5 <- ScanBamParam(tag=c("NM", "H1"), tagFilter=list(NM=c(2, 3, 4))) bam5 <- scanBam(fl, param=p5) table(bam5[[1]][["tag"]][["NM"]]) ## flag utils flag <- scanBamFlag(isUnmappedQuery=FALSE, isMinusStrand=TRUE) p6 <- ScanBamParam(what="flag") bam6 <- scanBam(fl, param=p6) flag6 <- bam6[[1]][["flag"]] head(bamFlagAsBitMatrix(flag6[1:9])) colSums(bamFlagAsBitMatrix(flag6)) flag bamFlagAsBitMatrix(flag)
Use ScanBcfParam()
to create a parameter object influencing the
‘INFO’ and ‘GENO’ fields parsed, and which sample records are
imported from a BCF file. Use of which
requires that a BCF
index file (<filename>.bci
) exists.
ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'missing' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'IntegerRangesList' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'GRanges' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'GRangesList' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## Accessors bcfFixed(object) bcfInfo(object) bcfGeno(object) bcfSamples(object) bcfTrimEmpty(object) bcfWhich(object)
ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'missing' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'IntegerRangesList' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'GRanges' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## S4 method for signature 'GRangesList' ScanBcfParam(fixed=character(), info=character(), geno=character(), samples=character(), trimEmpty=TRUE, which, ...) ## Accessors bcfFixed(object) bcfInfo(object) bcfGeno(object) bcfSamples(object) bcfTrimEmpty(object) bcfWhich(object)
fixed |
A logical(1) for use with |
info |
A character() vector of ‘INFO’ fields (see scanVcfHeader) to be returned. |
geno |
A character() vector of ‘GENO’ fields (see
scanVcfHeader) to be returned. |
samples |
A character() vector of sample names (see
scanVcfHeader) to be returned. |
trimEmpty |
A logical(1) indicating whether ‘GENO’ fields with no values should be returned. |
which |
An object, for which a method is defined (see usage,
above), describing the sequences and ranges to be queried. Variants
whose |
object |
An instance of class |
... |
Arguments used internally. |
Objects can be created by calls of the form ScanBcfParam()
.
which
:Object of class "IntegerRangesList"
indicating
which reference sequence and coordinate variants must overlap.
info
:Object of class "character"
indicating
portions of ‘INFO’ to be returned.
geno
:Object of class "character"
indicating
portions of ‘GENO’ to be returned.
samples
:Object of class "character"
indicating
the samples to be returned.
trimEmpty
:Object of class "logical"
indicating
whether empty ‘GENO’ fields are to be returned.
fixed
:Object of class "character"
. For use
with ScanVcfParam
only.
See 'Usage' for details on invocation.
Constructor:
Returns a ScanBcfParam
object. The
which
argument to the constructor can be one of several types,
as documented above.
Accessors:
Return the
corresponding field from object
.
Methods:
Compactly display the object.
Martin Morgan [email protected]
## see ?ScanVcfParam examples
## see ?ScanVcfParam examples
This function queries a tabix file, returning the names of the ‘sequences’ used as a key when creating the file.
seqnamesTabix(file, ...) ## S4 method for signature 'character' seqnamesTabix(file, ...)
seqnamesTabix(file, ...) ## S4 method for signature 'character' seqnamesTabix(file, ...)
file |
A |
... |
Additional arguments, currently ignored. |
A character()
vector of sequence names present in the file.
Martin Morgan <[email protected]>.
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools", mustWork=TRUE) seqnamesTabix(fl)
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools", mustWork=TRUE) seqnamesTabix(fl)
Use TabixFile()
to create a reference to a Tabix file (and its
index). Once opened, the reference remains open across calls to
methods, avoiding costly index re-loading.
TabixFileList()
provides a convenient way of managing a list of
TabixFile
instances.
## Constructors TabixFile(file, index = paste(file, "tbi", sep="."), ..., yieldSize=NA_integer_) TabixFileList(...) ## Opening / closing ## S3 method for class 'TabixFile' open(con, ...) ## S3 method for class 'TabixFile' close(con, ...) ## accessors; also path(), index(), yieldSize() ## S4 method for signature 'TabixFile' isOpen(con, rw="") ## actions ## S4 method for signature 'TabixFile' seqnamesTabix(file, ...) ## S4 method for signature 'TabixFile' headerTabix(file, ...) ## S4 method for signature 'TabixFile,GRanges' scanTabix(file, ..., param) ## S4 method for signature 'TabixFile,IntegerRangesList' scanTabix(file, ..., param) ## S4 method for signature 'TabixFile,missing' scanTabix(file, ..., param) ## S4 method for signature 'character,ANY' scanTabix(file, ..., param) ## S4 method for signature 'character,missing' scanTabix(file, ..., param) countTabix(file, ...)
## Constructors TabixFile(file, index = paste(file, "tbi", sep="."), ..., yieldSize=NA_integer_) TabixFileList(...) ## Opening / closing ## S3 method for class 'TabixFile' open(con, ...) ## S3 method for class 'TabixFile' close(con, ...) ## accessors; also path(), index(), yieldSize() ## S4 method for signature 'TabixFile' isOpen(con, rw="") ## actions ## S4 method for signature 'TabixFile' seqnamesTabix(file, ...) ## S4 method for signature 'TabixFile' headerTabix(file, ...) ## S4 method for signature 'TabixFile,GRanges' scanTabix(file, ..., param) ## S4 method for signature 'TabixFile,IntegerRangesList' scanTabix(file, ..., param) ## S4 method for signature 'TabixFile,missing' scanTabix(file, ..., param) ## S4 method for signature 'character,ANY' scanTabix(file, ..., param) ## S4 method for signature 'character,missing' scanTabix(file, ..., param) countTabix(file, ...)
con |
An instance of |
file |
For TabixFile(), A character(1) vector to the tabix file
path; can be remote (http://, ftp://). For |
index |
A character(1) vector of the tabix file index. |
yieldSize |
Number of records to yield each time the file is read
from using |
param |
An instance of GRanges or IntegerRangesList, used to select which records to scan. |
... |
Additional arguments. For |
rw |
character() indicating mode of file; not used for |
Objects are created by calls of the form TabixFile()
.
The TabixFile
class inherits fields from the
RsamtoolsFile
class.
TabixFileList
inherits methods from
RsamtoolsFileList
and SimpleList
.
Opening / closing:
Opens the (local or remote) path
and
index
. Returns a TabixFile
instance.
yieldSize
determines the number of records parsed during
each call to scanTabix
; NA
indicates that all
records are to be parsed.
Closes the TabixFile
con
; returning
(invisibly) the updated TabixFile
. The instance may be
re-opened with open.TabixFile
.
Accessors:
Returns a character(1) vector of the tabix path name.
Returns a character(1) vector of tabix index name.
Return or set an integer(1) vector indicating yield size.
Methods:
Visit the path in path(file)
, returning
the sequence names present in the file.
Visit the path in path(file)
, returning
the sequence names, column indicies used to sort the file, the
number of lines skipped while indexing, the comment character used
while indexing, and the header (preceeded by comment character, at
start of file) lines.
Return the number of records in each range of
param
, or the count of all records in the file (when
param
is missing).
For signature(file="TabixFile")
, Visit the
path in path(file)
, returning the result of
scanTabix
applied to the specified path. For
signature(file="character")
, call the corresponding method
after coercing file
to TabixFile
.
This method operates on file paths, rather than
TabixFile
objects, to index tab-separated files. See
indexTabix
.
Compactly display the object.
Martin Morgan
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools", mustWork=TRUE) tbx <- TabixFile(fl) param <- GRanges(c("chr1", "chr2"), IRanges(c(1, 1), width=100000)) countTabix(tbx) countTabix(tbx, param=param) res <- scanTabix(tbx, param=param) sapply(res, length) res[["chr1:1-100000"]][1:2] ## parse to list of data.frame's dff <- Map(function(elt) { read.csv(textConnection(elt), sep="\t", header=FALSE) }, res) dff[["chr1:1-100000"]][1:5,1:8] ## parse 100 records at a time length(scanTabix(tbx)[[1]]) # total number of records tbx <- open(TabixFile(fl, yieldSize=100)) while(length(res <- scanTabix(tbx)[[1]])) cat("records read:", length(res), "\n") close(tbx)
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools", mustWork=TRUE) tbx <- TabixFile(fl) param <- GRanges(c("chr1", "chr2"), IRanges(c(1, 1), width=100000)) countTabix(tbx) countTabix(tbx, param=param) res <- scanTabix(tbx, param=param) sapply(res, length) res[["chr1:1-100000"]][1:2] ## parse to list of data.frame's dff <- Map(function(elt) { read.csv(textConnection(elt), sep="\t", header=FALSE) }, res) dff[["chr1:1-100000"]][1:5,1:8] ## parse 100 records at a time length(scanTabix(tbx)[[1]]) # total number of records tbx <- open(TabixFile(fl, yieldSize=100)) while(length(res <- scanTabix(tbx)[[1]])) cat("records read:", length(res), "\n") close(tbx)
Scan compressed, sorted, tabix-indexed, tab-delimited files.
scanTabix(file, ..., param) ## S4 method for signature 'character,IntegerRangesList' scanTabix(file, ..., param) ## S4 method for signature 'character,GRanges' scanTabix(file, ..., param)
scanTabix(file, ..., param) ## S4 method for signature 'character,IntegerRangesList' scanTabix(file, ..., param) ## S4 method for signature 'character,GRanges' scanTabix(file, ..., param)
file |
The character() file name(s) of the tabix file be
processed, or more flexibly an instance of class
|
param |
A instance of |
... |
Additional arguments, currently ignored. |
scanTabix
returns a list, with one element per region. Each element
of the list is a character vector representing records in the region. If
param
is empty then all records will be returned.
scanTabix
signals errors using signalCondition
. The
following errors are signaled:
scanTabix_param
yieldSize(file)
must be NA when
more than one range is specified.
scanTabix_io
A read error occured while inputing the
tabix file. This might be because the file is corrupt, or of
incorrect format (e.g., when path
points to a plain text
file but index
is present, implying that path
should
be a bgzip
ed file. The error message may include an error
code representing the logical OR of these cryptic signals: 1,
BGZF_ERR_ZLIB; 2, BGZF_ERR_HEADER; 4, BGZF_ERR_IO; 8,
BGZF_ERR_MISUSE.
Martin Morgan <[email protected]>.
http://samtools.sourceforge.net/tabix.shtml
example(TabixFile)
example(TabixFile)
Iterate through a BAM file until a paired-end read is encountered or the end of file is reached; report the occurrence of paired-end reads to the user.
testPairedEndBam(file, index=file, ...)
testPairedEndBam(file, index=file, ...)
file |
character(1) BAM file name, or a |
index |
(optional) character(1) name of the index file of the 'BAM' file being processed; this is given without the '.bai' extension. |
... |
Additional arguments, currently unused. |
A logical vector of length 1 containing TRUE is returned if BAM file contained paired end reads, FALSE otherwise.
Martin Morgan mailto:[email protected], Sonali Arora mailto:[email protected]
fl <- system.file("extdata", "ex1.bam", package="Rsamtools") testPairedEndBam(fl)
fl <- system.file("extdata", "ex1.bam", package="Rsamtools") testPairedEndBam(fl)