Title: | FASTQ input and manipulation |
---|---|
Description: | This package implements sampling, iteration, and input of FASTQ files. The package includes functions for filtering and trimming reads, and for generating a quality assessment report. Data are represented as DNAStringSet-derived objects, and easily manipulated for a diversity of purposes. The package also contains legacy support for early single-end, ungapped alignment formats. |
Authors: | Bioconductor Package Maintainer [cre], Martin Morgan [aut], Michael Lawrence [ctb], Simon Anders [ctb], Rohit Satyam [ctb] (Converted Overview.Rnw vignette from Sweave to RMarkdown / HTML.), J Wokaty [ctb] |
Maintainer: | Bioconductor Package Maintainer <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.65.0 |
Built: | 2024-10-31 05:23:30 UTC |
Source: | https://github.com/bioc/ShortRead |
This package implements sampling, iteration, and input of FASTQ files. The package includes functions for filtering and trimming reads, and for generating a quality assessment report. Data are represented as DNAStringSet-derived objects, and easily manipulated for a diversity of purposes. The package also contains legacy support for early single-end, ungapped alignment formats.
See packageDescription('ShortRead')
Maintainer: Martin Morgan <[email protected]>
Classes derived from .QA-class
represent results of quality
assurance analyses. Details of derived class structure are found on
the help pages of the derived classes.
Objects from the class are created by ShortRead functions, in
particular qa
.
Class ".ShortReadBase"
, directly.
Methods defined on this class include:
signature(...="list")
: rbind data frame objects
in ...
. All objects of ...
must be of the same
class; the return value is an instance of that class.
signature(object = "SolexaExportQA")
: Display an
overview of the object contents.
Martin Morgan <[email protected]>
Specific classes derived from .QA
getClass(".QA", where=getNamespace("ShortRead"))
getClass(".QA", where=getNamespace("ShortRead"))
These functions and generics define ‘accessors’ (to get and set values) for objects in the ShortRead package; methods defined in other packages may have additional meaning.
## SRVector vclass(object, ...) ## AlignedRead chromosome(object, ...) position(object, ...) alignQuality(object, ...) alignData(object, ...) ## Solexa experimentPath(object, ...) dataPath(object, ...) scanPath(object, ...) imageAnalysisPath(object, ...) baseCallPath(object, ...) analysisPath(object, ...) ## SolexaSet solexaPath(object, ...) laneDescription(object, ...) laneNames(object, ...)
## SRVector vclass(object, ...) ## AlignedRead chromosome(object, ...) position(object, ...) alignQuality(object, ...) alignData(object, ...) ## Solexa experimentPath(object, ...) dataPath(object, ...) scanPath(object, ...) imageAnalysisPath(object, ...) baseCallPath(object, ...) analysisPath(object, ...) ## SolexaSet solexaPath(object, ...) laneDescription(object, ...) laneNames(object, ...)
object |
An object derived from class |
... |
Additional arguments passed to the accessor. The default definitions do not make use of additional arguments. |
Usually, the value of the corresponding slot, or other simple content
described on the help page of object
.
Martin Morgan
sp <- SolexaPath(system.file('extdata', package='ShortRead')) experimentPath(sp) basename(analysisPath(sp))
sp <- SolexaPath(system.file('extdata', package='ShortRead')) experimentPath(sp) basename(analysisPath(sp))
Construct an AlignedDataFrame
from a data frame and its
metadata
AlignedDataFrame(data, metadata, nrow = nrow(data))
AlignedDataFrame(data, metadata, nrow = nrow(data))
data |
A data frame containing alignment information. |
metadata |
A data frame describing the columns of |
nrow |
An optional argument, to be used when |
An object of AlignedDataFrame
.
Martin Morgan <[email protected]>
This class extends
AnnotatedDataFrame
. It
is a data frame and associated metadata (describing the columns of the
data frame). The main purpose of this class is to contain alignment
data in addition to the central information of
AlignedRead
.
Objects from the class are created by calls to the
AlignedDataFrame
function.
data
:Object of class "data.frame"
containing
the data. See AnnotatedDataFrame
for details.
varMetadata
:Object of class "data.frame"
describing columns of data
. See
AnnotatedDataFrame
for details.
dimLabels
:Object of class character
describing
the dimensions of the AnnotatedDataFrame. Used internally; see
AnnotatedDataFrame
for details.
.__classVersion__
:Object of class "Versions"
describing the version of this object. Used internally; see
AnnotatedDataFrame
for details.
Class "AnnotatedDataFrame"
, directly.
Class "Versioned"
, by class "AnnotatedDataFrame", distance 2.
This class inherits methods pData
(to retrieve the underlying
data frame) and varMetadata
(to retrieve the metadata) from
AnnotatedDataFrame
.
Additional methods include:
signature(x = "AlignedDataFrame", values = "AlignedDataFrame")
:
append values
after x
. varMetadata
of
x
and y
must be identical; pData
and
varMetadata
are appended using rbind
.
Martin Morgan <[email protected]>
This function constructs objects of
AlignedRead
. It will often be more convenient to
create AlignedRead
objects using parsers such as
readAligned
.
AlignedRead(sread, id, quality, chromosome, position, strand, alignQuality, alignData = AlignedDataFrame(nrow = length(sread)))
AlignedRead(sread, id, quality, chromosome, position, strand, alignQuality, alignData = AlignedDataFrame(nrow = length(sread)))
sread |
An object of class |
id |
An object of class |
quality |
An object of class |
chromosome |
A |
position |
A |
strand |
A |
alignQuality |
A |
alignData |
An |
An object of class AlignedRead
.
Martin Morgan <[email protected]>
This class represents and manipulates reads and their genomic alignments. Alignment information includes genomic position, strand, quality, and other data.
Objects of this class can be created from a call to the
AlignedRead
constructor, or more typically by parsing
appropriate files (e.g., readAligned
).
chromosome
Object of class "factor"
the
particular sequence within a set of target sequences
(e.g. chromosomes in a genome assembly) to which each short read
aligns.
position
Object of class "integer"
the
(base-pair) position in the genome to which the read is
aligned. AlignedRead objects created by readAligned use 1-based
indexing, with alignemnts reported in ‘left-most’
coordinates, as described in the vignette.
strand
Object of class "factor"
the strand of
the alignment.
alignQuality
Object of class "numeric"
representing an alignment quality score.
alignData
Object of class "AlignedDataFrame"
additional alignment information.
quality
Object of class "BStringSet"
representing base-call read quality scores.
sread
Object of class "DNAStringSet"
DNA
sequence of the read.
id
Object of class "BStringSet"
read
identifier.
Class "ShortReadQ"
, directly.
Class "ShortRead"
, by class "ShortReadQ", distance 2.
Class ".ShortReadBase"
, by class "ShortReadQ", distance 3.
See accessors
for additional functions to access slot
content, and ShortReadQ
,
ShortRead
for inherited methods. Additional
methods include:
signature(x = "AlignedRead", i = "ANY", j = "missing")
:
This method creates a new AlignedRead
object containing only
those reads indexed by i
. chromosome
is recoded to
contain only those levels in the new subset.
signature(x = "AlignedRead", values = "AlignedRead")
:
append values
after x
. chromosome
and
strand
must be factors with the same levels. See methods
for ShortReadQ
, AlignedDataFrame
for details of how
these components of x
and y
are appended.
signature(from = "PairwiseAlignments", to = "AlignedRead")
:
signature(from = "AlignedRead", to = "IntegerRangesList")
:
signature(from = "AlignedRead", to = "GRanges")
:
signature(from = "AlignedRead", to = "GAlignments")
:
signature(from = "AlignedRead", to = "GappedReads")
:
Invoke these methods with, e.g., as(from, "AlignedRead")
to coerce objects of class from
to class
"AlignedRead"
.
Coercion from AlignedRead to IntegerRangesList or
GRanges assumes that
position(from)
uses a ‘leftmost’ (see
coverage
on this page) coordinate
system. Since IntegerRangesList objects cannot
store NA
values, reads with NA
in the
position
, width
, chromosome
or (in the
case of GRanges) strand
vectors are dropped.
signature(object = "AlignedRead")
: access the
chromosome slot of object
.
signature(object = "AlignedRead")
: access the
position slot of object
.
signature(object = "AlignedRead")
: access the
strand slot of object
.
signature(x = "AlignedRead", shift = 0L, width = NULL, weight = 1L, ..., coords = c("leftmost", "fiveprime"), extend=0L)
:
Calculate coverage across reads present in x
.
shift
must be either 0L
or a named integer vector
with names including all levels(chromosome(x))
.
It specifies how the reads in x
should be (horizontally)
shifted before the coverage is computed.
width
must be either NULL
or a named vector of
non-negative integers with names including all
levels(chromosome(x))
. In the latter case, it specifies
for each chromosome the end of the chromosome region over which
coverage is to be calculated after the reads have been
shifted. Note that this region always starts at chromosome
position 1. If width
is NULL
, it ends at the
rightmost chromosome position covered by at least one read.
weight
must be 1L
for now (weighting the reads is
not supported yet, sorry).
coords
specifies the coordinate system used to record
position. Both systems number base pairs from left to right on the
5' strand. leftmost
indicates the eland convention, where
position(x)
is the left-most (minimum) base pair,
regardless of strand. fiveprime
is the MAQ convention,
where position(x)
is the coordinate of the 5' end of the
aligned read.
extend
indicates the number of base pairs to extend the
read. Extension is in the 3' direction, measured from the
3' end of the aligned read.
The return value of coverage
is a SimpleRleList
object.
signature(x = "AlignedRead", table = "IntegerRangesList")
:
Return a length(x) logical vector indicating whether the
chromosome, position, and width of x
overlap with ranges in
table
. Reads for which chromosome()
,
position()
, or width()
return NA
never
overlap with table
. This function assumes that positions
are in ‘leftmost’ coordinates, as defined in
coverage
.
signature(x = "AlignedRead", ..., withSread=TRUE)
:
signature(x = "AlignedRead", ..., withSread=TRUE)
:
signature(x = "AlignedRead", ..., withSread=TRUE)
:
signature(x = "AlignedRead", ..., withSread=TRUE)
:
Order, rank, sort, and find duplicates in AlignedRead
objects. Reads are sorted by chromosome
, strand
,
position
, and then (if withSread=TRUE
) sread
;
less fine-grained sorting can be accomplished with, e.g.,
x[srorder(sread(x))]
. srduplicated
behaves like
duplicated
, i.e., the first copy of a duplicate is
FALSE
while the remaining copies are TRUE
.
signature(object = "AlignedRead")
: provide a
compact display of the AlignedRead
content.
signature(x = "AlignedRead")
: display
alignData
in more detail.
Martin Morgan <[email protected]>
showMethods(class="AlignedRead", where=getNamespace("ShortRead")) dirPath <- system.file('extdata', 'maq', package='ShortRead') (aln <- readAligned(dirPath, 'out.aln.1.txt', type="MAQMapview")) coverage(aln)[[1]] cvg <- coverage(aln, shift=c(ChrA=10L)) ## remove 0 coverage on left ends ltrim0 <- function(x) { i <- !cumprod(runValue(x) == 0) Rle(runValue(x)[i], runLength(x)[i]) } endoapply(cvg, ltrim0) ## demonstration of show() and detail() methods show(aln) detail(aln)
showMethods(class="AlignedRead", where=getNamespace("ShortRead")) dirPath <- system.file('extdata', 'maq', package='ShortRead') (aln <- readAligned(dirPath, 'out.aln.1.txt', type="MAQMapview")) coverage(aln)[[1]] cvg <- coverage(aln, shift=c(ChrA=10L)) ## remove 0 coverage on left ends ltrim0 <- function(x) { i <- !cumprod(runValue(x) == 0) Rle(runValue(x)[i], runLength(x)[i]) } endoapply(cvg, ltrim0) ## demonstration of show() and detail() methods show(aln) detail(aln)
alphabetByCycle
summarizes nucleotides, amino acid, or qualities
by cycle, e.g., returning the number of occurrences of each nucleotide
A, T, G, C
across all reads from 36 cycles of a Solexa lane.
alphabetByCycle(stringSet, alphabet, ...)
alphabetByCycle(stringSet, alphabet, ...)
stringSet |
A R object representing the collection of reads, amino acid sequences, or quality scores, to be summarized. |
alphabet |
The alphabet (character vector of length 1 strings)
from which the sequences in |
... |
Additional arguments, perhaps used by methods defined on this generic. |
The default method requires that stringSet
extends the
XStringSet
class of
Biostrings.
The following method is defined, in addition to methods described in class-specific documentation:
signature(stringSet = "BStringSet")
:
this method uses an alphabet spanning all ASCII characters, codes
1:255
.
A matrix with number of rows equal to the length of alphabet
and columns equal to the maximum width of reads or quality scores in
the string set. Entries in the matrix are the number of times, over
all reads of the set, that the corresponding letter of the alphabet
(row) appeared at the specified cycle (column).
Martin Morgan
The IUPAC alphabet in Biostrings.
http://www.bioperl.org/wiki/FASTQ_sequence_format for the BioPerl definition of fastq.
Solexa documentation 'Data analysis - documentation : Pipeline output and visualisation'.
showMethods("alphabetByCycle") sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") alphabetByCycle(sread(rfq)) abcq <- alphabetByCycle(quality(rfq)) dim(abcq) ## 'high' scores, first and last cycles abcq[64:94,c(1:5, 32:36)]
showMethods("alphabetByCycle") sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") alphabetByCycle(sread(rfq)) abcq <- alphabetByCycle(quality(rfq)) dim(abcq) ## 'high' scores, first and last cycles abcq[64:94,c(1:5, 32:36)]
This generic takes a QualityScore
or
PhredQuality
object and calculates, for each read, the sum of
the encoded nucleotide probabilities.
alphabetScore(object, ...)
alphabetScore(object, ...)
object |
An object of class |
... |
Additional arguments, currently unused. |
A vector of numeric values of length equal to the length of
object
.
Martin Morgan <[email protected]>
This class contains a list-like structure with summary descriptions derived from visiting one or more Bowtie files.
Objects of the class are usually produced by a qa
method, with the argument type="Bowtie"
.
.srlist
:Object of class "list"
, containing
data frames or lists of data frames summarizing the results of
qa.
Class "SRList"
, directly.
Class ".QA"
, directly.
Class ".SRUtil"
, by class "SRList", distance 2.
Class ".ShortReadBase"
, by class ".QA", distance 2.
Accessor methods are inherited from the SRList
class.
signature(x="BowtieQA", ...,
dest=tempfile(), type="html")
: produces an
html file summarizing the QA results.
Martin Morgan <[email protected]>
qa
.
showClass("BowtieQA")
showClass("BowtieQA")
Short reads may contain ambiguous base calls (i.e., IUPAC symbols different from A, T, G, C). This generic removes all sequences containing 1 or more ambiguous bases.
clean(object, ...)
clean(object, ...)
object |
An object for which |
... |
Additional arguments, perhaps used by methods. |
The following method is defined, in addition to methods described in class-specific documentation:
signature(x = "DNAStringSet")
:
Remove all sequences containing non-base (A, C, G, T) IUPAC
symbols.
An instance of class(object)
, containing only sequences with
non-redundant nucleotides.
Martin Morgan <[email protected]>
showMethods('clean')
showMethods('clean')
countLines
visits all files in a directory path dirPath
whose base (i.e., file) name matches pattern
. Lines in the file
are counted as the number of new line characters.
countLines(dirPath, pattern=character(0), ..., useFullName=FALSE)
countLines(dirPath, pattern=character(0), ..., useFullName=FALSE)
dirPath |
A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of files whose lines are to be counted. |
pattern |
The ( |
... |
Additional arguments, passed internally to list.files. See
|
useFullName |
A |
A named integer vector of line counts. Names are paths to the files
whose lines have been counted, excluding dirPath
.
Martin Morgan
sp <- SolexaPath(system.file('extdata', package='ShortRead')) countLines(analysisPath(sp)) countLines(experimentPath(sp), recursive=TRUE) countLines(experimentPath(sp), recursive=TRUE, useFullName=TRUE)
sp <- SolexaPath(system.file('extdata', package='ShortRead')) countLines(analysisPath(sp)) countLines(experimentPath(sp), recursive=TRUE) countLines(experimentPath(sp), recursive=TRUE, useFullName=TRUE)
These functions were introduced but are now deprecated or defunct.
Defunct functions:
srapply
. Use the BiocParallel package instead.
readAligned,BamFile-method
. Use the GenomicAlignments
package instead.
basePath()
dustyScore
identifies low-complexity sequences, in a manner
inspired by the dust
implementation in BLAST
.
dustyScore(x, batchSize=NA, ...)
dustyScore(x, batchSize=NA, ...)
x |
A |
batchSize |
|
... |
Additional arguments, not currently used. |
The following methods are defined:
signature(x = "DNAStringSet")
: operating on
an object derived from class DNAStringSet
.
signature(x = "ShortRead")
: operating on
the sread
of an object derived from class
ShortRead
.
The dust-like calculations used here are as implemented at https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-February/000170.html. Scores range from 0 (all triplets unique) to the square of the width of the longest sequence (poly-A, -C, -G, or -T).
The batchSize
argument can be used to reduce the memory
requirements of the algorithm by processing the x
argument in
batches of the specified size. Smaller batch sizes use less memory,
but are computationally less efficient.
A vector of numeric scores, with length equal to the length of
x
.
Herve Pages (code); Martin Morgan
Morgulis, Getz, Schaffer and Agarwala, 2006. WindowMasker: window-based masker for sequenced genomes, Bioinformatics 22: 134-141.
The WindowMasker supplement defining dust
ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf
sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") range(dustyScore(rfq))
sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") range(dustyScore(rfq))
Short read technologies often produce a hierarchy of output files. The
content of the hierarchy varies. This class represents the root of the
file hierarchy. Specific classes (e.g.,
SolexaPath
) represent different technologies.
Objects from the class are created by calls to the constructor:
ExperimentPath(experimentPath)
character(1)
object pointing to the
top-level directory of the experiment; see specific technology
classes for additional detail.
(optional) logical vector which, when
TRUE
results in warnings if paths do not exist.
All paths must be fully-specified.
ExperimentPath
has one slot, containing a fully
specified path to the corresponding directory (described above).
basePath
See above.
The slot is accessed with experimentPath
.
Class ".ShortReadBase"
, directly.
Methods include:
signature(object = "ExperimentPath")
: briefly
summarize the file paths of object
.
signature(x = "ExperimentPath")
: summarize
file paths of x
.
Michael Lawrence
showClass("ExperimentPath")
showClass("ExperimentPath")
FastqFile
represents a path and connection to a fastq
file. FastqFileList
is a list of such connections.
FastqSampler
draws a subsample from a fastq file. yield
is the method used to extract the sample from the FastqSampler
instance; a short illustration is in the example
below. FastqSamplerList
is a list of FastqSampler
elements.
FastqStreamer
draws successive subsets from a fastq file, a
short illustration is in the example below. FastqStreamerList
is a list of FastqStreamer
elements.
## FastqFile and FastqFileList FastqFile(con, ...) FastqFileList(..., class="FastqFile") ## S3 method for class 'ShortReadFile' open(con, ...) ## S3 method for class 'ShortReadFile' close(con, ...) ## S4 method for signature 'FastqFile' readFastq(dirPath, pattern=character(), ...) ## S4 method for signature 'FastqFile' countFastq(dirPath, pattern=character(), ...) ## FastqSampler and FastqStreamer FastqSampler(con, n=1e6, readerBlockSize=1e8, verbose=FALSE, ordered = FALSE) FastqSamplerList(..., n=1e6, readerBlockSize=1e8, verbose=FALSE, ordered = FALSE) FastqStreamer(con, n, readerBlockSize=1e8, verbose=FALSE) FastqStreamerList(..., n, readerBlockSize=1e8, verbose=FALSE) yield(x, ...)
## FastqFile and FastqFileList FastqFile(con, ...) FastqFileList(..., class="FastqFile") ## S3 method for class 'ShortReadFile' open(con, ...) ## S3 method for class 'ShortReadFile' close(con, ...) ## S4 method for signature 'FastqFile' readFastq(dirPath, pattern=character(), ...) ## S4 method for signature 'FastqFile' countFastq(dirPath, pattern=character(), ...) ## FastqSampler and FastqStreamer FastqSampler(con, n=1e6, readerBlockSize=1e8, verbose=FALSE, ordered = FALSE) FastqSamplerList(..., n=1e6, readerBlockSize=1e8, verbose=FALSE, ordered = FALSE) FastqStreamer(con, n, readerBlockSize=1e8, verbose=FALSE) FastqStreamerList(..., n, readerBlockSize=1e8, verbose=FALSE) yield(x, ...)
con , dirPath
|
A character string naming a connection, or (for
|
n |
For |
readerBlockSize |
The number of bytes or characters to be read at
one time; smaller |
verbose |
Display progress. |
ordered |
logical(1) indicating whether sampled reads should be returned in the same order as they were encountered in the file. |
x |
An instance from the |
... |
Additional arguments. For |
pattern |
Ignored. |
class |
For developer use, to specify the underlying class
contained in the |
Available classes include:
FastqFile
A file path and connection to a fastq file.
FastqFileList
A list of FastqFile
instances.
FastqSampler
Uniformly sample records from a fastq file.
FastqStreamer
Iterate over a fastq file, returning successive parts of the file.
The following methods are available to users:
readFastq,FastqFile-method
:see also
?readFastq
.
writeFastq,ShortReadQ,FastqFile-method
:see also
?writeFastq
,
?"writeFastq,ShortReadQ,FastqFile-method"
.
countFastq,FastqFile-method
:see also
?countFastq
.
yield
:Draw a single sample from the
instance. Operationally this requires that the underlying data
(e.g., file) represented by the Sampler
instance be
visited; this may be time consuming.
FastqSampler
and FastqStreamer
use OpenMP threads,
when available, during creation of the return value. This may cause
the OpenMP implementation 'libgomp' to produce an error, if these
functions are called in a parallel R process, e.g.:
libgomp: Thread creation failed: Resource temporarily unavailable
A solution is to precede problematic code with the following code snippet, to disable OpenMP multi-threading:
nthreads <- .Call(ShortRead:::.set_omp_threads, 1L) on.exit(.Call(ShortRead:::.set_omp_threads, nthreads))
readFastq
, writeFastq
,
countFastq
, yield
.
sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") f <- FastqFile(fl) rfq <- readFastq(f) close(f) f <- FastqSampler(fl, 50) yield(f) # sample of size n=50 yield(f) # independent sample of size 50 close(f) ## Return sample as ordered in original file f <- FastqSampler(fl, 50, ordered=TRUE) yield(f) close(f) f <- FastqStreamer(fl, 50) yield(f) # records 1 to 50 yield(f) # records 51 to 100 close(f) ## iterating over an entire file f <- FastqStreamer(fl, 50) while (length(fq <- yield(f))) { ## do work here print(length(fq)) } close(f) ## iterating over IRanges rng <- IRanges(c(50, 100, 200), width=10:8) f <- FastqStreamer(fl, rng) while (length(fq <- yield(f))) { print(length(fq)) } close(f) ## Internal fields, methods, and help; for developers ShortRead:::.FastqSampler_g$methods() ShortRead:::.FastqSampler_g$fields() ShortRead:::.FastqSampler_g$help("yield")
sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") f <- FastqFile(fl) rfq <- readFastq(f) close(f) f <- FastqSampler(fl, 50) yield(f) # sample of size n=50 yield(f) # independent sample of size 50 close(f) ## Return sample as ordered in original file f <- FastqSampler(fl, 50, ordered=TRUE) yield(f) close(f) f <- FastqStreamer(fl, 50) yield(f) # records 1 to 50 yield(f) # records 51 to 100 close(f) ## iterating over an entire file f <- FastqStreamer(fl, 50) while (length(fq <- yield(f))) { ## do work here print(length(fq)) } close(f) ## iterating over IRanges rng <- IRanges(c(50, 100, 200), width=10:8) f <- FastqStreamer(fl, rng) while (length(fq <- yield(f))) { print(length(fq)) } close(f) ## Internal fields, methods, and help; for developers ShortRead:::.FastqSampler_g$methods() ShortRead:::.FastqSampler_g$fields() ShortRead:::.FastqSampler_g$help("yield")
filterFastq
filters reads from source to destination file(s)
applying a filter to reads in each file. The filter can be a function
or FilterRules instance; operations are done in a memory-efficient
manner.
filterFastq(files, destinations, ..., filter = FilterRules(), compress=TRUE, yieldSize = 1000000L)
filterFastq(files, destinations, ..., filter = FilterRules(), compress=TRUE, yieldSize = 1000000L)
files |
a character vector of valid file paths. |
destinations |
a character vector of destinations, recycled to be
the same length as |
... |
Additional arguments, perhaps used by a |
filter |
A simple function taking as it's first argument a
|
compress |
A logical(1) indicating whether the file should be
gz-compressed. The default is |
yieldSize |
Number of fastq records processed in each call to
|
Martin Morgan [email protected]
## path to a convenient fastq file sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") ## filter reads to keep those with GC < 0.7 fun <- function(x) { gc <- alphabetFrequency(sread(x), baseOnly=TRUE)[,c("G", "C")] x[rowSums(gc) / width(x) < .7] } filterFastq(fl, tempfile(), filter=fun) ## trimEnds,character-method uses filterFastq internally trimEnds(fl, "V", destinations=tempfile())
## path to a convenient fastq file sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") ## filter reads to keep those with GC < 0.7 fun <- function(x) { gc <- alphabetFrequency(sread(x), baseOnly=TRUE)[,c("G", "C")] x[rowSums(gc) / width(x) < .7] } filterFastq(fl, tempfile(), filter=fun) ## trimEnds,character-method uses filterFastq internally trimEnds(fl, "V", destinations=tempfile())
The Intensity
, IntensityMeasure
, and IntensityInfo
classes represent and manipulate image intensity measures. Instances
from the class may also contain information about measurement errors,
and additional information about the reads from which the intensities
are derived.
Intensity
, and IntensityMeasure
, are virtual classes,
and cannot be created directly. Classes derived from
IntensityMeasure
(e.g., ArrayIntensity
) and
Intensity
(e.g., SolexaIntensity
) are used
to represent specific technologies.
ArrayIntensity
objects can be created with calls of the form
ArrayIntensity(array(0, c(1,2,3)))
.
Objects of derived classes can be created from calls such as the
SolexaIntensity
constructor, or more typically by
parsing appropriate files (e.g., readIntensities
).
Class Intensity
has slots:
readInfo
:Object of class "IntensityInfo"
containing columns for the lane, tile, x, and y coordinates of the
read.
intensity
:Object of class
"IntensityMeasure"
containing image intensity data for each
read and cycle.
measurementError
:Object of class
"IntensityMeasure"
containing measures of image intensity
uncertainty for each read and cycle.
.hasMeasurementError
:Length 1 logical variable indicating whether intensity standard errors are included (internal use only).
Classes IntensityInfo
and IntensityMeasure
are virtual
classes, and have no slots.
These classes extend ".ShortReadBase"
, directly.
Methods and accessor functions for Intensity
include:
signature(object = "Intensity")
:
access the readInfo
slot of object
.
signature(object = "Intensity")
: access the
intensity
slot of object
.
signature(object = "Intensity")
:
access the nse
slot of object
, or signal an error if
no standard errors are available.
signature(object = "Intensity")
: return the
dimensions (e.g., number of reads by number of cycles) represented
by object
.
signature(object = "Intensity")
: provide a
compact representation of the object.
Subsetting "["
is available for the IntensityMeasure
class; the drop
argument to "["
is ignored.
Subsetting with "[["
is available for the ArrayIntensity
class. The method accepts three arguments, corresponding to the read,
base, and cycle(s) to be selected. The return value is the array
(i.e., underlying data values) corresponding to the selected indices.
Martin Morgan <[email protected]>
showMethods(class="Intensity", where=getNamespace("ShortRead")) example(readIntensities)
showMethods(class="Intensity", where=getNamespace("ShortRead")) example(readIntensities)
This class contains a list-like structure with summary descriptions derived from visiting one or more MAQMap files.
Objects of the class are usually produced by a qa
method.
.srlist
:Object of class "list"
, containing
data frames or lists of data frames summarizing the results of
qa.
Class "SRList"
, directly.
Class ".QA"
, directly.
Class ".SRUtil"
, by class "SRList", distance 2.
Class ".ShortReadBase"
, by class ".QA", distance 2.
Accessor methods are inherited from the SRList
class.
signature(x="MAQMapQA", ...,
dest=tempfile(), type="html")
: produces an
html file summarizing the QA results.
Martin Morgan <[email protected]>
qa
.
showClass("MAQMapQA")
showClass("MAQMapQA")
This function is a common interface to quality assessment functions
available in ShortRead
. Results from this function may be
displayed in brief, or integrated into reports using, e.g.,
report
.
qa(dirPath, ...) ## S4 method for signature 'character' qa(dirPath, pattern=character(0), type=c("fastq", "SolexaExport", "SolexaRealign", "Bowtie", "MAQMap", "MAQMapShort"), ...) ## S4 method for signature 'list' qa(dirPath, ...)
qa(dirPath, ...) ## S4 method for signature 'character' qa(dirPath, pattern=character(0), type=c("fastq", "SolexaExport", "SolexaRealign", "Bowtie", "MAQMap", "MAQMapShort"), ...) ## S4 method for signature 'list' qa(dirPath, ...)
dirPath |
A character vector or other object (e.g.,
|
pattern |
A character vector limiting the files in |
type |
The type of file being parsed; must be a character vector of length 1, selected from one of the types enumerated in the parameter. |
... |
Additional arguments used by methods.
|
The most common use of this function provides a directory path and pattern identifying FASTQ files for quality assessment. The default is then to create a quality assessment report based on a random sample of n=1000000 reads from each file.
The following methods are defined, in addition to those on S4 formal classes documented elsewhere:
qa,character-method
Quality assessment is performed on all files in directory
dirPath
whose file name matches pattern
. The type of
analysis performed is based on the type
argument. Use
SolexaExport
when all files matching pattern
are
Solexa _export.txt
files. Use SolexaRealign
for
Solexa _realign.txt
files. Use Bowtie
for Bowtie
files. Use MAQMapShort
for MAQ map
files produced by
MAQ versions below 0.70 and MAQMap
for more recent output.
Use fastq
for collections of fastq-format files. Quality
assessment details vary depending on data source.
qa,list-method
dirPath
is a list of objects, all of the same class and
typically derived from ShortReadQ
, on which quality
assessment is performed. All elements of the list must have names,
and these should be unique.
An object derived from class .QA
. Values
contained in this object are meant for use by report
Martin Morgan <[email protected]>
.QA
,
SolexaExportQA
MAQMapQA
FastqQA
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147") ## sample 1M reads / file qa <- qa(dirPath, "fastq.gz", BPPARAM=SerialParam()) if (interactive()) browseURL(report(qa)) showMethods("qa", where=getNamespace("ShortRead"))
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147") ## sample 1M reads / file qa <- qa(dirPath, "fastq.gz", BPPARAM=SerialParam()) if (interactive()) browseURL(report(qa)) showMethods("qa", where=getNamespace("ShortRead"))
Classes derived from .QA-class
represent results of quality
assurance analyses.
Users create instances of many of these classes by calling the
corresponding constructors, as documented on the help page for
qa2
. Classes constructed in this way include
QACollate
, QAFastqSource
,
QAAdapterContamination
,
QAFrequentSequence
, QANucleotideByCycle
,
QANucleotideUse
, QAQualityByCycle
,
QAQualityUse
, QAReadQuality
, and
QASequenceUse
.
The classes QASource
, QAFiltered
, QAFlagged
and
QASummary
are generated internally, not by users.
.QA2
extends class ".ShortReadBase"
,
directly.
QASummary
is a virtual class extending .QA2
; all
user-creatable classes extend QASummary
.
QASource
extends QASummary
. All classes used to
represent raw data input (QAFastqSource
) extend
QASource
.
QAData
is a reference class, used to contain a single instance
of the fastq used in all QA Summary steps.
QACollate
extends .QA2
. It contains a SimpleList
instance with zero or more QASummary
elements.
QA
extends .QA2
, and contains a SimpleList
of
zero or more QASummary
elements. This class represents the
results of the qa2
analysis.
Methods defined on this class include:
signature(object="QACollate", state, ...,
verbose=FALSE)
creates a QA report from the elements of
QACollate
. Methods on qa2
for objects extending
class QASummary
summarize QA statistics for that class,
e.g., qa2,QAFrequentSequences-method
implements the
calculations required to summarize frequently used sequences,
using data in state
.
signature(x="QA", ...)
creates an HTML
report. Methods on report
for objects extending class
QASummary
are responsible for creating the html snippet for
that QA component.
signature(object=".QA2", ..., verbose=FALSE)
implements criteria to flag individual lanes as failing quality
assessment. NOTE: flag is not fully implemented.
signature(...="QASummary")
: rbind multiple
summary elements of the same class, as when these have been
created by separately calculating statistics on a number of fastq
files.
signature(object = "SolexaExportQA")
: Display an
overview of the object contents.
Martin Morgan <[email protected]>
Specific classes derived from .QA2
getClass(".QA2", where=getNamespace("ShortRead"))
getClass(".QA2", where=getNamespace("ShortRead"))
This page summarizes an updated approach to quality assessment reports
in ShortRead
.
## Input source for short reads QAFastqSource(con = character(), n = 1e+06, readerBlockSize = 1e+08, flagNSequencesRange = NA_integer_, ..., html = system.file("template", "QASources.html", package="ShortRead")) QAData(seq = ShortReadQ(), filter = logical(length(seq)), ...) ## Possible QA elements QAFrequentSequence(useFilter = TRUE, addFilter = TRUE, n = NA_integer_, a = NA_integer_, flagK=.8, reportSequences = FALSE, ...) QANucleotideByCycle(useFilter = TRUE, addFilter = TRUE, ...) QANucleotideUse(useFilter = TRUE, addFilter = TRUE, ...) QAQualityByCycle(useFilter = TRUE, addFilter = TRUE, ...) QAQualityUse(useFilter = TRUE, addFilter = TRUE, ...) QAReadQuality(useFilter = TRUE, addFilter = TRUE, flagK = 0.2, flagA = 30L, ...) QASequenceUse(useFilter = TRUE, addFilter = TRUE, ...) QAAdapterContamination(useFilter = TRUE, addFilter = TRUE, Lpattern = NA_character_, Rpattern = NA_character_, max.Lmismatch = 0.1, max.Rmismatch = 0.2, min.trim = 9L, ...) ## Order QA report elements QACollate(src, ...) ## perform analysis qa2(object, state, ..., verbose=FALSE) ## Outputs from qa2 QA(src, filtered, flagged, ...) QAFiltered(useFilter = TRUE, addFilter = TRUE, ...) QAFlagged(useFilter = TRUE, addFilter = TRUE, ...) ## Summarize results as html report ## S4 method for signature 'QA' report(x, ..., dest = tempfile(), type = "html") ## additional methods; 'flag' is not fully implemented flag(object, ..., verbose=FALSE) ## S4 method for signature 'QASummary' rbind(..., deparse.level = 1)
## Input source for short reads QAFastqSource(con = character(), n = 1e+06, readerBlockSize = 1e+08, flagNSequencesRange = NA_integer_, ..., html = system.file("template", "QASources.html", package="ShortRead")) QAData(seq = ShortReadQ(), filter = logical(length(seq)), ...) ## Possible QA elements QAFrequentSequence(useFilter = TRUE, addFilter = TRUE, n = NA_integer_, a = NA_integer_, flagK=.8, reportSequences = FALSE, ...) QANucleotideByCycle(useFilter = TRUE, addFilter = TRUE, ...) QANucleotideUse(useFilter = TRUE, addFilter = TRUE, ...) QAQualityByCycle(useFilter = TRUE, addFilter = TRUE, ...) QAQualityUse(useFilter = TRUE, addFilter = TRUE, ...) QAReadQuality(useFilter = TRUE, addFilter = TRUE, flagK = 0.2, flagA = 30L, ...) QASequenceUse(useFilter = TRUE, addFilter = TRUE, ...) QAAdapterContamination(useFilter = TRUE, addFilter = TRUE, Lpattern = NA_character_, Rpattern = NA_character_, max.Lmismatch = 0.1, max.Rmismatch = 0.2, min.trim = 9L, ...) ## Order QA report elements QACollate(src, ...) ## perform analysis qa2(object, state, ..., verbose=FALSE) ## Outputs from qa2 QA(src, filtered, flagged, ...) QAFiltered(useFilter = TRUE, addFilter = TRUE, ...) QAFlagged(useFilter = TRUE, addFilter = TRUE, ...) ## Summarize results as html report ## S4 method for signature 'QA' report(x, ..., dest = tempfile(), type = "html") ## additional methods; 'flag' is not fully implemented flag(object, ..., verbose=FALSE) ## S4 method for signature 'QASummary' rbind(..., deparse.level = 1)
con |
|
n |
|
readerBlockSize |
integer(1) number of bytes to input, as used by
|
flagNSequencesRange |
|
html |
|
seq |
|
filter |
|
useFilter , addFilter
|
|
a |
|
flagK , flagA
|
|
reportSequences |
|
Lpattern , Rpattern , max.Lmismatch , max.Rmismatch , min.trim
|
Parameters influencing adapter identification, see
|
src |
The source, e.g., |
object |
An instance of class derived from |
.
state |
The data on which quality assessment will be performed; this is not usually necessary for end-users. |
verbose |
|
filtered , flagged
|
Primarily for internal use, instances of
|
x |
An instance of |
dest |
|
type |
|
deparse.level |
see |
... |
Additional arguments, e.g., |
Use QACollate
to specify an order in which components of a QA
report are to be assembled. The first argument is the data source
(e.g., QAFastqSource
).
Functions related to data input include:
QAFastqSource
defines the location of fastq files to
be included in the report. con
is used to construct a
FastqSampler
instance, and records are processed
using qa2,QAFastqSource-method
.
QAData
is a class for representing the data during the QA report generation pass; it is primarily for internal use.
Possible elements in a QA report are:
QAFrequentSequence
identifies the most-commonly
occuring sequences. One of n
or a
can be non-NA, and
determine the number of frequent sequences reported. n
specifies the number of most-frequent sequences to filter, e.g.,
n=10
would filter the top 10 most commonly occurring
sequences; a
provides a threshold frequency (count) above
which reads are filtered. The sample is flagged when a fraction
flagK
of the reads are filtered.
reportSequences
determines whether the most commonly
occuring sequences, as determined by n
or a
, are
printed in the html report.
QANucleotideByCycle
reports nucleotide frequency as a function of cycle.
QAQualityByCycle
reports average quality score as a function of cycle.
QAQualityUse
summarizes overall nucleotide qualities.
QAReadQuality
summarizes the distribution of read qualities.
QASequenceUse
summarizes the cumulative distribution of reads occurring 1, 2, ... times.
QAAdapterContamination
reports the occurrence of ‘adapter’ sequences on the left and / or right end of each read.
An object derived from class .QA
. Values
contained in this object are meant for use by report
Martin Morgan <[email protected]>
QA
.
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147") fls <- dir(dirPath, "fastq.gz", full=TRUE) coll <- QACollate(QAFastqSource(fls), QAReadQuality(), QAAdapterContamination(), QANucleotideUse(), QAQualityUse(), QASequenceUse(), QAFrequentSequence(n=10), QANucleotideByCycle(), QAQualityByCycle()) x <- qa2(coll, BPPARAM=SerialParam(), verbose=TRUE) res <- report(x) if (interactive()) browseURL(res)
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147") fls <- dir(dirPath, "fastq.gz", full=TRUE) coll <- QACollate(QAFastqSource(fls), QAReadQuality(), QAAdapterContamination(), QANucleotideUse(), QAQualityUse(), QASequenceUse(), QAFrequentSequence(n=10), QANucleotideByCycle(), QAQualityByCycle()) x <- qa2(coll, BPPARAM=SerialParam(), verbose=TRUE) res <- report(x) if (interactive()) browseURL(res)
Use these functions to construct quality indicators for reads or
alignments. See QualityScore
for details of
object content and methods available for manipulating them.
NumericQuality(quality = numeric(0)) IntegerQuality(quality = integer(0)) MatrixQuality(quality = new("matrix")) FastqQuality(quality, ...) SFastqQuality(quality, ...)
NumericQuality(quality = numeric(0)) IntegerQuality(quality = integer(0)) MatrixQuality(quality = new("matrix")) FastqQuality(quality, ...) SFastqQuality(quality, ...)
quality |
An object used to initialize the data
structure. Appropriate objects are indicated in the constructors
above for Numeric, Integer, and Matrix qualities. For
|
... |
Additional arguments, currently unused. |
Constructors return objects of the corresponding class derived from
QualityScore
.
Martin Morgan <[email protected]>
QualityScore
, readFastq
,
readAligned
nq <- NumericQuality(rnorm(20)) nq quality(nq) quality(nq[10:1])
nq <- NumericQuality(rnorm(20)) nq quality(nq) quality(nq[10:1])
This class hierarchy represents quality scores for short
reads. QualityScore
is a virtual base class, with derived
classes offering different ways of representing qualities. Methods
defined on QualityScore
are implemented in all derived
classes.
Objects from the class are created using constructors (e.g.,
NumericQuality
) named after the class name.
Defined classes are as follows:
Virtual base class; instances cannot be instantiated.
A single numeric vector, where values represent quality scores on an arbitrary scale.
A integer numeric vector, where values represent quality scores on an arbitrary scale.
A rectangular matrix of quality scores, with rows representing reads and columns cycles. The content and interpretation of row and column entries is arbitrary; the rectangular nature implies quality scores from equal-length reads.
‘fastq’ encoded quality scores stored in
a BStringSet
instance. Base qualities of a single read are
represented as an ASCII character string. The integer-valued
quality score of a single base is encoded as its ASCII equivalent
plus 33. The precise definition of the integer-valued quality
score is unspecified, but is usually a Phred score; the meaning
can be determined from the source of the quality scores. Multiple
reads are stored as a BStringSet
, and so can be of varying
lengths.
As with FastqQuality
, but with integer
qualities encoded as ASCII equivalent plus 64.
Class ".ShortReadBase"
, directly.
The following methods are defined on all QualityScore
and
derived classes:
signature(x = "QualityScore", i = "ANY", j = "missing")
signature(x = "MatrixQuality", i = "ANY", j = "missing")
:
Subset the object, with index i
indicating the reads for
which quality scores are to be extracted. The class of the result
is the same as the class of x
. It is an error to provide
any argument other than i
.
signature(x = "QualityScore", i = "ANY", j = "ANY")
:
Subset the object, returning the quality score (e.g., numeric
value) of the i
th read.
signature(x = "MatrixQuality", i = "ANY", j = "ANY")
:
Returns the vector of quality scores associated with the
i
th read.
signature(x = "MatrixQuality")
:
The integer(2) dimension (e.g., number of reads, read width) represented by the quality score.
signature(x = "QualityScore")
:
signature(x = "MatrixQuality")
:
The integer(1) length (e.g., number of reads) represented by the
quality score. Note that length
of MatrixQuailty
is
the number of rows of the corresponding matrix, and not the length
of the corresponding numeric vector.
signature(x = "QualityScore", values = "QualityScore")
:
append values
after x
.
signature(x = "QualityScore")
:
signature(x = "NumericQuality")
:
signature(x = "MatrixQuality")
:
signature(x = "FastqQuality")
:
A numeric vector with length equal to the number of quality
scores, and value equal to the number of quality scores for each
read. For instance, a FastqQuality
will have widths
equal to the number of nucleotides in the underlying short read.
signature(object = "QualityScore")
:
signature(object = "NumericQuality")
:
signature(object = "FastqQuality")
:
provide a brief summary of the object content.
signature(x = "QualityScore")
:
provide a more detailed view of object content.
The following methods are defined on specific classes:
signature(x = "FastqQuality", ...)
:
Return a character vector of valid quality characters.
signature(x = "FastqQuality", ...)
,
signature(x = "SFastqQuality", ...)
:
Returns a named character vector of integer encodings.
signature(stringSet = "FastqQuality")
:
Apply alphabetFrequency
to quality scores, returning a matrix as described in
alphabetFrequency
.
signature(stringSet = "FastqQuality")
:
Apply alphabetByCycle
to quality scores, returning a
matrix as described in alphabetByCycle
.
signature(object = "FastqQuality")
:
signature(object = "SFastqQuality")
:
signature(object = "PhredQuality")
:
Apply alphabetScore
(i.e., summed base quality, per
read) to object
.
signature(from = "FastqQuality", to = "numeric")
:
signature(from = "FastqQuality", to = "matrix")
:
signature(from = "FastqQuality", to = "PhredQuality")
:
signature(from = "SFastqQuality", to = "matrix")
:
signature(from = "SFastqQuality", to = "SolexaQuality")
:
Use as(from, "matrix")
) and similar to coerce objects of
class from
to class to
, using the quality encoding
implied by the class. When to
is “matrix”, the
result is a matrix of type integer
with number of columns
equal to the maximum width of from
; elements i, j
with j > width(from)[i]
have value NA_integer_
. The
result always represents the integer encoding of the corresponding
quality string.
signature(x = "FastqQuality", ...
: reverse the
quality sequence.
signature(x = "FastqQuality", start = NA, end = NA, width = NA, use.names = TRUE)
:
‘narrow’ quality
so that scores are between
start
and end
bases, according to
narrow
in the IRanges
package.
signature(object="FastqQuality", k="integer",
a="character", halfwidth="integer", ..., ranges=FALSE)
:
trim trailing nucleotides when a window of width 2 * halfwidth + 1
contains k
or more quality scores falling at or below
a
.
signature(object="FastqQuality", k="integer",
a="character", successive=FALSE, ..., ranges=FALSE)
: trim
trailing scores if k
scores fall below the quality encoded
by a
. If successive=FALSE
, the k'th failing score
and all subseqent scores are trimmed. If successive=TRUE
,
failing scores must occur successively; the sequence is trimmed
from the first of the successive failing score.
signature(x = "FastqQuality")
:
signature(x = "FastqQuality")
:
signature(x = "FastqQuality")
:
Apply srsort
, srorder
, srrank
, and
srduplicated
to quality scores, returning objects as
described on the appropriate help page.
Integer representations of SFastqQuality
and
FastqQuality
can be obtained with as(x, "matrix")
.
Martin Morgan <[email protected]>
NumericQuality
and other constructors.
names(slot(getClass("QualityScore"), "subclasses")) encoding(FastqQuality()) encoding(SFastqQuality())
names(slot(getClass("QualityScore"), "subclasses")) encoding(FastqQuality()) encoding(SFastqQuality())
Import files containing aligned reads into an internal representation of the alignments, sequences, and quality scores. Most methods (see ‘details’ for exceptions) read all files into a single R object.
readAligned(dirPath, pattern=character(0), ...)
readAligned(dirPath, pattern=character(0), ...)
dirPath |
A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute; some methods also accept a character vector of file names) of aligned read files to be input. |
pattern |
The ( |
... |
Additional arguments, used by methods. When |
There is no standard aligned read file format; methods parse particular file types.
The readAligned,character-method
interprets file types based
on an additional type
argument. Supported types are:
type="SolexaExport"
This type parses .*_export.txt
files following the
documentation in the Solexa Genome Alignment software manual,
version 0.3.0. These files consist of the following columns;
consult Solexa documentation for precise descriptions. If parsed,
values can be retrieved from AlignedRead
as
follows:
see below
stored in alignData
stored in alignData
stored in alignData
stored in alignData
stored in alignData
see below
see below
sread
quality
chromosome
alignData
position
strand
Ignored
alignQuality
Ignored
Ignored
Ignored
Ignored
Ignored
alignData
The following optional arguments, set to FALSE
by default,
influence data input
When TRUE
, include the
multiplex index as a column multiplexIndex
in
alignData
.
When TRUE
, include the paired
read number as a column pairedReadNumber
in
alignData
.
When TRUE
, construct an identifier string
as
‘Machine_Run:Lane:Tile:X:Y#multiplexIndex/pairedReadNumber’. The
substrings ‘#multiplexIndex’ and
‘/pairedReadNumber’ are not present if
withMultiplexIndex=FALSE
or
withPairedReadNumber=FALSE
.
A convencience which, when TRUE
, sets all
with*
values to TRUE
.
Note that not all paired read columns are interpreted. Different
interfaces to reading alignment files are described in
SolexaPath
and
SolexaSet
.
type="SolexaPrealign"
See SolexaRealign
type="SolexaAlign"
See SolexaRealign
type="SolexaRealign"
These types parse s_L_TTTT_prealign.txt
,
s_L_TTTT_align.txt
or s_L_TTTT_realign.txt
files
produced by default and eland analyses. From the Solexa
documentation, align
corresponds to unfiltered first-pass
alignments, prealign
adjusts alignments for error rates
(when available), realign
filters alignments to exclude
clusters failing to pass quality criteria.
Because base quality scores are not stored with alignments, the
object returned by readAligned
scores all base qualities as
-32
.
If parsed, values can be retrieved from
AlignedRead
as follows:
stored in sread
stored in alignQuality
stored in alignData
stored in position
stored in strand
Ignored; parse using
readXStringColumns
stored in alignData
type="SolexaResult"
This parses s_L_eland_results.txt
files, an intermediate
format that does not contain read or alignment quality
scores.
Because base quality scores are not stored with alignments, the
object returned by readAligned
scores all base qualities as
-32
.
Columns of this file type can be retrieved from
AlignedRead
as follows (description of
columns is from Table 19, Genome Analyzer Pipeline Software User
Guide, Revision A, January 2008):
Not parsed
stored in sread
Stored in alignData
as
matchCode
. Codes are (from the Eland manual): NM (no
match); QC (no match due to quality control failure); RM (no
match due to repeat masking); U0 (best match was unique and
exact); U1 (best match was unique, with 1 mismatch); U2 (best
match was unique, with 2 mismatches); R0 (multiple exact
matches found); R1 (multiple 1 mismatch matches found, no
exact matches); R2 (multiple 2 mismatch matches found, no
exact or 1-mismatch matches).
stored in alignData
as
nExactMatch
stored in alignData
as nOneMismatch
stored in alignData
as nTwoMismatch
stored in chromosome
stored in position
(direction of match) stored in strand
stored in alignData
, as
NCharacterTreatment
. ‘.’ indicates treatment of
‘N’ was not applicable; ‘D’ indicates treatment
as deletion; ‘|’ indicates treatment as insertion
stored in alignData
as
mismatchDetailOne
and mismatchDetailTwo
. Present
only for unique inexact matches at one or two
positions. Position and type of first substitution error,
e.g., 11A represents 11 matches with 12th base an A in
reference but not read. The reference manual cited below lists
only one field (mismatchDetailOne
), but two are present
in files seen in the wild.
type="MAQMap", records=-1L
Parse binary map
files produced by MAQ. See details in the next section. The
records
option determines how many lines are read;
-1L
(the default) means that all records are input. For
type="MAQMap"
, dir
and pattern
must match a
single file.
type="MAQMapShort", records=-1L
The same as
type="MAQMap"
but for map files made with Maq prior to
version 0.7.0. (These files use a different maximum read length
[64 instead of 128], and are hence incompatible with newer Maq map
files.). For type="MAQMapShort"
, dir
and
pattern
must match a single file.
type="MAQMapview"
Parse alignment files created by MAQ's ‘mapiew’ command. Interpretation of columns is based on the description in the MAQ manual, specifically
...each line consists of read name, chromosome, position, strand, insert size from the outer coordinates of a pair, paired flag, mapping quality, single-end mapping quality, alternative mapping quality, number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp on the reference, length of the read, read sequence and its quality.
The read name, read sequence, and quality are read as
XStringSet
objects. Chromosome and strand are read as
factor
s. Position is numeric
, while mapping quality is
numeric
. These fields are mapped to their corresponding
representation in AlignedRead
objects.
Number of mismatches of the best hit, sum of qualities of mismatched
bases of the best hit, number of 0-mismatch hits of the first 24bp,
number of 1-mismatch hits of the first 24bp are represented in the
AlignedRead
object as components of alignData
.
Remaining fields are currently ignored.
type="Bowtie"
Parse alignment files created with the Bowtie alignment
algorithm. Parsed columns can be retrieved from
AlignedRead
as follows:
id
strand
chromosome
position
; see comment below
sread
; see comment below
quality
; see comments below
alignData
, ‘similar’
column; Bowtie v. 0.9.9.3 (12 May, 2009) documents this as
the number of other instances where the same read aligns against the
same reference characters as were aligned against in this
alignment. Previous versions marked this as ‘Reserved’
alignData
‘mismatch’, column
NOTE: the default quality encoding changes to FastqQuality
with ShortRead version 1.3.24.
This method includes the argument qualityType
to specify
how quality scores are encoded. Bowtie quality scores are
‘Phred’-like by default, with
qualityType='FastqQuality'
, but can be specified as
‘Solexa’-like, with qualityType='SFastqQuality'
.
Bowtie outputs positions that are 0-offset from the left-most end
of the +
strand. ShortRead
parses position
information to be 1-offset from the left-most end of the +
strand.
Bowtie outputs reads aligned to the -
strand as their
reverse complement, and reverses the quality score string of these
reads. ShortRead
parses these to their original sequence
and orientation.
type="SOAP"
Parse alignment files created with the SOAP alignment
algorithm. Parsed columns can be retrieved from
AlignedRead
as follows:
id
sread
; see comment below
quality
; see comment below
alignData
alignData
(pairedEnd
)
alignData
(alignedLength
)
strand
chromosome
position
; see comment below
alignData
(typeOfHit
: integer
portion; hitDetail
: text portion)
This method includes the argument qualityType
to specify
how quality scores are encoded. It is unclear from SOAP
documentation what the quality score is; the default is
‘Solexa’-like, with qualityType='SFastqQuality'
, but
can be specified as ‘Phred’-like, with
qualityType='FastqQuality'
.
SOAP outputs positions that are 1-offset from the left-most end of
the +
strand. ShortRead
preserves this
representation.
SOAP reads aligned to the -
strand are reported by SOAP as
their reverse complement, with the quality string of these reads
reversed. ShortRead
parses these to their original sequence
and orientation.
A single R object (e.g., AlignedRead
) containing
alignments, sequences and qualities of all files in dirPath
matching pattern
. There is no guarantee of order in which files
are read.
Martin Morgan <[email protected]>, Simon Anders <[email protected]> (MAQ map)
The AlignedRead
class.
Genome Analyzer Pipeline Software User Guide, Revision A, January 2008.
The MAQ reference manual, http://maq.sourceforge.net/maq-manpage.shtml#5, 3 May, 2008.
The Bowtie reference manual, http://bowtie-bio.sourceforge.net, 28 October, 2008.
The SOAP reference manual, http://soap.genomics.org.cn/soap1, 16 December, 2008.
sp <- SolexaPath(system.file("extdata", package="ShortRead")) ap <- analysisPath(sp) ## ELAND_EXTENDED (aln0 <- readAligned(ap, "s_2_export.txt", "SolexaExport")) ## PhageAlign (aln1 <- readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign")) ## MAQ dirPath <- system.file('extdata', 'maq', package='ShortRead') list.files(dirPath) ## First line readLines(list.files(dirPath, full.names=TRUE)[[1]], 1) countLines(dirPath) ## two files collapse into one (aln2 <- readAligned(dirPath, type="MAQMapview")) ## select only chr1-5.fa, '+' strand filt <- compose(chromosomeFilter("chr[1-5].fa"), strandFilter("+")) (aln3 <- readAligned(sp, "s_2_export.txt", filter=filt))
sp <- SolexaPath(system.file("extdata", package="ShortRead")) ap <- analysisPath(sp) ## ELAND_EXTENDED (aln0 <- readAligned(ap, "s_2_export.txt", "SolexaExport")) ## PhageAlign (aln1 <- readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign")) ## MAQ dirPath <- system.file('extdata', 'maq', package='ShortRead') list.files(dirPath) ## First line readLines(list.files(dirPath, full.names=TRUE)[[1]], 1) countLines(dirPath) ## two files collapse into one (aln2 <- readAligned(dirPath, type="MAQMapview")) ## select only chr1-5.fa, '+' strand filt <- compose(chromosomeFilter("chr[1-5].fa"), strandFilter("+")) (aln3 <- readAligned(sp, "s_2_export.txt", filter=filt))
readBaseQuality
reads all base call files in a directory
dirPath
whose file name matches seqPattern
and all quality
score files whose name matches prbPattern
, returning a compact
internal representation of the sequences, and quality scores in the files.
Methods read all files into a single R object.
readBaseQuality(dirPath, ...) ## S4 method for signature 'character' readBaseQuality(dirPath, seqPattern=character(0), prbPattern=character(0), type=c("Solexa"), ...)
readBaseQuality(dirPath, ...) ## S4 method for signature 'character' readBaseQuality(dirPath, seqPattern=character(0), prbPattern=character(0), type=c("Solexa"), ...)
dirPath |
A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of files to be input. |
seqPattern |
The ( |
prbPattern |
The ( |
type |
The type of file to be parsed. Supported types include:
|
... |
Additional arguments, perhaps used by methods. |
A single R object (e.g., ShortReadQ
) containing
sequences and qualities of all files in dirPath
matching
seqPattern
and prbPattern
respectively. There is no
guarantee of order in which files are read.
Patrick Aboyoun <[email protected]>
A ShortReadQ
object.
sp <- SolexaPath(system.file("extdata", package="ShortRead")) readBaseQuality(sp, seqPattern="s_1.*_seq.txt", prbPattern="s_1.*_prb.txt")
sp <- SolexaPath(system.file("extdata", package="ShortRead")) readBaseQuality(sp, seqPattern="s_1.*_seq.txt", prbPattern="s_1.*_prb.txt")
As coverage
needs to know the lengths of the reference sequences,
this function is provided which extracts this information from
a .bfa file (Maq's "binary FASTA" format).
readBfaToc( bfafile )
readBfaToc( bfafile )
bfafile |
The file name of the .bfa file. |
An integer vector with one element per reference sequence found in the .bfa file, each vector element named with the sequence name and having the sequence length as value.
Simon Anders, EMBL-EBI, [email protected]
(Note: The C code for this function incorporates code from Li Heng's MAQ software, (c) Li Heng and released by him under GPL 2.
readFasta
reads all FASTA-formated files in a directory
dirPath
whose file name matches pattern pattern
,
returning a compact internal representation of the sequences and
quality scores in the files. Methods read all files into a single R
object; a typical use is to restrict input to a single FASTA file.
writeFasta
writes an object to a single file
, using
mode="w"
(the default) to create a new file or mode="a"
append to an existing file. Attempting to write to an existing file
with mode="w"
results in an error.
readFasta(dirPath, pattern = character(0), ..., nrec=-1L, skip=0L) ## S4 method for signature 'character' readFasta(dirPath, pattern = character(0), ..., nrec=-1L, skip=0L) writeFasta(object, file, mode="w", ...) ## S4 method for signature 'DNAStringSet' writeFasta(object, file, mode="w", ...)
readFasta(dirPath, pattern = character(0), ..., nrec=-1L, skip=0L) ## S4 method for signature 'character' readFasta(dirPath, pattern = character(0), ..., nrec=-1L, skip=0L) writeFasta(object, file, mode="w", ...) ## S4 method for signature 'DNAStringSet' writeFasta(object, file, mode="w", ...)
dirPath |
A character vector giving the directory path (relative or absolute) or single file name of FASTA files to be read. |
pattern |
The ( |
object |
An object to be output in |
file |
A length 1 character vector providing a path to a file to the object is to be written to. |
mode |
A length 1 character vector equal to either ‘w’ or ‘a’ to write to a new file or append to an existing file, respectively. |
... |
Additional arguments used by methods or, for
|
nrec |
See |
skip |
See |
readFasta
returns a DNAStringSet
.
containing sequences and qualities contained in all files in
dirPath
matching pattern
. There is no guarantee of order
in which files are read.
writeFasta
is invoked primarily for its side effect, creating
or appending to file file
. The function returns, invisibly, the
length of object
, and hence the number of records
written. There is a writeFasta
method for any class derived
from ShortRead
.
Martin Morgan
showMethods("readFasta") showMethods("writeFasta") f1 <- system.file("extdata", "someORF.fa", package="Biostrings") rfa <- readFasta(f1) sread(rfa) id(rfa) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") file <- tempfile() writeFasta(rfq, file) readLines(file, 8) writeFasta(sread(rfq), file) # no 'id's
showMethods("readFasta") showMethods("writeFasta") f1 <- system.file("extdata", "someORF.fa", package="Biostrings") rfa <- readFasta(f1) sread(rfa) id(rfa) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") file <- tempfile() writeFasta(rfq, file) readLines(file, 8) writeFasta(sread(rfq), file) # no 'id's
readFastq
reads all FASTQ-formated files in a directory
dirPath
whose file name matches pattern pattern
,
returning a compact internal representation of the sequences and
quality scores in the files. Methods read all files into a single R
object; a typical use is to restrict input to a single FASTQ file.
writeFastq
writes an object to a single file
, using
mode="w"
(the default) to create a new file or mode="a"
append to an existing file. Attempting to write to an existing file
with mode="w"
results in an error.
countFastq
counts the nubmer of records, nucleotides, and
base-level quality scores in one or several fastq files.
readFastq(dirPath, pattern=character(0), ...) ## S4 method for signature 'character' readFastq(dirPath, pattern=character(0), ..., withIds=TRUE) writeFastq(object, file, mode="w", full=FALSE, compress=TRUE, ...) countFastq(dirPath, pattern=character(0), ...) ## S4 method for signature 'character' countFastq(dirPath, pattern=character(0), ...)
readFastq(dirPath, pattern=character(0), ...) ## S4 method for signature 'character' readFastq(dirPath, pattern=character(0), ..., withIds=TRUE) writeFastq(object, file, mode="w", full=FALSE, compress=TRUE, ...) countFastq(dirPath, pattern=character(0), ...) ## S4 method for signature 'character' countFastq(dirPath, pattern=character(0), ...)
dirPath |
A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) or single file name of FASTQ files to be read. |
pattern |
The ( |
object |
An object to be output in |
file |
A length 1 character vector providing a path to a file to the object is to be written to. |
mode |
A length 1 character vector equal to either ‘w’ or ‘a’ to write to a new file or append to an existing file, respectively. |
full |
A logical(1) indicating whether the identifier line should
be repeated |
compress |
A logical(1) indicating whether the file should be
gz-compressed. The default is |
... |
Additional arguments. In particular,
|
withIds |
|
The fastq format is not quite precisely defined. The basic definition used here parses the following four lines as a single record:
@HWI-EAS88_1_1_1_1001_499 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT +HWI-EAS88_1_1_1_1001_499 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
The first and third lines are identifiers preceded by a specific
character (the identifiers are identical, in the case of Solexa). The
second line is an upper-case sequence of nucleotides. The parser
recognizes IUPAC-standard alphabet (hence ambiguous nucleotides),
coercing .
to -
to represent missing values. The final
line is an ASCII-encoded representation of quality scores, with one
ASCII character per nucleotide.
The encoding implicit in Solexa-derived fastq files is that each
character code corresponds to a score equal to the ASCII character
value minus 64 (e.g., ASCII @
is decimal 64, and corresponds to
a Solexa quality score of 0). This is different from BioPerl, for
instance, which recovers quality scores by subtracting 33 from the
ASCII character value (so that, for instance, !
, with decimal
value 33, encodes value 0).
The BioPerl description of fastq asserts that the first character of
line 4 is a !
, but the current parser does not support this
convention.
writeFastq
creates files following the specification outlined
above, using the IUPAC-standard alphabet (hence, sequences containing
‘.’ when read will be represented by ‘-’ when written).
readFastq
returns a single R object (e.g.,
ShortReadQ
) containing sequences and qualities
contained in all files in dirPath
matching
pattern
. There is no guarantee of order in which files are
read.
writeFastq
is invoked primarily for its side effect, creating
or appending to file file
. The function returns, invisibly, the
length of object
, and hence the number of records written.
countFastq
returns a data.frame with row names equal to the
base (file) name of the fastq file, and columns records
,
nucleotides
, and scores
, corresponding to tally of each
entity in each file. Parsing mistakes from poorly formmated files
result in an error.
Martin Morgan
The IUPAC alphabet in Biostrings.
http://www.bioperl.org/wiki/FASTQ_sequence_format for the BioPerl definition of fastq.
Solexa documentation 'Data analysis - documentation : Pipeline output and visualisation'.
methods(readFastq) methods(writeFastq) methods(countFastq) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") sread(rfq) id(rfq) quality(rfq) ## SolexaPath method 'knows' where FASTQ files are placed rfq1 <- readFastq(sp, pattern="s_1_sequence.txt") rfq1 file <- tempfile() writeFastq(rfq, file) readLines(file, 8) countFastq(file)
methods(readFastq) methods(writeFastq) methods(countFastq) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") sread(rfq) id(rfq) quality(rfq) ## SolexaPath method 'knows' where FASTQ files are placed rfq1 <- readFastq(sp, pattern="s_1_sequence.txt") rfq1 file <- tempfile() writeFastq(rfq, file) readLines(file, 8) countFastq(file)
readIntensities
reads image ‘intensity’ files (such as
Illumina's _int.txt
and (optionally) _nse.txt
) into a
single object.
readIntensities(dirPath, pattern=character(0), ...)
readIntensities(dirPath, pattern=character(0), ...)
dirPath |
Directory path or other object (e.g.,
|
pattern |
A length 1 character vector representing a regular
expression to be combined with |
... |
Additional arguments used by methods. |
Additional methods are defined on specific classes, see, e.g.,
SolexaPath
.
The readIntensities,character-method
contains an argument
type
that determines how intensities are parsed. Use the
type
argument to readIntensities,character-method
, as
described below. All readIntensities,character
methods accepts
the folling arguments:
Include estimates of variability (i.e., from
parsing _nse
files).
Report on progress when starting to read each file.
The supported types and their signatures are:
type="RtaIntensity"
Intensities are read from Illumina _cif.txt
and
_cnf.txt
-style files. The signature for this method is
dirPath, pattern=character(0), ..., type="RtaIntensity",
lane=integer(0), cycles=integer(0), cycleIteration=1L,
tiles=integer(0),
laneName=sprintf("L
cycleNames=sprintf("C
tileNames=sprintf("s_
posNames=sprintf("s_
withVariability=TRUE, verbose=FALSE
integer(1)
identifying the lane in which
cycles and tiles are to be processed.
integer()
enumerating cycles to be
processed.
integer(1)
identifying the
iteration of the base caller to be summarized
integer()
enumerating tile numbers to be
summarized.
character()
vectors identifying the lane and
cycle directories, and the ‘pos’ and tile file names
(excluding the ‘.cif’ or ‘.cnf’ extension) to be
processed.
The dirPath
and pattern
arguments are combined as
list.files(dirPath, pattern)
, and must identify a single
directory. Most uses of this function will focus on a single tile
(specified with, e.g., tiles=1L
); the laneName
,
cycleNames
, tileNames
, and posNames
parameters are designed to work with the default Illumina pipeline
and do not normally need to be specified.
type="IparIntensity"
Intensities are read from Solexa _pos.txt
,
_int.txt.p
, _nse.txt.p
-style file triplets. The
signature for this method is
dirPath, pattern=character(0), ...,
type="IparIntensity",
intExtension="_int.txt.p.gz",
nseExtension="_nse.txt.p.gz",
posExtension="_pos.txt",
withVariability=TRUE, verbose=FALSE
Files to be parsed are determined as, e.g., paste(pattern,
intExtension, sep="")
.
type="SolexaIntensity"
Intensities are read from Solexa _int.txt
and
_nse.txt
-style files. The signature for this method is
dirPath, pattern=character(0), ...,
type="SolexaIntensity",
intExtension="_int.txt",
nseExtension="_nse.txt",
withVariability=TRUE, verbose=FALSE
Files to be parsed are determined as, e.g., paste(pattern,
intExtension, sep="")
.
An object derived from class Intensity
.
Martin Morgan <[email protected]>, Michael Muratet <[email protected]> (RTA).
fl <- system.file("extdata", package="ShortRead") sp <- SolexaPath(fl) int <- readIntensities(sp) int intensity(int)[1,,] # one read intensity(int)[[1:2,,]] # two reads, as 'array' head(rowMeans(intensity(int))) # treated as 'array' head(pData(readIntensityInfo(int))) ## Not run: ## RTA Lane 2, cycles 1:80, cycle iteration 1, tile 3 int <- readIntensities("Data/Intensities", type="RtaIntensity", lane=2, cycles=1:80, tiles=3) ## End(Not run)
fl <- system.file("extdata", package="ShortRead") sp <- SolexaPath(fl) int <- readIntensities(sp) int intensity(int)[1,,] # one read intensity(int)[[1:2,,]] # two reads, as 'array' head(rowMeans(intensity(int))) # treated as 'array' head(pData(readIntensityInfo(int))) ## Not run: ## RTA Lane 2, cycles 1:80, cycle iteration 1, tile 3 int <- readIntensities("Data/Intensities", type="RtaIntensity", lane=2, cycles=1:80, tiles=3) ## End(Not run)
readPrb
reads all _prb.txt
files in a directory into a
single object. Most methods (see details) do this by identifying the
maximum base call quality for each cycle and read, and representing
this as an ASCII-encoded character string.
readPrb(dirPath, pattern = character(0), ...)
readPrb(dirPath, pattern = character(0), ...)
dirPath |
Directory path or other object (e.g.,
|
pattern |
Regular expression matching names of |
... |
Additional arguments, unused. |
The readPrb,character-method
contains an argument as
that determines the value of the returned object, as follows.
as="SolexaEncoding"
The ASCII encoding of the maximum per cycle and read quality score is encoded using Solexa conventions.
as="FastqEncoding"
The ASCII encoding of the maximum per cycle and read quality score
is encoded using Fastq conventions, i.e., !
has value 0.
as="IntegerEncoding"
The maximum per cycle and read quality score is returned as a in integer value. Values are collated into a matrix with number of rows equal to number of reads, and number of columns equal to number of cycles.
as="array"
The quality scores are not summarized; the return value is an integer array with dimensions corresponding to reads, nucleotides, and cycles.
An object of class QualityScore
, or an integer matrix.
Martin Morgan <[email protected]>
fl <- system.file("extdata", package="ShortRead") sp <- SolexaPath(fl) readPrb(sp, "s_1.*_prb.txt") # all tiles to a single file
fl <- system.file("extdata", package="ShortRead") sp <- SolexaPath(fl) readPrb(sp, "s_1.*_prb.txt") # all tiles to a single file
readQseq
reads all files matching pattern
in a directory
into a single ShortReadQ
-class object.
Information on machine, lane, tile, x, and y coordinates, filtering
status, and read number are not returned (although filtering status
can be used to selectively include reads as described below).
readQseq(dirPath, pattern = character(0), ..., as=c("ShortReadQ", "DataFrame", "XDataFrame"), filtered=FALSE, verbose=FALSE)
readQseq(dirPath, pattern = character(0), ..., as=c("ShortReadQ", "DataFrame", "XDataFrame"), filtered=FALSE, verbose=FALSE)
dirPath |
Directory path or other object (e.g.,
|
pattern |
Regular expression matching names of |
... |
Additional argument, passed to I/O functions. |
as |
|
filtered |
|
verbose |
|
An object of class ShortReadQ
.
Martin Morgan <[email protected]>
fl <- system.file("extdata", package="ShortRead") sp <- SolexaPath(fl) readQseq(sp)
fl <- system.file("extdata", package="ShortRead") sp <- SolexaPath(fl) readQseq(sp)
This function allows short read data components such as DNA sequence,
quality scores, and read names to be read in to XStringSet
(e.g., DNAStringSet
, BStringSet
) objects. One or several
files of identical layout can be specified.
readXStringColumns(dirPath, pattern=character(0), colClasses=list(NULL), nrows=-1L, skip=0L, sep = "\t", header = FALSE, comment.char="#")
readXStringColumns(dirPath, pattern=character(0), colClasses=list(NULL), nrows=-1L, skip=0L, sep = "\t", header = FALSE, comment.char="#")
dirPath |
A character vector giving the directory path (relative or absolute) of files to be read. |
pattern |
The ( |
colClasses |
A list of length equal to the number of columns in
a file. Columns with corresponding |
nrows |
A length 1 integer vector describing the maximum number
of |
skip |
A length 1 integer vector describing how many lines to skip at the start of each file. |
sep |
A length 1 character vector describing the column separator. |
header |
A length 1 logical vector indicating whether files include a header line identifying columns. If present, the header of the first file is used to name the returned values. |
comment.char |
A length 1 character vector, with a single character that, when appearing at the start of a line, indicates that the entire line should be ignored. Currently there is no way to use comment characters in other than the first position of a line. |
A list, with each element containing an XStringSet
object of
the type corresponding to the non-NULL elements of colClasses
.
Martin Morgan <[email protected]>
## valid character strings for colClasses names(slot(getClass("XString"), "subclasses")) dirPath <- system.file('extdata', 'maq', package='ShortRead') colClasses <- rep(list(NULL), 16) colClasses[c(1, 15, 16)] <- c("BString", "DNAString", "BString") ## read one file readXStringColumns(dirPath, "out.aln.1.txt", colClasses=colClasses) ## read all files into a single object for each column res <- readXStringColumns(dirPath, colClasses=colClasses)
## valid character strings for colClasses names(slot(getClass("XString"), "subclasses")) dirPath <- system.file('extdata', 'maq', package='ShortRead') colClasses <- rep(list(NULL), 16) colClasses[c(1, 15, 16)] <- c("BString", "DNAString", "BString") ## read one file readXStringColumns(dirPath, "out.aln.1.txt", colClasses=colClasses) ## read all files into a single object for each column res <- readXStringColumns(dirPath, colClasses=colClasses)
Use renew
to update an object defined in ShortRead with
new values. Discover update-able classes and values with
renewable
.
renewable(x, ...) renew(x, ...)
renewable(x, ...) renew(x, ...)
x |
For |
... |
For |
When invoked with no arguments renewable
returns a character
vector naming classes that can be renewed.
When invoked with a character(1)
or an instance of a
ShortRead class, a list of the names and values of the elements
that can be renewed. When x
is a character vector naming a
virtual class, then each element of the returned list is a non-virtual
descendant of that class that can be used in renewal. This is not
fully recursive.
renew
is always invoked with the x
argument being an
instance of a class identified by renewable()
. Remaining
arguments are name-value pairs identifying the components of x
that are to be renewed (updated). The name-value pairs must be
consistent with renewable(x)
. The resulting object is checked
for validity. Multiple components of the object can be updated in a
single call to renew
, allowing comparatively efficient complex
transformations.
renewable()
returns a character vector of renewable classes.
renewable(x)
returns a named list. The names correspond to
renewable classes, and the elements of the list correspond to
renewable components of the class.
renew(x, ...)
returns an object of the same class as
x
, but with components of x
replaced by the named values
of ...
.
Martin Morgan <[email protected]>
## discovery renewable() renewable("AlignedRead") renewable("QualityScore") ## instantiable classes ## example data sp <- SolexaPath(system.file("extdata", package="ShortRead")) ap <- analysisPath(sp) filt <- chromosomeFilter("chr[[:digit:]+].fa") aln <- readAligned(ap, "s_2_export.txt", "SolexaExport", filter=filt) ## renew chromosomes from 'chr1.fa' to 'chr1', etc labels <- sub("\\.fa", "", levels(chromosome(aln))) renew(aln, chromosome=factor(chromosome(aln), labels=labels)) ## multiple changes -- update chromosome, offset position renew(aln, chromosome=factor(chromosome(aln), labels=labels), position=1L+position(aln)) ## oops! invalid instances cannot be constructed try(renew(aln, position=1:10))
## discovery renewable() renewable("AlignedRead") renewable("QualityScore") ## instantiable classes ## example data sp <- SolexaPath(system.file("extdata", package="ShortRead")) ap <- analysisPath(sp) filt <- chromosomeFilter("chr[[:digit:]+].fa") aln <- readAligned(ap, "s_2_export.txt", "SolexaExport", filter=filt) ## renew chromosomes from 'chr1.fa' to 'chr1', etc labels <- sub("\\.fa", "", levels(chromosome(aln))) renew(aln, chromosome=factor(chromosome(aln), labels=labels)) ## multiple changes -- update chromosome, offset position renew(aln, chromosome=factor(chromosome(aln), labels=labels), position=1L+position(aln)) ## oops! invalid instances cannot be constructed try(renew(aln, position=1:10))
This generic function summarizes results from evaluation of
qa
into a report. Available report formats vary
depending on the data analysed.
report(x, ..., dest=tempfile(), type="html") report_html(x, dest, type, ...)
report(x, ..., dest=tempfile(), type="html") report_html(x, dest, type, ...)
x |
|
... |
Additional arguments used by specific methods. All methods with See specific methods for details on additional |
dest |
The output destination for the final report. For
|
type |
A text string defining the type of report; available
report types depend on the type of object |
report_html
is meant for use by package authors wishing to add
methods for creating HTML reports; users should always invoke
report
.
The following methods are defined:
x="BowtieQA", ..., dest=tempfile(), type="html"
Produce an HTML-based report from an object of class
BowtieQA
.
x="FastqQA", ..., dest=tempfile(), type="html"
Produce an HTML-based report from an object of class
FastqQA
.
x="MAQMapQA", ..., dest=tempfile(), type="html"
Produce an HTML-based report from an object of class
MAQMapQA
.
x="SolexaExportQA", ..., dest=tempfile(), type="html"
Produce an HTML-based report from an object of class
SolexaExportQA
.
x="SolexaExportQA", ..., dest=tempfile(), type="pdf"
(Deprecated) Produce an PDF report from an object of class
SolexaExportQA
.
x="SolexaPath", ..., dest=tempfile(), type="html"
Produce an HTML report by first visiting all _export.txt
files in the analysisPath
directory of x
to create a
SolexaExportQA
instance.
x="SolexaPath", ..., dest=tempfile(), type="pdf"
(Deprecated) Produce an PDF report by first visiting all
_export.txt
files in the analysisPath
directory of
x
to create a SolexaExportQA
instance.
x="ANY", ..., dest=tempfile(), type="ANY"
This method is used internally
This function is invoked for its side effect; the return value is the name of the directory or file where the report was created.
Martin Morgan <[email protected]>
showMethods("report") ## default CSS file cssFile <- c(QA.css=system.file("template", "QA.css", package="ShortRead")) noquote(readLines(cssFile))
showMethods("report") ## default CSS file cssFile <- c(QA.css=system.file("template", "QA.css", package="ShortRead")) noquote(readLines(cssFile))
This class represents the directory location where Roche (454) result files (fasta sequences and qualities) can be found.
Objects from the class are created with the RochePath
constructor:
RochePath(experimentPath = NA_character_,
readPath = experimentPath,
qualPath = readPath, ..., verbose = FALSE)
character(1)
or
RochePath
pointing to the top-level directory
of a Roche experiment.
character()
of directories (typically in
experimentPath
) containing sequence (read) information. The
default selects all directories matching
list.files(experimentPath, "run")
.
character()
of directories (typically in
experimentPath
) containing quality information. The default
selects all directories matching list.files(experimentPath,
"run")
.
logical(1)
indicating whether invalid paths
should be reported interactively.
RocheSet
has the following slots:
readPath
:Object of class "character"
, as
described in the constructor, above.
qualPath
:Object of class "character"
, as
described in the constructor, above.
basePath
:Object of class "character"
,
containing the experimentPath
.
Class "ExperimentPath"
, directly.
Class ".Roche"
, directly.
Class ".ShortReadBase"
, by class "ExperimentPath", distance 2.
Class ".ShortReadBase"
, by class ".Roche", distance 2.
RochePath
has the following methods or functions defined:
signature(dirPath = "RochePath",
pattern=".\.fna$", sample = 1, run = 1, ...)
: Read sequences
from files matching list.files(dirPath, pattern)
(when
dirPath="character"
) or
list.files(readPath(dir)[run], pattern)[sample]
. The
result is a DNAStringSet
.
signature(dirPath = "RochePath", reads=NULL,
pattern="\.qual$", sample=1, run=1, ...)
: Read quality scores
from files matching
list.files(qualPath(dirPath)[run])[sample]
. Non-null
reads
is used as an (optional) template for parsing
quality scores.
signature(dirPath = "RochePath",
fastaPattern = "\.fna$", qualPattern = "\.qual$", sample = 1,
run = 1)
: read sequences and quality scores into a
ShortReadQ
instance.
signature(dirPath = "character",
fastaPattern = "\.fna$", qualPattern = "\.qual$", sample = 1,
run = 1)
: wrapper for method above, coercing dirPath
to a RochePath
via RochePath(dirPath)
.
signature(dirPath = "RochePath", ...)
:
Reads in base and quality information. Currently delegates to
readFastaQual
, above, but will do more after
RochePath
supports more file types.
signature(dirPath = "RochePath", ...)
: Pass
arguments on to readFastaQual
, documented above.
signature(object = "RochePath")
: return the
contents of the readPath
slot.
signature(object = "RochePath")
: return the
basename
s of readPath(object)
.
signature(path = "RochePath")
: create a
RocheSet
from path
.
Additional methods include:
signature(object = "RochePath")
: Briefly
summarize the experiment path locations.
signature(x = "RochePath")
: Provide
additional detail on the Roche path. All file paths are presented
in full.
Michael Lawrence <[email protected]>
showClass("RochePath")
showClass("RochePath")
This class is meant to coordinate all data in a Roche (454)
experiment. See SRSet
for additional details.
Create objects from this class using one of the RocheSet
methods documented below
sourcePath
:Object of class "RochePath"
The
file system location of the data used in this experiment.
readIndex
:Object of class "integer"
indexing
reads included in the experiment; see SRSet
for details on data representation in this class.
readCount
:Object of class "integer"
containing the number of reads associated with each sample; see
SRSet
for details on data representation in
this class.
phenoData
:Object of class "AnnotatedDataFrame"
with as many rows as there are samples, containing information on
experimental design.
readData
:Object of class "AnnotatedDataFrame"
containing as many rows as there are reads, containing information
on each read in the experiment.
Class "SRSet"
, directly.
Class ".Roche"
, directly.
Class ".ShortReadBase"
, by class "SRSet", distance 2.
Class ".ShortReadBase"
, by class ".Roche", distance 2.
No methods defined with class "RocheSet" in the signature; see
SRSet
for inherited methods.
Michael Lawrence <[email protected]>
showClass("RocheSet")
showClass("RocheSet")
RtaIntensity
objects contain Illumina image
intensity measures created by the RTA pipeline. It will often be more
convenient to create this object using readIntensities
.
RtaIntensity(intensity=array(0, c(0, 0, 0)), measurementError=array(0, c(0, 0, 0)), readInfo=SolexaIntensityInfo( lane=integer()[seq_len(nrow(intensity))]), ...)
RtaIntensity(intensity=array(0, c(0, 0, 0)), measurementError=array(0, c(0, 0, 0)), readInfo=SolexaIntensityInfo( lane=integer()[seq_len(nrow(intensity))]), ...)
intensity |
A matrix of image intensity values. Successive
columns correspond to nucleotides A, C, G, T; four successive
columns correspond to each cycle. Typically, derived from
|
measurementError |
As |
readInfo |
An object of class |
... |
Additional arguments, not currently used. |
An object of class RtaIntensity
.
Martin Morgan <[email protected]>
RtaIntensity
,
readIntensities
.
rta <- RtaIntensity(array(runif(60), c(5,4,3))) intensity(rta) ## subsetting, access, and coercion as(intensity(rta)[1:2,,], "array")
rta <- RtaIntensity(array(runif(60), c(5,4,3))) intensity(rta) ## subsetting, access, and coercion as(intensity(rta)[1:2,,], "array")
Subclass of Intensity
for representing image
intensity data from the Illumina RTA pipeline.
Objects can be created by calls to RtaIntensity
or more usually
readIntensities
.
Object of RtaIntensity
have slots:
readInfo
:Object of class "RtaIntensityInfo"
representing information about each read.
intensity
:Object of class "ArrayIntensity"
containing an array of intensities with dimensions read, base, and
cycle. Nucleotide are A, C, G, T for each cycle.
measurementError
:Object of class
"ArrayIntensity"
containing measurement errors for each
read, cycle, and base, with dimensions like that for
intensity
.
.hasMeasurementError
:Object of class
"ScalarLogical"
used internally to indicate whether
measurement error information is included.
Class "SolexaIntensity"
, directly.
Class "Intensity"
, by class "SolexaIntensity", distance 2.
Class ".ShortReadBase"
, by class "SolexaIntensity", distance 3.
Class "RtaIntensity" inherits accessor, subsetting, and display
methods from class SolexaIntensity
.
Martin Morgan <[email protected]>
SolexaIntensity
, readIntensities
showClass("RtaIntensity") showMethods(class="RtaIntensity", where=getNamespace("ShortRead"))
showClass("RtaIntensity") showMethods(class="RtaIntensity", where=getNamespace("ShortRead"))
This class provides a way to store and manipulate, in a coordinated fashion, uniform-length short reads and their identifiers.
Objects from this class are created by readFasta
, or by
calls to the constructor ShortRead
, as outlined below.
sread
:Object of class "DNAStringSet"
containing IUPAC-standard, uniform-length DNA strings represent
short sequence reads.
id
:Object of class "BStringSet"
containing
identifiers, one for each short read.
Class ".ShortReadBase"
, directly.
Constructors include:
signature(sread = "DNAStringSet", id = "BStringSet")
:
Create a ShortRead
object from reads and their
identifiers. The length of id
must match that of sread
.
signature(sread = "DNAStringSet", id = "missing")
:
Create a ShortRead
object from reads, creating empty identifiers.
signature(sread = "missing", id = "missing")
:
Create an empty ShortRead
object.
Methods include:
signature(object = "AlignedRead")
: access the
sread slot of object
. as(object, "DNAStringSet")
is
similar, but adds id(object)
as the names()
of the
DNAStringSet.
signature(object = "AlignedRead")
: access the
id slot of object
.
signature(x = "ShortRead", i = "ANY", j = "missing")
:
This method creates a new ShortRead
object containing only
those reads indexed by i
. Additional methods on
‘[,ShortRead’ do not provide additional functionality, but
are present to limit inappropriate use.
signature(x = "ShortRead", values = "ShortRead")
:
append the sread
and id
slots of values
after
the corresponding fields of x
.
signature(x = "ShortRead", start = NA, end = NA, width = NA, use.names = TRUE)
:
‘narrow’ sread
so that sequences are between
start
and end
bases, according to
narrow
in the IRanges
package.
signature(x = "ShortRead")
: returns a
integer(1)
vector describing the number of reads in this
object.
signature(x = "ShortRead")
: returns an
integer()
vector of the widths of each read in this
object.
signature(x = "ShortRead")
:
signature(x = "ShortRead")
:
signature(x = "ShortRead")
:
signature(x = "ShortRead")
:
Order, rank, sort, and find duplicates in ShortRead
objects
based on sread(x)
, analogous to the corresponding functions
order
, rank
, sort
, and duplicated
,
ordering nucleotides in the order ACGT
.
signature(pattern="ShortRead", subject="ANY")
:
Find the edit distance between each read in pattern
and the
(short) sequences in subject
. See srdistance
for allowable values for subject
, and for additional
details.
signature(Lpattern = "", Rpattern = "", subject = "ShortRead", max.Lmismatch = 0, max.Rmismatch = 0, with.Lindels = FALSE, with.Rindels = FALSE, Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
:
Remove left and / or right flanking patterns from
sread(subject)
, as described in
trimLRPatterns
. Classes
derived from ShortRead
(e.g., ShortReadQ
,
AlignedRead
) have corresponding base quality scores
trimmed, too. The class of the return object is the same as the
class of subject
, except when ranges=TRUE
when the
return value is the ranges to use to trim 'subject'.
signature(stringSet = "ShortRead")
:
Apply alphabetByCycle
to the sread
component
of stringSet
, returning a matrix as described in
alphabetByCycle
.
signature(x= "ShortRead", n = 50)
:
Apply tables
to the sread
component of
x
, returning a list summarizing frequency of reads in
x
.
signature(object="ShortRead")
: Remove all reads
containing non-nucleotide ("N", "-"
) symbols.
signature(object = "ShortRead")
: provides a brief
summary of the object, including its class, length and width.
signature(x = "ShortRead")
: provides a
more extensive summary of this object, displaying the first and
last entries of sread
and id
.
signature(object, file, ...)
: write
object
to file
in fasta format. See
writeXStringSet
for ...
argument values.
Martin Morgan
showClass("ShortRead") showMethods(class="ShortRead", where=getNamespace("ShortRead"))
showClass("ShortRead") showMethods(class="ShortRead", where=getNamespace("ShortRead"))
These functions are deprecated, and will become defunct.
uniqueFilter(withSread=TRUE, .name="UniqueFilter")
uniqueFilter(withSread=TRUE, .name="UniqueFilter")
withSread |
A |
.name |
An optional |
See srFilter
for details of ShortRead filters.
uniqueFilter
selects elements satisfying
!srduplicated(x)
when withSread=TRUE
, and
!(duplicated(chromosome(x)) & duplicated(position(x)) &
duplicated(strand(x)))
when withSread=FALSE
.
The behavior when withSread=TRUE
can be obtained with
occurrenceFilter(withSread=TRUE)
. The behavior when
withSread=FALSE
can be obtained using a custom filter
This class provides a way to store and manipulate, in a coordinated fashion, the reads, identifiers, and quality scores of uniform-length short reads.
Objects from this class are the result of readFastq
, or
can be constructed from DNAStringSet
, QualityScore
, and
BStringSet
objects, as described below.
Slots sread
and id
are inherited from
ShortRead
. An additional slot defined in this
class is:
quality
:Object of class "BStringSet"
representing a quality score (see readFastq
for some
discussion of quality score).
Class "ShortRead"
, directly.
Class ".ShortReadBase"
, by class "ShortRead", distance 2.
Constructors include:
signature(sread = "DNAStringSet", quality = "QualityScore", id = "BStringSet")
:
signature(sread = "DNAStringSet", quality = "BStringSet", id = "BStringSet")
:
Create a ShortReadQ
object from reads, their quality
scores, and identifiers. When quality
is of class
BStringSet
, the type of encoded quality score is inferred
from the letters used in the scores. The length of id
and
quality
must match that of sread
.
signature(sread = "DNAStringSet", quality = "QualityScore", id = "missing")
:
signature(sread = "DNAStringSet", quality = "BStringSet", id = "missing")
:
Create a ShortReadQ
object from reads and their quality
scores, creating empty identifiers. When quality
is of
class BStringSet
, the type of encoded quality score is
inferred from the letters used in the scores.
signature(sread = "missing", quality = "missing", id = "missing")
:
Create an empty ShortReadQ
object.
See accessors
for additional functions to access slot
content, and ShortRead
for inherited
methods. Additional methods include:
inherited from signature(object = "ANY")
:
access the quality slot of object
.
signature(from = "SFastqQuality", to = "QualityScaledDNAStringSet")
:
(Use as(from, "QualityScaledDNAStringSet")
) coerce objects
of class from
to class to
, using the quality
encoding implied by quality(from)
. See
QualityScore
for supported quality
classes and their coerced counterparts.
signature(object = "ShortReadQ", file =
"character", ...)
:
signature(object = "ShortReadQ", file =
"FastqFile", ...)
:
Write object
to file
in fastq format. See
?writeFastq
for additional arguments mode
and full
.
signature(x = "ShortReadQ", i = "ANY", j = "missing")
:
This method creates a new ShortReadQ
object containing only
those reads indexed by i
. Additional methods on
‘[,ShortRead’ do not provide additional functionality, but
are present to limit inappropriate use.
signature(x = "ShortReadQ", i = "ANY", j =
"missing", ..., y="ShortReadQ")
:
This method updates x
so that records indexed by i
are replaced by corresponding records in value
.
signature(x = "ShortReadQ", values = "ShortRead")
:
append the sread
, quality
and id
slots of
values
after the corresponding fields of x
.
signature(x = "ShortReadQ", ...
:
reverse or reverse complement the DNA sequence, and reverse the
quality sequence.
signature(x = "ShortReadQ", start = NA, end =
NA, width = NA, use.names = TRUE)
: narrow sread
and
quality
so that sequences are between start
and
end
bases, according to
narrow
in the IRanges
package.
signature(object="ShortReadQ", k="integer",
a="character", halfwidth="integer", ..., ranges=FALSE)
:
trim trailing nucleotides when a window of width 2 * halfwidth + 1
contains k
or more quality scores falling at or below
a
.
signature(object="ShortReadQ", k="integer",
a="character", successive=FALSE, ..., ranges=FALSE)
: trim
trailing nucleotides if k
nucleotides fall below the
quality encoded by a
. If successive=FALSE
, the k'th
failing nucleotide and all subseqent nucleotides are trimmed. If
successive=TRUE
, failing nucleotides must occur
successively; the sequence is trimmed from the first of the
successive failing nucleotides.
signature(stringSet = "ShortReadQ")
:
Apply alphabetByCycle
to the sread
component,
the quality
component, and the combination of these two
components of stringSet
, returning a list of matrices with
three elements: "sread"
, "quality"
, and
"both"
.
signature(object = "ShortReadQ")
:
See alphabetScore
for details.
signature(dirPath = "ShortReadQ", lane="character",
..., verbose=FALSE)
:
Perform quality assessment on the ShortReadQ
object using
lane
to identify the object and returning an instance of
ShortReadQQA
. See qa
signature(x = "ShortReadQ")
: display the
first and last entries of each of sread
, id
, and
quality
entries of object
.
Martin Morgan
readFastq
for creation of objects of this class from
fastq-format files.
showClass("ShortReadQ") showMethods(class="ShortReadQ", where=getNamespace("ShortRead"), inherit=FALSE) showMethods(class="ShortRead", where=getNamespace("ShortRead"), inherit=FALSE) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") quality(rfq) sread(reverseComplement(rfq)) quality(reverseComplement(rfq)) quality(trimTails(rfq, 2, "H", successive=TRUE))
showClass("ShortReadQ") showMethods(class="ShortReadQ", where=getNamespace("ShortRead"), inherit=FALSE) showMethods(class="ShortRead", where=getNamespace("ShortRead"), inherit=FALSE) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") quality(rfq) sread(reverseComplement(rfq)) quality(reverseComplement(rfq)) quality(trimTails(rfq, 2, "H", successive=TRUE))
These classes contains a list-like structure with summary descriptions
derived from visiting one or more fastq files, or from a
ShortReadQ
object.
Objects of the class are usually produced by a qa
method.
.srlist
:Object of class "list"
, containing
data frames or lists of data frames summarizing the results of
qa.
Class "SRList"
, directly.
Class ".QA"
, directly.
Class ".SRUtil"
, by class "SRList", distance 2.
Class ".ShortReadBase"
, by class ".QA", distance 2.
Accessor methods are inherited from the SRList
class.
Additional methods defined on this class are:
signature(x="FastqQA", ..., dest=tempfile(), type="html")
:
produces HTML files summarizing QA results. dest
should be
a directory.
signature(x="ShortReadQA", ..., dest=tempfile(), type="html")
:
produces HTML files summarizing QA results. dest
should be
a directory.
Martin Morgan <[email protected]>
qa
.
showClass("FastqQA")
showClass("FastqQA")
"Snapshot"
A Snapshot
-class to visualize genomic data from
BAM files with zoom and pan functionality.
Snapshot(files, range, ...)
Snapshot(files, range, ...)
files |
A character() or |
range |
A |
... |
Additional, optional, arguments to be passed to the
Snapshot
|
signature(x = "Snapshot")
: Zoom (in or out) the
current plot.
signature(x = "Snapshot")
: Pan (right or left)
the current plot.
signature(x = "Snapshot")
: Toggle the current
functions which imported records are to be immediately
evaluated. Note that the active range will be changed to the
current active window.
signature(x = "Snapshot")
: Toggle the panning
effects.
signature(x = "Snapshot")
: Toggle the zooming
effects.
signature(object = "Snapshot")
: Display a
Snapshot
object.
signature(x = "Snapshot")
: Get the files
field (object of class BamFileList
) of a Snapshot
object.
signature(x = "Snapshot")
: Get the
functions
field (object of SnapshotFunctionList
) of a
Snapshot
object.
signature(x = "Snapshot")
: Get the view
field (object of SpTrellis
) of a Snapshot
object.
signature(x = "Snapshot")
: Get the
.range
field (object of GRanges
) of a Snapshot
object.
signature(x = "Snapshot")
: Get the
trellis
object, a field of the SpTrellis
object.
.debug
:Object of class function
to display
messages while in debug mode
.auto_display
:Object of class logical
to
automatically display the coverage plot.
.range
:Object of class GRanges
indicating
which ranges of records to be imported from BAM fields.
.zin
:Object of class logical
indicating
whether the current zooming effect is zoom in.
.pright
:Object of class logical
indicating
whether the current panning effect is right.
.data
:Object of class data.frame
containing
coverage a position is represented for each strand and BAM file.
.data_dirty
:Object of class logical
indicating whether to re-evaluate the imported records.
.initial_functions
:Object of class
SnapshotFunctionList
available by the Snapshot
object.
.current_function
:Object of class character
of the function the imported recorded are currently evaluated and
visualized.
annTrack
:Default to NULL
if not intended to
visualize the annotation track. If default visualization
function(s) is intended to be used to plot the annotation,
annTrack
has to be a GRanges
instance.
functions
:Object of class
SnapshotFunctionList
of customized functions to evaluate
and visualize the imported records.
files
:Object of class BamFileList
to be imported.
view
:Object of class SpTrellis
that is
essentially a reference class wrapper of Trellis
objects.
display()
:Display the current Snapshot
object.
pan()
:Pan (right or left) the current plot.
zoom()
:Zoom (in or out) the current plot.
toggle(zoom, pan, currentFunction)
:Toggle zooming, panning effects or the currentFuction in which the imported records are to be evaluated and visualized.
Martin Morgan and Chao-Jen Wong [email protected]
## example 1: Importing specific ranges of records file <- system.file("extdata", "SRR002051.chrI-V.bam", package="yeastNagalakshmi") which <- GRanges("chrI", IRanges(1, 2e5)) s <- Snapshot(file, range=which) ## methods zoom(s) # zoom in ## zoom in to a specific region zoom(s, range=GRanges("chrI", IRanges(7e4, 7e4+8000))) pan(s) # pan right togglez(s) # change effect of zooming zoom(s) # zoom out togglep(s) # change effect of panning pan(s) ## accessors functions(s) vrange(s) show(s) ignore.strand(s) view(s) ## extract the spTrellis object getTrellis(s) ## extract the trellis object ## example 2: ignore strand s <- Snapshot(file, range=which, ignore.strand=TRUE) ## ## example 3: visualizing annotation track ## library(GenomicFeatures) getAnnGR <- function(txdb, which) { ex <- exonsBy(txdb, by="gene") seqlevels(ex, pruning.mode="coarse") <- seqlevels(which) r <- range(ex) gr <- unlist(r) values(gr)[["gene_id"]] <- rep.int(names(r), times=lengths(r)) gr } txdbFile <- system.file("extdata", "sacCer2_sgdGene.sqlite", package="yeastNagalakshmi") # txdb <- makeTxDbFromUCSC(genome="sacCer2", tablename="sgdGene") txdb <- loadDb(txdbFile) which <- GRanges("chrI", IRanges(1, 2e5)) gr <- getAnnGR(txdb, which) ## note that the first column of the elementMetadata annotates of the ## range of the elements. gr s <- Snapshot(file, range=which, annTrack=gr) annTrack(s) ## zoom in to an interesting region zoom(s, range=GRanges("chrI", IRanges(7e4, 7e4+8000))) togglez(s) ## zoom out zoom(s) pan(s) ## example 4, 5, 6: multiple BAM files with 'multicoarse_covarage' ## and 'multifine_coverage' view. ## Resolution does not automatically switch for views of multiple ## files. It is important to note if width(which) < 10,000, use ## multifine_coverage. Otherwise use multicoarse_coverage file <- system.file("extdata", "SRR002051.chrI-V.bam", package="yeastNagalakshmi") which <- GRanges("chrI", IRanges(1, 2e5)) s <- Snapshot(c(file, file), range=which, currentFunction="multicoarse_coverage") ## grouping files and view by 'multicoarse_coverage' bfiles <- BamFileList(c(a=file, b=file)) values(bfiles) <- DataFrame(sampleGroup=factor(c("normal", "tumor"))) values(bfiles) s <- Snapshot(bfiles, range=which, currentFunction="multicoarse_coverage", fac="sampleGroup") ## grouping files and view by 'multifine_coverage' which <- GRanges("chrI", IRanges(7e4, 7e4+8000)) s <- Snapshot(bfiles, range=which, currentFunction="multifine_coverage", fac="sampleGroup")
## example 1: Importing specific ranges of records file <- system.file("extdata", "SRR002051.chrI-V.bam", package="yeastNagalakshmi") which <- GRanges("chrI", IRanges(1, 2e5)) s <- Snapshot(file, range=which) ## methods zoom(s) # zoom in ## zoom in to a specific region zoom(s, range=GRanges("chrI", IRanges(7e4, 7e4+8000))) pan(s) # pan right togglez(s) # change effect of zooming zoom(s) # zoom out togglep(s) # change effect of panning pan(s) ## accessors functions(s) vrange(s) show(s) ignore.strand(s) view(s) ## extract the spTrellis object getTrellis(s) ## extract the trellis object ## example 2: ignore strand s <- Snapshot(file, range=which, ignore.strand=TRUE) ## ## example 3: visualizing annotation track ## library(GenomicFeatures) getAnnGR <- function(txdb, which) { ex <- exonsBy(txdb, by="gene") seqlevels(ex, pruning.mode="coarse") <- seqlevels(which) r <- range(ex) gr <- unlist(r) values(gr)[["gene_id"]] <- rep.int(names(r), times=lengths(r)) gr } txdbFile <- system.file("extdata", "sacCer2_sgdGene.sqlite", package="yeastNagalakshmi") # txdb <- makeTxDbFromUCSC(genome="sacCer2", tablename="sgdGene") txdb <- loadDb(txdbFile) which <- GRanges("chrI", IRanges(1, 2e5)) gr <- getAnnGR(txdb, which) ## note that the first column of the elementMetadata annotates of the ## range of the elements. gr s <- Snapshot(file, range=which, annTrack=gr) annTrack(s) ## zoom in to an interesting region zoom(s, range=GRanges("chrI", IRanges(7e4, 7e4+8000))) togglez(s) ## zoom out zoom(s) pan(s) ## example 4, 5, 6: multiple BAM files with 'multicoarse_covarage' ## and 'multifine_coverage' view. ## Resolution does not automatically switch for views of multiple ## files. It is important to note if width(which) < 10,000, use ## multifine_coverage. Otherwise use multicoarse_coverage file <- system.file("extdata", "SRR002051.chrI-V.bam", package="yeastNagalakshmi") which <- GRanges("chrI", IRanges(1, 2e5)) s <- Snapshot(c(file, file), range=which, currentFunction="multicoarse_coverage") ## grouping files and view by 'multicoarse_coverage' bfiles <- BamFileList(c(a=file, b=file)) values(bfiles) <- DataFrame(sampleGroup=factor(c("normal", "tumor"))) values(bfiles) s <- Snapshot(bfiles, range=which, currentFunction="multicoarse_coverage", fac="sampleGroup") ## grouping files and view by 'multifine_coverage' which <- GRanges("chrI", IRanges(7e4, 7e4+8000)) s <- Snapshot(bfiles, range=which, currentFunction="multifine_coverage", fac="sampleGroup")
A class to store custom reader and viewer functions for the
Snapshot
class.
SnapshotFunction(reader, viewer, limits, ...) reader(x, ...) viewer(x, ...) limits(x, ...)
SnapshotFunction(reader, viewer, limits, ...) reader(x, ...) viewer(x, ...) limits(x, ...)
reader |
A function for reading data. The function must take a
single argument (a |
viewer |
A function for visualizing the data. The function must
accept the |
limits |
An integer(2) indicating the minimum and maximum number
of nucleotides the |
x |
An instance of |
... |
Additional arguments, currently unused. |
reader
:Object of class function
for
reading data from BAM files and returning a
data.frame
.
viewer
:Object of class function
for
visualization that returns an SpTrellis
object.
limits
:Object of class integer
for the limits
of ranges to be visualized.
Martin Morgan and Chao-Jen Wong
## internally defined function reader(ShortRead:::.fine_coverage) viewer(ShortRead:::.fine_coverage) limits(ShortRead:::.fine_coverage)
## internally defined function reader(ShortRead:::.fine_coverage) viewer(ShortRead:::.fine_coverage) limits(ShortRead:::.fine_coverage)
This class contains a list-like structure with summary descriptions derived from visiting one or more Solexa ‘export’ or ‘realign’ files.
Objects of the class are usually produced by a qa
method.
.srlist
:Object of class "list"
, containing
data frames or lists of data frames summarizing the results of
qa.
Class "SRList"
, directly.
Class ".QA"
, directly.
Class ".SRUtil"
, by class "SRList", distance 2.
Class ".ShortReadBase"
, by class ".QA", distance 2.
Accessor methods are inherited from the SRList
class.
Additional methods defined on this class are:
signature(x="SolexaExportQA", ..., dest=tempfile(), type="html")
:
produces HTML files summarizing QA results. dest
should be
a directory.
signature(x="SolexaExportQA", ..., dest=tempfile(), type="pdf")
:
(deprecated; use type="html"
instead) produces a pdf file
summarizing QA results. dest
should be a file.
signature(x="SolexaRealignQA", ..., dest=tempfile(), type="html")
:
produces HTML files summarizing QA results. dest
should be
a directory.
Martin Morgan <[email protected]>
qa
.
showClass("SolexaExportQA")
showClass("SolexaExportQA")
These function constructs objects of
SolexaIntensity
and
SolexaIntensityInfo
. It will often be more
convenient to create these objects using parsers such as
readIntensities
.
SolexaIntensity(intensity=array(0, c(0, 0, 0)), measurementError=array(0, c(0, 0, 0)), readInfo=SolexaIntensityInfo( lane=integer(nrow(intensity))), ...) SolexaIntensityInfo(lane=integer(0), tile=integer(0)[seq_along(lane)], x=integer(0)[seq_along(lane)], y=integer(0)[seq_along(lane)])
SolexaIntensity(intensity=array(0, c(0, 0, 0)), measurementError=array(0, c(0, 0, 0)), readInfo=SolexaIntensityInfo( lane=integer(nrow(intensity))), ...) SolexaIntensityInfo(lane=integer(0), tile=integer(0)[seq_along(lane)], x=integer(0)[seq_along(lane)], y=integer(0)[seq_along(lane)])
intensity |
A matrix of image intensity values. Successive
columns correspond to nucleotides A, C, G, T; four successive
columns correspond to each cycle. Typically, derived from
|
measurementError |
As |
readInfo |
An object of class |
lane |
An integer vector giving the lane from which each read is derived. |
tile |
An integer vector giving the tile from which each read is derived. |
x |
An integer vector giving the tile-local x coordinate of the read from which each read is derived. |
y |
An integer vector giving the tile-local y coordinate of the read from which each read is derived. |
... |
Additional arguments, not currently used. |
An object of class SolexaIntensity
, or
SolexaIntensityInfo
.
Martin Morgan <[email protected]>
Instances of Intensity
and
IntensityInfo
for representing image intensity
data from Solexa experiments.
Objects can be created by calls to SolexaIntensityInfo
or
SolexaIntensity
, or more usually readIntensities
.
Object of SolexaIntensity
have slots:
readInfo
:Object of class "SolexaIntensityInfo"
representing information about each read.
intensity
:Object of class "ArrayIntensity"
containing an array of intensities with dimensions read, base, and
cycle. Nucleotide are A, C, G, T for each cycle.
measurementError
:Object of class
"ArrayIntensity"
containing measurement errors for each
read, cycle, and base, with dimensions like that for
intensity
.
.hasMeasurementError
:Object of class
"ScalarLogical"
used internally to indicate whether
measurement error information is included.
Object of SolexaIntensityInfo
Object of class "data.frame"
, inherited from
AnnotatedDataFrame
.
Object of class "data.frame"
, inherited
from AnnotatedDataFrame
.
Object of class "character"
, inherited
from AnnotatedDataFrame
.
.__classVersion__
Object of class "Versions"
,
inherited from AnnotatedDataFrame
.
Object of class "ScalarLogical"
, used internally
to indicate whether the user initialized this object.
Class SolexaIntensity
:
Class "Intensity"
, directly.
Class ".ShortReadBase"
, by class "Intensity", distance 2.
Class SolexaIntensityInfo
:
Class "AnnotatedDataFrame"
, directly
Class "IntensityInfo"
, directly
Class "Versioned"
, by class "AnnotatedDataFrame", distance 2
Class ".ShortReadBase"
, by class "IntensityInfo", distance 2 Class "IntensityInfo"
, directly.
Class "SolexaIntensity" inherits accessor and display
methods from class Intensity
. Additional methods include:
[
signature(x = "SolexaIntensity", i="ANY", j="ANY", k="ANY")
:
Selects the ith read, jth nucleotide, and kth cycle. Selection is coordinated across intensity, measurement error, and read information.
Class "SolexaIntensityInfo" inherits accessor, subsetting, and display
methods from class IntensityInfo
and
AnnotatedDataFrame
.
Martin Morgan <[email protected]>
showClass("SolexaIntensity") sp <- SolexaPath(system.file('extdata', package='ShortRead')) int <- readIntensities(sp) int # SolexaIntensity readIntensityInfo(int) # SolexaIntensityInfo int[1:5,,] # read 1:5
showClass("SolexaIntensity") sp <- SolexaPath(system.file('extdata', package='ShortRead')) int <- readIntensities(sp) int # SolexaIntensity readIntensityInfo(int) # SolexaIntensityInfo int[1:5,,] # read 1:5
Solexa produces a hierarchy of output files. The content of the hierarchy varies depending on analysis options. This class represents a standard class hierarchy, constructed by searching a file hierarchy for appropriately named directories.
Objects from the class are created by calls to the constructor:
SolexaPath(experimentPath,
dataPath=.solexaPath(experimentPath, "Data"),
scanPath=.solexaPath(dataPath, "GoldCrest"),
imageAnalysisPath=.solexaPath(dataPath, "^(C|IPAR)"),
baseCallPath=.solexaPath(imageAnalysisPath,
"^Bustard"),
analysisPath=.solexaPath(baseCallPath,
"^GERALD"),
..., verbose=FALSE)
character(1)
object pointing to the
top-level directory of a Solexa run, e.g.,
/home/solexa/user/080220_HWI-EAS88_0004
. This is the only
required argument
(optional) Solexa ‘Data’ folder .
(optional) Solexa GoldCrest image scan path.
(optional) Firecrest image analysis path.
(optional) Bustard base call path.
(optional) Gerald analysis pipeline path.
Additional arguments, unused by currently implemented methods.
(optional) logical vector which, when
TRUE
results in warnings if paths do not exist.
All paths must be fully-specified.
SolexaPath
has the following slots, containing either a fully
specified path to the corresponding directory (described above) or
NA
if no appropriate directory was discovered.
basePath
See experimentPath
, above.
dataPath
See above.
scanPath
See above.
imageAnalysisPath
See above.
baseCallPath
See above.
analysisPath
See above.
Class ".Solexa"
, directly.
Class ".ShortReadBase"
, by class ".Solexa", distance 2.
Transforming methods include:
signature(dirPath = "SolexaPath", pattern=character(0), run, ...)
:
Use imageAnalysisPath(sp)[run]
as the directory path(s) and
pattern=character(0)
as the pattern for discovering Solexa
intensity files. See
readIntensities,character-method
for additional
parameters.
signature(dirPath = "SolexaPath", pattern=character(0), run, ...)
:
Use baseCallPath(dirPath)[run]
as the directory path(s) and
pattern=character(0)
as the pattern for discovering Solexa
‘prb’ files, returning a SFastqQuality
object containing the maximum qualities found for each base of
each cycle.
The ...
argument may include the named argument
as
. This influences the return value, as explained on the
readPrb,character-method
page.
signature(dirPath, pattern = character(0), ...,
nrec=-1L, skip=0L)
:
Use analysisPath(dirPath)[run]
as the directory path(s) for
discovering fasta-formatted files, returning a
ShortRead
object. The default method reads
all files into a single object.
signature(dirPath = "SolexaPath", pattern = ".*_sequence.txt",
run, ..., qualityType="SFastqQuality")
:
Use analysisPath(dirPath)[run]
as the directory path(s) and
pattern=".*_sequence.txt"
as the pattern for discovering
fastq-formatted files, returning a ShortReadQ
object. The default method reads all sequence files into a
single object.
signature(dirPath = "SolexaPath", seqPattern = ".*_seq.txt", prbPattern = "s_[1-8]_prb.txt", run, ...)
:
Use baseCallPath(dirPath)[run]
as the directory path(s) and
seqPattern=".*_seq.txt"
as the pattern for discovering
base calls and prbPattern=".*_prb.txt"
as the pattern
for discovering quality scores. Note that the default method reads
all base call and quality score files into a single object;
often one will want to specify a pattern for each lane.
signature(directory="SolexaPath", pattern=".*_qseq.txt.*", run, ...., filtered=FALSE)
:
Use analysisPath(dirPath)[run]
as the directory path and
pattern=".*_qseq.txt.*"
as the pattern for discovering read
and quality scores in Solexa 'qseq' files. Data from all
files are read into a single object; often one will want to
specify a pattern for each lane. Details are as for
readQseq,character-method
.
signature(dirPath = "SolexaPath", pattern = ".*_export.txt.*", run, ..., filter=srFilter())
:
Use analysisPath(dirPath)[run]
as the directory path and
pattern=".*_export.txt"
as the pattern for discovering
Eland-aligned reads in the Solexa 'export' file format. Note that
the default method reads all aligned read files into a
single object; often one will want to specify a pattern for each
lane. Use an object of SRFilter
to select
specific chromosomes, strands, etc.
signature(dirPath="SolexaPath", pattern="character(0)", run, ...)
:
Use analysisPath(dirPath)[run]
as the directory path(s) and
pattern=".*_export.txt"
as the pattern for discovering
Solexa export
-formatted fileds, returning a
SolexaExportQA
object summarizing quality
assessment. If Rmpi
or parallel
has been initiated,
quality assessment calculations are distributed across available
nodes or cores (one node per export file.)
signature(x, ..., dest=tempfile(), type="pdf")
: Use
qa(x, ...)
to generate quality assessment measures, and
use these to generate a quality assessment report at location
dest
of type type
(e.g., ‘pdf’).
signature(path = "SolexaPath")
: create a
SolexaSet
object based on path
.
Additional methods include:
signature(object = "SolexaPath")
: briefly
summarize the file paths of object
. The
experimentPath
is given in full; the remaining paths are
identified by their leading characters.
signature(x = "SolexaPath")
: summarize
file paths of x
. All file paths are presented in
full.
Martin Morgan
showClass("SolexaPath") showMethods(class="SolexaPath", where=getNamespace("ShortRead")) sf <- system.file("extdata", package="ShortRead") sp <- SolexaPath(sf) sp readFastq(sp, pattern="s_1_sequence.txt") ## Not run: nfiles <- length(list.files(analysisPath(sp), "s_[1-8]_export.txt")) library(Rmpi) mpi.spawn.Rslaves(nslaves=nfiles) report(qa(sp)) ## End(Not run) ## Not run: nfiles <- length(list.files(analysisPath(sp), "s_[1-8]_export.txt")) report(qa(sp)) ## End(Not run)
showClass("SolexaPath") showMethods(class="SolexaPath", where=getNamespace("ShortRead")) sf <- system.file("extdata", package="ShortRead") sp <- SolexaPath(sf) sp readFastq(sp, pattern="s_1_sequence.txt") ## Not run: nfiles <- length(list.files(analysisPath(sp), "s_[1-8]_export.txt")) library(Rmpi) mpi.spawn.Rslaves(nslaves=nfiles) report(qa(sp)) ## End(Not run) ## Not run: nfiles <- length(list.files(analysisPath(sp), "s_[1-8]_export.txt")) report(qa(sp)) ## End(Not run)
This class coordinates the file hierarchy produced by the Solexa
‘pipeline’ with annotation data contained in an
AnnotatedDataFrame
(defined in the Biobase
package).
Objects can be created from the constructor:
SolexaSet(path, ...)
.
A character(1)
vector giving the fully-qualified
path to the root of the directory hierarchy associated with each
Solexa flow cell, or an object of class SolexaPath
(see
SolexaPath
for this method).
Additional arguments, especially laneDescription
,
an
AnnotatedDataFrame
describing the content of each of the 8 lanes in the Solexa flow
cell.
SolexaSet
has the following slots:
solexaPath
:Object of class "SolexaPath"
.
laneDescription
:Object of class
"AnnotatedDataFrame"
, containing information about the
samples in each lane of the flow cell.
Class ".Solexa"
, directly.
Class ".ShortReadBase"
, by class ".Solexa", distance 2.
signature(object = "SolexaSet")
: Return the
directory paths present when this object was created as a
SolexaPath
.
signature(object = "SolexaSet")
: Return the
names of each lane in the flow cell, currently names are simply
1:8
.
signature(object = "SolexaSet")
: Briefly
summarize the experiment path and lane description of the Solexa
set.
signature(x = "SolexaSet")
: Provide
additional detail on the Solexa set, including the content of
solexaPath
and the pData
and varMetadata
of
laneDescription
.
Methods transforming SolexaSet
objects include:
signature(dirPath = "SolexaSet", pattern = ".*_export.txt", run, ..., filter=srFilter())
:
Use analysisPath(solexaPath(dirPath))[run]
as the directory
path(s) and pattern=".*_export.txt"
as the pattern for
discovering Eland-aligned reads in the Solexa 'export' file
format. Note that the default method reads all aligned read
files into a single object; often one will want to specify a
pattern for each lane. Use an object of
SRFilter
to select specific chromosomes,
strands, etc.
Martin Morgan
showClass("SolexaSet") showMethods(class="SolexaSet", where=getNamespace("ShortRead")) ## construct a SolexaSet sf <- system.file("extdata", package="ShortRead") df <- data.frame(Sample=c("Sample 1", "Sample 2", "Sample 3", "Sample 4", "Center-wide control", "Sample 6", "Sample 7", "Sample 8"), Genome=c(rep("hg18", 4), "phi_plus_SNPs.txt", rep("hg18", 3))) dfMeta <- data.frame(labelDescription=c("Type of sample", "Alignment genome")) adf <- new("AnnotatedDataFrame", data=df, varMetadata=dfMeta) SolexaSet(sf, adf)
showClass("SolexaSet") showMethods(class="SolexaSet", where=getNamespace("ShortRead")) ## construct a SolexaSet sf <- system.file("extdata", package="ShortRead") df <- data.frame(Sample=c("Sample 1", "Sample 2", "Sample 3", "Sample 4", "Center-wide control", "Sample 6", "Sample 7", "Sample 8"), Genome=c(rep("hg18", 4), "phi_plus_SNPs.txt", rep("hg18", 3))) dfMeta <- data.frame(labelDescription=c("Type of sample", "Alignment genome")) adf <- new("AnnotatedDataFrame", data=df, varMetadata=dfMeta) SolexaSet(sf, adf)
A reference class to manage the trellis graphics related component of the
Snapshot
functionality for visualization of genomic data.
SpTrellis(trellis, debug_enabled=FALSE)
SpTrellis(trellis, debug_enabled=FALSE)
trellis |
A trellis object for storing the plot of the genome area being visualized. |
debug_enabled |
|
trellis
:Object of class trellis
for storing the plot
information.
logical(1)
indicating whether class methods
should report debugging information to the user.
signature(x="SpTrellis")
: zoom in
signature(x="SpTrellis")
: zoom out
signature(x="SpTrellis")
: shift to the right
signature(x="SpTrellis")
: shift to the left
signature(x="SpTrellis")
: restore to the
original plot
signature(x="SpTrellis")
: show the current plot
signature(x="SpTrellis")
: update the trellis
parameters of the SpTrellis
object.
Chao-Jen [email protected]
col <- c("#66C2A5", "#FC8D62") x = numeric(1000) x[sample(1000, 100)] <- abs(rnorm(100)) df <- data.frame(x = c(x, -x), pos = seq(1, 1e5, length.out=1000), group = rep(c("positive", "negative"), each=1000)) cv <- lattice::xyplot(x ~ pos, df, group=group, type="s", col=col, main="yeast chrI:1 - 2e5", ylab="Coverage", xlab="Coordinate", scales=list(y=list(tck=c(1,0)), x=list(rot=45, tck=c(1,0), tick.number=20)), panel=function(...) { lattice::panel.xyplot(...) lattice::panel.grid(h=-1, v=20) lattice::panel.abline(a=0, b=0, col="grey") }) s <- SpTrellis(cv) s zi(s) zi(s) left(s) right(s) zo(s) restore(s)
col <- c("#66C2A5", "#FC8D62") x = numeric(1000) x[sample(1000, 100)] <- abs(rnorm(100)) df <- data.frame(x = c(x, -x), pos = seq(1, 1e5, length.out=1000), group = rep(c("positive", "negative"), each=1000)) cv <- lattice::xyplot(x ~ pos, df, group=group, type="s", col=col, main="yeast chrI:1 - 2e5", ylab="Coverage", xlab="Coordinate", scales=list(y=list(tck=c(1,0)), x=list(rot=45, tck=c(1,0), tick.number=20)), panel=function(...) { lattice::panel.xyplot(...) lattice::panel.grid(h=-1, v=20) lattice::panel.abline(a=0, b=0, col="grey") }) s <- SpTrellis(cv) s zi(s) zi(s) left(s) right(s) zo(s) restore(s)
Use Snapshot
-class to visualize a specific region of genomic data
spViewPerFeature(GRL, name, files, ignore.strand=FALSE, multi.levels = FALSE, fac=character(0L), ...)
spViewPerFeature(GRL, name, files, ignore.strand=FALSE, multi.levels = FALSE, fac=character(0L), ...)
GRL |
Object |
name |
Character(1) specifying which element in |
files |
Charactor() or |
ignore.strand |
Logical(1) indicating whether to ignore the strand of the genomic data. |
multi.levels |
Logical(1) indicating whether to plot the
coverage of multiple files on different panels. If |
fac |
Character(1) indicating which column of local metadata
( |
... |
Arguments used for creating a |
A Snapshot
instance
Chao-Jen Wong [email protected]
## Example 1 library(GenomicFeatures) txdbFile <- system.file("extdata", "sacCer2_sgdGene.sqlite", package="yeastNagalakshmi") ## either use a txdb file quaried from UCSC or use existing TxDb packages. txdb <- loadDb(txdbFile) grl <- exonsBy(txdb, by="gene") file <- system.file("extdata", "SRR002051.chrI-V.bam", package="yeastNagalakshmi") s <- spViewPerFeature(GRL=grl, name="YAL001C", files=file) ## Example 2 ## multi-files: using 'BamFileList' and setting up the 'DataFrame' ## holding the phenotype data bfiles <- BamFileList(c(a=file, b=file)) values(bfiles) <- DataFrame(sampleGroup=factor(c("normal", "tumor"))) values(bfiles) s <- spViewPerFeature(GRL=grl, name="YAL001C", files=bfiles, multi.levels=TRUE, fac="sampleGroup")
## Example 1 library(GenomicFeatures) txdbFile <- system.file("extdata", "sacCer2_sgdGene.sqlite", package="yeastNagalakshmi") ## either use a txdb file quaried from UCSC or use existing TxDb packages. txdb <- loadDb(txdbFile) grl <- exonsBy(txdb, by="gene") file <- system.file("extdata", "SRR002051.chrI-V.bam", package="yeastNagalakshmi") s <- spViewPerFeature(GRL=grl, name="YAL001C", files=file) ## Example 2 ## multi-files: using 'BamFileList' and setting up the 'DataFrame' ## holding the phenotype data bfiles <- BamFileList(c(a=file, b=file)) values(bfiles) <- DataFrame(sampleGroup=factor(c("normal", "tumor"))) values(bfiles) s <- spViewPerFeature(GRL=grl, name="YAL001C", files=bfiles, multi.levels=TRUE, fac="sampleGroup")
srdistance
calculates the edit distance from each read in
pattern
to each read in subject
. The underlying
algorithm pairwiseAlignment
is only efficient when both
reads are short, and when the number of subject
reads is small.
srdistance(pattern, subject, ...)
srdistance(pattern, subject, ...)
pattern |
An object of class |
subject |
A short |
... |
additional arguments, unused. |
The underlying algorithm performs pairwise alignment from each read in
pattern
to each sequence in subject
. The return value is
a list of numeric vectors of distances, one list element for each
sequence in subject
. The vector in each list element contains
for each read in pattern
the edit distance from the read to the
corresponding subject. The weight matrix and gap penalties used to
calculate the distance are structured to weight base substitutions and
single base insert/deletions equally. Edit distance between known and
ambiguous (e.g., N) nucleotides, or between ambiguous nucleotides, are
weighted as though each possible nucleotide in the ambiguity were
equally likely.
A list of length equal to that of subject
. Each element is a
numeric vector equal to the length of pattern
, with values
corresponding to the minimum distance between between the
corresponding pattern and subject sequences.
Martin Morgan <[email protected]>
sp <- SolexaPath(system.file("extdata", package="ShortRead")) aln <- readAligned(sp, "s_2_export.txt") polyA <- polyn("A", 35) polyT <- polyn("T", 35) d1 <- srdistance(clean(sread(aln)), polyA) d2 <- srdistance(sread(aln), polyA) d3 <- srdistance(sread(aln), c(polyA, polyT))
sp <- SolexaPath(system.file("extdata", package="ShortRead")) aln <- readAligned(sp, "s_2_export.txt") polyA <- polyn("A", 35) polyT <- polyn("T", 35) d1 <- srdistance(clean(sread(aln)), polyA) d2 <- srdistance(sread(aln), polyA) d3 <- srdistance(sread(aln), c(polyA, polyT))
These generics order, rank, sort, and find duplicates in short read
objects, including fastq-encoded qualities. srorder
,
srrank
and srsort
differ from the default functions
rank
, order
and sort
in that sorting is based on
an internally-defined order rather than, e.g., the order implied by
LC_COLLATE
.
srorder(x, ...) srrank(x, ...) srsort(x, ...) srduplicated(x, ...)
srorder(x, ...) srrank(x, ...) srsort(x, ...) srduplicated(x, ...)
x |
The object to be sorted, ranked, ordered, or to have duplicates identified; see the examples below for objects for which methods are defined. |
... |
Additional arguments available for use by methods; usually ignored. |
Unlike sort
and friends, the implementation does not preserve
order of duplicated elements. Like duplicated
, one element in
each set of duplicates is marked as FALSE
.
srrank
settles ties using the “min” criterion described
in rank
, i.e., identical elements are ranked equal to the
rank of the first occurrence of the sorted element.
The following methods are defined, in addition to methods described in class-specific documentation:
signature(x = "XStringSet")
:
signature(x = "XStringSet")
:
signature(x = "XStringSet")
:
Apply srorder
, srrank
, srsort
,
srduplicated
to
XStringSet
objects such
as those returned by sread
.
signature(x = "ShortRead")
:
signature(x = "ShortRead")
:
signature(x = "ShortRead")
:
Apply srorder
, srrank
, srsort
,
srduplicated
to
XStringSet
objects to
the sread
component of ShortRead
and
derived objects.
The functions return the following values:
srorder |
An integer vector the same length as |
srrank |
An integer vector the same length as |
srsort |
An instance of |
srduplicated |
A logical vector the same length as |
Martin Morgan <[email protected]>
showMethods("srsort") showMethods("srorder") showMethods("srduplicated") sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") sum(srduplicated(sread(rfq))) srsort(sread(rfq)) srsort(quality(rfq))
showMethods("srsort") showMethods("srorder") showMethods("srduplicated") sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") sum(srduplicated(sread(rfq))) srsort(sread(rfq)) srsort(quality(rfq))
These functions create user-defined (srFitler
) or built-in
instances of SRFilter
objects. Filters can be
applied to objects from ShortRead
, returning a logical vector
to be used to subset the objects to include only those components
satisfying the filter.
srFilter(fun, name = NA_character_, ...) ## S4 method for signature 'missing' srFilter(fun, name=NA_character_, ...) ## S4 method for signature 'function' srFilter(fun, name=NA_character_, ...) compose(filt, ..., .name) idFilter(regex=character(0), fixed=FALSE, exclude=FALSE, .name="idFilter") occurrenceFilter(min=1L, max=1L, withSread=c(NA, TRUE, FALSE), duplicates=c("head", "tail", "sample", "none"), .name=.occurrenceName(min, max, withSread, duplicates)) nFilter(threshold=0L, .name="CleanNFilter") polynFilter(threshold=0L, nuc=c("A", "C", "T", "G", "other"), .name="PolyNFilter") dustyFilter(threshold=Inf, batchSize=NA, .name="DustyFilter") srdistanceFilter(subject=character(0), threshold=0L, .name="SRDistanceFilter") ## ## legacy filters for ungapped alignments ## chromosomeFilter(regex=character(0), fixed=FALSE, exclude=FALSE, .name="ChromosomeFilter") positionFilter(min=-Inf, max=Inf, .name="PositionFilter") strandFilter(strandLevels=character(0), .name="StrandFilter") alignQualityFilter(threshold=0L, .name="AlignQualityFilter") alignDataFilter(expr=expression(), .name="AlignDataFilter")
srFilter(fun, name = NA_character_, ...) ## S4 method for signature 'missing' srFilter(fun, name=NA_character_, ...) ## S4 method for signature 'function' srFilter(fun, name=NA_character_, ...) compose(filt, ..., .name) idFilter(regex=character(0), fixed=FALSE, exclude=FALSE, .name="idFilter") occurrenceFilter(min=1L, max=1L, withSread=c(NA, TRUE, FALSE), duplicates=c("head", "tail", "sample", "none"), .name=.occurrenceName(min, max, withSread, duplicates)) nFilter(threshold=0L, .name="CleanNFilter") polynFilter(threshold=0L, nuc=c("A", "C", "T", "G", "other"), .name="PolyNFilter") dustyFilter(threshold=Inf, batchSize=NA, .name="DustyFilter") srdistanceFilter(subject=character(0), threshold=0L, .name="SRDistanceFilter") ## ## legacy filters for ungapped alignments ## chromosomeFilter(regex=character(0), fixed=FALSE, exclude=FALSE, .name="ChromosomeFilter") positionFilter(min=-Inf, max=Inf, .name="PositionFilter") strandFilter(strandLevels=character(0), .name="StrandFilter") alignQualityFilter(threshold=0L, .name="AlignQualityFilter") alignDataFilter(expr=expression(), .name="AlignDataFilter")
fun |
An object of class |
name |
A |
filt |
A |
.name |
An optional |
regex |
Either |
fixed |
|
exclude |
|
min |
|
max |
|
strandLevels |
Either |
withSread |
A |
duplicates |
Either |
threshold |
A |
nuc |
A |
batchSize |
|
subject |
A |
expr |
A |
... |
Additional arguments for subsequent methods; these arguments are not currently used. |
srFilter
allows users to construct their own filters. The
fun
argument to srFilter
must be a function accepting a
single argument x
and returning a logical vector that can be
used to select elements of x
satisfying the filter with
x[fun(x)]
The signature(fun="missing")
method creates a default filter
that returns a vector of TRUE
values with length equal to
length(x)
.
compose
constructs a new filter from one or more existing
filter. The result is a filter that returns a logical vector with
indices corresponding to components of x
that pass all
filters. If not provided, the name of the filter consists of the names
of all component filters, each separated by " o "
.
The remaining functions documented on this page are built-in filters
that accept an argument x
and return a logical vector of
length(x)
indicating which components of x
satisfy the
filter.
idFilter
selects elements satisfying
grep(regex, id(x), fixed=fixed)
.
chromosomeFilter
selects elements satisfying
grep(regex, chromosome(x), fixed=fixed)
.
positionFilter
selects elements satisfying
min <= position(x) <= max
.
strandFilter
selects elements satisfying
match(strand(x), strand, nomatch=0) > 0
.
occurrenceFilter
selects elements that occur >=min
and
<=max
times. withSread
determines how reads will be
treated: TRUE
to include the sread, chromosome, strand, and
position when determining occurrence, FALSE
to include
chromosome, strand, and position, and NA
to include only
sread. The default is withSread=NA
. duplicates
determines how reads with more than max
reads are
treated. head
selects the first max
reads of each set of
duplicates, tail
the last max
reads, and sample
a
random sample of max
reads. none
removes all reads
represented more than max
times. The user can also provide a
function (as used by tapply
) of a single argument to
select amongst reads.
nFilter
selects elements with fewer than threshold
'N'
symbols in each element of sread(x)
.
polynFilter
selects elements with fewer than threshold
copies of any nucleotide indicated by nuc
.
dustyFilter
selects elements with high sequence complexity, as
characterized by their dustyScore
. This emulates the
dust
command from WindowMaker
software. Calculations can be memory intensive; use
batchSize
to process the argument to dustyFilter
in
batches of the specified size.
srdistanceFilter
selects elements at an edit distance greater
than threshold
from all sequences in subject
.
alignQualityFilter
selects elements with alignQuality(x)
greater than threshold
.
alignDataFilter
selects elements with
pData(alignData(x))
satisfying expr
. expr
should
be formulated as though it were to be evaluated as
eval(expr, pData(alignData(x)))
.
srFilter
returns an object of SRFilter
.
Built-in filters return a logical vector of length(x)
, with
TRUE
indicating components that pass the filter.
Martin Morgan <[email protected]>
sp <- SolexaPath(system.file("extdata", package="ShortRead")) aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example # a 'chromosome 5' filter filt <- chromosomeFilter("chr5.fa") aln[filt(aln)] # filter during input readAligned(sp, "s_2_export.txt", filter=filt) # x- and y- coordinates stored in alignData, when source is SolexaExport xy <- alignDataFilter(expression(abs(x-500) > 200 & abs(y-500) > 200)) aln[xy(aln)] # both filters as a single filter chr5xy <- compose(filt, xy) aln[chr5xy(aln)] # both filters as a collection filters <- c(filt, xy) subsetByFilter(aln, filters) summary(filters, aln) # read, chromosome, strand, position tuples occurring exactly once aln[occurrenceFilter(withSread=TRUE, duplicates="none")(aln)] # reads occurring exactly once aln[occurrenceFilter(withSread=NA, duplicates="none")(aln)] # chromosome, strand, position tuples occurring exactly once aln[occurrenceFilter(withSread=FALSE, duplicates="none")(aln)] # custom filter: minimum calibrated base call quality >20 goodq <- srFilter(function(x) { apply(as(quality(x), "matrix"), 1, min, na.rm=TRUE) > 20 }, name="GoodQualityBases") goodq aln[goodq(aln)]
sp <- SolexaPath(system.file("extdata", package="ShortRead")) aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example # a 'chromosome 5' filter filt <- chromosomeFilter("chr5.fa") aln[filt(aln)] # filter during input readAligned(sp, "s_2_export.txt", filter=filt) # x- and y- coordinates stored in alignData, when source is SolexaExport xy <- alignDataFilter(expression(abs(x-500) > 200 & abs(y-500) > 200)) aln[xy(aln)] # both filters as a single filter chr5xy <- compose(filt, xy) aln[chr5xy(aln)] # both filters as a collection filters <- c(filt, xy) subsetByFilter(aln, filters) summary(filters, aln) # read, chromosome, strand, position tuples occurring exactly once aln[occurrenceFilter(withSread=TRUE, duplicates="none")(aln)] # reads occurring exactly once aln[occurrenceFilter(withSread=NA, duplicates="none")(aln)] # chromosome, strand, position tuples occurring exactly once aln[occurrenceFilter(withSread=FALSE, duplicates="none")(aln)] # custom filter: minimum calibrated base call quality >20 goodq <- srFilter(function(x) { apply(as(quality(x), "matrix"), 1, min, na.rm=TRUE) > 20 }, name="GoodQualityBases") goodq aln[goodq(aln)]
Objects of this class are functions that, when provided an appropriate object from the ShortRead package, return logical vectors indicating which parts of the object satisfy the filter criterion.
A number of filters are built-in (described below); users are free to
create their own filters, using the srFilter
function.
Objects can be created through srFilter
(to create a
user-defined filter) or through calls to constructors for predefined
filters, as described on the srFilter
page.
.Data
:Object of class "function"
taking a
single named argument x
corresponding to the ShortRead
object that the filter will be applied to. The return value of the
filter function is expected to be a logical vector that can be
used to subset x
to include those elements of x
satisfying the filter.
name
:Object of class "ScalarCharacter"
representing the name of the filter. The name is useful for
suggesting the purpose of the filter, and for debugging failed
filters.
Class "function"
, from data part.
Class ".SRUtil"
, directly.
Class "OptionalFunction"
, by class "function", distance 2.
Class "PossibleMethod"
, by class "function", distance 2.
signature(fun = "SRFilter")
: Return the function
representing the underlying filter; this is primarily for
interactive use to understanding filter function; usually the
filter is invoked as a normal function call, as illustrated below
signature(x = "SRFilter")
: Return, as a
ScalarCharacter
, the name of the function.
signature(object = "SRFilter")
: display a brief
summary of the filter
signature(from = "SRFilter", to =
"FilterRules")
: Coerce a filter to a
FilterRules
object of
length one.
signature(x = "SRFilter", ...)
: Combine filters into a
single FilterRules
object.
Martin Morgan <[email protected]>
srFilter
for predefined and user-defined filters.
## see ?srFilter
## see ?srFilter
Objects of this class are logical vectors indicating records passing the applied filter, with an associated data frame summarizing the name, input number of records, records passing filter, and logical operation used for all filters in which the result participated.
SRFilterResult(x = logical(), name = NA_character_, input = length(x), passing = sum(x), op = NA_character_) ## S4 method for signature 'SRFilterResult,SRFilterResult' Logic(e1, e2) ## S4 method for signature 'SRFilterResult' name(x, ...) stats(x, ...) ## S4 method for signature 'SRFilterResult' show(object)
SRFilterResult(x = logical(), name = NA_character_, input = length(x), passing = sum(x), op = NA_character_) ## S4 method for signature 'SRFilterResult,SRFilterResult' Logic(e1, e2) ## S4 method for signature 'SRFilterResult' name(x, ...) stats(x, ...) ## S4 method for signature 'SRFilterResult' show(object)
x , object , e1 , e2
|
For |
name |
|
input |
|
passing |
|
op |
|
... |
Additional arguments, unused in methods documented on this page. |
Objects can be created through SRFilterResult
, but these
are automatically created by the application of srFilter
instances.
.Data
:Object of class "logical"
indicating
records that passed the filter.
name
:Object of class "ScalarCharacter"
representing the name of the filter whose results are
summarized. The name is either the actual name of the filter, or a
combination of filter names and logical operations when the
outcome results from application of several filters in a single
logical expression.
stats
:Object of class "data.frame"
summarizing
the name, input number of records, records passing filter, and
logical operation used for all filters in which the result
participated. The data.frame
rows correspond either to
single filters, or to logical combinations of filters.
Class "logical"
, from data part.
Class ".SRUtil"
, directly.
Class "vector"
, by class "logical", distance 2.
Class "atomic"
, by class "logical", distance 2.
Class "vectorORfactor"
, by class "logical", distance 3.
signature(e1 = "SRFilterResult", e2 =
"SRFilterResult")
: logic operations on filters.
signature(x = "SRFilterResult")
: Negate the outcome
of the current filter results
signature(x = "SRFilterResult")
: The name of the
filter that the results are based on.
signature(x = "SRFilterResult")
: a
data.frame
as described in the ‘Slots’ section of
this page.
signature(object = "SRFilterResult")
: summary of
filter results.
Martin Morgan mailto:[email protected]
fa <- srFilter(function(x) x %% 2 == 0, "Even") fb <- srFilter(function(x) x %% 2 == 1, "Odd") x <- 1:10 fa(x) | fb(x) fa(x) & fb(x) !(fa(x) & fb(x))
fa <- srFilter(function(x) x %% 2 == 0, "Even") fb <- srFilter(function(x) x %% 2 == 1, "Odd") x <- 1:10 fa(x) | fb(x) fa(x) & fb(x) !(fa(x) & fb(x))
This class coordinates phenotype (sample) and sequence data, primarily as used on the Roche platform.
Conceptually, this class has reads from a single experiment
represented as a long vector, ordered by sample. The readCount
slot indicates the number of reads in each sample, so that the sum of
readCount
is the total number of reads in the experiment. The
readIndex
field is a light-weight indicator of which reads from
all those available that are currently referenced by the SRSet
.
Objects of this class are not usually created directly, but instead
are created by a derived class, e.g., RocheSet
.
sourcePath
:Object of class "ExperimentPath"
,
containing the directory path where sequence files can be found.
readIndex
:Object of class "integer"
indicating
specific sequences included in the experiment.
readCount
:Object of class "integer"
containing
the number of reads in each sample included in the experiment. The
sum of this vector is the total number of reads.
phenoData
:Object of class "AnnotatedDataFrame"
describing each sample in the experiment. The number of rows of
phenoData
equals the number of elements in
readCount
.
readData
:Object of class "AnnotatedDataFrame"
containing annotations on all reads.
Class ".ShortReadBase"
, directly.
signature(object = "SRSet")
: return the
ExperimentPath
associated with this object.
signature(object = "SRSet")
: return the
phenoData
associated with this object.
signature(object="SRSet")
:
signature(object="SRSet")
:
signature(object="SRSet")
:
signature(object="SRSet")
: Retrieve the
corresponding slot from object
.
signature(object = "SRSet")
: display the contents
of this object.
signature(x = "SRSet")
: provide more
extensive information on the object.
Michael Lawrence <[email protected]>
showClass("SRSet")
showClass("SRSet")
These classes provide important utility functions in the ShortRead package, but may occasionally be seen by the user and are documented here for that reason.
Utility classes include:
.SRUtil-class
a virtual base class from which all
utility classes are derived.
SRError-class
created when errors occur in
ShortRead package code.
SRWarn-class
created when warnings occur in
ShortRead package code
SRList-class
representing a list (heterogeneous
collection) of objects. The S4Vectors::SimpleList class is a
better choice for a list-like container.
SRVector-class
representing a vector (homogeneous
collection, i.e., all elements of the same class) of objects.
Objects from these classes are not normally constructed by the user. However, constructors are available, as follows.
SRError(type, fmt, ...)
, SRWarn(type, fmt, ...)
:
character(1)
vector describing the type of the
error. type
must come from a pre-defined list of types.
a sprintf
-style format string for the
message to be reported with the error.
additional arguments to be interpolated into fmt
.
SRList(...)
elements of any type or length to be placed into the
SRList
. If the length of ...
is 1 and the
argument is a list, then the list itself is placed into
SRList
.
SRVector(..., vclass)
elements all satisfying an is
relationship
with vclass
, to be placed in SRVector
.
the class to which all elements in ...
belong. If vclass
is missing and length(list(...))
is greater than zero, then vclass
is taken to be the class
of the first argument of ...
.
SRVector
errors:
this error occurs when not all
arguments ...
satisfy an ‘is’ relationship with
vclass
.
SRError
and SRWarn
have the following slots defined:
.type
:Object of class "character"
containing
the type of error or warning. .type
must come from a
pre-defined list of types, see, e.g.,
ShortRead:::.SRError_types
.
.message
:Object of class "character"
containing a detailed message describing the error or warning.
SRList
has the following slot defined:
.srlist
:Object of class "list"
containing the
elements in the list.
SRVector
extends SRList
, with the following additional
slot:
vclass
:Object of class "character"
naming the
type of object all elements of SRVector
must be.
Accessors are available for all slots, and have the same name as the
slot, e.g., vclass
to access the vclass
slot of
SRVector
. Internal slots (those starting with ‘.’ also
have accessors, but these are not exported e.g.,
ShortRead:::.type
.
SRList
has the following methods:
signature(x = "SRList")
: return the
(integer(1)
) length of the SRList
.
signature(x = "SRList")
: return a character
vector of list element names. The length of the returned vector is
the same as the length of x
.
signature(x = "SRList", value = "character")
:
assign value
as names for members of x
.
signature(x = "SRList", i = "ANY", j = "missing")
:
subset the list using standard R list subset paradigms.
signature(x = "SRList", i = "ANY", j = "missing")
:
select element ‘i’ from the list, using standard R list
selection paradigms.
signature(X = "SRList", FUN="ANY")
: apply a function to
all elements of X
, with additional arguments interpreted
as with lapply
.
signature(X = "SRList")
: apply a function to
all elements of X
, simplifying the result if
possible. Additional arguments interpreted as with
sapply
.
signature(object="SRList")
: coerce the SRList
object to a list.
signature(object = "SRList")
: display an
informative summary of the object content, including the length of
the list represented by object
.
signature(x = "SRList")
: display a more
extensive version of the object, as one might expect from printing
a standard list in R.
SRVector
inherits all methods from SRList
, and has the
following additional methods:
signature(object = "SRVector")
: display an
informative summary of the object content, e.g., the vector class
(vclass
) and length.
signature(x = "SRVector")
: display a more
extensive version of the object, as one might expect from a
printing a standard R list.
Martin Morgan
getClass(".SRUtil", where=getNamespace("ShortRead")) ShortRead:::.SRError_types ShortRead:::.SRWarn_types detail(SRList(1:5, letters[1:5])) tryCatch(SRVector(1:5, letters[1:5]), SRVectorClassDisagreement=function(err) { cat("caught:", conditionMessage(err), "\n") })
getClass(".SRUtil", where=getNamespace("ShortRead")) ShortRead:::.SRError_types ShortRead:::.SRWarn_types detail(SRList(1:5, letters[1:5])) tryCatch(SRVector(1:5, letters[1:5]), SRVectorClassDisagreement=function(err) { cat("caught:", conditionMessage(err), "\n") })
This generic summarizes the number of times each sequence occurs in an
XStringSet
instance.
tables(x, n=50, ...)
tables(x, n=50, ...)
x |
An object for which a |
n |
An |
... |
Additional arguments available to methods |
Methods of this generic summarize the frequency with which each read occurs, There are two components to the summary. The reads are reported from most common to least common; typically a method parameter controls how many reads to report. Methods also return a pair of vectors describing how many reads were represented 1, 2, ... times.
The following methods are defined, in addition to methods described in class-specific documentation:
signature(x= "XStringSet", n = 50)
:
Apply tables
to the XStringSet
x
.
A list of length two.
top |
A named integer vector. Names correspond to sequences.
Values are the number of times the corresponding sequence occurs in
the |
distribution |
a |
Martin Morgan <[email protected]>
showMethods("tables") sp <- SolexaPath(system.file("extdata", package="ShortRead")) aln <- readAligned(sp) tables(sread(aln), n=6) lattice::xyplot(log10(nReads)~log10(nOccurrences), tables(sread(aln))$distribution)
showMethods("tables") sp <- SolexaPath(system.file("extdata", package="ShortRead")) aln <- readAligned(sp) tables(sread(aln), n=6) lattice::xyplot(log10(nReads)~log10(nOccurrences), tables(sread(aln))$distribution)
These generic functions remove leading or trailing nucleotides or
qualities. trimTails
and trimTailw
remove low-quality
reads from the right end using a sliding window (trimTailw
) or
a tally of (successive) nucleotides falling at or below a quality
threshold (trimTails
). trimEnds
takes an alphabet of
characters to remove from either left or right end.
## S4 methods for 'ShortReadQ', 'FastqQuality', or 'SFastqQuality' trimTailw(object, k, a, halfwidth, ..., ranges=FALSE) trimTails(object, k, a, successive=FALSE, ..., ranges=FALSE) trimEnds(object, a, left=TRUE, right=TRUE, relation=c("<=", "=="), ..., ranges=FALSE) ## S4 method for signature 'BStringSet' trimTailw(object, k, a, halfwidth, ..., alphabet, ranges=FALSE) ## S4 method for signature 'BStringSet' trimTails(object, k, a, successive=FALSE, ..., alphabet, ranges=FALSE) ## S4 method for signature 'character' trimTailw(object, k, a, halfwidth, ..., destinations, ranges=FALSE) ## S4 method for signature 'character' trimTails(object, k, a, successive=FALSE, ..., destinations, ranges=FALSE) ## S4 method for signature 'character' trimEnds(object, a, left=TRUE, right=TRUE, relation=c("<=", "=="), ..., destinations, ranges=FALSE)
## S4 methods for 'ShortReadQ', 'FastqQuality', or 'SFastqQuality' trimTailw(object, k, a, halfwidth, ..., ranges=FALSE) trimTails(object, k, a, successive=FALSE, ..., ranges=FALSE) trimEnds(object, a, left=TRUE, right=TRUE, relation=c("<=", "=="), ..., ranges=FALSE) ## S4 method for signature 'BStringSet' trimTailw(object, k, a, halfwidth, ..., alphabet, ranges=FALSE) ## S4 method for signature 'BStringSet' trimTails(object, k, a, successive=FALSE, ..., alphabet, ranges=FALSE) ## S4 method for signature 'character' trimTailw(object, k, a, halfwidth, ..., destinations, ranges=FALSE) ## S4 method for signature 'character' trimTails(object, k, a, successive=FALSE, ..., destinations, ranges=FALSE) ## S4 method for signature 'character' trimEnds(object, a, left=TRUE, right=TRUE, relation=c("<=", "=="), ..., destinations, ranges=FALSE)
object |
An object (e.g., |
k |
|
a |
For For |
halfwidth |
The half width (cycles before or after the current; e.g., a half-width of 5 would span 5 + 1 + 5 cycles) in which qualities are assessed. |
successive |
|
left , right
|
|
relation |
|
... |
Additional arguments, perhaps used by methods. |
destinations |
For |
alphabet |
|
ranges |
|
trimTailw
starts at the left-most nucleotide, tabulating the
number of cycles in a window of 2 * halfwidth + 1
surrounding
the current nucleotide with quality scores that fall at or below
a
. The read is trimmed at the first nucleotide for which this
number >= k
. The quality of the first or last nucleotide is
used to represent portions of the window that extend beyond the
sequence.
trimTails
starts at the left-most nucleotide and accumulates
cycles for which the quality score is at or below a
. The read
is trimmed at the first location where this number >= k
. With
successive=TRUE
, failing qualities must occur in strict
succession.
trimEnds
examines the left
, right
, or both ends
of object
, marking for removal letters that correspond to
a
and relation
. The trimEnds,ShortReadQ-method
trims based on quality.
ShortReadQ
methods operate on quality scores; use
sread()
and the ranges
argument to trim based on
nucleotide (see examples).
character
methods transform one or several fastq files to new
fastq files, applying trim operations based on quality scores; use
filterFastq
with your own filter
argument to filter on
nucleotides.
An instance of class(object)
trimmed to contain only those
nucleotides satisfying the trim criterion or, if ranges=TRUE
an
IRanges
instance defining the ranges that would trim
object
.
The trim*
functions use OpenMP threads (when available) during
creation of the return value. This may sometimes create problems when
a process is already running on multiple threads, e.g., with an error
message like
libgomp: Thread creation failed: Resource temporarily unavailable
A solution is to precede problematic code with the following code snippet, to disable threading
nthreads <- .Call(ShortRead:::.set_omp_threads, 1L) on.exit(.Call(ShortRead:::.set_omp_threads, nthreads))
Martin Morgan <[email protected]>
showMethods(trimTails) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") ## remove leading / trailing quality scores <= 'I' trimEnds(rfq, "I") ## remove leading / trailing 'N's rng <- trimEnds(sread(rfq), "N", relation="==", ranges=TRUE) narrow(rfq, start(rng), end(rng)) ## remove leading / trailing 'G's or 'C's trimEnds(rfq, c("G", "C"), relation="==")
showMethods(trimTails) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") ## remove leading / trailing quality scores <= 'I' trimEnds(rfq, "I") ## remove leading / trailing 'N's rng <- trimEnds(sread(rfq), "N", relation="==", ranges=TRUE) narrow(rfq, start(rng), end(rng)) ## remove leading / trailing 'G's or 'C's trimEnds(rfq, c("G", "C"), relation="==")
These functions perform a variety of simple operations.
polyn(nucleotides, n)
polyn(nucleotides, n)
nucleotides |
A character vector with all elements having exactly 1 character, typically from the IUPAC alphabet. |
n |
An |
polyn
returns a character vector with each element having
n
characters. Each element contains a single nucleotide. Thus
polyn("A", 5)
returns AAAAA
.
polyn
returns a character vector of length length(nucleotide)
Martin Morgan <[email protected]>
polyn(c("A", "N"), 35)
polyn(c("A", "N"), 35)