Title: | Analysis of nucleotide, sequence and quality content on fastq files |
---|---|
Description: | Analyze read length, phred scores and alphabet frequency and DNA k-mers on uncompressed and compressed fastq files. |
Authors: | Wolfgang Kaisers |
Maintainer: | Wolfgang Kaisers <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.41.0 |
Built: | 2024-11-14 06:00:02 UTC |
Source: | https://github.com/bioc/seqTools |
Analyze read length, phred scores and alphabeth frequency and DNA k-mers on uncompressed and compressed files.
Package: | seqTools |
Type: | Package |
Version: | 0.99.31 |
Date: | 2013-10-14 |
License: | GPL-2 |
Depends: | methods |
Wolfgang Kaisers Maintainer: Wolfgang Kaisers <[email protected]>
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
# A) Count DNA k-mer countDnaKmers("ATAAATA", 2) # B) Quality check on FASTQ file basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq("test_l6.fq") plotPhredQuant(fq, 1)
# A) Count DNA k-mer countDnaKmers("ATAAATA", 2) # B) Quality check on FASTQ file basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq("test_l6.fq") plotPhredQuant(fq, 1)
ascii2char
calculates character representations for given phred values.
char2ascii
returns phred values for given ASCII encoded
representations (the reverse transformation of ascii2char
).
ascii2char(x, multiple=FALSE) char2ascii(c)
ascii2char(x, multiple=FALSE) char2ascii(c)
x |
|
multiple |
|
c |
|
The functions are only wrappers for convenience. char2ascii
is defined as strtoi(charToRaw(c), base = 16L)
. ascii2char
is defined as rawToChar(as.raw(x), multiple)
.
ascii2char
returns character
. char2ascii
returns integer
.
Wolfgang Kaisers
Ewing B, Green P Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities Genome Research 1998 8(3): 186-194
getPhredTable
ascii2char(97:101, multiple=FALSE) ascii2char(97:101, multiple=TRUE) char2ascii("abcde") char2ascii(paste("a", "b", "c", collapse="")) ascii2char(char2ascii("abcde"))
ascii2char(97:101, multiple=FALSE) ascii2char(97:101, multiple=TRUE) char2ascii("abcde") char2ascii(paste("a", "b", "c", collapse="")) ascii2char(char2ascii("abcde"))
Calculates pairwise distance matrix from DNA k-mer counts based on a modified Canberra distance. Before calculating canberra distances, read counts are normalized (in order to correct systematic effects on the distance) by scaling up read counts in each DNA k-mer count vector so that normalized read counts in each sample are nearly equal.
cbDistMatrix(object,nReadNorm=max(nReads(object)))
cbDistMatrix(object,nReadNorm=max(nReads(object)))
object |
|
nReadNorm |
|
The distance between two DNA k-mer normalized count vectors is calculated by
where cb is given by
Square matrix
. The number of rows equals the number of files
(=nFiles(object)
).
The static size of the retured k-mer array is 4^k.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
hclust
basedir<-system.file("extdata",package="seqTools") basenames<-c("g4_l101_n100.fq.gz","g5_l101_n100.fq.gz") filenames<-file.path(basedir,basenames) fq<-fastqq(filenames,6,c("g4","g5")) dm<-cbDistMatrix(fq)
basedir<-system.file("extdata",package="seqTools") basenames<-c("g4_l101_n100.fq.gz","g5_l101_n100.fq.gz") filenames<-file.path(basedir,basenames) fq<-fastqq(filenames,6,c("g4","g5")) dm<-cbDistMatrix(fq)
Objects of class Fastqq
are created by reading data from
FASTQ-files using the function fastqq
. The fastqq
function calls
Sys.time()
before and after execution of the core collecting routine.
collectDur
returns the number of seconds between these two times
(as numeric value). collectTime
returns the two timestamps inside a
list
.
collectDur(object) collectTime(object)
collectDur(object) collectTime(object)
object |
|
collectTime
returns numeric
. collectTime
returns list
.
Wolfgang Kaisers
fastqq
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq<-fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) collectTime(fq) collectDur(fq)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq<-fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) collectTime(fq) collectDur(fq)
Counts occurrence of DNA k-mers in given DNA sequence.
The k-mers are searched in a set of search windows,
which are defined by start
and width
parameter.
From each position of the search window, a DNA k-mer is identified
on the right hand side on the given DNA sequence.
Each value in the start
vector defines the left border
of a search window.
The size of the search window is given by the appropriate value in the
width
vector.
The function is intended to count DNA k-mers in selected regions (e.g. exons)
on DNA sequence.
countDnaKmers(dna,k,start,width)
countDnaKmers(dna,k,start,width)
dna |
|
k |
|
start |
|
width |
|
The start positions for counting of DNA k-mers are all positions in
{start,...,start+width-1}.
As the identification of a DNA k-mer scans a sequence window of size k,
the last allowed start position counting a k-mer is nchar(dna)-k+1
.
The function throws the error 'Search region exceeds string end'
when a value start + width + k > nchar(dna) + 2
occurs.
matrix
. Each colum contains the motif-count values for one frame.
The column names are the values in the start vector.
Each row represents one DNA motif.
The DNA sequence of the DNA motif is given as row.name.
Wolfgang Kaisers
countGenomeKmers
seq <- "ATAAATA" countDnaKmers(seq, 2, 1:3, 3)
seq <- "ATAAATA" countDnaKmers(seq, 2, 1:3, 3)
Reads (compressed) fasta files and counts for DNA k-mers in the sequence.
countFastaKmers(filenames,k=4)
countFastaKmers(filenames,k=4)
filenames |
|
k |
Length of counted DNA k-mers. |
Maximal allowed value for k is 12.
matrix
.
The static size of the retured k-mer array is 4^k.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-177
basedir <- system.file("extdata", package="seqTools") filename <- file.path(basedir,"small.fa") ## Not run: writeFai(filename, "small.fa.fai") res <- countFastaKmers(filename, k=2)
basedir <- system.file("extdata", package="seqTools") filename <- file.path(basedir,"small.fa") ## Not run: writeFai(filename, "small.fa.fai") res <- countFastaKmers(filename, k=2)
Counts K-mers of DNA sequences inside a vector of DNA sequences.
The k-mers are searched in a set of search windows,
which are defined by start
and width
parameter.
From each position of the search window, a DNA k-mer is
identified on the right hand side on the given DNA sequence.
Each value in the start
vector defindes the left border
of a search window.
The size of the search window is given by the appropriate value in the
width
vector.
The function is intended to count DNA k-mers in selected regions
(e.g. exons) on DNA chromosomes while respecting strand orientation.
countGenomeKmers(dna, seqid, start, width, strand, k)
countGenomeKmers(dna, seqid, start, width, strand, k)
dna |
|
seqid |
|
start |
|
width |
|
strand |
|
k |
|
The function returns a matrix. Each colum contains the motif-count values for one frame. Each row represents one DNA motif. The DNA sequence of the DNA motif is given as row.name.
matrix
.
Wolfgang Kaisers
sq <- "TTTTTCCCCGGGGAAAA" seqid <- as.integer(c(1, 1)) start <- as.integer(c(6, 14)) width <- as.integer(c(4, 4)) strand <- as.integer(c(1, 0)) k <- 2 countGenomeKmers(sq, seqid, start, width, strand, k)
sq <- "TTTTTCCCCGGGGAAAA" seqid <- as.integer(c(1, 1)) start <- as.integer(c(6, 14)) width <- as.integer(c(4, 4)) strand <- as.integer(c(1, 0)) k <- 2 countGenomeKmers(sq, seqid, start, width, strand, k)
The function regards the given string as DNA sequence bearing a collection of
splice sites. The given lEnd
and rStart
positions act as
(1-based) coordinates of the innermost exonic nucleotides. They reside on
exon-intron boundaries and have one exonic and one intronic adjacent
nucleotide. The function counts width
k-mers upstream on exonic
DNA in reading direction (left -> right on (+) strand, right -> left on
(-) strand).
countSpliceKmers(dna, seqid, lEnd, rStart, width, strand, k)
countSpliceKmers(dna, seqid, lEnd, rStart, width, strand, k)
dna |
|
seqid |
|
lEnd |
|
rStart |
|
width |
|
strand |
|
k |
|
The function returns a matrix. Each colum contains the motif-count values for one frame. Each row represents one DNA motif. The DNA sequence of the DNA motif is given as row.name.
matrix
.
Wolfgang Kaisers
seq <- "acgtGTccccAGcccc" countSpliceKmers(seq, seqid=1, lEnd=4, rStart=10, width=2, strand=1, k=3) # sq1 <- "TTTTTCCCCGGGGAAAA" sq2 <- "TTTTTTTCCCCGGGGAAAA" sq <- c(sq1, sq2) seqid <- c( 1, 1, 2, 2) lEnd <- c( 9, 9, 11, 11) rStart <- c(14, 14, 16, 16) width <- c( 4, 4, 4, 4) strand <- c( 1, 0, 1, 0) countSpliceKmers(sq, seqid, lEnd, rStart, width, strand, k=2)
seq <- "acgtGTccccAGcccc" countSpliceKmers(seq, seqid=1, lEnd=4, rStart=10, width=2, strand=1, k=3) # sq1 <- "TTTTTCCCCGGGGAAAA" sq2 <- "TTTTTTTCCCCGGGGAAAA" sq <- c(sq1, sq2) seqid <- c( 1, 1, 2, 2) lEnd <- c( 9, 9, 11, 11) rStart <- c(14, 14, 16, 16) width <- c( 4, 4, 4, 4) strand <- c( 1, 0, 1, 0) countSpliceKmers(sq, seqid, lEnd, rStart, width, strand, k=2)
Reads (compressed) FASTQ files and counts for DNA k-mers for each position in sequence.
fastqKmerLocs(filenames, k=4)
fastqKmerLocs(filenames, k=4)
filenames |
Vector of FASTQ file names. Files can be gz compressed. |
k |
Length of counted DNA k-mers. |
Maximal allowed value for k is 12.
list
. The length of the list equals the number of given
filenames.
Contains for each given file a matrix with 4^k rows and
(maxSeqLen - k + 1) columns (maxSeqLen= maximum read length).
The matrix contains for each k-mer and
k-mer-start position the counted values.
The static size of the retured k-mer array is 4^k.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
basedir <- system.file("extdata", package="seqTools") setwd(basedir) res <- fastqKmerLocs("test_l10_ATCGN.fq", k=2) res <- fastqKmerLocs("test_l10_atcg.fq", k=2) res <- fastqKmerLocs("test_l10_ATCGN.fq", k=2) res <- fastqKmerLocs("test_l6_multi_line.fq", k=2)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) res <- fastqKmerLocs("test_l10_ATCGN.fq", k=2) res <- fastqKmerLocs("test_l10_atcg.fq", k=2) res <- fastqKmerLocs("test_l10_ATCGN.fq", k=2) res <- fastqKmerLocs("test_l6_multi_line.fq", k=2)
Reads (compressed) FASTQ files and counts for given DNA k-mer subset for each
position in sequence. The k-mer subset is given by a vector of k-mer
indices. k-mer indices can be obtained from DNA k-mers with the function
kMerIndex
.
fastqKmerSubsetLocs(filenames, k=4, kIndex)
fastqKmerSubsetLocs(filenames, k=4, kIndex)
filenames |
|
k |
|
kIndex |
|
Maximal allowed value for k is 12.
list
. The length of the list equals the number of given
filenames. Contains for each given file a matrix. Each matrix has one row
for each given kIndex
and an additional row with counts for all
other DNA k-mers (labeled other
). The number of columns equals the
maximal sequence length in the FASTQ file.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
basedir <- system.file("extdata", package="seqTools") setwd(basedir) k <- 4 kMers <- c("AAAA", "AACC", "AAGG") kIdx <- kMerIndex(kMers) res <- fastqKmerSubsetLocs("test_l6.fq", k, kIdx)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) k <- 4 kMers <- c("AAAA", "AACC", "AAGG") kIdx <- kMerIndex(kMers) res <- fastqKmerSubsetLocs("test_l6.fq", k, kIdx)
Reads read numbers, read lengths, counts per position alphabet frequencies, phred scores and counts per file DNA k-mers from (possibly compressed) FASTQ files.
fastqq(filenames, k=6, probeLabel)
fastqq(filenames, k=6, probeLabel)
filenames |
Vector of FASTQ file names. Files can be gz compressed. |
k |
Length of counted DNA k-mers. |
probeLabel |
|
Maximal allowed value for k is 12.
S4 Object of class 'Fastqq'.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
Fastqq-class
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq("test_l6.fq") fq <- fastqq("test_l6_multi_line.fq") fq <- fastqq("non_exist.fq") fq <- fastqq("test_l10_ATCGN.fq") fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5"))
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq("test_l6.fq") fq <- fastqq("test_l6_multi_line.fq") fq <- fastqq("non_exist.fq") fq <- fastqq("test_l10_ATCGN.fq") fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5"))
"Fastqq"
Contains quality related summarizing data on FASTQ files.
Objects can be created by calls of the form fastqq("test.fq")
.
filenames
:"character"
: Vector of Fastqq file names.
probeLabel
:"character"
: Vector of probe labels.
nFiles
:"integer"
: Length of fileNamess.
k
:"integer"
: Length of counted DNA k-mers.
maxSeqLen
:"integer"
Maximum sequence length found in
FASTQ files. Determines row-number in 'seqLenCount' matrix and
column-number in 'nac' and 'phred' slot.
kmer
:"matrix"
Matrix containing DNA k-mers counts.
firstKmer
:"matrix"
Matrix containing count of
incipient DNA k-mers.
nReads
:"integer"
Vector containing number of reads
per file.
seqLenCount
:"matrix"
Matrix containing Counts of
read lengths.
gcContent
:"matrix"
Matrix containing GC content
(in percent).
nN
:"integer"
Vector containing Number of N
nucleotide entries per file.
nac
:"list"
Contains counted per position alphabet
frequencies.
phred
:"list"
Contains per position phred count
tables (one per Fastqq file).
seqLen
:"matrix"
Contains minimal and maximal
sequence length (one column per file).
collectTime
:"list"
Contains start and end time of
FASTQ reading as 'POSIXct'.
The following methods are defined for class Fastqq
:
Basic accessors:
signature(object="Fastqq")
: Returns k-value
(length of DNA k-mers) as integer
.
signature(object="Fastqq")
: Returns
matrix
with 4^k rows anc nFiles
columns. For each
k-mer and FASTQ-file, the absolute count value of the k-mer in the
FASTQ file is given.
signature(object="Fastqq")
: Returns number of
Files from which data has been collected as integer
.
signature(object="Fastqq")
: Returns integer
vector of length nFiles
. For each FASTQ file, the absolute
number of containes 'N' nucleotide entries is given.
signature(object="Fastqq")
: Returns number of
reads in each FASTQ file as integer
.
signature(object="Fastqq")
: Returns number
names of FASTQ files from which data has been collected as
character
.
signature(object="Fastqq")
: Returns maximum
sequence length which has been found in all FASTQ files as
integer
.
signature(object="Fastqq")
: Returns matrix
which tables counted read length in all FASTQ files.
signature(object="Fastqq",i="numeric")
:
Returns integer
vector of length 100 which countains
absolute read count numbers for each percentage of GC-content.
i
is the index of the FASTQ file for wich the values
are returned. The GC content values for all files together can
be obtained using gcContentMatrix
.
signature(object="Fastqq",i="integer")
:
Returns matrix
which contains the absolute nucleotide
count values for each nucleotide and read position. i
is
the index of the FASTQ file for wich the values are returned.
signature(object="Fastqq")
: Returns matrix
with two rows and nFiles
columns. For each file the minimum
and maximum read length is given.
signature(object="Fastqq")
: Returns a
matrix
with 4^k rows and nFiles
columns. Each entry
gives the absolute count of the k-mer (given as row name) in each
file (given as column name).
signature(object="Fastqq",i="integer")
: Returns
a matrix
with 93 rows and maxSeqLen
columns. The
matrix gived the absolute counts of each phred value for each
sequence position. i
is the index of the FASTQ file for
wich the values are returned.
signature(object="Fastqq",
quantiles="numeric", i="integer")
: Returns a data.frame
.
The data.frame has one row for each given quantile and
maxSeqLen
columns. Each value gives the quantile (given by
row name) of the phred values at the sequence position (given by
column name). For the quantiles
argument, a numeric vector
with values in [0,1] must be given. For the i
argument, a
single integer value must be given which denotes the index of the
FASTQ file from which values are returned (value must be in
{1,...,nFiles}).
signature(object="Fastqq")
: Returns
character
vector which contains the probeLabel
entries for given Fastqq
object.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
fastqq
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz","g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) # fileNames(fq) getK(fq) nNnucs(fq) nFiles(fq) nReads(fq) maxSeqLen(fq) collectTime(fq) collectDur(fq) slc<-seqLenCount(fq) nf<-nucFreq(fq,1) nf[1:4,1:10] seqLen(fq) probeLabel(fq) probeLabel(fq) <- 1:nFiles(fq) # kc<-kmerCount(fq) kc[1:10, ] plotKmerCount(fq) # ph<-phred(fq, 1) ph[25:35,1:15] pq <- phredQuantiles(fq,c(0.25, 0.5, 0.75), 1) plotNucFreq(fq, 1) # Nucleotide count plotNucCount(fq, 2:3) # GC content gcContent(fq, 1) # fqq<-fq[1]
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz","g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) # fileNames(fq) getK(fq) nNnucs(fq) nFiles(fq) nReads(fq) maxSeqLen(fq) collectTime(fq) collectDur(fq) slc<-seqLenCount(fq) nf<-nucFreq(fq,1) nf[1:4,1:10] seqLen(fq) probeLabel(fq) probeLabel(fq) <- 1:nFiles(fq) # kc<-kmerCount(fq) kc[1:10, ] plotKmerCount(fq) # ph<-phred(fq, 1) ph[25:35,1:15] pq <- phredQuantiles(fq,c(0.25, 0.5, 0.75), 1) plotNucFreq(fq, 1) # Nucleotide count plotNucCount(fq, 2:3) # GC content gcContent(fq, 1) # fqq<-fq[1]
Returns a matrix with read counts. getGCcontent returns a numeric vector with the GC contend (in percent) for each fastq file.
gcContentMatrix(object)
gcContentMatrix(object)
object |
|
The matrix contains one column for each FASTQ file. Rows labeled from 0 to 100 which represents percent (%) GC content. The matrix contains numbers of reads with the respective proportion of GC (Row 2 contains number of reads with 2% GC content).
matrix
.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
gcContent
basedir <- system.file("extdata",package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) fqm<-gcContentMatrix(fq) getGCcontent(fq)
basedir <- system.file("extdata",package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) fqm<-gcContentMatrix(fq) getGCcontent(fq)
For each k, there exist $4^k$ DNA k-mers. Many functions inside this package
return values where DNA k-mers appear as array indices. kMerIndex
can be used for extraction of count values for special k-mers by provision
of index values.
kMerIndex(kMers, k=nchar(kMers)[1], base=1)
kMerIndex(kMers, k=nchar(kMers)[1], base=1)
kMers |
|
k |
|
base |
|
Maximal allowed value for k is 12.
integer
.
Wolfgang Kaisers
kMerIndex(c("AACC", "ATAA")) kMerIndex(c("AA","AC"), base=1) kMerIndex(c("AA","AC"), base=0)
kMerIndex(c("AACC", "ATAA")) kMerIndex(c("AA","AC"), base=1) kMerIndex(c("AA","AC"), base=0)
Returns a copy of given object where DNA k-mer counts and first DNA k-mer count table are reduced in size.
meltDownK(object, newK)
meltDownK(object, newK)
object |
|
newK |
|
The function sums all count values which belong the the new motif up. The new motif is the new-k sized prefix of the given k-mer motif.
S4 Object of class 'Fastqq'.
The meltDownK
mechanism is assotiated with a change of
DNA k-mer count values (by its accumulative character).
Also, count values from down-melted tables are not identical to directly
counted values for lower k.
For example counting 'AAAA' with k=1 yields four 'A'.
Counting 'AAAA' with k=2 yields three 'AA'.
As meltDownK
sums up count values by prefix k-mers, the melted count
table for the second (k=2) count will return three 'A'.
Another source for differences may be N-nucleotides. Counting 'AANA' returns
three 'A' (using k=1) but only one 'AA' for k=2.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq<-fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) fqm <- meltDownK(fq, 2)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq<-fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) fqm <- meltDownK(fq, 2)
Fastqq
objects.
The Fastqq
objects contain position-wise counted phred values.
The mergedPhred
function adds the counted values for all FASTQ files
together into a single matrix. The matrix then again contains position-wise
counted phred values. The mergedPhredQuantiles
and
plotMergedPhredQuant
are analogues to the phredQuantiles
and plotPhredQuant
functions.
mergedPhred(object) mergedPhredQuantiles(object, quantiles) plotMergedPhredQuant(object, main, ...)
mergedPhred(object) mergedPhredQuantiles(object, quantiles) plotMergedPhredQuant(object, main, ...)
object |
|
quantiles |
|
main |
|
... |
Optional arguments which are passed to the |
The function adds the phred values from all contained FASTQ data.
mergedPhred
returns a matrix
with 94 rows and
(maxSeqLen + 1) columns. mergedPhredQuantiles
returns a
data.frame
with one row for each given quantile and max(seqLen(.))
columns. plotMergedPhredQuant
returns nothing.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771\ Ewing B, Green P Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 1998 Vol. 8 No. 3 186-194
basedir <- system.file("extdata",package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # ph <- mergedPhred(fq) ph[25:35, 1:15] pq <- mergedPhredQuantiles(fq, c(0.25, 0.5, 0.75)) plotMergedPhredQuant(fq) #
basedir <- system.file("extdata",package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # ph <- mergedPhred(fq) ph[25:35, 1:15] pq <- mergedPhredQuantiles(fq, c(0.25, 0.5, 0.75)) plotMergedPhredQuant(fq) #
The contents of two given Fastqq objects are merged together into one resulting Fastqq object.
mergeFastqq(lhs,rhs)
mergeFastqq(lhs,rhs)
lhs |
|
rhs |
|
The data on all FASTQ files in the two incoming objects is merged
together.
The object has the same internal structure as if the data from all FASTQ
files had been collected by a separate call of fastqq
on the merged
FASTQ file names of the arguments.
Duplicated probeLabel's are separated by adding of consecutive numbers as
suffix to all probeLabel's.
When lhs
and rhs
contain kmer-counts for different k
(getK), the function uses the meltDownK
mechanism in order to
equalize the k
values.
Therefore it is possible to compare samples which were counted with
different k
(i.e. k-mer resolution).
S4 Object of class 'Fastqq'.
Note that the meltDownK
mechanism is assotiated with a change of
DNA k-mer count values. See 'meltDownK' help (note) for more information.
Wolfgang Kaisers
basedir<-system.file("extdata",package="seqTools") setwd(basedir) # lhs<-fastqq("g4_l101_n100.fq.gz",k=4,"g4") rhs<-fastqq("g5_l101_n100.fq.gz",k=4,"g5") fq<-mergeFastqq(lhs,rhs)
basedir<-system.file("extdata",package="seqTools") setwd(basedir) # lhs<-fastqq("g4_l101_n100.fq.gz",k=4,"g4") rhs<-fastqq("g5_l101_n100.fq.gz",k=4,"g5") fq<-mergeFastqq(lhs,rhs)
The phredDist
function returns a named vector with
relative Phred content from the whole Fastqq
object or a subset
which is denoted by a index i
.
The plotPhredDist
function produces a plot of the
phredDist
values.
phredDist(object, i) plotPhredDist(object, i, maxp=45, col, ...)
phredDist(object, i) plotPhredDist(object, i, maxp=45, col, ...)
object |
|
i |
|
maxp |
|
col |
Colour encoding for plotted lines. |
... |
Additional values passed to plot function. |
i must be a numerical vector with values in {1,...,nFiles}. The
plotPhredDist
function is also prepared for additional arguments:
The maxp value denotes the maximal Phred value until which the Phred values
are plotted (possibly shrinks the x-Axis). The standard line color is
topo.colors(10)[3]
. Additional arguments (e.g. main="") can be
passed to the plot function.
phredDist
returns numeric
. plotPhredDist
returns nothing.
Wolfgang Kaisers
Ewing B, Green P Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 1998 Vol. 8 No. 3 186-194
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) # phredDist(fq) plotPhredDist(fq, main="g4 and g5") #
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) # phredDist(fq) plotPhredDist(fq, main="g4 and g5") #
The function calculates characters and corresponding ascii values for a given range of phred values. As default, a data.frame with all valid phred values {0,...,93} is returned.
phredTable(phred)
phredTable(phred)
phred |
|
data.frame
. The data.frame has three columns: "ascii","phred"
and "char"
Wolfgang Kaisers
Ewing B, Green P Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 1998 Vol. 8 No. 3 186-194
char2ascii
phredTable()
phredTable()
The function creates plots on proportions of relative GC content. For each FASTQ file from wich data is contained, one separate line is plotted. A value of 0.1 at the proportion of 40 says that 0.1 % of the reads have 40 % GC content.
plotGCcontent(object, main, ...)
plotGCcontent(object, main, ...)
object |
|
main |
|
... |
Other arguments which are passed to the internally called plot function. |
The area under each plotted line adds up to 1.
None.
Wolfgang Kaisers
Fastqq-class
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotGCcontent(fq)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotGCcontent(fq)
The function creates plots from counted DNA k-mers from Fastqq objects.
plotKmerCount(object,index,mxey,main="K-mer count",...)
plotKmerCount(object,index,mxey,main="K-mer count",...)
object |
|
index |
|
mxey |
|
main |
|
... |
Additional parameters which are passed down to the plot function. |
Values for i must be in {1,...,nFiles}. The function shrinks the k-mer count table down to size of 4096 (k = 6) when k > 6 in order to limit the complexity of the plot.
None.
The static size of the retured k-mer array is 4^k.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
Fastqq-class
basedir <- system.file("extdata",package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotKmerCount(fq) plotKmerCount(fq,1) plotKmerCount(fq, 1:2) #
basedir <- system.file("extdata",package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotKmerCount(fq) plotKmerCount(fq,1) plotKmerCount(fq, 1:2) #
The function creates plots from nucleotide counts from Fastqq objects.
plotNucCount(object, nucs=16, maxx,...)
plotNucCount(object, nucs=16, maxx,...)
object |
|
nucs |
|
maxx |
|
... |
(currently unused). |
Values for i must be in {1,...,nFiles}. The nucs index encodes for IUPAC characters as shown in the following table.
1 | A | | | 6 | R | | | 11 | M | | | 16 | N |
2 | C | | | 7 | Y | | | 12 | B | | | 17 | . |
3 | G | | | 8 | S | | | 13 | D | | | 18 | - |
4 | T | | | 9 | W | | | 14 | H | | | 19 | = |
5 | U | | | 10 | K | | | 15 | V | | | 20 | '' |
When count values for 'A' are to be plotted, 'nucs' must be =1. When count values for 'GC' are to be plotted, 'nucs' must be c(2,3).
None.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
Fastqq-class
basedir <- system.file("extdata", package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotNucCount(fq) plotNucCount(fq, 1) plotNucCount(fq, 1:2) #
basedir <- system.file("extdata", package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotNucCount(fq) plotNucCount(fq, 1) plotNucCount(fq, 1:2) #
The function creates plots on position wise relative nucleotide content single FASTQ files.
plotNucFreq(object, i, main, maxx, ...)
plotNucFreq(object, i, main, maxx, ...)
object |
|
i |
|
main |
|
maxx |
|
... |
Other arguments which are passed to the internally called plot function. |
None.
Wolfgang Kaisers
Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 2010 Vol.38 No.12 e131, doi: 10.1093/nar/gkq224
Fastqq-class
basedir <- system.file("extdata", package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) # plotNucFreq(fq, 2) # Same plot without x-axis plotNucFreq(fq, 2, xaxt="n") # plotNucFreq(fq, 1, maxx=15)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4","g5")) # plotNucFreq(fq, 2) # Same plot without x-axis plotNucFreq(fq, 2, xaxt="n") # plotNucFreq(fq, 1, maxx=15)
The function creates plots which describes the position wise distribution of phred quantiles in single FASTQ files.
plotPhredQuant(object, i, main, ...)
plotPhredQuant(object, i, main, ...)
object |
|
i |
|
main |
|
... |
Other arguments which are passed to the internally called plot function. |
None.
Wolfgang Kaisers
Fastqq-class
basedir <- system.file("extdata",package="seqTools") # setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotPhredQuant(fq, 2) # Same plot without x-axis plotPhredQuant(fq,2, xaxt="n")
basedir <- system.file("extdata",package="seqTools") # setwd(basedir) fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # plotPhredQuant(fq, 2) # Same plot without x-axis plotPhredQuant(fq,2, xaxt="n")
The propPhred
function returns a named vector
with relative Phred content for all contained lanes.
propPhred(object, greater = 30, less = 93)
propPhred(object, greater = 30, less = 93)
object |
|
greater |
|
less |
|
The greater
and less
arguments must be numeric, have
length 1 and be >0 and < 94. greater
must be less than less
.
With the default settings the reported proportions should be >50 % for
all lanes in order to be acceptable (see 't Hoen et. al.).
Numeric.
Wolfgang Kaisers
't Hoen et.al Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories Nature Biotechnology 2013 Vol. 31 1015 - 1022 (doi:10.1038/nbt.2702)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # Proportion of phred Values >30 propPhred(fq) # Proportion of phred Values >10 and < 30 propPhred(fq, greater=10, less=30)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) # fq <- fastqq(c("g4_l101_n100.fq.gz", "g5_l101_n100.fq.gz"), k=4, probeLabel=c("g4", "g5")) # Proportion of phred Values >30 propPhred(fq) # Proportion of phred Values >10 and < 30 propPhred(fq, greater=10, less=30)
Counts DNA K-mers for reverse complement of given DNA sequence.
The k-mers are counted in a set of search windows,
which are defined by start
and width
parameter.
From each position of the search window, a DNA k-mer is identified on the
left hand side on the reverse complement of the given DNA sequence.
Each value in the start
vector defines the right border
of a search window.
The size of the search window is given by the appropriate value
in the width
vector.
revCountDnaKmers(dna, k ,start, width)
revCountDnaKmers(dna, k ,start, width)
dna |
|
k |
|
start |
|
width |
|
The start positions for identification of DNA k-mers are all positions in {start-width+1,...,start}. In order to prevent counting before the first nucleotide of the DNA sequence, all start values must be >= width + k. The function throws an error when this border is exceeded.
matrix
. Each colum contains the motif-count values for one frame.
Each row represents one DNA motif. The DNA sequence of the DNA motif is
given as row.name.
Wolfgang Kaisers
rseq <- "TATTAT" revCountDnaKmers(rseq, 2,6:4, 2)
rseq <- "TATTAT" revCountDnaKmers(rseq, 2,6:4, 2)
Writes compressed FASTQ files where sequence sections contain concatenated k-mers which are uniformly distributed in the range of k-mers for given k. The function first writes a batch of randomly FASTQ files containing randomly simulated DNA sequence. In a second step the function repeatedly writes FASTQ files with random DNA sequence where a fraction of the reads is 'contaminated' with given DNA k-mers. In a third step, for each set of simulated and contaminated files, a hierarchical cluster (HC) tree based on DNA k-mers is calculated. For each set of files, the size of the smaller fraction in the first half of the tree is counted (perc). The value can be used as measure for separation capability of the HC algorithm.
sim_fq(nRep=2, nContamVec=c(100, 1000), grSize=20, nSeq=1e4, k=6, kIndex=1365, pos=20)
sim_fq(nRep=2, nContamVec=c(100, 1000), grSize=20, nSeq=1e4, k=6, kIndex=1365, pos=20)
nRep |
|
nContamVec |
|
grSize |
|
nSeq |
|
k |
|
kIndex |
|
pos |
|
The function is intended to be used as explorative tool (not for routine quality assessment). There are some files written and there will be a lot of output on the terminal. It is therefore recommended to switch to a separate working directory and to run this function on a separate terminal. The function is not exported.
data.frame
containing results of the counted perc values for
each repetition of the simulation.
Wolfgang Kaisers
kMerIndex("CCCCCC") ## Not run: res <- seqTools:::sim_fq(nRep=2, nContamVec=c(10, 100), grSize=4, nSeq=1e2) ## End(Not run)
kMerIndex("CCCCCC") ## Not run: res <- seqTools:::sim_fq(nRep=2, nContamVec=c(10, 100), grSize=4, nSeq=1e2) ## End(Not run)
For each combination of the parameters k and nSeq, the function writes one FASTQ file and collects the data. The FASTQ files are equally structured: Each read contains 17 randomly selected DNA 6-mers. Therefore the read-length is always 102.
simFastqqRunTimes(k, nSeq, filedir=".")
simFastqqRunTimes(k, nSeq, filedir=".")
k |
|
nSeq |
|
filedir |
|
The FASTQ files contain the parameter settings inside their filename. The files are created with 'writeSimFastq'.
data.frame
.
The data frame has four columns: id, k, nSeq and runtime.
Wolfgang Kaisers
## Not run: res <- simFastqqRunTimes(k=2:9, nSeq=100000) plot(runtime~k,res,type="b") ## End(Not run)
## Not run: res <- simFastqqRunTimes(k=2:9, nSeq=100000) plot(runtime~k,res,type="b") ## End(Not run)
Fastq files sometimes need to be preprocessed before alignment. Three different mechanisms come into use here: Discarding whole reads, trimming sequences and masking nucleotides. This function performs all three mechanisms together in one step. All reads with insufficient phred are discarded. The reads can be trimmed ad each terminal side (on trim of fixed size and a trim based on quality thresholds).
trimFastq(infile, outfile="keep.fq.gz", discard="disc.fq.gz", qualDiscard=0, qualMask=0, fixTrimLeft=0, fixTrimRight=0, qualTrimLeft=0, qualTrimRight=0, qualMaskValue=78, minSeqLen=0)
trimFastq(infile, outfile="keep.fq.gz", discard="disc.fq.gz", qualDiscard=0, qualMask=0, fixTrimLeft=0, fixTrimRight=0, qualTrimLeft=0, qualTrimRight=0, qualMaskValue=78, minSeqLen=0)
infile |
|
outfile |
|
discard |
|
qualDiscard |
|
qualMask |
|
fixTrimLeft |
|
fixTrimRight |
|
qualMaskValue |
|
qualTrimLeft |
|
qualTrimRight |
|
minSeqLen |
|
The function divides the input file into two outputs: The output file (contains the accepted reads) and the discard file (contains the excluded reads). After trim operations, the function checks for remaining read length. When the read length is smaller than minSeqLen, the read will be discarded.
Numeric. A vector of length 2 which contains the number of reads which are written to output and to discard
Wolfgang Kaisers
Ewing B, Green P Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 1998 Vol. 8 No. 3 186-194
basedir <- system.file("extdata", package="seqTools") setwd(basedir) trimFastq("sim.fq.gz", qualDiscard=10, qualMask=15, fixTrimLeft=2, fixTrimRight=2, qualTrimLeft=28, qualTrimRight=30, minSeqLen=5)
basedir <- system.file("extdata", package="seqTools") setwd(basedir) trimFastq("sim.fq.gz", qualDiscard=10, qualMask=15, fixTrimLeft=2, fixTrimRight=2, qualTrimLeft=28, qualTrimRight=30, minSeqLen=5)
The function reads a FASTA file and produces a FASTA index file as output.
writeFai(infiles, outfiles)
writeFai(infiles, outfiles)
infiles |
|
outfiles |
|
None.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
## Not run: infile <- system.file("extdata", "small.fa", package="seqTools") writeFai(infile, "small.fa.fai") ## End(Not run)
## Not run: infile <- system.file("extdata", "small.fa", package="seqTools") writeFai(infile, "small.fa.fai") ## End(Not run)
Writes compressed FASTQ files where sequence sections contain concatenated k-mers which are uniformly distributed in the range of k-mers for given k. A fraction of the reads can be contaminated with one or more deterministic k-mers.
writeSimContFastq(k=6, nk=5, nSeq=10, pos=1, kIndex=1, nContam=nSeq, filename="simc.fq.gz")
writeSimContFastq(k=6, nk=5, nSeq=10, pos=1, kIndex=1, nContam=nSeq, filename="simc.fq.gz")
k |
|
nk |
|
nSeq |
|
pos |
|
kIndex |
|
nContam |
|
filename |
|
The read headers are consequtive numbered. The phred quality values are equally set to 46 (='.') which represents a phred value of 13. This function is not designed for routine use. The random content FASTQ files can be used in order to measure the separation capabilities of hierarchical clustering mechanisms.
None.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
## Not run: writeSimContFastq()
## Not run: writeSimContFastq()
Writes compressed FASTQ files where sequence sections contain concatenated k-mers which are uniformly distributed in the range of k-mers for given k.
writeSimFastq(k=6, nk=5, nSeq=10, filename="sim.fq.gz")
writeSimFastq(k=6, nk=5, nSeq=10, filename="sim.fq.gz")
k |
|
nk |
|
nSeq |
|
filename |
|
The read headers are consequtive numbered. The phred quality values are equally set to 46 (='.') which represents a phred value of 13. This function is not designed for routine use. The random content FASTQ files can be used in order to measure the separation capabilities of hierarchical clustering mechanisms.
None.
Wolfgang Kaisers
Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM The sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 2010 Vol.38 No.6 1767-1771
writeSimFastq()
writeSimFastq()