Title: | Simulation of ChIP-seq experiments |
---|---|
Description: | A general framework for the simulation of ChIP-seq data. Although currently focused on nucleosome positioning the package is designed to support different types of experiments. |
Authors: | Peter Humburg |
Maintainer: | Peter Humburg <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.61.0 |
Built: | 2024-12-19 04:24:01 UTC |
Source: | https://github.com/bioc/ChIPsim |
This package provides a framework for the simulation of ChIP-seq experiments. An implementation of a simulation for nucleosome positioning experiments is provided as part of the package. Simulations for other experiments can be implemented using the provided framework.
Package: | ChIPsim |
Type: | Package |
Version: | 1.3.1 |
Date: | 2010-07-30 |
License: | GPL (>= 2) |
LazyLoad: | yes |
Depends: | Biostrings |
Imports: | IRanges, ShortRead |
Suggests: | actuar, zoo |
Function simChIP
is the main driver of the simulation. To simulate different
types of experiments the functions passed to the functions
argument of simChIP
have to be replaced. See the vignettes for more detail.
Peter Humburg
Maintainer: Peter Humburg <[email protected]>
~~ Literature or other references for background information ~~
ShortRead
and its dependencies
are used to handle short read and genomic sequences.
## See the accompanying vignette 'Introduction to ChIPsim' for a detailed ## example of how to use this package for nucleosome positioning simulations. ## A guide for the writing of extensions that cover other types of ## experiments is provided in 'Extending ChIPsim'.
## See the accompanying vignette 'Introduction to ChIPsim' for a detailed ## example of how to use this package for nucleosome positioning simulations. ## A guide for the writing of extensions that cover other types of ## experiments is provided in 'Extending ChIPsim'.
Given a feature density this function produces two read densities, one for each strand.
bindDens2readDens(bindDens, fragment, nfrag = 1e+05, bind = 147, minLength = 150, maxLength = 180, ...)
bindDens2readDens(bindDens, fragment, nfrag = 1e+05, bind = 147, minLength = 150, maxLength = 180, ...)
bindDens |
Numeric vector with the feature density for one chromosome. |
fragment |
Function giving the fragment length distribution. |
nfrag |
Number of fragments that should be simulated to generate the read distribution. |
bind |
Length of binding site. |
minLength |
Minimum fragment length. |
maxLength |
Maximum fragment length. |
... |
Further arguments to |
A two column matrix. The first column contains the read density for the forward strand, the second column the read density for the reverse strand.
Peter Humburg
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(start=200, length=1e5) ## calculate feature density featureDens <- feat2dens(features, length=1e5) ## convert to read density readDens <- bindDens2readDens(featureDens, fragDens, meanLength=160)
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(start=200, length=1e5) ## calculate feature density featureDens <- feat2dens(features, length=1e5) ## convert to read density readDens <- bindDens2readDens(featureDens, fragDens, meanLength=160)
These functions convert an ASCII encoded sequence of read qualities into a numeric vector of error probabilities and vice versa.
decodeQuality(quality, type = c("Illumina", "Sanger", "Solexa")) encodeQuality(quality, type = c("Illumina", "Sanger", "Solexa"))
decodeQuality(quality, type = c("Illumina", "Sanger", "Solexa")) encodeQuality(quality, type = c("Illumina", "Sanger", "Solexa"))
quality |
For |
type |
Type of encoding to use. |
See extractQuality
for a description of the currently supported encodings.
Either a numeric vector of error probabilities or a character string of encoded read quality scores. Each entry in the vector corresponds to one character in the input.
Peter Humburg
## decodeQuality and encodeQualty are the inverse operations ## of each other as one might expect quality <- "IIIIIIIIIIIIICIIGIIIIGII95III6II-II0" errorProb <- decodeQuality(quality, type="Sanger") qualitySanger <- encodeQuality(errorProb, type="Sanger") all.equal(quality, qualitySanger) ## They can also be used to convert between encodings qualityIllumina <- encodeQuality(errorProb, type="Illumina")
## decodeQuality and encodeQualty are the inverse operations ## of each other as one might expect quality <- "IIIIIIIIIIIIICIIGIIIIGII95III6II-II0" errorProb <- decodeQuality(quality, type="Sanger") qualitySanger <- encodeQuality(errorProb, type="Sanger") all.equal(quality, qualitySanger) ## They can also be used to convert between encodings qualityIllumina <- encodeQuality(errorProb, type="Illumina")
Produces a list of parameters for each of the functions used to carry out the various stages of the simulation.
defaultControl(features = list(), bindDensity = list(), readDensity = list(), readNames = list(), readSequence = list())
defaultControl(features = list(), bindDensity = list(), readDensity = list(), readNames = list(), readSequence = list())
features |
Parameters for feature generation. |
bindDensity |
Parameters for the conversion of feature sequence into binding site densities. |
readDensity |
Parameters for the conversion of binding site densities into read densities. Always provides parameters
|
readNames |
Parameters for the generation of read names. |
readSequence |
Parameters for the conversion of read positions into read sequences. Always provides parameters
|
Any parameters passed as part of list to one of the arguments of defaultControl
will be passed on to
the corresponding function in simChIP
. The build-in defaults can be overwritten by providing a
list entry with the same name.
List of parameters for use as the control
argument to simChIP
.
Peter Humburg
defaultControl() defaultControl(features=list(maxTail=0), readSequence=list(readLen=50))
defaultControl() defaultControl(features=list(maxTail=0), readSequence=list(readLen=50))
For each nucleotide this function provides probabilities indicating how likely it is to be replaced by any of the other nucleotides should a sequencing error occur.
defaultErrorProb()
defaultErrorProb()
The probabilities used here are the ones determined by Dohm et al. for Beta vulgaris. They should be more appropriate than a uniform distribution but the use of species specific rates is recommended where available.
A list of four vectors with replacement probabilities for each nucleotide.
Peter Humburg
Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucl. Acids Res., pages gkn425+, July 2008.
defaultErrorProb()
defaultErrorProb()
Provides default functions to carry out the different stages of the ChIP-seq simulation.
defaultFunctions()
defaultFunctions()
A list with components
features |
|
bindDensity |
|
readDensity |
|
sampleReads |
|
readSequence |
|
readNames |
Peter Humburg
defaultFunctions()
defaultFunctions()
Functions to generate defaults for makeFeatures
.
defaultGenerator() defaultTransition() defaultInit(prob=c(0.2, 0.05, 0, 0.25, 0.5), states=c("ReversePhasedFeature", "StableFeature", "PhasedFeature", "NFRFeature", "FuzzyFeature")) defaultLastFeat(isEnd = c(FALSE, rep(TRUE, 4)), states = c("ReversePhasedFeature", "StableFeature", "PhasedFeature", "NFRFeature", "FuzzyFeature"))
defaultGenerator() defaultTransition() defaultInit(prob=c(0.2, 0.05, 0, 0.25, 0.5), states=c("ReversePhasedFeature", "StableFeature", "PhasedFeature", "NFRFeature", "FuzzyFeature")) defaultLastFeat(isEnd = c(FALSE, rep(TRUE, 4)), states = c("ReversePhasedFeature", "StableFeature", "PhasedFeature", "NFRFeature", "FuzzyFeature"))
prob |
Numeric vector giving the initial state distribution. This will be normalised if the probabilities do not add up to 1. |
isEnd |
Logical vector indicating which states, i.e. features, are allowed to be last in the sequence. |
states |
Character vector of state names. |
These functions generate data structures that can be passed as arguments to makeFeatures
.
Using this set of functions will create a nucleosome positioning simulation. Some of the defaults
can be modified by passing different values to defaultInit
and defaultLastFeat
.
Return values are suitable as arguments generator
, transition
, init
and lastFeat
of
makeFeatures
. See the documentation of makeFeatures
for more detail.
Peter Humburg
set.seed(1) ## generate defaults generator <- defaultGenerator() transition <- defaultTransition() lastFeat <- defaultLastFeat() ## change the initial state distribution such that it ## always starts with a fuzzy feature init <- defaultInit(c(0, 0, 0, 0, 1)) ## now generate some features for a stretch of 1 million base pairs features <- makeFeatures(generator=generator, transition=transition, lastFeat=lastFeat, init=init, length=1e6)
set.seed(1) ## generate defaults generator <- defaultGenerator() transition <- defaultTransition() lastFeat <- defaultLastFeat() ## change the initial state distribution such that it ## always starts with a fuzzy feature init <- defaultInit(c(0, 0, 0, 0, 1)) ## now generate some features for a stretch of 1 million base pairs features <- makeFeatures(generator=generator, transition=transition, lastFeat=lastFeat, init=init, length=1e6)
Converts the read qualities encoded in fastq formatted files into error probabilities.
extractQuality(reads, minLength = 25, dir, type = c("Illumina", "Sanger", "Solexa"))
extractQuality(reads, minLength = 25, dir, type = c("Illumina", "Sanger", "Solexa"))
reads |
Either the name of a fastq file or a |
minLength |
Minimum read length required. |
dir |
Directory of fastq file. |
type |
Character string indicating the format the qualities are encoded in (see Details). |
If reads
and dir
are character strings it is assumed that ‘dir/reads’ is the name
of a fastq file. Otherwise reads
should be a ShortReadQ
object in which
case dir
is ignored.
Currently three different encodings of read qualities are supported. The encoding has to be selected via the
type
argument. The supported formats are
The format currently used by Illumina (version 1.3). This is a phred score between 0 and 40 encoded as ASCII characters 64 to 104. [default]
The Sanger format uses a phred quality score between 0 and 93 encoded as ASCII characters 33 to 126.
The old Solexa format previously used by Solexa/Illumina uses a quality score between -5 and 40 encoded as ASCII characters 59 to 104.
A list with a vector of error probabilities for each read in reads
that is at least minLength
nucleotides long.
Peter Humburg
decodeQuality
, readQualitySample
## Not run: ## load reads from a fastq file with Sanger encoding qualities <- extractQuality("test.fastq", dir=".", type="Sanger") ## extract error probabilities for first 25bp of each read qualities25 <- sapply(qualities, "[", 1:25) ## plot average quality for each position plot(rowMeans(qualities25), type='b', xlab="Read position", ylab="Error probability") ## End(Not run)
## Not run: ## load reads from a fastq file with Sanger encoding qualities <- extractQuality("test.fastq", dir=".", type="Sanger") ## extract error probabilities for first 25bp of each read qualities25 <- sapply(qualities, "[", 1:25) ## plot average quality for each position plot(rowMeans(qualities25), type='b', xlab="Read position", ylab="Error probability") ## End(Not run)
Given a list of features (as produced by makeFeatures
) computes the feature density for each
and combines them into a chromosome wide density.
feat2dens(features, length, featureBgr = TRUE, ...)
feat2dens(features, length, featureBgr = TRUE, ...)
features |
A list of features. |
length |
Total length of feature density vector (i.e. chromosome length). If this is missing the length is inferred from the feature parameters. |
featureBgr |
Logical indicating whether feature specific background should be added to the density. If this is |
... |
Further arguments to |
A vector with the feature density for each position along the chromosome.
Peter Humburg
The majority of the work is done by calls to featureDensity
and joinRegion
.
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(start=200, length=1e5) ## calculate density featureDens <- feat2dens(features, length=1e5)
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(start=200, length=1e5) ## calculate density featureDens <- feat2dens(features, length=1e5)
This set of functions is used to generate the density of individual features of different types.
featureDensity
is an S3 generic, functions may be defined for different feature classes.
featureDensity(x, ...) ## S3 method for class 'StableFeature' featureDensity(x, stable=stableDens, background=FALSE, ...) ## S3 method for class 'StablePhasedFeature' featureDensity(x, stable=stableDens, dist=distDens, background=FALSE, ...) ## S3 method for class 'ReversePhasedFeature' featureDensity(x, stable=stableDens, dist=distDens, background=FALSE, ...) ## S3 method for class 'NFRFeature' featureDensity(x, background=FALSE, ...) ## S3 method for class 'FuzzyFeature' featureDensity(x, ...)
featureDensity(x, ...) ## S3 method for class 'StableFeature' featureDensity(x, stable=stableDens, background=FALSE, ...) ## S3 method for class 'StablePhasedFeature' featureDensity(x, stable=stableDens, dist=distDens, background=FALSE, ...) ## S3 method for class 'ReversePhasedFeature' featureDensity(x, stable=stableDens, dist=distDens, background=FALSE, ...) ## S3 method for class 'NFRFeature' featureDensity(x, background=FALSE, ...) ## S3 method for class 'FuzzyFeature' featureDensity(x, ...)
x |
The feature for which the density should be computed. |
stable |
Function that should be used to compute the density of a stable feature. |
dist |
Function that should be used to compute the distribution of distances between adjacent features. |
background |
Logical indicating whether uniform background should be added to the feature. |
... |
Arguments to future functions. |
These functions are used internally by feat2dens
. There should be no need to call them
directly but it is important to supply suitable featureDensity
functions for new feature types.
A two column matrix. The first column contains the density, the second the weight at each position.
Peter Humburg
## Create a single StableFeature feature <- stableFeature(start = 200, weight = 0.8, shift = 10, stability = 1, ratio = 1) ## Convert the feature into a density (without background) featDens <- featureDensity(feature) ## If we want featureDensity to automatically add uniform background ## we have to ensure that the feature has a 'meanDist' parameter ## (this is usually added by 'reconcileFeatures'). feature$meanDist <- 200 featDens2 <- featureDensity(feature, background = TRUE)
## Create a single StableFeature feature <- stableFeature(start = 200, weight = 0.8, shift = 10, stability = 1, ratio = 1) ## Convert the feature into a density (without background) featDens <- featureDensity(feature) ## If we want featureDensity to automatically add uniform background ## we have to ensure that the feature has a 'meanDist' parameter ## (this is usually added by 'reconcileFeatures'). feature$meanDist <- 200 featDens2 <- featureDensity(feature, background = TRUE)
These functions are used to generate the parameters for different nucleosome positioning related features.
stableFeature(start, minDist = 175, weight = seq(0.1, 1, by = 0.1), shift = c(0, 5, 10), ratio = seq(0, 4, by = 0.25), stability = seq(0.1, 5, by = 0.1), weightProb, shiftProb, ratioProb, stabilityProb, ...) phasedFeature(minDist = 175, length = seq(1000, 10000, by = minDist), meanDist = minDist:300, lengthProb, meanDistProb, start, ...) fuzzyFeature(start, length = seq(1000, 10000, by = 1000), meanDist = 175:400, lengthProb, meanDistProb, ...) nfrFeature(start, length = seq(50, 500, by = 10), weight = seq(0.1, 1, by = 0.1), lengthProb, weightProb, ...)
stableFeature(start, minDist = 175, weight = seq(0.1, 1, by = 0.1), shift = c(0, 5, 10), ratio = seq(0, 4, by = 0.25), stability = seq(0.1, 5, by = 0.1), weightProb, shiftProb, ratioProb, stabilityProb, ...) phasedFeature(minDist = 175, length = seq(1000, 10000, by = minDist), meanDist = minDist:300, lengthProb, meanDistProb, start, ...) fuzzyFeature(start, length = seq(1000, 10000, by = 1000), meanDist = 175:400, lengthProb, meanDistProb, ...) nfrFeature(start, length = seq(50, 500, by = 10), weight = seq(0.1, 1, by = 0.1), lengthProb, weightProb, ...)
start |
Start location of feature on chromosome. |
minDist |
Minimum distance between nucleosomes. |
length |
A numeric vector giving possible values for the length of the feature. |
meanDist |
A numeric vector giving possible values for the mean distance between nucleosomes. |
weight |
A numeric vector giving possible values for the weight of the feature. |
shift |
A numeric vector giving possible values for the distance between favoured positions of stable nucleosomes. |
ratio |
A numeric vector giving possible values for the ratio between probabilities for alternative and central position of stable nucleosomes. |
stability |
A numeric vector giving possible values for the stability of stable nucleosomes. |
lengthProb |
Length distribution of feature. This corresponds to the state duration distribution of the underlying generating model. |
meanDistProb |
Distribution of mean distances between nucleosomes. |
weightProb |
Distribution of feature weights. |
shiftProb |
Distribution of distances between favoured positions of stable nucleosome. |
ratioProb |
Ratio distribution. |
stabilityProb |
Stability distribution. |
... |
Further arguments (currently ignored). |
A list of parameters for the corresponding feature type. These parameters are later used to compute nucleosome densities.
Peter Humburg
feature <- stableFeature(200)
feature <- stableFeature(200)
This function generates a list of genomic features for a single chromosome based on a Markov model.
makeFeatures(generator = defaultGenerator(), transition = defaultTransition(), init = defaultInit(), start = 1000, length, control = list(), globals = list(minDist = 175), lastFeat = defaultLastFeat(), experimentType = "NucleosomePosition", truncate = FALSE, maxTries = 10, force=FALSE)
makeFeatures(generator = defaultGenerator(), transition = defaultTransition(), init = defaultInit(), start = 1000, length, control = list(), globals = list(minDist = 175), lastFeat = defaultLastFeat(), experimentType = "NucleosomePosition", truncate = FALSE, maxTries = 10, force=FALSE)
generator |
A named list providing functions to generate the parameters associated with each type of feature. The name of each element in the list is the name of the state the function should be associated with. |
transition |
Named list of transition probabilities. Each element is a (named) numeric vector giving the transition probabilities for the state indicated by the element's name, i.e., each element of the list is a row of the transition probability matrix but zero probabilities can be omitted. |
init |
Named numeric vector of initial state probabilities. The names have to correspond to state names of the model. Zero probabilities may be omitted. |
start |
Numeric value indicating the position at which the first feature should be placed. |
length |
Maximum length of DNA that should be covered with features. |
control |
Named list with additional arguments to generator functions (one list per generator). Again the names should be the same as the state names. |
globals |
List of global arguments to be passed to all generator functions. |
lastFeat |
Named logical vector indicating for each feature type whether it can be the last feature. |
experimentType |
Type of experiment the simulated features belong to. This is used as the class of the return value. |
truncate |
Logical value indicating whether the final feature should be truncated to ensure that total length does
not exceed |
maxTries |
Maximum number of attempts made to generate a valid sequence of features. If no valid sequence is
generated during the first |
force |
Logical indicating whether the function should be forced to return a feature sequence, even if no
valid sequence was found. If this is |
This function will generate features from any first order Markov model specified by init
, transition
and generator
. If force
is FALSE
the returned feature sequence is guaranteed to contain at least
one feature and end in a state that is indicated as possible end state in lastFeat
. Note that the states for
which lastFeat
is TRUE
are not end states in the sense that the chain is terminated once the state is
entered or that the chain remains in the state once it is first entered. Instead this is a mechanism to ensure that
some states cannot be the last in the sequence.
Due to the constrains on the total length of DNA covered by features as well as the possible constraint on the final
feature of the sequence it is possible to specify models that cannot produce a legal sequence. In other cases it may
be possible but unlikely to produce a feature sequence that satisfies both constraints. A serious attempt is made to
satisfy both requirement, generating a new feature sequence or truncating an existing one if necessary. To ensure that
this terminates eventually the number of attempts to generate a valid sequence are limited to maxTries
.
In some cases it may be desirable to carry out some post-processing of the feature sequence to ensure that parameters
of neighbouring features are compatible in some sense. For the default nucleosome positioning simulation the
function reconcileFeatures
provides this functionality and placeFeatures
is an interface
to makeFeatures
that utilises reconcileFeatures
.
A list of features (with class determined by experimentType
).
Each feature is represented by a list of parameters and has a class with the same name as the
state that generated the feature. In addition all features are of class SimulatedFeature
.
Peter Humburg
Functions to generate default values for some of the arguments:
defaultGenerator
, defaultInit
, defaultTransition
, defaultLastFeat
.
Use feat2dens
to convert a feature sequence into feature densities.
placeFeatures
is an interface to makeFeature
for nucleosome positioning.
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- makeFeatures(length=1e6) ## check the total length of the features sum(sapply(features, "[[", "length")) ## 995020
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- makeFeatures(length=1e6) ## check the total length of the features sum(sapply(features, "[[", "length")) ## 995020
This function provides an interface to makeFeatures
and reconcileFeatures
that combines
both steps of the feature generation process.
placeFeatures(..., maxTail = 0.01, compoundFeatures=list("StablePhasedFeature"))
placeFeatures(..., maxTail = 0.01, compoundFeatures=list("StablePhasedFeature"))
... |
Arguments to |
maxTail |
Maximum portion of total length of chromosome that may be left free of features (see Details). |
compoundFeatures |
List of feature classes that are produced by combining two features. This may happen during the call
to |
This function (as well as makeFeatures
which it calls) tries to fill as much of the genomic region
with features as possible, i.e. an attempt is made to produce a feature sequence that covers length
base pairs. In most cases the sequence will be slightly shorter. The maxTail
argument determines how long
a region without any features at the end of the genomic region is acceptable (as fraction of the total length).
Note however that even maxTail = 0
does not guarantee a feature sequence of exactly the requested length.
A list of simulated features. The class of the return value as well as the features generated depend on
the arguments passed to makeFeatures
.
Using the reconcileFeatures
mechanism it is possible to introduce dependence between neighbouring features
that is not easily expressed in terms of a simple Markov model. In some cases the same effect can be achieved by
introducing additional states into the model but it may be more convenient to simply post-process the feature sequence.
Peter Humburg
makeFeatures
, reconcileFeatures
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(length=1e6, maxTail = 0) ## check the total length of the features sum(sapply(features, "[[", "length")) ## 990509
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(length=1e6, maxTail = 0) ## check the total length of the features sum(sapply(features, "[[", "length")) ## 990509
Convert read positions for a single chromosome (both strands) into read sequences + qualities and write them to file
pos2fastq(readPos, names, quality, sequence, qualityFun, errorFun, readLen = 36, file, qualityType = c("Illumina", "Sanger", "Solexa"), ...)
pos2fastq(readPos, names, quality, sequence, qualityFun, errorFun, readLen = 36, file, qualityType = c("Illumina", "Sanger", "Solexa"), ...)
readPos |
A list of two numeric vectors (one per strand) |
names |
List of names to use for reads in fastq file. Has to be of same shape as |
quality |
Passed on as argument to |
sequence |
Reference sequence (a |
qualityFun |
Function to generate quality scores. |
errorFun |
Function to introduce sequencing errors. |
readLen |
Read length to generate. |
file |
Output file (either file name or connection). |
qualityType |
Encoding to use for read quality scores. |
... |
Further arguments (see Details). |
Arguments passed as part of ... will be passed on to qualityFun
, except an argument called prob
which is
passed on to errorFun instead if present.
Invisibly returns the number of records that were written.
Peter Humburg
See readError
for a possible choice of errorFun
and readQualitySample
for a simple
qualityFun
.
set.seed(1) ## a function to generate random read qualities (in Sanger format) randomQuality <- function(read, ...){ paste(sample(unlist(strsplit(rawToChar(as.raw(33:126)),"")), length(read), replace = TRUE), collapse="") } ## generate a reference sequence chromosome <- DNAString(paste(sample(c("A", "C", "G", "T"), 1e5, replace = TRUE), collapse = "")) ## and a few read positions reads <- list(sample(100:9900, 5), sample(100:9900, 5)) names <- list(paste("read", 1:5, sep="_"), paste("read", 6:10, sep="_")) ## convert to fastq format pos2fastq(reads, names, sequence=chromosome, qualityFun=randomQuality, errorFun=readError, file="")
set.seed(1) ## a function to generate random read qualities (in Sanger format) randomQuality <- function(read, ...){ paste(sample(unlist(strsplit(rawToChar(as.raw(33:126)),"")), length(read), replace = TRUE), collapse="") } ## generate a reference sequence chromosome <- DNAString(paste(sample(c("A", "C", "G", "T"), 1e5, replace = TRUE), collapse = "")) ## and a few read positions reads <- list(sample(100:9900, 5), sample(100:9900, 5)) names <- list(paste("read", 1:5, sep="_"), paste("read", 6:10, sep="_")) ## convert to fastq format pos2fastq(reads, names, sequence=chromosome, qualityFun=randomQuality, errorFun=readError, file="")
Given a read sequence and quality this function introduces errors by first choosing positions
that should be modified based on the quality score and then exchanges nucleotides based on the
probabilities given in prob
.
readError(read, qual, alphabet = c("A", "C", "G", "T"), prob = defaultErrorProb(), ...)
readError(read, qual, alphabet = c("A", "C", "G", "T"), prob = defaultErrorProb(), ...)
read |
A character string representing a read sequence. |
qual |
Numeric vector of read error probabilities. |
alphabet |
Alphabet used for read sequence. |
prob |
Nucleotide exchange probabilities. |
... |
Further arguments (currently ignored). |
If the read sequence contains letters that are not part of alphabet
they are replaced
by the first entry of alphabet
before positions of sequencing errors are determined.
The alphabet used has to match the names used in prob
.
The modified read sequence.
Peter Humburg
defaultErrorProb
, readSequence
set.seed(42) ## generate sequence read and quality quality <- paste(sample(unlist(strsplit(rawToChar(as.raw(33:126)),"")), 36, replace = TRUE), collapse="") errorProb <- decodeQuality(quality, type = "Sanger") read <- paste(sample(c("A", "C", "G", "T"), 36, replace = TRUE), collapse = "") ## use readError to introduce sequencing errors read2 <- readError(read, errorProb) all.equal(read, read2) ## "1 string mismatch"
set.seed(42) ## generate sequence read and quality quality <- paste(sample(unlist(strsplit(rawToChar(as.raw(33:126)),"")), 36, replace = TRUE), collapse="") errorProb <- decodeQuality(quality, type = "Sanger") read <- paste(sample(c("A", "C", "G", "T"), 36, replace = TRUE), collapse = "") ## use readError to introduce sequencing errors read2 <- readError(read, errorProb) all.equal(read, read2) ## "1 string mismatch"
Given a read sequence and a list of read quality scores this function returns a (possibly truncated) quality score of the same length as the read.
readQualitySample(read, qualities, checkLength = TRUE, ...)
readQualitySample(read, qualities, checkLength = TRUE, ...)
read |
A sequence read. |
qualities |
List of sequence read quality scores. |
checkLength |
Flag indicating whether the length of quality scores should be checked to ensure that they are at least as long
as the read. If |
... |
Further arguments, currently not used. |
Using checkLength = TRUE
leads to a substantial decrease in performance and is impractical for a large simulation.
To avoid this slow down it is recommended to remove short sequences from qualities
beforehand so that
checkLength = FALSE
can be used.
An read quality score string of the same length as read
.
Peter Humburg
Given a read position, a reference sequence, a strand and a read length this function returns the read sequence.
readSequence(readPos, sequence, strand, readLen = 36)
readSequence(readPos, sequence, strand, readLen = 36)
readPos |
Numeric value indicating the start position on the chromosome. |
sequence |
Chromosome sequence (a |
strand |
Strand indicator (+1 / -1) |
readLen |
Length of read. |
Read sequence.
Peter Humburg
The reconcileFeatures
functions provide a facility to post-process a list of features representing
a simulated experiment. reconcileFeatures
is an S3 generic, new functions can be added for additional
types of experiment. The current default is to call reconcileFeatures.SimulatedExperiment
which, if
called without further arguments, will simply return the feature list unchanged.
reconcileFeatures(features, ...) ## Default S3 method: reconcileFeatures(features, ...) ## S3 method for class 'SimulatedExperiment' reconcileFeatures(features, defaultValues=list(), ...) ## S3 method for class 'NucleosomePosition' reconcileFeatures(features, defaultMeanDist = 200, ...)
reconcileFeatures(features, ...) ## Default S3 method: reconcileFeatures(features, ...) ## S3 method for class 'SimulatedExperiment' reconcileFeatures(features, defaultValues=list(), ...) ## S3 method for class 'NucleosomePosition' reconcileFeatures(features, defaultMeanDist = 200, ...)
features |
List of simulated features. |
defaultValues |
Named list of default parameter values. The method for class |
defaultMeanDist |
Default value for the average distance between nucleosomes for nucleosome positioning experiments. |
... |
Further arguments to future functions. |
A list of features of the same class as features
.
Peter Humburg
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- makeFeatures(length=1e6, ) ## check the total length of the features sum(sapply(features, "[[", "length")) ## 995020 ## reconcile features to ensure smooth transitions ## For experiments of class NucleosomePosition this ## also combines some features and introduces ## some overlap between them. features <- reconcileFeatures(features) ## check the total length of the features again sum(sapply(features, "[[", "length")) ## 984170
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- makeFeatures(length=1e6, ) ## check the total length of the features sum(sapply(features, "[[", "length")) ## 995020 ## reconcile features to ensure smooth transitions ## For experiments of class NucleosomePosition this ## also combines some features and introduces ## some overlap between them. features <- reconcileFeatures(features) ## check the total length of the features again sum(sapply(features, "[[", "length")) ## 984170
Given a read density this function returns the starting positions of sequence reads.
sampleReads(readDens, nreads = 6e+06, strandProb = c(0.5, 0.5))
sampleReads(readDens, nreads = 6e+06, strandProb = c(0.5, 0.5))
readDens |
A two column matrix of read densities (as produced by |
nreads |
Number of read positions to generate. |
strandProb |
A numeric vector with two elements giving weights for forward and reverse strand. |
The expected number of reads for each strand is strandProb * nreads
.
A list with components fwd
and rev
giving the read positions on the
forward and reverse strand respectively.
Peter Humburg
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(start=200, length=1e5) ## calculate feature density featureDens <- feat2dens(features, length=1e5) ## convert to read density readDens <- bindDens2readDens(featureDens, fragDens, meanLength=160) ## sample reads ## of course you would usually want a much larger number readPos <- sampleReads(readDens, nreads=1000)
set.seed(1) ## generate a (relatively short) sequence of nucleosome features features <- placeFeatures(start=200, length=1e5) ## calculate feature density featureDens <- feat2dens(features, length=1e5) ## convert to read density readDens <- bindDens2readDens(featureDens, fragDens, meanLength=160) ## sample reads ## of course you would usually want a much larger number readPos <- sampleReads(readDens, nreads=1000)
This function acts as driver for the simulation. It takes all required arguments and passes them on to the functions for the different stages of the simulation. The current defaults will simulate a nucleosome positioning experiment.
simChIP(nreads, genome, file, functions = defaultFunctions(), control = defaultControl(), verbose = TRUE, load = FALSE)
simChIP(nreads, genome, file, functions = defaultFunctions(), control = defaultControl(), verbose = TRUE, load = FALSE)
nreads |
Number of reads to generate. |
genome |
An object of class 'DNAStringSet' or the name of a fasta file containing the genome sequence. |
file |
Base of output file names (see Details). |
functions |
Named list of functions to use for various stages of the simulation, expected names are: ‘features’, ‘bindDens’, ‘readDens’, ‘sampleReads’, ‘readNames’, ‘readSequence’ |
control |
Named list of arguments to be passed to simulation functions (one list per function). |
verbose |
Logical indicating whether progress messages should be printed. |
load |
Logical indicating whether an attempt should be made to load intermediate results from a previous run. |
The simulation consists of six of stages:
generate feature sequence (for each chromosome): chromosome length -> feature sequence (list)
compute binding site density: feature sequence -> binding site density (vector)
compute read density: binding site density -> read density (two column matrix, one column for each strand)
sample read start sites: read density -> read positions (list)
create read names: number of reads -> unique names
obtain read sequence and quality: read positions, genome sequence, [qualities] -> output file
After each of the first three stages the results of the stage are written to a file and can be reused later.
File names are created by appending ‘_features.rdata
’, ‘_bindDensity.rdata
’ and
‘_readDensity.rdata
’ to file
respectively. Previous results will be loaded for reuse if
load
is TRUE
and files with matching names are found. This is useful to sample repeatedly
from the same read density or to recover partial results from an interrupted run.
The creation of files can be prevented by setting file =
“”. In this case all results will be
returned in a list at the end. Note that this will require more memory since all intermediate results have
to be held until the end.
The behaviour of the simulation is mainly controlled through the functions
and control
arguments.
They are expected to be lists of the same length with matching names. The names indicate the stage of the simulation
for which the function should be used; elements of control
will be used as arguments for the corresponding
functions.
A list. The components are typically either lists (with one component per chromosome) or file names
but note that this may depend on the return value of functions listed in functions
.
The components of the returned list are:
features |
Either a list of generated features or the name of a file containing that list; |
bindDensity |
Either a list with binding site densities or the name of a file containing that list; |
readDensity |
Either a list of read densities or the name of a file containing that list; |
readPosition |
Either a list of read start sites or the name of a file containing that list; |
readSequence |
The return value of the function listed as ‘ |
readNames |
Either a list of read names or the name of a file containing that list. |
Peter Humburg
defaultFunctions
, defaultControl
## Not run: ## To run the default nucleosome positioning simulation ## we can simply run something like the line below. ## This will result in 10 million reads sampled from the genome. ## Of course the file names have to be changed as appropriate. simChIP(1e7, genome = "reference.fasta", file = "output/sim_10M") ## End(Not run)
## Not run: ## To run the default nucleosome positioning simulation ## we can simply run something like the line below. ## This will result in 10 million reads sampled from the genome. ## Of course the file names have to be changed as appropriate. simChIP(1e7, genome = "reference.fasta", file = "output/sim_10M") ## End(Not run)
Generates a set of unique (and very simple) read names.
simpleNames(n, nameBase = "read")
simpleNames(n, nameBase = "read")
n |
Number of names to generate. |
nameBase |
Base name to use. |
A character vector with entries of the form ‘nameBase_i’ where i runs from 1 to n
.
Peter Humburg
simpleNames(5)
simpleNames(5)
This is intended to produce the final output of the simulation by providing a fastq file that may then be used in an analysis pipeline.
writeFASTQ(read, quality, name, file, append = FALSE)
writeFASTQ(read, quality, name, file, append = FALSE)
read |
List of read sequences. |
quality |
List of read quality scores. |
name |
Read names. |
file |
File name. If this is “” results will be printed to the standard output connection. |
append |
Logical indicating the reads should be appended to an existing file. |
The first three arguments should have the same length but will be recycled if necessary.
Called for its side effect.
Peter Humburg
readSequence
, readQualitySample
, writeReads
set.seed(1) ## generate sequence read and quality quality <- paste(sample(unlist(strsplit(rawToChar(as.raw(33:126)),"")), 36, replace = TRUE), collapse="") read <- paste(sample(c("A", "C", "G", "T"), 36, replace = TRUE), collapse = "") ## write a fastq record writeFASTQ(read, quality, "read_1", file="")
set.seed(1) ## generate sequence read and quality quality <- paste(sample(unlist(strsplit(rawToChar(as.raw(33:126)),"")), 36, replace = TRUE), collapse="") read <- paste(sample(c("A", "C", "G", "T"), 36, replace = TRUE), collapse = "") ## write a fastq record writeFASTQ(read, quality, "read_1", file="")
This is an interface to pos2fastq
that writes all reads for a given genome to a single file.
writeReads(readPos, readNames, sequence, quality, file, ...)
writeReads(readPos, readNames, sequence, quality, file, ...)
readPos |
List of read positions with each component holding the read positions for one chromosome, which are themselves two component lists that provide forward and reverse strand positions. |
readNames |
List of the same shape as |
sequence |
Genome reference sequence (a |
quality |
Read quality scores (see Details). |
file |
Output file. |
... |
Further arguments to |
If quality
looks like it might refer to a fastq file an attempt is made to create a
ShortReadQ
object. The read qualities of any
ShortReadQ
object passed as quality
(directly or indirectly as file name) are extracted and passed on to pos2fastq
as quality
argument. Otherwise it is passed on unchanged.
The name of the output file.
Peter Humburg