Title: | Identifying Protein Binding Sites in High-Throughput Sequencing Data |
---|---|
Description: | ChIPseqR identifies protein binding sites from ChIP-seq and nucleosome positioning experiments. The model used to describe binding events was developed to locate nucleosomes but should flexible enough to handle other types of experiments as well. |
Authors: | Peter Humburg |
Maintainer: | Peter Humburg <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.61.0 |
Built: | 2024-12-27 04:45:02 UTC |
Source: | https://github.com/bioc/ChIPseqR |
ChIPseqR provides a set of functions for the analysis of ChIP-seq data. Protein binding sites are located by identifying a characteristic pattern of peaks in read counts on both DNA strands.
Package: | ChIPseqR |
Type: | Package |
Version: | 1.13.1 |
Date: | 2012-12-11 |
License: | GPL (>=2) |
LazyLoad: | yes |
The easiest way to obtain binding site predictions for nucleosomes is to use simpleNucCall
. This provides
a simple interface to callBindingSites
. This function operates on AlignedRead
objects and provides useful defaults for nucleosome analysis. Parameters can be adjusted to detect the presence of other
DNA binding proteins, e.g. transcription factors. If more fine control is desired the following steps will produce binding site
predictions:
strandPileup
:Turn mapped reads into read counts along the genome.
startScore
:Score potential binding sites.
getCutoff
:Determine cutoff required to achieve desired false discovery rate.
pickPeak
:Find all peaks in the binding site score that exceed the significance
threshold determined by getCutoff
. These are the predicted binding sites.
Peter Humburg
Maintainer: Peter Humburg <[email protected]>
Humburg, P., Helliwell, C., Bulger, D. & Stone, G. ChIPseqR: Analysis of ChIP-seq Experiments. BMC Bioinformatics 12, 39+ (2011).
## See 'simpleNucCall' for examples of how to obtain nucleosome predictions.
## See 'simpleNucCall' for examples of how to obtain nucleosome predictions.
Accessor functions for S4 classes in package "ChIPseqR"
.
## S4 method for signature 'ANY' binding(x, ...) ## S4 method for signature 'ANY' cutoff(x, ...) ## S4 replacement method for signature 'ANY' cutoff(x, ...) <- value ## S4 method for signature 'ANY' nullDist(x, ...) ## S4 replacement method for signature 'ANY' nullDist(x, ...) <- value ## S4 method for signature 'ANY' peaks(x, ...) ## S4 method for signature 'ANY' pvalue(x, ...) ## S4 method for signature 'ANY' support(x, ...)
## S4 method for signature 'ANY' binding(x, ...) ## S4 method for signature 'ANY' cutoff(x, ...) ## S4 replacement method for signature 'ANY' cutoff(x, ...) <- value ## S4 method for signature 'ANY' nullDist(x, ...) ## S4 replacement method for signature 'ANY' nullDist(x, ...) <- value ## S4 method for signature 'ANY' peaks(x, ...) ## S4 method for signature 'ANY' pvalue(x, ...) ## S4 method for signature 'ANY' support(x, ...)
x |
An S4 object. |
... |
Further arguments, ignored by default method. |
value |
New value for slot. |
These methods allow safe read (and in some cases write) access to slots of S4 classes and should be used for this purpose rather than modifying slots manually.
The current value of the interrogated slot.
Default method for accessor function.
This page documents the generics and their default behaviour. See the help page of each class for class specific implementations.
Peter Humburg
Creates a set of (strand specific) read counts centred at the genomic features provided.
alignFeature(data, anno, offset = 1000)
alignFeature(data, anno, offset = 1000)
data |
List with read counts as returned by |
anno |
Data frame with annotation data in GFF format. |
offset |
Half width of window around start point of annotated features. |
List with one component for each feature in anno
.
Peter Humburg
The GFF file format specification: http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
set.seed(1) ## determine binding site locations b <- sample(1:8.5e5, 500) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 5000)) rev <- c(rev, sample(25:1e6, 5000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(15000, 15000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## convert binding site locations into GFF format gff <- data.frame(chromosome="chr1", source="test", feature="binding", start=b-73, end=b+73, score=".", strand=".", frame=".") ## align read counts relative to binding site location aligned <- alignFeature(readPile, gff, offset=500)
set.seed(1) ## determine binding site locations b <- sample(1:8.5e5, 500) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 5000)) rev <- c(rev, sample(25:1e6, 5000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(15000, 15000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## convert binding site locations into GFF format gff <- data.frame(chromosome="chr1", source="test", feature="binding", start=b-73, end=b+73, score=".", strand=".", frame=".") ## align read counts relative to binding site location aligned <- alignFeature(readPile, gff, offset=500)
This class provides the infrastructure to store results of ChIP-seq analysis.
## S4 method for signature 'BindScore' binding(x) ## S4 method for signature 'BindScore' chrLength(x, subset) ## S4 method for signature 'BindScore' cutoff(x, type=c("score", "pvalue")) ## S4 replacement method for signature 'BindScore' cutoff(x, type=c("score", "pvalue")) <- value ## S4 method for signature 'BindScore' head(x, n=6, by=c("score", "position"), ...) ## S4 method for signature 'BindScore' lapply(X, FUN, ...) ## S4 method for signature 'BindScore' length(x) ## S4 replacement method for signature 'BindScore' length(x) <- value ## S4 method for signature 'BindScore' max(x, ..., na.rm=TRUE) ## S4 method for signature 'BindScore' min(x, ..., na.rm=TRUE) ## S4 method for signature 'BindScore' names(x) ## S4 replacement method for signature 'BindScore,ANY' names(x) <- value ## S4 method for signature 'BindScore' nullDist(x) ## S4 replacement method for signature 'BindScore' nullDist(x) <- value ## S4 method for signature 'BindScore' peaks(x, ...) ## S4 method for signature 'BindScore' range(x, ..., na.rm=TRUE) ## S4 method for signature 'BindScore' score(x) ## S4 method for signature 'BindScore' support(x) ## S4 method for signature 'BindScore' tail(x, n=6, by=c("score", "position"), ...) BindScore(call, score=list(), pvalue=list(), peaks=list(), cutoff=c(-Inf, 1), nullDist=c(0, 1), names=NULL, start=1L, compress=TRUE, digits=16)
## S4 method for signature 'BindScore' binding(x) ## S4 method for signature 'BindScore' chrLength(x, subset) ## S4 method for signature 'BindScore' cutoff(x, type=c("score", "pvalue")) ## S4 replacement method for signature 'BindScore' cutoff(x, type=c("score", "pvalue")) <- value ## S4 method for signature 'BindScore' head(x, n=6, by=c("score", "position"), ...) ## S4 method for signature 'BindScore' lapply(X, FUN, ...) ## S4 method for signature 'BindScore' length(x) ## S4 replacement method for signature 'BindScore' length(x) <- value ## S4 method for signature 'BindScore' max(x, ..., na.rm=TRUE) ## S4 method for signature 'BindScore' min(x, ..., na.rm=TRUE) ## S4 method for signature 'BindScore' names(x) ## S4 replacement method for signature 'BindScore,ANY' names(x) <- value ## S4 method for signature 'BindScore' nullDist(x) ## S4 replacement method for signature 'BindScore' nullDist(x) <- value ## S4 method for signature 'BindScore' peaks(x, ...) ## S4 method for signature 'BindScore' range(x, ..., na.rm=TRUE) ## S4 method for signature 'BindScore' score(x) ## S4 method for signature 'BindScore' support(x) ## S4 method for signature 'BindScore' tail(x, n=6, by=c("score", "position"), ...) BindScore(call, score=list(), pvalue=list(), peaks=list(), cutoff=c(-Inf, 1), nullDist=c(0, 1), names=NULL, start=1L, compress=TRUE, digits=16)
x |
Object of class |
X |
Object of class |
subset |
Index vector indicating a subset of |
type |
A string indicating which type of cut-off should be returned or changed. Either |
n |
Number of entries to show. |
by |
A string indicating whether the output should be sorted by score or by position in the genome. |
na.rm |
Logical indication whether |
FUN |
Function to apply to results for each chromosome. |
value |
Replacement value. |
call |
Function call used to generate the values of the other slots. |
score |
List of binding site scores. One component per chromosome. |
pvalue |
List of binding site p-values. One component per chromosome. |
peaks |
List of significant peaks in binding site score. One component per chromosome. |
cutoff |
Numeric vector of length two indicating the significance cut-off in terms of score and p-value. |
nullDist |
Parameters of the null distribution. |
names |
Character vector providing names for chromosomes. |
start |
Integer indicating position of first binding site score. |
compress |
Logical indicating whether scores and p-values should be compressed. |
digits |
The number of decimal places to retain for compression. |
... |
Further arguments. |
Objects can be created by calls of the form new("BindScore", functionCall, score, pvalue, peaks, cutoff, nullDist, names, ...)
.
Objects of this class are typically created (and returned) by functions that perform
peak calling on ChIP-seq data. Usually there should be no need to create them directly.
functionCall
:Object of class "call"
storing the function call used to initiate the analysis.
score
:Object of class "list"
. The binding site score. One numeric vector per chromosome.
pvalue
:Object of class "list"
. The (adjusted) p-values corresponding to the scores in slot score
.
peaks
:Object of class "list"
giving the location of significant peaks in the binding site score. These correspond to the location of predicted binding sites.
cutoff
:Object of class "numeric"
with entries ‘pvalue’ and ‘score’ giving the significance threshold used for peak calling in terms of p-value and score.
nullDist
:Object of class "numeric"
providing the parameters of the null distribution used to determine p-values.
start
:Object of class "integer"
indicating the index corresponding to the first entry in score
(assumed to be the same for all chromosomes).
signature(x = "BindScore")
: Convert results into a data.frame
giving the location, score and p-value of significant peaks.
signature(x = "BindScore", i = "ANY", j = "missing", drop = "missing")
: Restrict results to a subset of chromosomes. Chromosomes can either be identified by name or by numerical index.
signature(x = "BindScore", i = "ANY", j = "missing")
: Restrict results to a single chromosome. Note that x[["chr1"]]
is identical to x["chr1"]
.
signature(x = "BindScore", i = "ANY", j = "numeric")
: subset results to restrict them to a region on a single chromosome.
signature(x = "BindScore")
: Returns length of binding site used during analysis.
signature(x = "BindScore", subset = "ANY")
: Returns length of all chromosomes represented in x
.
signature(x = "BindScore")
: Sets the significance cut-off. Argument type=c("score", "pvalue")
determines which cut-off is to be set, the other is adjusted accordingly. This recalculates the significance of peaks in the binding site score and may be slow.
signature(x = "BindScore")
: Returns significance threshold used for analysis.
signature(x = "BindScore")
: Returns the first n
peaks. Argument by = c("score", "position")
determines whether results are sorted by score or by genomic location.
signature(X = "BindScore")
: Applies a function to results for each chromosome.
signature(x = "BindScore")
: Reduces the number of chromosomes for which results are stored, i.e., length(x) <- 3
only retains the first three chromosomes.
signature(x = "BindScore")
: Returns the number of binding sites identified by the analysis.
signature(x = "BindScore")
: Returns maximum score.
signature(x = "BindScore")
: Returns minimum score.
signature(x = "BindScore", value = "ANY")
: Sets the chromosome names.
signature(x = "BindScore")
: Returns the chromosome names.
signature(x = "BindScore")
: Sets the parameters of the null distribution adjusting the significance cut-off in the process and predicts binding sites using the new null distribution.
signature(x = "BindScore")
: Returns list of predicted binding sites.
signature(x = "BindScore")
: Range of binding site scores.
signature(x = "BindScore")
: Returns list of binding site scores.
signature(x = "BindScore")
: Returns length of support region used during analysis.
signature(x = "BindScore")
: Returns the last n
peaks. Argument by = c("score", "position")
determines whether results are sorted by score or by genomic location.
Peter Humburg
~put references to the literature/web site here ~
ReadCounts
for the data structure used as input for the analysis and callBindingSites
showClass("BindScore") set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations ## the artificial dataset is very small so predictions may not be very reliable bindScore <- simpleNucCall(readPile, bind=147, support=20, plot=FALSE, compress=FALSE) ## number of binding sites found length(bindScore) ## the first few predictions, by score head(bindScore) ## score and p-value cut-off used cutoff(bindScore)
showClass("BindScore") set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations ## the artificial dataset is very small so predictions may not be very reliable bindScore <- simpleNucCall(readPile, bind=147, support=20, plot=FALSE, compress=FALSE) ## number of binding sites found length(bindScore) ## the first few predictions, by score head(bindScore) ## score and p-value cut-off used cutoff(bindScore)
Methods for function callBindingSites
in Package ‘ChIPseqR’. These methods are used to
identify protein binding sites from ChIP-seq data.
## S4 method for signature 'ANY' callBindingSites(data, chrLen, plot=TRUE, verbose=TRUE, ..., plotTo) ## S4 method for signature 'character' callBindingSites(data, type, minQual=70, ...) ## S4 method for signature 'matrix' callBindingSites(data, chrName="chr", ...) ## S4 method for signature 'ReadCounts' callBindingSites(data, bind, support, background, bgCutoff=0.9, supCutoff=0.9, fdr = 0.05, extend=1, tailCut=0.95, piLambda=0.5, adapt=FALSE, corSummary=median, compress = TRUE, digits = 16, plot=TRUE, verbose=TRUE, ask=FALSE, plotTo, ...)
## S4 method for signature 'ANY' callBindingSites(data, chrLen, plot=TRUE, verbose=TRUE, ..., plotTo) ## S4 method for signature 'character' callBindingSites(data, type, minQual=70, ...) ## S4 method for signature 'matrix' callBindingSites(data, chrName="chr", ...) ## S4 method for signature 'ReadCounts' callBindingSites(data, bind, support, background, bgCutoff=0.9, supCutoff=0.9, fdr = 0.05, extend=1, tailCut=0.95, piLambda=0.5, adapt=FALSE, corSummary=median, compress = TRUE, digits = 16, plot=TRUE, verbose=TRUE, ask=FALSE, plotTo, ...)
data |
Either an object containing information about mapped reads or a list. See below for details. |
bind |
Length of binding region to use (see Details). |
support |
Length of support region to use (see Details). |
background |
Length of background window. If this is missing it will be set to 10*( |
chrLen |
Numeric vector indicating the length of all chromosomes. Only needed when
|
bgCutoff |
Numeric value between 0.5 and 1. This determines how much estimates of the background read density are allowed to vary for adjacent windows. Set to 1 to disable cutoff. |
supCutoff |
Numeric value between 0.5 and 1. This determines how much estimates of the support region read density are allowed to vary for forward and reverse strand. Set to 1 to disable cutoff. |
fdr |
Target false discovery rate. |
extend |
Numeric value indicating how far mapped reads should be extended when calculating read counts. |
type |
Format of alignment file (see |
minQual |
Minimum alignment quality to use. All reads with lower alignment quality are discarded. |
tailCut |
Truncation point used to exclude outliers when estimating null distribution. |
chrName |
Name to use for the single chromosome. |
piLambda |
If |
adapt |
Logical indicating whether an adaptive false discovery rate should be used. If this is |
corSummary |
Function used to summarise cross-correlation across chromosomes. See the Details section on binding and support region. |
compress |
Logical indicating whether the return value should be compressed. |
digits |
Number of decimal places to retain for binding site score for compression. |
plot |
Logical. If |
verbose |
Logical. If |
ask |
Logical. Setting this to |
plotTo |
Character string giving the name of a file that should be used to store plots generated during the analysis. If this is not missing a pdf file with the given name will be created. |
... |
Additional arguments. Most methods pass them on to the |
The length of binding and support regions can either be given as a single value or as a range of possible values (by providing the minimum and maximum). In the latter case the cross-correlation between read counts on forward and reverse strand will be used to determine a value within that range. Note that this may lead sub-optimal choices of binding and support region length.
An object of class BindScore
if compress = FALSE
, otherwise an object
of class RLEBindScore
Default method to handle all forms of input not explicitly handled by their own method. In particular this will be used for objects of class AlignedRead
and data.frame
but it will handle class for which a strandPileup
method is available.
Allows to use a file name referring to a file of mapped sequence reads as input.
Uses a matrix of read counts (for a single chromosome) as input.
This methods implements the peak calling algorithm. Other methods will typically reformat their input and pass it on to this method.
simpleNucCall
for an interface with nucleosome specific defaults. This function uses
strandPileup
, startScore
, getCutoff
and pickPeak
. See the
help pages of these functions for additional detail on the individual steps involved. See getBindLen
for details on the estimation of binding and support region length.
set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations ## the artificial dataset is very small so predictions may not be very reliable bindScore <- callBindingSites(readPile, bind=147, support=20, background=2000, plot=FALSE)
set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations ## the artificial dataset is very small so predictions may not be very reliable bindScore <- callBindingSites(readPile, bind=147, support=20, background=2000, plot=FALSE)
Generates a compressed representation of binding site scores.
## S4 method for signature 'BindScore' compress(x, digits=16)
## S4 method for signature 'BindScore' compress(x, digits=16)
x |
An object of class |
digits |
Integer indicating the number of decimal places to retain. |
Binding site scores are compressed by first rounding them to digits
decimal
places and then applying run-length encoding.
An object of class RLEBindScore
.
Compression reduces the precision of binding site scores and may affect results, especially for small values
of digits
.
Peter Humburg
Rle
, RleList
, compress-BindScore
Methods to obtain compressed versions of data structures.
signature(x = "BindScore")
Converts x
into an object of class RLEBindScore
.
signature(x = "ReadCounts")
Converts x
into an object of class RLEReadCounts
.
signature(x = "RLEBindScore")
Returns (the already compressed) x
.
signature(x = "RLEReadCounts")
Returns (the already compressed) x
.
Generates a compressed representation of read counts.
## S4 method for signature 'ReadCounts' compress(x)
## S4 method for signature 'ReadCounts' compress(x)
x |
An object of class |
Run-length encoding is used to obtain a compressed representation of read counts.
An object of class RLEReadCounts
Peter Humburg
Rle
, RleList
, compress-BindScore
These methods extract read count and binding site sores from compressed representations.
## S4 method for signature 'RLEReadCounts' decompress(x) ## S4 method for signature 'RLEBindScore' decompress(x)
## S4 method for signature 'RLEReadCounts' decompress(x) ## S4 method for signature 'RLEBindScore' decompress(x)
x |
An object of class |
An object of class BindScore
or ReadCounts
respectively.
Peter Humburg
Methods to extract compressed data structures.
signature(x = "ANY")
The default method simply returns x
.
signature(x = "Rle")
Restores the atomic vector encoded in x
.
signature(x = "RLEBindScore")
Returns an object of BindScore
.
signature(x = "RleList")
Extracts the components of x
and restores them to atomic vectors.
signature(x = "RLEReadCounts")
Returns an object of ReadCounts
.
Extracts sequence of predicted binding sites from reference genome and exports them in FASTA format.
exportBindSequence(prediction, reference, bind, overlap = FALSE, file = "")
exportBindSequence(prediction, reference, bind, overlap = FALSE, file = "")
prediction |
Object of class |
reference |
Reference genome sequence (as |
bind |
Length of binding site to assume for sequence extraction. This may be missing in which case the value is
derived from |
overlap |
Logical indicating whether overlapping predictions should be allowed. |
file |
Name of output file. |
An XStringViews
object containing the sequences. If a file name is provided this is returned
invisibly.
Peter Humburg
Package Biostrings
XStringViews
, XStringSet
, BindScore
This function calculates the cross-correlation between read counts on forward and reverse strand.
getBindCor(data, max.lag, summary, plot = TRUE, ...)
getBindCor(data, max.lag, summary, plot = TRUE, ...)
data |
An object of class |
max.lag |
Maximum lag to use in cross-correlation calculation. |
summary |
Function to use to summarise cross-correlation across chromosomes. |
plot |
Logical indicating whether to plot cross-correlation. |
... |
Further arguments, currently ignored. |
Function fttcor
in package “timsac” is used to carry out the computation.
The (summarised) cross-correlation. If summary
is missing a list of cross-correlations for
each chromosome is returned.
Peter Humburg
fftcor
, ReadCounts
, getBindLen
The cross-correlation between forward and reverse strand read counts is used to estimate the distance between peaks on both strands. This is then used to derive suitable values for the length of binding and support regions.
getBindLen(data, bind, support, summary = median, verbose = FALSE, plot = TRUE, ...)
getBindLen(data, bind, support, summary = median, verbose = FALSE, plot = TRUE, ...)
data |
An object of class |
bind |
Either known length of binding region or minimum and maximum of binding region length to consider. |
support |
Either known length of support region or minimum and maximum of support region length to consider. |
summary |
Function to use to summarise cross-correlation across chromosomes. |
verbose |
Logical indicating whether progress messages should be printed. |
plot |
Logical indicating whether cross-correlation should be plotted. |
... |
Further arguments to |
This assumes that the first peak in cross-correlation corresponds to the length of the binding site. Note that this is not accurate. The peak is closer to bind + 2*m where m is the median of the read distribution in the support region ('read distribution in the support region' means the read density as a function of distance to binding site start/end). Consequently this method will overestimate the length of the binding site. If either bind or support are of length 1 this is assumed to be the known value and a more accurate estimate for the remaining parameter is used.
A numeric vector giving the estimated lengths of binding and support regions.
Peter Humburg
Given a vector of observed binding site scores and a desired false discovery rate, this function returns the lowest score that should be considered significant to achieve the given false discovery rate.
getCutoff(score, alpha = 0.05, tailCut = 0.95, adapt = FALSE, lambda, plot = TRUE, returnPval = TRUE)
getCutoff(score, alpha = 0.05, tailCut = 0.95, adapt = FALSE, lambda, plot = TRUE, returnPval = TRUE)
score |
Numeric vector with binding site scores. |
alpha |
Desired false discovery rate. |
tailCut |
Truncation point used to exclude outliers when fitting the null distribution. |
adapt |
Logical indicating whether an adaptive false discovery rate should be used. |
lambda |
If |
plot |
If this is |
returnPval |
Indicates whether or not the corrected p-values for all scores should be returned. |
A list with components
cutoff |
A numeric vector with the score and nominal false discovery rate corresponding to the determined cutoff. |
h0 |
A numeric vector with the mean and standard deviation of the estimated null distribution. |
pvalue |
If |
pi0 |
If |
This function is used by callBindingSites
to determine the significance threshold for
peak-calling.
Peter Humburg
For the adaptive false discovery rate procedure used if adapt=TRUE
see
JD Storey, JE Taylor and D Siegmund. Strong control, conservative point
estimation and simultaneous conservative consistency of false discovery rates: a unified approach.
Journal of the Royal Statistical Society B, 66(1):187-205, 2004.
Given a vector of scores and a threshold, this function finds all peaks that exceed the threshold.
pickPeak(score, threshold, offset = 0, sub = FALSE)
pickPeak(score, threshold, offset = 0, sub = FALSE)
score |
Numeric vector. |
threshold |
All values in |
offset |
Offset to add to the determined peak locations. |
sub |
Logical. If this is |
If sub = FALSE
a numeric vector giving the location of all peaks.
Otherwise a list with components
peaks |
The same peak locations that are returned for |
subPeaks |
A list with one component for each entry in ‘peaks’ giving the location of local maxima. |
This function is used by callBindingSites
for peak-calling.
Peter Humburg
callBindingSites
, startScore
, getCutoff
Generates plots to assess the fit of the estimated null distribution.
## S4 method for signature 'BindScore,missing' plot(x, npoints = 10000, type=c("density", "qqplot"), ...)
## S4 method for signature 'BindScore,missing' plot(x, npoints = 10000, type=c("density", "qqplot"), ...)
x |
An object of class |
npoints |
Maximum number of points to plot in a QQ-plot. |
type |
Character string indicating the plot type. |
... |
Further arguments (currently ignored). |
Type ‘density’ produces a histogram of binding site scores with overlaid null distribution. Type ‘qqplot’ produces a normal QQ-plot comparing the observed binding site scores to the null distribution.
Called for its side effect.
Peter Humburg
Produces plots to assess the distribution of reads, either for an entire chromosome or within a (small) window.
## S4 method for signature 'ReadCounts,missing' plot(x, chr, center, score, width=2000, type=c("hilbert", "window"), ...) ## S4 method for signature 'RLEReadCounts,missing' plot(x, chr, center, score, width=2000, type=c("hilbert", "window"), ...)
## S4 method for signature 'ReadCounts,missing' plot(x, chr, center, score, width=2000, type=c("hilbert", "window"), ...) ## S4 method for signature 'RLEReadCounts,missing' plot(x, chr, center, score, width=2000, type=c("hilbert", "window"), ...)
x |
Object of class |
chr |
Index or name of chromosome for which read counts should be plotted. |
center |
For type ‘window’, the center coordinate of the window to plot. |
score |
For type ‘window’, an object of type |
width |
For type ‘window’, the width of the window. |
type |
Character string indicating the type of plot that should be produced (see details). |
... |
Further arguments (see details). |
Type ‘window’ produces a plot that shows read counts as vertical bars. If score
is not missing, it is used to plot the score and predicted binding sites (if any) as well.
Any additional arguments are passed on to .plotWindow
.
Type ‘hilbert’ produces a Hilbert curve plot of read counts for an entire chromosome.
Additional arguments are passed on to .plotReads
.
Called for its side effect.
Peter Humburg
Provides facility to export the location of genomic features to a GFF formatted file.
pos2gff(pos, method, feature, len, strand, score, name)
pos2gff(pos, method, feature, len, strand, score, name)
pos |
Named list with one component per chromosome giving the start position of features on that chromosome. |
method |
Entry for method field in GFF file. Recycled as necessary |
feature |
Entry for feature field in GFF file. Recycled as necessary |
len |
Length of fetures. This is used to calculate matching end positions for each start position given in |
strand |
Entry for feature field in GFF file. Recycled as necessary |
score |
Entry for feature field in GFF file. Recycled as necessary |
name |
Entry for feature field in GFF file. Recycled as necessary |
A data.frame
with columns 'chromosome'
, 'method'
, 'feature'
, 'start'
,
'end'
, 'score'
, 'strand'
. Writing this data frame to a text
file produces a GFF formatted file.
Peter Humburg
The GFF specification: http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
pos <- list(chr1=c(10, 50, 60), chr2=c(22, 200, 500)) pos2gff(pos, "test", "foo", 25, c("+", "+", "-", "+", "-", "-"), 0, "test")
pos <- list(chr1=c(10, 50, 60), chr2=c(22, 200, 500)) pos2gff(pos, "test", "foo", 25, c("+", "+", "-", "+", "-", "-"), 0, "test")
Represents counts of (possibly extended) reads for each strand of the genome.
ReadCounts(counts=list(), names=NULL, compress=TRUE)
ReadCounts(counts=list(), names=NULL, compress=TRUE)
counts |
A list of read counts. Each component is a two column matrix of strand specific read counts for a chromosome. |
names |
Character vector of chromosome names. If this is |
compress |
Logical indicating whether read counts should be compressed. |
Objects can be created by calls of the form ReadCounts(counts, names, compress=FALSE)
or by calls
to strandPileup
.
counts
:Object of class "list"
with one component per chromosome, containing a matrix of read counts (one column per strand).
signature(x = "ReadCounts", i = "ANY", j = "missing")
: Replace read counts for chromosomes indicated by i
.
signature(x = "ReadCounts", i = "ANY", j = "missing", drop = "missing")
: Returns list of read counts for chromosomes indicated by i
.
signature(x = "ReadCounts", i = "ANY", j = "missing")
: Replace read counts for chromosome i
.
signature(x = "ReadCounts", i = "ANY", j = "missing")
: Returns read counts for chromosome i
.
signature(x = "ReadCounts")
: Replace read counts for chromosome i
(by name).
signature(x = "ReadCounts")
: Returns read counts for chromosome i
(by name).
signature(data = "ReadCounts")
: Predict bindingsites from read counts.
signature(x = "ReadCounts", subset = "ANY")
: Returns length of all chromosomes represented in x
.
signature(X = "ReadCounts")
: Apply function to read counts for each chromosome.
signature(x = "ReadCounts")
: Change the number of chromosomes represented by x
to value
.
signature(x = "ReadCounts")
: Number of chromosomes represented by x
.
signature(x = "ReadCounts", value = "ANY")
: Change names of chromosomes.
signature(x = "ReadCounts")
: Chromosome names.
signature(x = "ReadCounts", byStrand = "Logical", subset = "ANY")
: Returns the number of reads on each chromosome, split by strand (if byStrand
is TRUE
).
signature(X = "ReadCounts")
: Apply function to read counts for each chromosome.
Peter Humburg
BindScore
, strandPileup
, compress
, decompress
showClass("ReadCounts") ## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=500, extend=1, plot=FALSE, compress=FALSE) names(readPile) length(readPile) sapply(readPile, sum)
showClass("ReadCounts") ## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=500, extend=1, plot=FALSE, compress=FALSE) names(readPile) length(readPile) sapply(readPile, sum)
This class provides a memory efficient representation of binding site scores.
Objects can be created by calls of the form BindScore(functionCall, score, pvalue, peaks, cutoff, nullDist, names, start, digits, compress=TRUE)
or through calls to callBindingSites
.
functionCall
:Object of class "call"
storing the function call used to initiate the analysis.
score
:Object of class "list"
. The binding site score. One run-length encoded numeric vector per chromosome.
pvalue
:Object of class "list"
. The (adjusted and run-length encoded) p-values corresponding to the scores in slot score
.
peaks
:Object of class "list"
giving the location of significant peaks in the binding site score. These correspond to the location of predicted binding sites.
cutoff
:Object of class "numeric"
with entries ‘pvalue’ and ‘score’ giving the significance threshold used for peak calling in terms of p-value and score.
nullDist
:Object of class "numeric"
providing the parameters of the null distribution used to determine p-values.
start
:Object of class "integer"
indicating the index corresponding to the first entry in score
(assumed to be the same for all chromosomes).
Class "BindScore"
, directly.
signature(x = "RLEBindScore")
: conversion to BindScore
object.
Peter Humburg
showClass("RLEBindScore") set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations ## the artificial dataset is very small so predictions may not be very reliable bindScore <- simpleNucCall(readPile, bind=147, support=20, plot=FALSE, compress=TRUE) ## number of binding sites found length(bindScore) ## the first few predictions, by score head(bindScore) ## score and p-value cut-off used cutoff(bindScore)
showClass("RLEBindScore") set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations ## the artificial dataset is very small so predictions may not be very reliable bindScore <- simpleNucCall(readPile, bind=147, support=20, plot=FALSE, compress=TRUE) ## number of binding sites found length(bindScore) ## the first few predictions, by score head(bindScore) ## score and p-value cut-off used cutoff(bindScore)
This class provides a memory efficient representation of strand specific read counts.
Objects can be created by calls of the form ReadCounts(counts, names, compress = TRUE)
or by calls
to strandPileup
.
counts
:Object of class "list"
with one component per chromosome,
containing a read counts encoded in an object of class RleList
.
Class "ReadCounts"
, directly.
signature(x = "RLEReadCounts")
: Returns length of all chromosomes represented in x
.
signature(x = "RLEReadCounts")
: Expands read counts and returns object of class ReadCounts
.
signature(x = "RLEReadCounts")
: Returns the number of reads on each chromosome, split by strand (if byStrand
is TRUE
)
signature(x = "RLEReadCounts", y = "missing")
: Generates plots of read counts.
Peter Humburg
showClass("RLEReadCounts") ## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=500, extend=1, plot=FALSE, compress=TRUE) names(readPile) length(readPile) sapply(readPile, sum)
showClass("RLEReadCounts") ## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=500, extend=1, plot=FALSE, compress=TRUE) names(readPile) length(readPile) sapply(readPile, sum)
This function provides a simplified interface to callBindingSites
with defaults suitable for
the detection of nucleosomes.
simpleNucCall(data, bind=128, support=17, background=2000, chrLen, ...)
simpleNucCall(data, bind=128, support=17, background=2000, chrLen, ...)
data |
Either an object of class |
bind |
Length of binding region to use (see Details). |
support |
Length of support region to use (see Details). |
background |
Length of background window. If this is missing it will be set to 10*( |
chrLen |
Numeric vector indicating the length of all chromosomes. Only needed when
|
... |
Further arguments to |
A list
with components
binding |
A |
score |
A |
pval |
A |
Peter Humburg
~put references to the literature/web site here ~
callBindingSites
for additional parameters.
## generate some simple artificial read data set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations bindScore <- simpleNucCall(readPile, bind=147, support=20, plot=FALSE)
## generate some simple artificial read data set.seed(1) ## determine binding site locations b <- sample(1:1e6, 5000) ## sample read locations fwd <- unlist(lapply(b, function(x) sample((x-83):(x-73), 20, replace=TRUE))) rev <- unlist(lapply(b, function(x) sample((x+73):(x+83), 20, replace=TRUE))) ## add some background noise fwd <- c(fwd, sample(1:(1e6-25), 50000)) rev <- c(rev, sample(25:1e6, 50000)) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(150000, 150000)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=1e6, extend=1, plot=FALSE) ## predict binding site locations bindScore <- simpleNucCall(readPile, bind=147, support=20, plot=FALSE)
For each position in the genome this function computes a score indicating the likelihood that a protein binding site starts at that position.
startScore(data, b, support, background, bgCutoff, supCutoff)
startScore(data, b, support, background, bgCutoff, supCutoff)
data |
A two column matrix with read counts. The two columns correspond to reads on the forward and reverse strand respectively. |
b |
Length of binding region. |
support |
Length of support region. |
background |
Length of background window. |
bgCutoff |
Cutoff for the change in read rates between adjacent windows (see Details). |
supCutoff |
Cutoff for the change in read rates between support regions on forward and reverse strand (see Details). |
Robust estimates of read rates in background windows and support regions are obtained by limiting the
difference between related estimates. Consider a forward support region of length 10 containing 20 reads.
The maximum likelihood estimate for the rate parameter of the (assumed) underlying Poisson distribution is
. If there are 50 reads in the reverse
support region a robust estimate of the rate parameter is obtained as
max(50/10, qpois(supCutoff, lambda=lambda_hat))
Numeric vector with binding site scores.
Instead of calling this function directly use callBindingSites
.
Peter Humburg
Given a set of aligned reads this function computes the number of reads starting at each position in the genome.
## S4 method for signature 'AlignedRead' strandPileup(aligned, chrLen, extend, coords=c("leftmost", "fiveprime"), compress = TRUE, plot = TRUE, ask = FALSE, ...) ## S4 method for signature 'data.frame' strandPileup(aligned, chrLen, extend, coords=c("leftmost", "fiveprime"), compress = TRUE, plot = TRUE, ask = FALSE, ...)
## S4 method for signature 'AlignedRead' strandPileup(aligned, chrLen, extend, coords=c("leftmost", "fiveprime"), compress = TRUE, plot = TRUE, ask = FALSE, ...) ## S4 method for signature 'data.frame' strandPileup(aligned, chrLen, extend, coords=c("leftmost", "fiveprime"), compress = TRUE, plot = TRUE, ask = FALSE, ...)
aligned |
An object containing information about aligned reads (see Details). |
chrLen |
A numeric vector giving the length of each chromosome. |
extend |
A numeric value indicating how far reads should be extended. |
coords |
A character value indicating the coordinate system to use.
See |
compress |
Logical indicating whether read counts should be compressed. |
plot |
If this is |
ask |
Logical. Setting this to |
... |
Further arguments to |
The method for data.frame
requires the column names to follow a strict naming scheme. Required columns are
A factor with chromosome names.
A factor with levels “-” and “+” indicating which strand the read mapped to.
Start position of read on chromosome.
End position of read or length of read respectively.
An object of class ReadCounts
.
Peter Humburg
coverage
, AlignedRead
,
callBindingSites
## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) readPile <- strandPileup(reads, chrLen=500, extend=1, plot=FALSE)
## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) readPile <- strandPileup(reads, chrLen=500, extend=1, plot=FALSE)
Read counts are summarized in a sliding window of variable size with variable overlap between windows.
windowCounts(reads, window = 1000, shift = 500, method = sum)
windowCounts(reads, window = 1000, shift = 500, method = sum)
reads |
Numeric vector of read counts. |
window |
Width of window. |
shift |
Distance between consecutive window start positions. |
method |
Function used to produce a summary for each window. It should accept a single numeric vector as argument. |
If method
returns a single value a vector of all window summaries is returned, otherwise
the return value is a list with one component for each window.
Peter Humburg
## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=501, extend=1, plot=FALSE, compress=FALSE) ## get number of reads in sliding window wdwCount <- windowCounts(apply(readPile[[1]], 1, sum), window=10, shift=5)
## generate some very simple artificial read data set.seed(1) fwd <- sample(c(50:70, 250:270), 30, replace=TRUE) rev <- sample(c(197:217, 347:417), 30, replace=TRUE) ## create data.frame with read positions as input to strandPileup reads <- data.frame(chromosome="chr1", position=c(fwd, rev), length=25, strand=factor(rep(c("+", "-"), times=c(30, 30)))) ## create object of class ReadCounts readPile <- strandPileup(reads, chrLen=501, extend=1, plot=FALSE, compress=FALSE) ## get number of reads in sliding window wdwCount <- windowCounts(apply(readPile[[1]], 1, sum), window=10, shift=5)