Title: | chipseq: A package for analyzing chipseq data |
---|---|
Description: | Tools for helping process short read data for chipseq experiments. |
Authors: | Deepayan Sarkar [aut], Robert Gentleman [aut], Michael Lawrence [aut], Zizhen Yao [aut], Oluwabukola Bamigbade [ctb] (Converted vignette from Sweave to R Markdown / HTML.), Bioconductor Package Maintainer [cre] |
Maintainer: | Bioconductor Package Maintainer <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.57.0 |
Built: | 2024-10-30 04:40:54 UTC |
Source: | https://github.com/bioc/chipseq |
Convenience for creating an
SRFilter
object appropriate
for ChIP-seq data. Typically, the result is passed
to readAligned
when loading reads.
chipseqFilter(exclude = "[_MXY]", uniqueness = c("location", "sequence", "location*sequence", "none"), hasStrand = TRUE)
chipseqFilter(exclude = "[_MXY]", uniqueness = c("location", "sequence", "location*sequence", "none"), hasStrand = TRUE)
exclude |
A regular expression for excluding chromosomes by name. Just
like the parameter to |
uniqueness |
The criteria used to determine whether a read is unique. A read may
be unique if it maps to a unique |
hasStrand |
Whether to require that the read is mapped to a strand, which usually translates to whether the read was mapped at all. |
An SRFilter
object
M. Lawrence
sp <- SolexaPath(system.file("extdata", package="ShortRead")) filter <- chipseqFilter() aln <- readAligned(sp, "s_2_export.txt", filter=filter) ## allow mapping to the same location (but only if sequence is different) filter <- chipseqFilter(uniqueness = "sequence") aln <- readAligned(sp, "s_2_export.txt", filter=filter) ## allow sex chromosomes filter <- chipseqFilter(exclude = "[M_]") aln <- readAligned(sp, "s_2_export.txt", filter=filter)
sp <- SolexaPath(system.file("extdata", package="ShortRead")) filter <- chipseqFilter() aln <- readAligned(sp, "s_2_export.txt", filter=filter) ## allow mapping to the same location (but only if sequence is different) filter <- chipseqFilter(uniqueness = "sequence") aln <- readAligned(sp, "s_2_export.txt", filter=filter) ## allow sex chromosomes filter <- chipseqFilter(exclude = "[M_]") aln <- readAligned(sp, "s_2_export.txt", filter=filter)
A function that plots one or two coverage vectors over a relatively small interval in the genome.
coverageplot(peaks1, peaks2 = NULL, i = 1, xlab = "Position", ylab = "Coverage", opposite = TRUE, ...)
coverageplot(peaks1, peaks2 = NULL, i = 1, xlab = "Position", ylab = "Coverage", opposite = TRUE, ...)
peaks1 , peaks2
|
A set of peaks as described by ranges over a coverage vector. |
i |
Which peak to use. |
xlab , ylab
|
Axis labels. |
opposite |
Logical specifying whether the two peaks should be plotted on opposite sides (appropriate for positive and negative strand peaks). |
... |
extra arguments. |
Deepayan Sarkar
cov <- Rle(c(1:10, seq(10, 1, -2), seq(1,5,2), 4:1), rep(1:2, 11)) peaks <- slice(cov, 3) peaks.cov <- Views(cov, ranges(peaks)) peaks.cov.rev <- rev(peaks.cov) coverageplot(peaks.cov, peaks.cov.rev, ylab = "Example")
cov <- Rle(c(1:10, seq(10, 1, -2), seq(1,5,2), 4:1), rep(1:2, 11)) peaks <- slice(cov, 3) peaks.cov <- Views(cov, ranges(peaks)) peaks.cov.rev <- rev(peaks.cov) coverageplot(peaks.cov, peaks.cov.rev, ylab = "Example")
A small subset of a ChIP-Seq dataset downloaded from the Short-Read Archive.
data(cstest)
data(cstest)
The dataset is on object of class GRangesList
with read
alignments from three chromosomes in two lanes representing CTCF and
GFP pull-down in mouse.
Short Read Archive, GEO accession number GSM288351 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM288351
Chen X., Xu H., Yuan P., Fang F., Huss M., Vega V.B., Wong E., Orlov Y.L., Zhang W., Jiang J., Loh Y.H., Yeo H.C., Yeo Z.X., Narang V., Govindarajan K.R., Leong B., Shahab A.S., Ruan Y., Bourque G., Sung W.K., Clarke N.D., Wei C.L., Ng H.H. (2008), “Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells”. Cell, 133:1106-1117.
data(cstest) names(cstest) cstest$gfp
data(cstest) names(cstest) cstest$gfp
Given two sets of peaks, this function combines them and summarizes the individual coverage vectors under the combined peak set.
diffPeakSummary(ranges1, ranges2, viewSummary = list(sums = viewSums, maxs = viewMaxs))
diffPeakSummary(ranges1, ranges2, viewSummary = list(sums = viewSums, maxs = viewMaxs))
ranges1 |
First set of peaks (typically
an |
ranges2 |
Second set of peaks (typically
an |
viewSummary |
A list of the per peak summary functions. |
A data.frame
with one row for each peak in the combined data.
The chromosome, start and stop nucleotide positions (+ strand) are
given as are the summary statistics requested.
D. Sarkar
data(cstest) library(BSgenome.Mmusculus.UCSC.mm9) seqlevels(cstest) <- seqlevels(Mmusculus) seqlengths(cstest) <- seqlengths(Mmusculus) ## find peaks findPeaks <- function(reads) { reads.ext <- resize(reads, width = 200) slice(coverage(reads.ext), lower = 8) } peakSummary <- diffPeakSummary(findPeaks(cstest$gfp), findPeaks(cstest$ctcf))
data(cstest) library(BSgenome.Mmusculus.UCSC.mm9) seqlevels(cstest) <- seqlevels(Mmusculus) seqlengths(cstest) <- seqlengths(Mmusculus) ## find peaks findPeaks <- function(reads) { reads.ext <- resize(reads, width = 200) slice(coverage(reads.ext), lower = 8) } peakSummary <- diffPeakSummary(findPeaks(cstest$gfp), findPeaks(cstest$ctcf))
estimate.mean.fraglen
implements three methods for estimating
mean fragment length. The other functions are related helper
functions implementing various methods, but may be useful by
themselves for diagnostic purposes. Many of these operations are
potentially slow.
sparse.density
is intended to be similar to
density
, but returns the results in a run-length encoded
form. This is useful when long stretches of the range of the data
have zero density.
estimate.mean.fraglen(x, method = c("SISSR", "coverage", "correlation"), ...) basesCovered(x, shift = seq(5, 300, 5), seqLen = 100, verbose = FALSE) densityCorr(x, shift = seq(0, 500, 5), center = FALSE, width = seqLen *2L, seqLen=100L, maxDist = 500L, ...) sparse.density(x, width = 50, kernel = "epanechnikov", from = start(rix)[1] - 10L, to = end(rix)[length(rix)] + 10L)
estimate.mean.fraglen(x, method = c("SISSR", "coverage", "correlation"), ...) basesCovered(x, shift = seq(5, 300, 5), seqLen = 100, verbose = FALSE) densityCorr(x, shift = seq(0, 500, 5), center = FALSE, width = seqLen *2L, seqLen=100L, maxDist = 500L, ...) sparse.density(x, width = 50, kernel = "epanechnikov", from = start(rix)[1] - 10L, to = end(rix)[length(rix)] + 10L)
x |
For For For |
method |
Character string giving method to be used.
|
shift |
Integer vector giving amount of shifts to be tried when optimizing. The current algorithm simply evaluates all supplied values and reports the one giving minimum coverage or maximum correlation. |
seqLen |
For the |
verbose |
Logical specifying whether progress information should be printed during execution. |
center |
For the |
width |
half-bandwidth used in the computation. This needs to be specified as an integer, data-driven rules are not supported. |
kernel |
A character string giving the density kernel. |
from , to
|
specifies range over which the density is to be computed. |
maxDist |
If distance to nearest neighbor is more than this, the position is discarded. This removes isolated points, which are not very informative. |
... |
Extra arguments, passed on as appropriate to other functions. |
For the correlation method, the range over which densities are computed only cover the range of reads; that is, the beginning and end of chromosomes are excluded.
estimate.mean.fraglen
gives an estimate of the mean fragment
length.
basesCovered
and densityCorr
give a vector of the
corresponding objective function evaluated at the supplied values of
shift
.
sparse.density
returns an object of class "Rle"
.
Deepayan Sarkar, Michael Lawrence
R. Jothi, S. Cuddapah, A. Barski, K. Cui, and K. Zhao. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Research, 36:5221–31, 2008.
P. V. Kharchenko, M. Y. Tolstorukov, and P. J. Park. Design and analysis of ChIP experiments for DNA-binding proteins. Nature Biotechnology, 26:1351–1359, 2008.
data(cstest) estimate.mean.fraglen(cstest[["ctcf"]], method = "coverage")
data(cstest) estimate.mean.fraglen(cstest[["ctcf"]], method = "coverage")
Plots the distribution of island depths using points for the observed islands and a line for the Poisson estimate of the noise. Useful for choosing a depth corresponding to a desired FDR.
islandDepthPlot(x, maxDepth = 20L)
islandDepthPlot(x, maxDepth = 20L)
x |
A coverage object, e.g., |
maxDepth |
The maximum depth to plot (there are usually some outliers). |
D. Sarkar, M. Lawrence
peakCutoff
for calculating a cutoff value for an FDR.
data(cstest) cov <- coverage(resize(cstest$ctcf, width=200)) islandDepthPlot(cov)
data(cstest) cov <- coverage(resize(cstest$ctcf, width=200)) islandDepthPlot(cov)
Subsamples data from multiple lanes on a per-chromosome basis.
laneSubsample(lane1, lane2, fudge = 0.05)
laneSubsample(lane1, lane2, fudge = 0.05)
lane1 , lane2
|
Two lanes of data, each of class
|
fudge |
A numeric fudge factor. For each chromosome, if the
difference in the sizes relative to the size of the first dataset is
less than |
laneSubsample
returns a list similar to its input, but with the
larger dataset subsampled to be similar to the smaller one.
D. Sarkar
data(cstest) ## subsample to compare lanes cstest.sub <- laneSubsample(cstest[[1]], cstest[[2]]) unlist(cstest.sub)
data(cstest) ## subsample to compare lanes cstest.sub <- laneSubsample(cstest[[1]], cstest[[2]]) unlist(cstest.sub)
Calculates a peak cutoff value given an FDR, assuming a Poisson noise distribution estimated from the frequency of singleton and doubleton islands.
peakCutoff(cov, fdr.cutoff = 0.001, k = 2:20)
peakCutoff(cov, fdr.cutoff = 0.001, k = 2:20)
cov |
The coverage object, e.g.,
an |
fdr.cutoff |
The maximum-allowed FDR for calculating the cutoff. |
k |
The coverage levels at which to estimate an FDR value. The maximal
value that is less than |
A numeric value to use for calling peaks
D. Sarkar and M. Lawrence
islandDepthPlot
for the graphical equivalent; the
vignette for a bit more explanation.
data(cstest) cov <- coverage(resize(cstest$ctcf, width=200)) peakCutoff(cov)
data(cstest) cov <- coverage(resize(cstest$ctcf, width=200)) peakCutoff(cov)
Summarizes a set of peaks into a
GRanges
object with columns
of statistics like the peak maxima and integrals (sums).
peakSummary(x, ...)
peakSummary(x, ...)
x |
An object containing peaks, usually
a |
... |
Arguments to pass to methods |
A GRanges
object of the peaks, with columns named
max
, maxpos
(position of the maximum, centered),
and sum
.
view-summarization-methods in the IRanges package
for view summarization methods like viewMaxs
and viewSums
.
THIS FUNCTION IS DEFUNCT!
Divides a short-read dataset into several subsets, and computes various summaries cumulatively. The goal is to study the characteristics of the data as a function of sample size.
subsetSummary(x, chr, nstep, props = seq(0.1, 1, 0.1), chromlens = seqlengths(x), fg.cutoff = 6, seqLen = 200, fdr.cutoff = 0.001, use.fdr = FALSE, resample = TRUE, islands = TRUE, verbose = getOption("verbose"))
subsetSummary(x, chr, nstep, props = seq(0.1, 1, 0.1), chromlens = seqlengths(x), fg.cutoff = 6, seqLen = 200, fdr.cutoff = 0.001, use.fdr = FALSE, resample = TRUE, islands = TRUE, verbose = getOption("verbose"))
x |
A |
chr |
The chromosome for which the summaries are to be obtained.
Must specify a valid element of |
nstep |
The number of maps in each increment for the full dataset (not per-chromosome). This will be translated to a per-chromosome number proportionally. |
props |
Alternatively, an increasing sequence of proportions
determining the size of each subset. Overrides |
chromlens |
A named vector of per-chromosome lengths, typically
the result of |
fg.cutoff |
The coverage depth above which a region would be considered foreground. |
seqLen |
The number of bases to which to extend each read before computing coverage. |
resample |
Logical; whether to randomly reorder the reads before
dividing them up into subsets. Useful to remove potential order
effects (for example, if data from two lanes were combined to
produce |
fdr.cutoff |
The maximum false discovery rate for a region that is considered to be foreground. |
use.fdr |
Whether to use the FDR detected peaks when calling foreground and background. |
islands |
Logical. If |
verbose |
logical controlling whether progress information will be shown during computation (which is potentially long-running). |
A data frame with various per-subset summaries.
This function should be considered preliminary, in that it might change significantly or simply be removed in a subsequent version. If you like it the way it is, please notify the maintainer.
Deepayan Sarkar, Michael Lawrence