Title: | GC Aware Peak Caller |
---|---|
Description: | Peak calling for ChIP-seq data with consideration of potential GC bias in sequencing reads. GC bias is first estimated with generalized linear mixture models using effective GC strategy, then applied into peak significance estimation. |
Authors: | Mingxiang Teng and Rafael A. Irizarry |
Maintainer: | Mingxiang Teng <[email protected]> |
License: | GPL-3 |
Version: | 1.31.0 |
Built: | 2024-11-23 06:25:05 UTC |
Source: | https://github.com/bioc/gcapc |
ChIP-seq experiments usually use crosslinking strategy to capture sequencing fragments. The fragment location is affected by at least but not limited to two factors, protein real binding and crosslinking operation. This function estimate size of binding part in crosslinked DNA-protein complexes, and denoted that as ChIP-seq binding width.Also, the peak detection window half size is estimated based on binding width.
bindWidth(coverage, range = c(50L, 500L), step = 50L, odd = TRUE)
bindWidth(coverage, range = c(50L, 500L), step = 50L, odd = TRUE)
coverage |
A list object returned by function |
range |
A non-nagative integer vector with length 2. This vector set the range within which binding width and peak window size are estimated. Default c(50,500) represents most ChIP-seq experiments. |
step |
A non-negative integer to set the resolution of binding
width estimation within |
odd |
A logical vector which, when TRUE, only allows return odd number of binding width, which is preferred by the effective GC content estimation. Default: TRUE. |
A numeric vector with 2 elements: Estimated binding width and half size of peak detection window.
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bindWidth(cov)
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bindWidth(cov)
This function calls ChIP-seq peaks using potential GC effects information. Enrichment scores are calculated on sliding windows of prefiltered large regions, with GC effects considered. Permutation analysis is used to determine significant binding peaks.
gcapcPeaks(coverage, gcbias, bdwidth, flank = NULL, prefilter = 4L, permute = 5L, pv = 0.05, plot = FALSE, genome = "hg19", gctype = c("ladder", "tricube"))
gcapcPeaks(coverage, gcbias, bdwidth, flank = NULL, prefilter = 4L, permute = 5L, pv = 0.05, plot = FALSE, genome = "hg19", gctype = c("ladder", "tricube"))
coverage |
A list object returned by function |
gcbias |
A list object returned by function |
bdwidth |
A non-negative integer vector with two elements
specifying ChIP-seq binding width and peak detection half window size.
Usually generated by function |
flank |
A non-negative integer specifying the flanking width of
ChIP-seq binding. This parameter provides the flexibility that reads
appear in flankings by decreased probabilities as increased distance
from binding region. This paramter helps to define effective GC
content calculation. Default is NULL, which means this paramater will
be calculated from |
prefilter |
A non-negative integer specifying the minimum of reads
to qualify a potential binding region. Regions with total of reads from
forward and reverse strands larger or equivalent to |
permute |
A non-negative integer specifying times of permutation to be performed. Default is 5. When whole large genome is used, such as human genome, 5 times of permutation could be enough. |
pv |
A numeric specifying p-value cutoff for significant binding peaks. Default is 0.05. |
plot |
A logical vector which, when TRUE (default), returns density plots of real and permutation enrichment scores. |
genome |
A BSgenome object containing the sequences
of the reference genome that was used to align the reads, or the name of
this reference genome specified in a way that is accepted by the
|
gctype |
A character vector specifying choice of method to calculate
effective GC content. Default |
A GRanges of peaks with meta columns:
es |
Estimated enrichment score. |
pv |
p-value. |
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb <- gcEffects(cov, bdw, sampling = c(0.15,1)) gcapcPeaks(cov, gcb, bdw)
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb <- gcEffects(cov, bdw, sampling = c(0.15,1)) gcapcPeaks(cov, gcb, bdw)
GC effects are estimated based on effective GC content and reads count on genome-wide windows, using generalized linear mixture models. Genome wide windows are randomly or supervised sampled with given proportions. GC effects of background and foreground are estimated separately.
gcEffects(coverage, bdwidth, flank = NULL, plot = TRUE, sampling = c(0.05, 1), supervise = GRanges(), gcrange = c(0.3, 0.8), emtrace = TRUE, model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0, theta1 = mu1, p = 0.02, converge = 0.001, genome = "hg19", gctype = c("ladder", "tricube"))
gcEffects(coverage, bdwidth, flank = NULL, plot = TRUE, sampling = c(0.05, 1), supervise = GRanges(), gcrange = c(0.3, 0.8), emtrace = TRUE, model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0, theta1 = mu1, p = 0.02, converge = 0.001, genome = "hg19", gctype = c("ladder", "tricube"))
coverage |
A list object returned by function |
bdwidth |
A non-negative integer vector with two elements
specifying ChIP-seq binding width and peak detection half window size.
Usually generated by function |
flank |
A non-negative integer specifying the flanking width of
ChIP-seq binding. This parameter provides the flexibility that reads
appear in flankings by decreased probabilities as increased distance
from binding region. This paramter helps to define effective GC
content calculation. Default is NULL, which means this paramater will
be calculated from |
plot |
A logical vector which, when TRUE (default), returns plots of intermediate results. |
sampling |
A numeric vector with length 2. The first number specifies the proportion of regions to be sampled for GC effects estimation. The second number specifies the repeat times for sampling. Default c(0.05,1) gives pretty robust estimation for human genome. However, smaller genomes might need both higher proportion and more repeat times for robust estimation. |
supervise |
A GRanges object specifying peak regions in the studied data, such as peaks called by peak callers, e.g. MACS & SPP. These peak regions provide supervised window sampling for both mixtures in the generalized linear model. Default no supervising. Or, if provided peak regions have too few covered windows, supervised sampling will be replaced by random sampling automatically. |
gcrange |
A non-negative numeric vector with length 2. This vector sets the range of GC content to filter regions for GC effect estimation. For human, most regions have GC content between 0.3 and 0.8, which is set as the default. Other regions with GC content beyond this range will be ignored. This range is critical when very few foreground regions are selected for mixture model fitting, since outliers could drive the regression lines. Thus, if possible, first make a scatter plot between counts and GC content to decide this parameter. Alternatively, select a narrower range, e.g. c(0.35,0.7), to aviod outlier effects from both high and low GC-content regions. |
emtrace |
A logical vector which, when TRUE (default), allows to print the trace of log likelihood changes in EM iterations. |
model |
A character specifying the distribution model to be used in
generalized linear model fitting. The default is negative
binomial( |
mu0 |
A non-negative numeric initiating read count signals for background regions. This is treated as the starting value of background mean for poisson/nbinom fitting. Default is 1. |
mu1 |
A non-negative numeric initiating read count signals for foreground regions. This is treated as the starting value of foreground mean for poisson/nbinom fitting, Default is 50. |
theta0 |
A non-negative numeric initiating the shape parameter of
negative binomial model for background regions. For more detail, see
theta in |
theta1 |
A non-negative numeric initiating the shape parameter of
negative binomial model for foreground regions. For more detail, see
theta in |
p |
A non-negative numeric specifying the proportion of foreground regions in all estimated regions. This is treated as a starting value for EM algorithm. Default is 0.02. |
converge |
A non-negative numeric specifying the condition of EM
algorithm termination. EM algorithm stops when the ratio of log likelihood
increment to whole log likelihood is less or equivalent to
|
genome |
A BSgenome object containing the sequences
of the reference genome that was used to align the reads, or the name of
this reference genome specified in a way that is accepted by the
|
gctype |
A character vector specifying choice of method to calculate
effective GC content. Default |
A list of objects
gc |
The GC contents at which GC effects are estimated. |
mu0 |
Predicted background signals at GC content |
mu1 |
Predicted foreground signals at GC content |
mu0med0 |
Median of predicted background signals. |
mu1med1 |
Median of predicted foreground signals. |
mu0med1 |
Median of predicted background signals at GC content of foreground windows. |
mu1med0 |
Median of predicted foreground signals at GC content of background windows. |
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb <- gcEffects(cov, bdw, sampling = c(0.15,1))
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb <- gcEffects(cov, bdw, sampling = c(0.15,1))
Plot the consistancy between two peak lists by their significance.
peaksCAT(x, y, ranks = seq(200, min(length(x), length(y), 20000), 50), exclude = GRanges(), seqinfo = NULL, esx = 1, esy = 1, add = FALSE, ...)
peaksCAT(x, y, ranks = seq(200, min(length(x), length(y), 20000), 50), exclude = GRanges(), seqinfo = NULL, esx = 1, esy = 1, add = FALSE, ...)
x |
A GRanges of identified peaks from one method or one replicate. At least one meta column should be included to allow for significance ranking of peaks. |
y |
A GRanges of identified peaks from compared method or anoter replicate. At least one meta column should be included to allow for significance ranking of peaks. |
ranks |
A non-negative integer vector specifying the ranks to be used for CAT plot. |
exclude |
A GRanges object specifying regions to be excluded for CAT plot, such as the blacklist regions proposed by ENCODE Consortium. |
seqinfo |
A vector of chromosome names to limit the CAT plot to
selected chromosomes. Chromosome names here must be in the same format
as |
esx |
A non-negative integer specifying which meta column of
|
esy |
A non-negative integer specifying which meta column of
|
add |
A logical vector which, when TRUE, adds the current plotting line to existing plots. FALSE will generate a new plot. |
... |
Other parameters passed to |
A CAT plot.
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb1 <- gcEffects(cov, bdw, sampling=c(0.15,1), plot=FALSE) peaks1 <- gcapcPeaks(cov, gcb1, bdw) gcb2 <- gcEffects(cov, bdw, sampling=c(0.2,1), plot=FALSE) peaks2 <- gcapcPeaks(cov, gcb2, bdw) peaksCAT(peaks1, peaks2, ranks=seq(100,200,5), ylim=c(0,1))
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb1 <- gcEffects(cov, bdw, sampling=c(0.15,1), plot=FALSE) peaks1 <- gcapcPeaks(cov, gcb1, bdw) gcb2 <- gcEffects(cov, bdw, sampling=c(0.2,1), plot=FALSE) peaks2 <- gcapcPeaks(cov, gcb2, bdw) peaksCAT(peaks1, peaks2, ranks=seq(100,200,5), ylim=c(0,1))
Reads coverage in single base pair resolution using only 5-prime end of BAM file records. Coverages are reported for forward and reverse strands separately. Options for customized filtering of BAM records are provided.
read5endCoverage(bam, chroms = NULL, mapq = 30L, duplicate = FALSE, flag = scanBamFlag(isUnmappedQuery = FALSE, isSecondaryAlignment = FALSE, isNotPassingQualityControls = FALSE))
read5endCoverage(bam, chroms = NULL, mapq = 30L, duplicate = FALSE, flag = scanBamFlag(isUnmappedQuery = FALSE, isSecondaryAlignment = FALSE, isNotPassingQualityControls = FALSE))
bam |
The path to a BAM file, which is sorted and indexed. |
chroms |
NULL or a vector of chromosome names that compatible with the provided BAM file. Reads coverage will be generated for these chromosomes. Default (NULL) will use all chromosomes in BAM file. |
mapq |
A non-negative integer specifying the minimum mapping
quality to include. BAM records with mapping qualities less
than |
duplicate |
A logical vector which, when FALSE (Default), returns maximum coverage of 1 for every base pair. Reads that start at the same position but on different strands are not treated as duplicates. |
flag |
A returned object by |
A list of two objects by GenomicRanges::coverage
fwd |
Coverage object for forward strand. |
rev |
Coverage object for reverse strand. |
bam <- system.file("extdata", "chipseq.bam", package="gcapc") read5endCoverage(bam)
bam <- system.file("extdata", "chipseq.bam", package="gcapc") read5endCoverage(bam)
This function refines the ranks (i.e. significance/pvalue) of pre-determined peaks by potential GC effects. These peaks can be obtained from other peak callers, e.g. MACS or SPP.
refinePeaks(coverage, gcbias, bdwidth, peaks, flank = NULL, permute = 5L, genome = "hg19", gctype = c("ladder", "tricube"))
refinePeaks(coverage, gcbias, bdwidth, peaks, flank = NULL, permute = 5L, genome = "hg19", gctype = c("ladder", "tricube"))
coverage |
A list object returned by function |
gcbias |
A list object returned by function |
bdwidth |
A non-negative integer vector with two elements
specifying ChIP-seq binding width and peak detection half window size.
Usually generated by function |
peaks |
A GRanges object specifying the peaks to be refined. A flexible set of peaks are preferred to reduce potential false negative, meaning both significant (e.g. p<=0.05) and non-significant (e.g. p>0.05) peaks are preferred to be included. If the total number of peaks is not too big, a reasonable set of peaks include all those with p-value/FDR less than 0.99 by other peak callers. |
flank |
A non-negative integer specifying the flanking width of
ChIP-seq binding. This parameter provides the flexibility that reads
appear in flankings by decreased probabilities as increased distance
from binding region. This paramter helps to define effective GC
content calculation. Default is NULL, which means this paramater will
be calculated from |
permute |
A non-negative integer specifying times of permutation to be performed. Default is 5. When whole large genome is used, such as human genome, 5 times of permutation could be enough. |
genome |
A BSgenome object containing the sequences
of the reference genome that was used to align the reads, or the name of
this reference genome specified in a way that is accepted by the
|
gctype |
A character vector specifying choice of method to calculate
effective GC content. Default |
A GRanges object the same as peaks
with two additional
meta columns:
newes |
Refined enrichment scores. |
newpv |
Refined pvalues. |
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb <- gcEffects(cov, bdw, sampling = c(0.15,1)) peaks <- gcapcPeaks(cov, gcb, bdw) refinePeaks(cov, gcb, bdw, peaks)
bam <- system.file("extdata", "chipseq.bam", package="gcapc") cov <- read5endCoverage(bam) bdw <- bindWidth(cov) gcb <- gcEffects(cov, bdw, sampling = c(0.15,1)) peaks <- gcapcPeaks(cov, gcb, bdw) refinePeaks(cov, gcb, bdw, peaks)
For a given set of sites with the same/comparable width, their
read count table from multiple samples are adjusted based on
potential GC effects. For each sample separately, GC effects are
estimated based on their effective GC content and
reads count using generalized linear mixture models. Then, count
table is adjusted based on estimated GC effects.
It it important that the given sites includes both foreground and
background regions, see sites
below.
refineSites(counts, sites, flank = 250L, outputidx = rep(TRUE, nrow(counts)), gcrange = c(0.3, 0.8), emtrace = TRUE, plot = TRUE, model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0, theta1 = mu1, p = 0.2, converge = 1e-04, genome = "hg19", gctype = c("ladder", "tricube"))
refineSites(counts, sites, flank = 250L, outputidx = rep(TRUE, nrow(counts)), gcrange = c(0.3, 0.8), emtrace = TRUE, plot = TRUE, model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0, theta1 = mu1, p = 0.2, converge = 1e-04, genome = "hg19", gctype = c("ladder", "tricube"))
counts |
A count matrix with each row corresponding to each element
in |
sites |
A GRanges object with length equivalent to number of rows
in |
flank |
A non-negative integer specifying the flanking width of ChIP-seq binding. This parameter provides the flexibility that reads appear in flankings by decreased probabilities as increased distance from binding region. This paramter helps to define effective GC content calculation. |
outputidx |
A logical vector with the length equivalent to number
of rows in |
gcrange |
A non-negative numeric vector with length 2. This vector sets the range of GC content to filter regions for GC effect estimation. For human, most regions have GC content between 0.3 and 0.8, which is set as the default. Other regions with GC content beyond this range will be ignored. This range is critical when very few foreground regions are selected for mixture model fitting, since outliers could drive the regression lines. Thus, if possible, first make a scatter plot between counts and GC content to decide this parameter. Alternatively, select a narrower range, e.g. c(0.35,0.7), to aviod outlier effects from both high and low GC-content regions. |
emtrace |
A logical vector which, when TRUE (default), allows to print the trace of log likelihood changes in EM iterations. |
plot |
A logical vector which, when TRUE (default), returns miture fitting plot. |
model |
A character specifying the distribution model to be used in
generalized linear model fitting. The default is negative
binomial( |
mu0 |
A non-negative numeric initiating read count signals for background sites. This is treated as the starting value of background mean for poisson/nbinom fitting. |
mu1 |
A non-negative numeric initiating read count signals for foreground sites. This is treated as the starting value of foreground mean for poisson/nbinom fitting. |
theta0 |
A non-negative numeric initiating the shape parameter of
negative binomial model for background sites. For more detail, see
theta in |
theta1 |
A non-negative numeric initiating the shape parameter of
negative binomial model for foreground sites. For more detail, see
theta in |
p |
A non-negative numeric specifying the proportion of foreground sites in all estimated sites. This is treated as a starting value for EM algorithm. |
converge |
A non-negative numeric specifying the condition of EM
algorithm termination. EM algorithm stops when the ratio of log likelihood
increment to whole log likelihood is less or equivalent to
|
genome |
A BSgenome object containing the sequences
of the reference genome that was used to align the reads, or the name of
this reference genome specified in a way that is accepted by the
|
gctype |
A character vector specifying choice of method to calculate
effective GC content. Default |
The count matrix after GC adjustment. The matrix values are not integer any more.