A step-by-step guide in the analysis of ChIP-Seq data using the PICS package in R
Under the Artistic license 2.0, you are free to use and redistribute this software. However, if you use this package for your own publication, we would ask you to cite the following paper:
ChIP-Seq, which combines chromatin immunoprecipitation with massively parallel short-read sequencing, can profile in vivo genome-wide transcription factor-DNA association with higher sensitivity, specificity and spatial resolution than ChIP-chip. While it presents new opportunities for research, ChIP-Seq poses new challenges for statistical analysis that derive from the complexity of the biological systems characterized and the variability and biases in its digital sequence data. We propose a method called PICS (Probabilistic Inference for ChIP-Seq) for extracting information from ChIP-Seq aligned-read data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. Its per-event fragment length estimates also allow it to remove from analysis regions that have atypical fragment lengths. PICS uses pre-calculated, whole-genome read mappability profile and a truncated t-distribution to adjust binding event models to compensate for reads that are missing due to local genome repetitiveness. PICS estimates uncertainties in model parameters, and these can be used to define confidence regions on binding event locations and to filter estimates.
A typical PICS analysis consists of the following steps:As with any package, you first need to load it with the following command
The first step of the PICS pipeline consists of converting the
aligned reads (from both IP and control samples) into a format that
allows efficient segmentation of the genome into a set of candidate
regions that have enough Forward and Reverse reads. The data formatting
could be a bed type dataframes as well as
AlignedReads' objects as returned by the function
readAligned’
which can read various file formats including Eland, MAQ, Bowtie, etc.
Please refer to the `ShortRead’ vignette for more details on how to read
data into R, other than from a BED file. In the example listed below, we
use bed type files, which consists of tab delimited files with the
following columns: space, start, end, strand, where each line represent
a single read.
In addition to the IP and Control datasets, we use another argument consisting of a mappability profile for the genome interrogated. For each chromosome, a mappability profile for a specific read length consists of a vector of zeros and ones that gives an estimated read mappability `score’ for each base pair in the chromosome. A score of one at a position means that we should be able to align a read of that length uniquely at that position, while a score of zero indicates that no read of that length should be uniquely alignable at that position. As noted, reads that cannot be mapped to unique genomic locations are typically discarded. For convenience, and because transitions between mappable and non-mappable regions are typically much shorter than these regions, we compactly summarize each chromosome’s mappability profile as a disjoint union of non-mappable intervals that specify only zero-valued profile. We store this information as a bed file where each line correspond a non-mappable intervals. PICS makes use of such mappability profiles to estimate the reads that are missing because they could not be aligned. Mappability profiles for a range of reads lengths and species are available from:
http:/wiki.rglab.org/index.php?title=Public:Mappability_Profile
To make your own mappability profiles, use the script available at:
https://github.com/SRenan/proMap/
Once the data have been read, we can create the `GRanges’ object, which stores the sequences of genome locations and associatedd annotations. Each element of the list contains the IP and control reads as well as the mappability profiles for the corresponding chromosome. When reading the data, we can also sort and remove duplicated reads, which is recommended for a PICS analysis (sort is actually required).
In this documentation, the path for the data is: …/PICS/inst/extdata folder.
path <- system.file("extdata", package = "PICS")
dataIP <- read.table(file.path(path, "Treatment_tags_chr21_sort.bed"), header=TRUE,
colClasses=c("factor", "integer", "integer", "factor"))
dataIP <- as(dataIP, "GRanges")
dataCont <- read.table(file.path(path, "Input_tags_chr21_sort.bed"), header = TRUE,
colClasses = c("factor", "integer", "integer", "factor"))
dataCont <- as(dataCont, "GRanges")
We segment the genome into candidate regions by pre-processing
bidirectional aligned reads data from a single ChIP-Seq experiment to
detect candidate regions with a minimum number of forward and reverse
reads. These candidate regions will then be processed by PICS. The
minReads
parameter will heavily influence the number of
candidate regions returned. In principle, it is better to use a small
value for minReads
in order not to miss any true binding
sites. However, a value too small will likely return many false positive
regions, which can result a longer computing time for the function. If a
control sample is available you may also chose a value of
minReads
that would give you a ratio of the number of
candidate regions in the control over the number of candidate regions in
the IP of about 25%. A ratio larger than 25% would most likely result in
a large FDR when using , and therefore it is not necessary to use too
low of a threshold. See below for an illustration.
After having segmented the genome into candidate regions, we use the
function to detect binding regions in a probabilistic way, and returns
binding site estimates, confidence intervals, etc. In the code below, we
assume that you are processing transcription factor data, as specified
with the dataType
option set to TF
for
transcription factor. We also plan to support histone modification data
in a future release of PICS, which will be turned on by specifying
dataType="H"
.
In order to improve the computational efficiency of the PICS package,
we recommend the utilisation of the package, which allows for easy
parallel computations. In what follows, we assume that is installed on
your machine and you have at least two cores. If not, you could omit the
first line and the argument nCores
, and calculations will
occur on a single CPU. By default the function will use only one
core.
In our case, assuming that we have already segmented the genome using , we can proceed with the following command:
In order to estimate the FDR, we need to rerun our analysis after swapping the IP and control samples. Note that this requires the presence of a control sample. We proceed with the following commands to compute the FDR after removing binding site estimates with noisy parameters as specify by the filter.
segC <- segmentPICS(dataCont, dataC = dataIP, map = map, minReads = 1)
picsC <- PICS(segC)
fdr <- picsFDR(pics, picsC, filter = list(delta = c(50, Inf), se = c(0, 50),
sigmaSqF = c(0, 22500), sigmaSqR = c(0, 22500)))
Then once the FDR has been estimated, one can visualize the FDR as a function of the score to pick a threshold that would lead to the desired FDR.
plot(pics, picsC, xlim = c(2, 8), ylim = c(0, .2), filter = list(delta = c(50,300),
se = c(0, 50), sigmaSqF = c(0, 22500), sigmaSqR = c(0, 22500)), type = "l")
You can also visualize the FDR as a function of the number of regions.
To facilitate data processing and data output, we make use of the
package to summarize our results as GRanges
objects. The
resulting GRanges
objects can then be analyzed with the
package and/or exported to bed/wig files with the package. In each case,
we can make use of filters noisy regions and or regions with too low of
score. In particular the function can be used to set the score threshold
leading to the desired FDR
.
myFilter = list(delta = c(50, 300), se = c(0, 50), sigmaSqF = c(0, 22500),
sigmaSqR = c(0, 22500))
rdBed <- makeGRangesOutput(pics, type="bed", filter = c(myFilter, list(score = c(1, Inf))))
The following command will create the appropriate bed files, and because it requires the package, we do not run it here but simply displayed for your purpose.
The following command will create the appropriate
GRanges
and corresponding wig file, and because it requires
the package, we do not run it here but simply displayed for your
purpose.