enrichR()
calls ChIP-seq enrichment over control for
coordinate-sorted and indexed bamfilesdiffR()
calls differential enrichment between two
conditions, i.e. two ChIP-seq tracksregimeR()
calls k
enrichment regimes in a
ChIP-seq experiment over controlexportR()
writes above results to bed, bedGraph or
bigWig#export enriched regions with FDR<=10% for downstream analysis
exportR(obj = e,
filename = "enriched.bed",
type = "bed",
fdr = 0.1)
#or
#write normalized differential enrichment to bigWig for genome browser display
exportR(obj = de,
filename = "diffEnrichment.bw",
type = "bigWig")
## To cite the normR package in publications, use:
##
## Johannes Helmuth et al. normR: Regime enrichment calling for ChIP-seq
## data bioRxiv 082263; doi: https://doi.org/10.1101/082263
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## author = {Johannes Helmuth and Na Li and Laura Arrigoni and Kathrin Gianmoena and Cristina Cadenas and Gilles Gasparoni and Anupam Sinha and Philip Rosenstiel and Joern Walter and Jan G. Hengstler and Thomas Manke and Ho-Ryun Chung},
## title = {normR: Regime enrichment calling for ChIP-seq data},
## year = {2016},
## doi = {10.1101/082263},
## journal = {bioRxiv},
## publisher = {Cold Spring Harbor Laboratory},
## url = {https://doi.org/10.1101/082263},
## }
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) provides genome-wide localization data for DNA-associated proteins. To infer the regions bound by such proteins the read densities obtained by such a ChIP-seq experiment are compared to the corresponding read profile obtained by a control experiment. A meaningful comparison requires normalization to mitigate the effects of technical biases, e.g. different sequencing depth. But more importantly the effect of the enrichment of certain regions on the overall read statistics. Normalization requires knowledge of the regions that remained unchanged, such that normalization and difference calling are inseparable.
This package, normR (normR
obeys regime mixture
Rules), follows this logic and performs normalization
and difference calling simultaneously to identify genomic regions
enriched by the ChIP-procedure (enrichR()
). In addition,
normR enables the comparison between ChIP-seq data obtained from
different conditions allowing for unraveling genomic regions that change
their association with the ChIP-target (diffR()
). Lastly,
normR is capable to differentiate multiple regimes of enrichment,
i.e. broad domains and sharp peaks (regimeR()
). In
brief, all these routines encompass three steps:
This vignette explains a common workflow of normR analysis on NGS data for calling enrichment, identification of differential ChIP-seq enrichment and stratification of enrichment regimes. Herein, we provide examples for the export of results to common data formats like bigWig and bed. We show how analysis statistics and diagnostic plots are helpful for studying results. In a latter section, we cover more advanced topics including the alteration of read counting strategies, the application of normR on user-defined regions (e.g. promoters) and the integration of Copy Number Variation information in differential ChIP-seq enrichment calls.
enrichR()
: Calling Enrichment with an Input
ControlIn this first section, we would like to call regions significantly enriched for reads in the ChIP-seq experiment given a Control alignment. Here, we analyze ChIP-seq data for both H3K4me3 (pointy enrichment) and H3K36me3 (broad enrichment) given an Input-seq control alignment originating from a human immortalized myelogenous leukemia line (K562). Using normR, we show that our representative region on human chromsome 1 (chr1:22500000-25000000) is enriched for H3K4me3 mostly at promoters and precludes H3K36me3 enrichment which is overrepresented in gene bodies.
IGV browser shot of Input (grey), H3K4me3 (green) and H3K36me3 (purple) alignment data on chr1 22.5Mb to 25Mb with genes (black) drawn.
As part of the normR package, we provide 3 alignment files in bam
format (Input, H3K4me3 and H3K36me3 ChIP-seq) containing reads for human
chr1 22.5Mb to 25Mb. Note, bam files used in normR need to be sorted by
read coordinates (samtools sort x.bam x_sorted
) and indexed
(samtools index x_sorted.bam
). Our example files already
fullfil this criteria.
Firstly, we retrieve filepaths for toy data:
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
## Warning: multiple methods tables found for 'setequal'
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'S4Arrays::read_block' by
## 'DelayedArray::read_block' when loading 'SummarizedExperiment'
inputBamfile <- system.file("extdata", "K562_Input.bam", package="normr")
k4me3Bamfile <- system.file("extdata", "K562_H3K4me3.bam", package="normr")
k36me3Bamfile <- system.file("extdata", "K562_H3K36me3.bam", package="normr")
Secondly and lastly, we need to specify the genome of the
alignment. The genome
argument can be a character
specifying a UCSC genome identifier, e.g. “hg19”, or we can
provide a 2-dimensional data.frame
with columns seqnames
and length. We will follow the later option: You can generate a genome
data.frame
yourself or can use GenomeInfoDb
for retrieving the data.frame from UCSC for given genome identifier:
#Fetch chromosome information
genome <- GenomeInfoDb::getChromInfoFromUCSC("hg19", assembled.molecules.only = TRUE)
#Delete unnecessary columns
genome <- genome[,1:2]
genome
## chrom size
## 1 chr1 249250621
## 2 chr2 243199373
## 3 chr3 198022430
## 4 chr4 191154276
## 5 chr5 180915260
## 6 chr6 171115067
## 7 chr7 159138663
## 8 chr8 146364022
## 9 chr9 141213431
## 10 chr10 135534747
## 11 chr11 135006516
## 12 chr12 133851895
## 13 chr13 115169878
## 14 chr14 107349540
## 15 chr15 102531392
## 16 chr16 90354753
## 17 chr17 81195210
## 18 chr18 78077248
## 19 chr19 59128983
## 20 chr20 63025520
## 21 chr21 48129895
## 22 chr22 51304566
## 23 chrX 155270560
## 24 chrY 59373566
## 25 chrM 16571
## 26 chrMT 16569
## chrom size
## 1 chr1 249250621
Now, we are all set to do a enrichment call with default parameters:
#Enrichment Calling for H3K4me3 and H3K36me3
k4me3Fit <- enrichR(treatment = k4me3Bamfile, control = inputBamfile,
genome = genome, verbose = FALSE)
k36me3Fit <- enrichR(treatment = k36me3Bamfile, control = inputBamfile,
genome = genome, verbose = FALSE)
That was easy and fast! You must know that all results are stored as
NormRFit-class
objects. They provide convenient access to
count data and fitting results. For example, let’s have a look at some
simple fitting statistics for H3K4me3:
## NormRFit-class object
##
## Type: enrichR
## Number of Regions: 997003
## Theta* (naive bg): 0.3928
## Background component B: 1
##
## +++ Results of fit +++
## Mixture Proportions:
## Background Class 1
## 94.997% 5.003%
## Theta:
## Background Class 1
## 0.09148 0.92761
The “Type” of the NormRFit
object is defined by the
function generating it, i.e. enrichR()
,
diffR()
or regimeR()
. The “Number of Regions”
is the number of 250bp bins along the specified genome (default
binsize). The “Number of Components” is 2 (background and enriched) in
the case of enrichR()
. The parameter θ* (“Theta* (naive bg)”)
describes a naive background parametrization if the effect of enrichment
is not taken into account. The actual θB is with ~0.09
much smaller than θ* which allows for more
sensitive enrichment calling. Furthermore, by looking at the “Mixture
Proportions” we find H3K4me3 is enriched in ~5% of the windows. For
H3K36me3, we find ~23% of the regions enriched.
## NormRFit-class object
##
## Type: enrichR
## Number of Regions: 997003
## Theta* (naive bg): 0.5143
## Background component B: 1
##
## +++ Results of fit +++
## Mixture Proportions:
## Background Class 1
## 76.75% 23.25%
## Theta:
## Background Class 1
## 0.1131 0.8383
We can use some methods provided by the NormRFit-class
to get a grasp on the quality of our normR call, e.g. print a
more concise summary that gives the number of significant regions under
different False Discovery Rates (FDR).
## NormRFit-class object
##
## Type: 'enrichR'
## Number of Regions: 997003
## Number of Components: 2
## Theta* (naive bg): 0.393
## Background component B: 1
##
## +++ Results of fit +++
## Mixture Proportions:
## Background Class 1
## 95% 5%
## Theta:
## Background Class 1
## 0.0915 0.9276
##
## Bayesian Information Criterion: 73401
##
## +++ Results of binomial test +++
## T-Filter threshold: 4
## Number of Regions filtered out: 988560
## Significantly different from background B based on q-values:
## TOTAL:
## *** ** * . n.s.
## Bins 24 433 67 63 71 7785
## % 0.239 4.554 5.222 5.850 6.557 77.578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'n.s.'
Note, summary()
prints an additional section containing
information on the statistical testing. The “T-Filter threshold” filters
out regions that are not considered for multiple testing correction due
to low power. The sum of counts in treatment and control is less than
this quantity (Dialsingh et al., Bioinformatics, 2015, 1–7).
The “Number of Regions filtered out” is very large in our example
because the toy alignment files are filtered for reads within chr1
22.5Mb to 25Mb. However, the specified genome covers
chr1:0-249250621
which results in alot of zero counts. This
does not influence the fit but for testing the regions are filtered.
Based on computed q-vlaues, H3K4me3 shows 587 (24+433+67+63) regions
significantly enriched for FDR ≤ 0.05. The
summary for H3K36me3 enrichment calls confirms 2,378 (0+1951+212+215)
regions significantly enriched for FDR ≤ 0.05:
## NormRFit-class object
##
## Type: 'enrichR'
## Number of Regions: 997003
## Number of Components: 2
## Theta* (naive bg): 0.514
## Background component B: 1
##
## +++ Results of fit +++
## Mixture Proportions:
## Background Class 1
## 76.7% 23.3%
## Theta:
## Background Class 1
## 0.113 0.838
##
## Bayesian Information Criterion: 131166
##
## +++ Results of binomial test +++
## T-Filter threshold: 4
## Number of Regions filtered out: 988119
## Significantly different from background B based on q-values:
## TOTAL:
## *** ** * . n.s.
## Bins 0 1951 212 215 121 6385
## % 0.0 12.7 14.1 15.5 16.3 41.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'n.s.'
Based on the fitted background, normR can calculate a standardized, regularized enrichment for further processing:
#background normalized enrichment
k4me3Enr <- getEnrichment(k4me3Fit)
#restrict to regions with non-zero counts
idx <- which(rowSums(do.call(cbind, getCounts(k4me3Fit))) != 0)
summary(k4me3Enr[idx])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2590 -0.9680 -0.7688 -0.4050 0.1409 2.0151
If we compare H3K4me3 and H3K36me3 enrichment in non-zero regions, we can see some mutual exclusivity of both marks represented by off-diagonal elements):
x <- k4me3Enr[idx]
y <- getEnrichment(k36me3Fit)[idx]
d.x <- density(x); d.y <- density(y)
limits <- range(x,y)
layout( matrix( c(0,2,2,1,3,3,1,3,3),ncol=3) )
plot(d.x$x, d.x$y, xlim=limits, type='l',
main="H3K36me3 normalized Enrichment", xlab="", ylab="Density")
abline(v=0, lty=3, lwd=2, col=4)
plot(d.y$y, d.y$x, ylim=limits, xlim=rev(range(d.y$y)), type='l',
main="H3K4me3 normalized Enrichment", xlab="Density", ylab="")
abline(h=0, lty=3, lwd=2, col=4)
color <- rep("grey10", length(idx))
plot(x, y, xlim=limits, ylim=limits, pch=20, xlab="", ylab="",
col=adjustcolor(color, alpha.f=.2))
abline(0, 1, lty=2, lwd=3, col=2)
abline(v=0, lty=3, lwd=2, col=4)
abline(h=0, lty=3, lwd=2, col=4)
To analyze mutually exclusivity of H3K4me3 and H3K36me3, we would
like to recover the regions signficantly enriched in
k4me3Fit
and k36me3Fit
and color these regions
in the scatter plot above.
#integer vector with <NA> set to non-significant regions
k4me3Classes <- getClasses(k4me3Fit, fdr = 0.05)
k36me3Classes <- getClasses(k36me3Fit, fdr = 0.05)
#Color scatter plot based on enrichment
color[!is.na(k4me3Classes[idx])] <- "#2C9500"
color[!is.na(k36me3Classes[idx])] <- "#990099"
color[!is.na(k4me3Classes+k36me3Classes)[idx]] <- "#971621"
plot(x, y, xlim=limits, ylim=limits, pch=20,
col=adjustcolor(color, alpha.f=.5), xlab="H3K4me3 normalized Enrichment",
ylab="H3K36me3 normalized Enrichment")
legend("topright", pch=20, col=unique(color), cex=.6, bg="white",
legend=c("Background", "H3K36me3 enriched", "H3K4me3 enriched",
"H3K4me3/K36me3 enriched")
)
Processing of regions within R can be facilitated by getting
significantly enriched (FDR ≤ 0.05)
regions as a GRanges
object:
k4me3Ranges <- getRanges(k4me3Fit)[!is.na(k4me3Classes)]
#Alternatively you can extract ranges without storing the class vector
k4me3Ranges <- getRanges(k4me3Fit, fdr = 0.05)
#as expected we get 587 regions
length(k4me3Ranges)
## [1] 587
As a representative analysis, we would like check for overrepresentation of enriched regions with genes and promoters by using Fisher’s exact test. Let’s start with H3K4me3:
#example gene annotation for representative region (chr1:22500000-25000000)
genes <- read.delim(file = system.file("extdata", "genes.bed",package="normr"),
header = FALSE, stringsAsFactors = FALSE)
library("GenomicRanges")
genes <- GRanges(seqnames = genes[, 1],
ranges = IRanges(start = genes[, 2], end = genes[, 3]),
strand = genes[, 6],
ENSTID = genes[, 4])
genes <- unique(genes)
#Fisher-test provides significance of overlap
#(total specifies number of bins in representative region)
overlapOdds <- function(query, subject, total = 10000) {
subject <- reduce(subject, ignore.strand = TRUE)
ov1 <- countOverlaps(query, subject)
m <- matrix(c(sum(ov1 != 0), sum(ov1 == 0),
ceiling(sum(width(subject))/width(query)[1]-sum(ov1 != 0)), 0),
ncol = 2)
m[2,2] <- total - sum(m)
fisher.test(m, alternative="greater")
}
#Overlap of H3K4me3 with genes
overlapOdds(k4me3Ranges, genes)
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value = 1.1e-09
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 1.471389 Inf
## sample estimates:
## odds ratio
## 1.718336
#Overlap of H3K4me3 with promoters
promoters <- promoters(genes, upstream = 2000, downstream = 2000)
overlapOdds(k4me3Ranges, promoters)
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 6.665838 Inf
## sample estimates:
## odds ratio
## 7.738937
It is known that promoters are marked by H3K4me3 if their gene’s expression is initiated. Our analysis above shows that H3K4me3-enriched regions are indeed significantly overrepresented within genes (Fisher’s signed-exact test; P-value<0.001; odds ratio = 1.72) and, more pronounced, in promoter regions (odds ratio = 7.74).
By comparing H3K36me3 and H3K4me3 ranges, we can identify significant overlap of H3K36me3 and H3K4me3 (odds ratio = 1.53) that is most pronounced in promoter regions (odds ratio = 9.68) than compared to gene bodies (odds ratio = 2.85).
#Overlap of H3K36me3 with H3K4me3
k36me3Ranges <- getRanges(k36me3Fit, fdr = 0.05)
overlapOdds(k36me3Ranges, k4me3Ranges)
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value = 4.173e-06
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 1.306727 Inf
## sample estimates:
## odds ratio
## 1.527935
#Overlap of H3K36me3 with H3K4me3 at promoter regions
overlapOdds(k36me3Ranges[countOverlaps(k36me3Ranges, promoters) != 0],
k4me3Ranges[countOverlaps(k4me3Ranges, promoters) != 0])
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 7.920713 Inf
## sample estimates:
## odds ratio
## 9.676087
#Overlap of H3K36me3 with H3K4me3 in genes
overlapOdds(k36me3Ranges[countOverlaps(k36me3Ranges, genes) != 0],
k4me3Ranges[countOverlaps(k4me3Ranges, genes) != 0])
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 2.390976 Inf
## sample estimates:
## odds ratio
## 2.85357
H3K36me3 is associated to transcriptional elongation in the gene body. The presence of H3K36me3 within the gene body marks transcribed genes. Indeed, H3K36me3 enrichment is significantly overrepresented mostly at genes (odds ratio = 5.52) and, to a lower extend, at promoters (odds ratio = 2.68).
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 4.986738 Inf
## sample estimates:
## odds ratio
## 5.522417
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 2.426846 Inf
## sample estimates:
## odds ratio
## 2.680547
While there exist a plethora of analysis options of normR results
within R, exportR()
provides functionality to write results
to a file. To export coordinates of enriched regions, the widely used BED format
is applicable. It is human-readable and can be imported in common genome
browsers, e.g. UCSC genome browser
or IGV. To export the
background-normalized enrichment, the binary bigWig
format is used. Check ?exportR
for more options.
#export coordinates of significantly (FDR <= 0.05) enriched regions
exportR(k4me3Fit, filename = "k4me3Fit.bed", type = "bed", fdr = 0.05)
exportR(k36me3Fit, filename = "k36me3Fit.bed", type = "bed", fdr = 0.05)
#export background-normalized enrichment
exportR(k4me3Fit, filename = "k4me3Fit.bw", type = "bigWig")
exportR(k36me3Fit, filename = "k36me3Fit.bw", type = "bigWig")
IGV browser shot of Input (grey), H3K4me3 (green) and H3K36me3 (purple) alignment data (bars), normalized enrichment, i.e. “bigWig” files, (lines) and enriched regions, i.e. “bed” files (boxes below respective tracks).
diffR()
: Calling Differential Enrichment without a
Control ExperimentNormalization and difference calling are inseparable in calling
ChIP-seq enrichment. Following this notion, a direct comparison of two
ChIP-seq tracks can be performed with diffR()
. In many
studies, researchers are interested in conditional changes in ChIP-seq
enrichment. Below, we exemplify this analysis by joint analysis of
H3K4me3 and H3K36me3 ChIP-seq data. Because we already counted
k4me3Bamfile
and k36me3Bamfile
already in
k4me3Fit
and k36me3Fit
, respectively, we can
use these counts directly. Note that, in this case, the
genome
has be set to a GenomicRanges
object
specifying the genomic regions. We can extract this from either one of
the NormRFit
objects.
#We could use read counts from above NormRFit objects
k4k36Dif <- diffR(treatment = getCounts(k4me3Fit)$treatment,
control = getCounts(k36me3Fit)$treatment,
genome = getRanges(k4me3Fit),
verbose = FALSE)
#<or> (unnecessarily) count again
#k4k36Dif <- diffR(treatment = k4me3Bamfile, control = k36me3Bamfile,
# genome = genome, verbose = FALSE)
#summary statistics
summary(k4k36Dif)
## NormRFit-class object
##
## Type: 'diffR'
## Number of Regions: 997003
## Number of Components: 3
## Theta* (naive bg): 0.379
## Background component B: 2
##
## +++ Results of fit +++
## Mixture Proportions:
## Class 1 Background Class 2
## 49.8% 29.5% 20.7%
## Theta:
## Class 1 Background Class 2
## 0.0183 0.4798 0.9726
##
## Bayesian Information Criterion: 48069
##
## +++ Results of binomial test +++
## T-Filter threshold: 6
## Number of Regions filtered out: 994238
## Significantly different from background B based on q-values:
## TOTAL:
## *** ** * . n.s.
## Bins 0 1896 433 189 111 136
## % 0.00 19.94 24.50 26.48 27.65 1.43
## Class 1:
## *** ** * . n.s.
## Bins 0 1567 342 130 65 661
## % 0.00 56.67 12.37 4.70 2.35 23.91
## Class 2:
## *** ** * . n.s.
## Bins 0 329 91 59 46 2240
## % 0.00 11.90 3.29 2.13 1.66 81.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'n.s.'
The “Type” of the object has changed to ‘diffR’ because it was
generated by this function. The “Number of Regions” did not change
because we use the same binning strategy. However, the “Number of
Components” is now 3 representing (i) H3K36me3 enrichment without
H3K4me3-enrichment, (ii) H3K36me3 and H3K4me3 (non)-enriched and (iii)
H3K4me3 enrichment without H3K36me3-enrichment. The “Backgroundcomponent
B” is 2 in this case: diffR()
identifies significant
enrichment and depletion by a two-sided test on the background. Looking
at the “Mixture Proportions”, regions are classified to (i) in ~49.8%,
(ii) in ~29.5% and (iii) in ~20.7% of the regions. 2,629 regions are
significantly different from background with FDR ≤ 0.05. These
regions are either H3K4me3-positive or H3K36me3 positive. We can export
these regions and the normalized enrichment with
exportR
:
exportR(k4k36Dif, filename = "k4k36Dif.bed", type = "bed", fdr = 0.05)
exportR(k4k36Dif, filename = "k4k36Dif.bw", type = "bigWig")
IGV browser shot of Input (grey), H3K4me3 (green) and H3K36me3 (purple) alignment data. Background normalized difference is plotted as a heatmap, i.e. “bigWig” file, and differential regions are plotted as boxes, i.e. “bed” file (blue: treatment (H3K4me3) enriched, red: control (H3K36me3) enriched).
regimeR()
: Identify Enrichment Regimes in ChIP-seq
ExperimentsThe two sections above aimed at discerning enrichment from
background. The extendable normR approach also allows for identification
of different enrichment regimes with regimeR()
by
increasing the number of model components.
Let’s start with 3 components (Background + 2 Enrichment Regimes) for H3K4me3. By using two enrichment regimes, we may uncover effects of sample heterogeneity affecting transcriptional initiation of certian genes.
k4me3Regimes <- regimeR(treatment = getCounts(k4me3Fit)$treatment,
control = getCounts(k4me3Fit)$control,
genome = getRanges(k4me3Fit),
models = 3,
verbose = FALSE)
summary(k4me3Regimes)
## NormRFit-class object
##
## Type: 'regimeR'
## Number of Regions: 997003
## Number of Components: 3
## Theta* (naive bg): 0.393
## Background component B: 1
##
## +++ Results of fit +++
## Mixture Proportions:
## Background Class 1 Class 2
## 89.58% 7.64% 2.78%
## Theta:
## Background Class 1 Class 2
## 0.0691 0.5496 0.9547
##
## Bayesian Information Criterion: 69872
##
## +++ Results of binomial test +++
## T-Filter threshold: 4
## Number of Regions filtered out: 988560
## Significantly different from background B based on q-values:
## TOTAL:
## *** ** * . n.s.
## Bins 41 448 80 123 113 7638
## % 0.401 4.778 5.560 6.762 7.866 74.634
## Class 1:
## *** ** * . n.s.
## Bins 0 245 80 123 113 7882
## % 0.000 2.902 0.948 1.457 1.338 93.355
## Class 2:
## *** ** * . n.s.
## Bins 41 203 0 0 0 8199
## % 0.486 2.404 0.000 0.000 0.000 97.110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'n.s.'
~10.5% of the regions show enrichment which gets segmented into ~7.7% low and ~2.8% high enrichment. 692 regions are significant (FDR ≤ 0.05).
Now, we would like to use two enrichment regimes for H3K36me3. In this way, we might be able to classify genes of low and high transcriptional rates:
k36me3Regimes <- regimeR(treatment = getCounts(k36me3Fit)$treatment,
control = getCounts(k36me3Fit)$control,
genome = getRanges(k36me3Fit),
models = 3,
verbose = FALSE)
summary(k36me3Regimes)
## NormRFit-class object
##
## Type: 'regimeR'
## Number of Regions: 997003
## Number of Components: 3
## Theta* (naive bg): 0.514
## Background component B: 1
##
## +++ Results of fit +++
## Mixture Proportions:
## Background Class 1 Class 2
## 69.2% 15.9% 14.9%
## Theta:
## Background Class 1 Class 2
## 0.0776 0.5443 0.8858
##
## Bayesian Information Criterion: 126538
##
## +++ Results of binomial test +++
## T-Filter threshold: 4
## Number of Regions filtered out: 988119
## Significantly different from background B based on q-values:
## TOTAL:
## *** ** * . n.s.
## Bins 0 2130 211 237 157 6149
## % 0.0 13.4 14.7 16.2 17.2 38.6
## Class 1:
## *** ** * . n.s.
## Bins 0 780 202 237 157 7508
## % 0.00 8.78 2.27 2.67 1.77 84.51
## Class 2:
## *** ** * . n.s.
## Bins 0 1350 9 0 0 7525
## % 0.000 15.196 0.101 0.000 0.000 84.703
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'n.s.'
We can now export the called regimes as bed files for browser display. Each track has two enrichment regimes which are shaded for their degree of significance. The export is done analogously to the cases described above:
exportR(k4me3Regimes, filename = "k4me3Regimes.bed", type = "bed", fdr=0.05)
exportR(k36me3Regimes, filename = "k36me3Regimes.bed", type = "bed", fdr=0.05)
IGV browser shot of Input (grey), H3K4me3 (green) and H3K36me3 (purple) alignment data. Regime calls are plotted as boxes below respective tracks (Yellow = low enrichment, Pink = high enrichment)
In addition to the information covered above, it is recommend to have
a look at help pages (?
) of normR functions. Here, we would
like to discuss three important points:
NormRCountConfig-class
It is very important how we count reads contained in the bamfile. The
NormRCountConfig-class
provides methods to define the
counting strategy on single-end and paired-end alignment data:
#Single End:
# Count in 500bp bins.
# Consider only reads with Mapping Quality >= 20.
# Filter reads for marked duplicates (e.g. with picard mark-duplicates)
# Shift the counting position for a read 100 bp downstream.
countConfigSE <- countConfigSingleEnd(binsize = 500, mapq = 20,
filteredFlag = 1024, shift = 100)
#Paired End:
# Count in 500bp bins.
# Consider only reads with Mapping Quality >= 30.
# Count the midpoint of the aligned fragment instead of 5' ends.
# Consider only reads corresponding to fragments with size from 100 to 300bp
countConfigPE <- countConfigPairedEnd(binsize = 500, mapq = 30, midpoint=TRUE,
tlenFilter = c(100, 300))
#Plug in the counting configuration into normR, e.g. in enrichR()
fit <- enrichR(treatment = k4me3Bamfile,
control = inputBamfile,
genome = genome,
countConfig = countConfigPE)
You could do a fit on a set of pre-defined regions like promoters or
known transcription factor binding sites. You need to count beforehand
with bamsignals
. Note, for the fit to work correctly these
regions should be of same size.
promoters <- promoters(genes, 1500, 1500)
#regions have identical size?
all(width(promoters) == 3000)
## [1] TRUE
#Fit only on promoters
promotersFit <- enrichR(treatment = k4me3Bamfile, control = inputBamfile,
genome = promoters, verbose = FALSE)
summary(promotersFit)
## NormRFit-class object
##
## Type: 'enrichR'
## Number of Regions: 265
## Number of Components: 2
## Theta* (naive bg): 0.891
## Background component B: 1
##
## +++ Results of fit +++
## Mixture Proportions:
## Background Class 1
## 51.7% 48.3%
## Theta:
## Background Class 1
## 0.125 0.943
##
## Bayesian Information Criterion: 98584
##
## +++ Results of binomial test +++
## T-Filter threshold: 4
## Number of Regions filtered out: 3
## Significantly different from background B based on q-values:
## TOTAL:
## *** ** * . n.s.
## Bins 101 36 1 4 3 117
## % 12.9 17.6 17.7 18.2 18.6 15.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'n.s.'
Copy Number Variations (CNVs) are an important feature of cancerous
cells like tumour samples. The difference calling on two ChIP-seq
experiments with diffR()
is sensitive to CNVs if the
underlying sequence is amplified in the genome. However, you can harness
diffR()
’s functionality to call differences in two Input
tracks to detect CNVs in a treatment respective to a control. To allow
for coarse-grained detection of difference in Input, a sufficiently
large binsize has to be used, e.g. 22kb.
cnvs <- diffR(treatment = treatmentInputBamfile,
control = controlInputBamfile,
genome = genome,
countConfig = countConfigSingleEnd(binsize = 2.5e4))
#export the CNV calls
exportR(cnvs, "CNVs.bed")
#Filter previous ChIP-seq difference calls for CNVs
ov <- countOverlaps(getRanges(diffFit, fdr = .05), getRanges(cnvs, fdr = .05))
idx <- which(ov == 0)
cnvCleanedGR <- getRanges(diffFit, fdr = .05)[idx]