ChIPseeker is an R package for annotating ChIP-seq data analysis. It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, ChIPseeker contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation.
If you use ChIPseeker(Yu, Wang, and He 2015) in published research, please cite:
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) has become standard technologies for genome wide identification of DNA-binding protein target sites. After read mappings and peak callings, the peak should be annotated to answer the biological questions. Annotation also create the possibility of integrating expression profile data to predict gene expression regulation. ChIPseeker(Yu, Wang, and He 2015) was developed for annotating nearest genes and genomic features to peaks.
ChIP peak data set comparison is also very important. We can use it as an index to estimate how well biological replications are. Even more important is applying to infer cooperative regulation. If two ChIP seq data, obtained by two different binding proteins, overlap significantly, these two proteins may form a complex or have interaction in regulation chromosome remodelling or gene expression. ChIPseeker(Yu, Wang, and He 2015) support statistical testing of significant overlap among ChIP seq data sets, and incorporate open access database GEO for users to compare their own dataset to those deposited in database. Protein interaction hypothesis can be generated by mining data deposited in database. Converting genome coordinations from one genome version to another is also supported, making this comparison available for different genome version and different species.
Several visualization functions are implemented to visualize the coverage of the ChIP seq data, peak annotation, average profile and heatmap of peaks binding to TSS region.
Functional enrichment analysis of the peaks can be performed by my Bioconductor packages DOSE(Yu et al. 2015), ReactomePA(Yu and He 2016), clusterProfiler(Yu et al. 2012).
The datasets CBX6 and CBX7 in this vignettes were
downloaded from GEO (GSE40740)(Pemberton
et al. 2014) while ARmo_0M, ARmo_1nM and
ARmo_100nM were downloaded from GEO (GSE48308)(Urbanucci et al. 2012) . ChIPseeker
provides readPeakFile
to load the peak and store in
GRanges
object.
## $ARmo_0M
## [1] "/tmp/Rtmp6hfjvL/Rinst23ad23fd810d/ChIPseeker/extdata/GEO_sample_data/GSM1174480_ARmo_0M_peaks.bed.gz"
##
## $ARmo_1nM
## [1] "/tmp/Rtmp6hfjvL/Rinst23ad23fd810d/ChIPseeker/extdata/GEO_sample_data/GSM1174481_ARmo_1nM_peaks.bed.gz"
##
## $ARmo_100nM
## [1] "/tmp/Rtmp6hfjvL/Rinst23ad23fd810d/ChIPseeker/extdata/GEO_sample_data/GSM1174482_ARmo_100nM_peaks.bed.gz"
##
## $CBX6_BF
## [1] "/tmp/Rtmp6hfjvL/Rinst23ad23fd810d/ChIPseeker/extdata/GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"
##
## $CBX7_BF
## [1] "/tmp/Rtmp6hfjvL/Rinst23ad23fd810d/ChIPseeker/extdata/GEO_sample_data/GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz"
## GRanges object with 1331 ranges and 2 metadata columns:
## seqnames ranges strand | V4 V5
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 815093-817883 * | MACS_peak_1 295.76
## [2] chr1 1243288-1244338 * | MACS_peak_2 63.19
## [3] chr1 2979977-2981228 * | MACS_peak_3 100.16
## [4] chr1 3566182-3567876 * | MACS_peak_4 558.89
## [5] chr1 3816546-3818111 * | MACS_peak_5 57.57
## ... ... ... ... . ... ...
## [1327] chrX 135244783-135245821 * | MACS_peak_1327 55.54
## [1328] chrX 139171964-139173506 * | MACS_peak_1328 270.19
## [1329] chrX 139583954-139586126 * | MACS_peak_1329 918.73
## [1330] chrX 139592002-139593238 * | MACS_peak_1330 210.88
## [1331] chrY 13845134-13845777 * | MACS_peak_1331 58.39
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
After peak calling, we would like to know the peak locations over the
whole genome, covplot
function calculates the coverage of
peak regions over chromosomes and generate a figure to visualize. GRangesList
is also supported and can be used to compare coverage of multiple bed
files.
When peak
is a GRangsList
object, user can
set the colors directly or by passing a palette to
fill_color
.
First of all, for calculating the profile of ChIP peaks binding to TSS regions, we should prepare the TSS regions, which are defined as the flanking sequence of the TSS sites. Then align the peaks that are mapping to these regions, and generate the tagMatrix.
## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrix <- getTagMatrix(peak, windows=promoter)
##
## to speed up the compilation of this vignettes, we use a precalculated tagMatrix
data("tagMatrixList")
tagMatrix <- tagMatrixList[[4]]
In the above code, you should notice that tagMatrix is not restricted
to TSS regions. The regions can be other types that defined by the user.
ChIPseeker
expanded the scope of region. Users can input the type
and
by
parameters to get the regions they want.
ChIPseeker provide a one step function to generate this figure from bed file. The following function will generate the same figure as above.
Users can use nbin
parameter to speed up.
Users can also use ggplot method to change the details of the figures.
peakHeatmap(files[[4]],TxDb = txdb,nbin = 800,upstream=3000, downstream=3000) +
scale_fill_distiller(palette = "RdYlGn")
Users can also profile genebody regions with
peakHeatmap()
.
peakHeatmap(peak = files[[4]],
TxDb = txdb,
upstream = rel(0.2),
downstream = rel(0.2),
by = "gene",
type = "body",
nbin = 800)
Sometimes there will be a need to explore the comparison of the peak
heatmap over two regions, for example, the following picture is the peak
over two gene sets. One possible scenery of using this method is to
compare the peak heatmap over up-regulating genes and down-regulating
genes. Here txdb1
and txdb2
is the simulated
gene sets obtain from TxDb.Hsapiens.UCSC.hg19.knownGene
.
Using peakHeatmap_multiple_Sets()
, accepting
list
object containing different regions information. The
length of each part is correlated to the amount of regions.
txdb1 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb2 <- unlist(fiveUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))[1:10000,]
region_list <- list(geneX = txdb1, geneY = txdb2)
peakHeatmap_multiple_Sets(peak = files[[4]],
upstream = 1000,downstream = 1000,
by = c("geneX","geneY"),
type = "start_site",
TxDb = region_list,nbin = 800)
We also meet the need of ploting heatmap and peak profiling together.
peak_Profile_Heatmap(peak = files[[4]],
upstream = 1000,
downstream = 1000,
by = "gene",
type = "start_site",
TxDb = txdb,
nbin = 800)
Exploring several regions with heatmap and peak profiling is also supported.
txdb1 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb2 <- unlist(fiveUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))[1:10000,]
region_list <- list(geneX = txdb1, geneY = txdb2)
peak_Profile_Heatmap(peak = files[[4]],
upstream = 1000,
downstream = 1000,
by = c("geneX","geneY"),
type = "start_site",
TxDb = region_list,nbin = 800)
plotAvgProf(tagMatrix, xlim=c(-3000, 3000),
xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
## >> plotting figure... 2024-10-30 04:43:44 AM
The function plotAvgProf2
provide a one step from bed
file to average profile plot. The following command will generate the
same figure as shown above.
plotAvgProf2(files[[4]], TxDb=txdb, upstream=3000, downstream=3000,
xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
Confidence interval estimated by bootstrap method is also supported for characterizing ChIP binding profiles.
Referring to the issue
#16 , we developed and improved several functions support start site
region, end site region and body region of
Gene/Transcript/Exon/Intron/3UTR/5UTR. getBioRegion
can
prepare the different regions for ChIP peaks to bind.
getTagMatrix
can accept type
, by
,
upstream
and downstream
parameters to get
tagmatrix according to different needs. plotPeakProf
and
plotPeakProf2
supports the plotting of profiles of peaks
binding to different regions.Users can also create heatmap or average
profile of ChIP peaks binding to these regions.
In order to plot body regions, a new methond binning
,was
introduced to getTagMatrix
. The idea of
binning
was derived from deeptools(Ramı́rez et al. 2016). binning
scaled the regions having different lengths to the equal length by
deviding the regions into the same amounts of boxes. Because the amount
of boxes is equal, the regions can be thought of scaling to equal
length.binning
method can speed up the
getTagMatrix
by changing the precision from bp to
box(several bps).
There are three ways to plot these regions. First, users can use
getBioRegion
to prepare the regions. Then align the peaks
that are mapping to these regions, and generate the tagMatrix by
getTagMatrix
. At Last, plot the figures by
plotPeakProf
. Second, users can input type
and
by
parameters to getTagMatrix
to get the
tagMatrix and plot the figures. Third, users can use
plotPeakProf2
to do everything in one step.
Here uses the method of inputting type
and
by
parameters. type = "start_site"
means the
start site region. by = "gene"
means that it is the start
site region of gene(TSS regions). If users want to use binning method,
the nbin
method must be set.
We improved and developed several functions to plot body region of
Gene/Transcript/Exon/Intron/3UTR/5UTR. If users want to get more
information from the body region, we added upstream
and
downstream
parameters to functions in order to get flank
extension of body regions. upstream
and
downstream
can be NULL(default), rel object and actual
numbers. NULL(default) reflects body regions with no flank extension.
Rel object reflects the percentage of total length of body regions.
Actual numbers reflects the actual length of flank extension.
## Here uses `plotPeakProf2` to do all things in one step.
## Gene body regions having lengths smaller than nbin will be filtered
## A message will be given to warning users about that.
## >> 9 peaks(0.872093%), having lengths smaller than 800bp, are filtered...
## the ignore_strand is FALSE in default. We put here to emphasize that.
## We will not show it again in the below example
plotPeakProf2(peak = peak, upstream = rel(0.2), downstream = rel(0.2),
conf = 0.95, by = "gene", type = "body", nbin = 800,
TxDb = txdb, weightCol = "V5",ignore_strand = F)
Users can also get the profile ChIP peaks binding to gene body regions with no flank extension or flank extension decided by actual length.
## The first method using getBioRegion(), getTagMatrix() and plotPeakProf() to plot in three steps.
genebody <- getBioRegion(TxDb = txdb,
by = "gene",
type = "body")
matrix_no_flankextension <- getTagMatrix(peak,windows = genebody, nbin = 800)
plotPeakProf(matrix_no_flankextension,conf = 0.95)
## The second method of using getTagMatrix() and plotPeakProf() to plot in two steps
matrix_actual_extension <- getTagMatrix(peak,windows = genebody, nbin = 800,
upstream = 1000,downstream = 1000)
plotPeakProf(matrix_actual_extension,conf = 0.95)
Users can also get the body region of 5UTR/3UTR.
## >> loading peak file... 2024-10-30 04:43:45 AM
## >> preparing features information... 2024-10-30 04:43:45 AM
## >> identifying nearest features... 2024-10-30 04:43:45 AM
## >> calculating distance from peak to TSS... 2024-10-30 04:43:45 AM
## >> assigning genomic annotation... 2024-10-30 04:43:45 AM
## >> adding gene annotation... 2024-10-30 04:43:53 AM
## >> assigning chromosome lengths 2024-10-30 04:43:53 AM
## >> done... 2024-10-30 04:43:53 AM
Note that it would also be possible to use Ensembl-based
EnsDb
annotation databases created by the ensembldb package
for the peak annotations by providing it with the TxDb
parameter. Since UCSC-style chromosome names are used we have to change
the style of the chromosome names from Ensembl to UCSC
in the example below.
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75
seqlevelsStyle(edb) <- "UCSC"
peakAnno.edb <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
TxDb=edb, annoDb="org.Hs.eg.db")
Peak Annotation is performed by annotatePeak
. User can
define TSS (transcription start site) region, by default TSS is defined
from -3kb to +3kb. The output of annotatePeak
is
csAnno
instance. ChIPseeker
provides as.GRanges
to convert csAnno
to
GRanges
instance, and as.data.frame
to convert
csAnno
to data.frame
which can be exported to
file by write.table
.
TxDb
object contained transcript-related features of a
particular genome. Bioconductor provides several package that containing
TxDb
object of model organisms with multiple commonly used
genome version, for instance TxDb.Hsapiens.UCSC.hg38.knownGene,
TxDb.Hsapiens.UCSC.hg19.knownGene
for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene
and TxDb.Mmusculus.UCSC.mm9.knownGene
for mouse genome mm10 and mm9, etc. User can also prepare their own
TxDb
object by retrieving information from UCSC Genome
Bioinformatics and BioMart data resources by R function
makeTxDbFromBiomart
and makeTxDbFromUCSC
.
TxDb
object should be passed for peak annotation.
All the peak information contained in peakfile will be retained in
the output of annotatePeak
. The position and strand
information of nearest genes are reported. The distance from peak to the
TSS of its nearest gene is also reported. The genomic region of the peak
is reported in annotation column. Since some annotation may overlap, ChIPseeker
adopted the following priority in genomic annotation.
Downstream is defined as the downstream of gene end.
ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierachy.
annotatePeak
report detail information when the
annotation is Exon or Intron, for instance “Exon (uc002sbe.3/9736, exon
69 of 80)”, means that the peak is overlap with an Exon of transcript
uc002sbe.3, and the corresponding Entrez gene ID is 9736 (Transcripts
that belong to the same gene ID may differ in splice events), and this
overlaped exon is the 69th exon of the 80 exons that this transcript
uc002sbe.3 prossess.
Parameter annoDb is optional, if provided, extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID will be added. The geneId column in annotation output will be consistent with the geneID in TxDb. If it is ENTREZID, ENSEMBL will be added if annoDb is provided, while if it is ENSEMBL ID, ENTREZID will be added.
To annotate the location of a given peak in terms of genomic
features, annotatePeak
assigns peaks to genomic annotation
in “annotation” column of the output, which includes whether a peak is
in the TSS, Exon, 5’ UTR, 3’ UTR, Intronic or Intergenic. Many
researchers are very interesting in these annotations. TSS region can be
defined by user and annotatePeak
output in details of which
exon/intron of which genes as illustrated in previous section.
Pie and Bar plot are supported to visualize the genomic annotation.
Since some annotation overlap, user may interested to view the full
annotation with their overlap, which can be partially resolved by
vennpie
function.
We extend UpSetR to view full
annotation overlap. User can user upsetplot
function.
We can combine vennpie
with upsetplot
by
setting vennpie = TRUE.
The distance from the peak (binding site) to the TSS of the nearest
gene is calculated by annotatePeak
and reported in the
output. We provide plotDistToTSS
to calculate the
percentage of binding sites upstream and downstream from the TSS of the
nearest genes, and visualize the distribution.
Once we have obtained the annotated nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating biological knowledge provided by biological ontologies. For instance, Gene Ontology (GO)(Ashburner et al. 2000) annotates genes to biological processes, molecular functions, and cellular components in a directed acyclic graph structure, Kyoto Encyclopedia of Genes and Genomes (KEGG)(Kanehisa et al. 2004) annotates genes to pathways, Disease Ontology (DO)(Schriml et al. 2011) annotates genes with human disease association, and Reactome(Croft et al. 2013) annotates gene to pathways and reactions.
ChIPseeker also provides a function, seq2gene, for linking genomc regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function is designed to link both coding and non-coding genomic regions to coding genes and facilitate functional analysis.
Enrichment analysis is a widely used approach to identify biological themes. I have developed several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE(Yu et al. 2015) for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler(Yu et al. 2012) for Gene Ontology and KEGG enrichment analysis.
## ID Description GeneRatio BgRatio RichFactor
## R-HSA-9830369 R-HSA-9830369 Kidney development 19/499 46/11146 0.4130435
## R-HSA-9758941 R-HSA-9758941 Gastrulation 27/499 125/11146 0.2160000
## FoldEnrichment zScore pvalue p.adjust qvalue
## R-HSA-9830369 9.226017 12.102741 2.252177e-14 2.360281e-11 2.351747e-11
## R-HSA-9758941 4.824721 9.309385 5.773075e-12 3.025091e-09 3.014153e-09
## geneID
## R-HSA-9830369 2625/5076/7490/652/6495/2303/3975/6928/10736/5455/7849/3237/6092/2122/255743/2296/3400/28514/2138
## R-HSA-9758941 5453/5692/5076/5080/7546/3169/652/5015/2303/5717/3975/2627/5714/344022/7849/5077/2637/7022/8320/7545/6657/4487/51176/2296/28514/2626/64321
## Count
## R-HSA-9830369 19
## R-HSA-9758941 27
gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
## ID Description GeneRatio BgRatio RichFactor
## R-HSA-9830369 R-HSA-9830369 Kidney development 17/408 46/11146 0.3695652
## R-HSA-9758941 R-HSA-9758941 Gastrulation 24/408 125/11146 0.1920000
## FoldEnrichment zScore pvalue p.adjust qvalue
## R-HSA-9830369 10.096014 12.049721 1.806589e-13 1.765037e-10 1.709604e-10
## R-HSA-9758941 5.245176 9.303549 1.834131e-11 8.959732e-09 8.678338e-09
## geneID
## R-HSA-9830369 2625/5076/3227/652/6495/2303/3975/3237/6092/2122/255743/7490/6928/7849/5455/2296/28514
## R-HSA-9758941 5453/5692/5076/7546/3169/652/2303/5717/3975/2627/5714/344022/2637/8320/7545/7020/2626/5080/5015/7849/51176/2296/64321/28514
## Count
## R-HSA-9830369 17
## R-HSA-9758941 24
More information can be found in the vignettes of Bioconductor packages DOSE(Yu et al. 2015), ReactomePA, clusterProfiler(Yu et al. 2012), which also provide several methods to visualize enrichment results. The clusterProfiler(Yu et al. 2012) is designed for comparing and visualizing functional profiles among gene clusters, and can directly applied to compare biological themes at GO, DO, KEGG, Reactome perspective.
Function plotAvgProf
, tagHeatmap
and
plotPeakProf
can accept a list of tagMatrix
and visualize profile or heatmap among several ChIP experiments, while
plotAvgProf2
, peakHeatmap
and
plotPeakProf2
can accept a list of bed files and perform
the same task in one step.
## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))
## >> plotting figure... 2024-10-30 04:44:19 AM
## normal method
plotPeakProf2(files, upstream = 3000, downstream = 3000, conf = 0.95,
by = "gene", type = "start_site", TxDb = txdb,
facet = "row")
## binning method
plotPeakProf2(files, upstream = 3000, downstream = 3000, conf = 0.95,
by = "gene", type = "start_site", TxDb = txdb,
facet = "row", nbin = 800)
Functions plotPeakProf
and plotPeakProf2
also support to plot profile of several ChIP peak data binding to body
region.
The plotAnnoBar
and plotDistToTSS
can also
accept input of a named list of annotated peaks (output of
annotatePeak
).
We can use plotAnnoBar
to comparing their genomic
annotation.
R function plotDistToTSS
can use to comparing distance
to TSS profiles among ChIPseq data.
As shown in section 4, the annotated genes can analyzed by
r Biocpkg("clusterProfiler")
(Yu et
al. 2012), r Biocpkg("DOSE")
(Yu et al. 2015), meshes and
r Biocpkg("ReactomePA")
for Gene Ontology, KEGG, Disease
Ontology, MeSH and Reactome Pathway enrichment analysis.
The clusterProfiler(Yu et al. 2012) package provides
compareCluster
function for comparing biological themes
among gene clusters, and can be easily adopted to compare different ChIP
peak experiments.
genes = lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) = sub("_", "\n", names(genes))
compKEGG <- compareCluster(geneCluster = genes,
fun = "enrichKEGG",
pvalueCutoff = 0.05,
pAdjustMethod = "BH")
dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")
User may want to compare the overlap peaks of replicate experiments
or from different experiments. ChIPseeker
provides peak2GRanges
that can read peak file and stored in
GRanges object. Several files can be read simultaneously using lapply,
and then passed to vennplot
to calculate their overlap and
draw venn plot.
vennplot
accept a list of object, can be a list of
GRanges or a list of vector. Here, I will demonstrate using
vennplot
to visualize the overlap of the nearest genes
stored in peakAnnoList.
Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface. ChIPseeker implemented statistical methods to measure the significance of the overlap.
p <- GRanges(seqnames=c("chr1", "chr3"),
ranges=IRanges(start=c(1, 100), end=c(50, 130)))
shuffle(p, TxDb=txdb)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 208740950-208740999 *
## [2] chr3 67754438-67754468 *
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
We implement the shuffle
function to randomly permute
the genomic locations of ChIP peaks defined in a genome which stored in
TxDb
object.
With the ease of this shuffle
method, we can generate
thousands of random ChIP data and calculate the background null
distribution of the overlap among ChIP data sets.
enrichPeakOverlap(queryPeak = files[[5]],
targetPeak = unlist(files[1:4]),
TxDb = txdb,
pAdjustMethod = "BH",
nShuffle = 50,
chainFile = NULL,
verbose = FALSE)
## qSample
## ARmo_0M GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_1nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## CBX6_BF GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## tSample qLen tLen N_OL
## ARmo_0M GSM1174480_ARmo_0M_peaks.bed.gz 1663 812 0
## ARmo_1nM GSM1174481_ARmo_1nM_peaks.bed.gz 1663 2296 8
## ARmo_100nM GSM1174482_ARmo_100nM_peaks.bed.gz 1663 1359 3
## CBX6_BF GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663 1331 968
## pvalue p.adjust
## ARmo_0M 0.90196078 0.90196078
## ARmo_1nM 0.09803922 0.19607843
## ARmo_100nM 0.31372549 0.41830065
## CBX6_BF 0.01960784 0.07843137
Parameter queryPeak is the query ChIP data, while targetPeak is bed file name or a vector of bed file names from comparison; nShuffle is the number to shuffle the peaks in targetPeak. To speed up the compilation of this vignettes, we only set nShuffle to 50 as an example for only demonstration. User should set the number to 1000 or above for more robust result. Parameter chainFile are chain file name for mapping the targetPeak to the genome version consistent with queryPeak when their genome version are different. This creat the possibility of comparison among different genome version and cross species.
In the output, qSample is the name of queryPeak and qLen is the the number of peaks in queryPeak. N_OL is the number of overlap between queryPeak and targetPeak.
There are many ChIP seq data sets that have been published and deposited in GEO database. We can compare our own dataset to those deposited in GEO to search for significant overlap data. Significant overlap of ChIP seq data by different binding proteins may be used to infer cooperative regulation and thus can be used to generate hypotheses.
We collect about 17,000 bed files deposited in GEO,
user can use getGEOspecies
to get a summary based on
speices.
## species Freq
## 1 Actinobacillus pleuropneumoniae 1
## 2 Actinobacillus pleuropneumoniae serovar 3 str. JL03 1
## 3 Aedes aegypti 11
## 4 Anabaena 6
## 5 Anolis carolinensis 7
## 6 Anopheles gambiae 2
## 7 Apis mellifera 17
## 8 Apis mellifera scutellata 1
## 9 Arabidopsis 17
## 10 Arabidopsis lyrata 4
## 11 Arabidopsis thaliana 2473
## 12 Atelerix albiventris 2
## 13 Bombus terrestris 8
## 14 Bos taurus 257
## 15 Brachypodium distachyon 10
## 16 Branchiostoma lanceolatum 93
## 17 Brassica rapa 12
## 18 Caenorhabditis elegans 532
## 19 Callithrix jacchus 48
## 20 Candida albicans 34
## 21 Candida dubliniensis 20
## 22 Canis lupus 1
## 23 Canis lupus familiaris 35
## 24 Capsaspora owczarzaki 21
## 25 Carica papaya 1
## 26 Caulobacter crescentus NA1000 8
## 27 Caulobacter vibrioides 2
## 28 Chlamydomonas reinhardtii 53
## 29 Chlorocebus aethiops 11
## 30 Chlorocebus sabaeus 6
## 31 Ciona savignyi 4
## 32 Citrullus lanatus 3
## 33 Cleome hassleriana 1
## 34 Columba livia 6
## 35 Corynebacterium glutamicum 9
## 36 Crassostrea gigas 4
## 37 Cryptococcus neoformans 67
## 38 Cucumis melo 9
## 39 Cucumis sativus 3
## 40 Cyprinus carpio 40
## 41 Danio rerio 532
## 42 Daphnia magna 36
## 43 Daphnia pulex 14
## 44 Dictyostelium discoideum 77
## 45 Dimocarpus longan 9
## 46 Dromaius novaehollandiae 27
## 47 Drosophila busckii 2
## 48 Drosophila melanogaster 1483
## 49 Drosophila melanogaster;\tSindbis virus 3
## 50 Drosophila miranda 2
## 51 Drosophila pseudoobscura 7
## 52 Drosophila simulans 12
## 53 Drosophila virilis 28
## 54 Drosophila willistoni 1
## 55 Drosophila yakuba 8
## 56 Equus caballus 1
## 57 Escherichia coli 126
## 58 Escherichia coli BW25113 4
## 59 Escherichia coli K-12 2
## 60 Escherichia coli str. K-12 substr. MG1655 16
## 61 Eutrema salsugineum 12
## 62 Fragaria vesca 3
## 63 Fusarium oxysporum 4
## 64 Gallus gallus 292
## 65 Gasterosteus aculeatus 1
## 66 Geobacter sulfurreducens PCA 3
## 67 Glycine max 31
## 68 Gorilla gorilla 8
## 69 Haloferax volcanii 3
## 70 Helianthus annuus 2
## 71 Heterocephalus glaber 4
## 72 Histophilus somni 1
## 73 Homo sapiens 51828
## 74 Homo sapiens;\tHuman betaherpesvirus 6 2
## 75 Homo sapiens;\tHuman betaherpesvirus 6A 38
## 76 Homo sapiens;\tHuman gammaherpesvirus 8 12
## 77 Homo sapiens;\tHuman herpesvirus 8 6
## 78 Homo sapiens;\tHuman poliovirus 1 strain Sabin 5
## 79 Homo sapiens;\tInfluenza A virus (A/hvPR8/34(H1N1)) 12
## 80 Homo sapiens;\tMus musculus 13
## 81 Homo sapiens;\tSindbis virus 8
## 82 Human gammaherpesvirus 4 3
## 83 Human herpesvirus 6B 2
## 84 Human herpesvirus 8 6
## 85 Human immunodeficiency virus 11
## 86 Hydra vulgaris 6
## 87 Komagataella phaffii 12
## 88 Larimichthys crocea 7
## 89 Leersia perrieri 1
## 90 Legionella pneumophila 5
## 91 Leishmania amazonensis 4
## 92 Leishmania major 14
## 93 Leishmania major strain Friedlin 4
## 94 Leishmania tarentolae 15
## 95 Macaca mulatta 136
## 96 Macaca mulatta polyomavirus 1 7
## 97 Macaca mulatta;\tMacacine gammaherpesvirus 4 3
## 98 Malus domestica 21
## 99 Marchantia polymorpha 1
## 100 Medicago truncatula 8
## 101 Mesoplasma florum 1
## 102 Microcebus murinus 6
## 103 Moloney murine leukemia virus 9
## 104 Monodelphis domestica 31
## 105 Moraxella catarrhalis O35E 6
## 106 Mus 2
## 107 Mus musculus 38038
## 108 Mus musculus musculus x M. m. castaneus 1
## 109 Mus musculus x Mus spretus 11
## 110 Mus musculus;\tHuman herpesvirus 1 strain KOS 2
## 111 Mus musculus;\tMurid herpesvirus 68 2
## 112 Musa acuminata AAA Group 3
## 113 Mycobacterium abscessus ATCC 19977 4
## 114 Mycobacterium marinum 1
## 115 Mycobacterium tuberculosis 2
## 116 Mycobacterium tuberculosis H37Rv 5
## 117 Mycolicibacterium smegmatis 2
## 118 Mycolicibacterium smegmatis MC2 155 2
## 119 Myotis brandtii 15
## 120 Naumovozyma castellii 1
## 121 Nematostella vectensis 31
## 122 Neurospora crassa 3
## 123 Nicotiana tabacum 6
## 124 Nicrophorus vespilloides 4
## 125 Nippostrongylus brasiliensis 9
## 126 Nomascus leucogenys 1
## 127 Oreochromis niloticus 1
## 128 Ornithorhynchus anatinus 5
## 129 Oryza brachyantha 11
## 130 Oryza glaberrima 1
## 131 Oryza punctata 1
## 132 Oryza sativa 110
## 133 Oryza sativa Japonica Group 17
## 134 Oryzias latipes 26
## 135 Ovis aries 2
## 136 Pan troglodytes 106
## 137 Pantherophis guttatus 7
## 138 Papio anubis 3
## 139 Patiria miniata 1
## 140 Penicillium chrysogenum 1
## 141 Petunia x hybrida 10
## 142 Phascolarctos cinereus 2
## 143 Phaseolus coccineus 2
## 144 Plasmodium falciparum 220
## 145 Plasmodium falciparum 3D7 43
## 146 Plectus sambesii 1
## 147 Pongo pygmaeus 6
## 148 Procambarus fallax 6
## 149 Procambarus virginalis 16
## 150 Pseudomonas aeruginosa PAO1 7
## 151 Pseudomonas putida KT2440 2
## 152 Pseudomonas savastanoi pv. phaseolicola 1448A 3
## 153 Pseudozyma aphidis 11
## 154 Pyricularia oryzae 18
## 155 Pyrococcus furiosus 4
## 156 Pyrus x bretschneideri 4
## 157 Rattus norvegicus 237
## 158 Rhodobacter sphaeroides 12
## 159 Rhodopseudomonas palustris 6
## 160 Rhodopseudomonas palustris CGA009 3
## 161 Romanomermis culicivorax 1
## 162 Saccharomyces cerevisiae 3740
## 163 Saccharomyces cerevisiae BY4741 8
## 164 Saccharomyces cerevisiae S288C 18
## 165 Saccharomyces cerevisiae x Saccharomyces paradoxus 16
## 166 Saccharomyces cerevisiae;\tHomo sapiens 2
## 167 Saccharomyces cerevisiae;\tMus musculus 12
## 168 Saccharomyces kudriavzevii 1
## 169 Saccharomyces paradoxus 8
## 170 Saccharomyces uvarum 1
## 171 Salmo salar 32
## 172 Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344 4
## 173 Sarcophilus harrisii 3
## 174 Schizosaccharomyces japonicus 2
## 175 Schizosaccharomyces pombe 460
## 176 Schmidtea mediterranea 7
## 177 Solanum lycopersicum 76
## 178 Solanum pennellii 18
## 179 Sorghum bicolor 3
## 180 Spodoptera frugiperda 16
## 181 Streptomyces coelicolor A3(2) 6
## 182 Strongylocentrotus purpuratus 36
## 183 Sus scrofa 334
## 184 Taeniopygia guttata 21
## 185 Tetrahymena thermophila 19
## 186 Tetrahymena thermophila CU428 6
## 187 Thermus thermophilus HB27 4
## 188 Torulaspora microellipsoides 6
## 189 Toxoplasma gondii 21
## 190 Toxoplasma gondii RH 12
## 191 Tribolium castaneum 14
## 192 Trichinella spiralis 1
## 193 Trichuris muris 4
## 194 Triticum aestivum 127
## 195 Trypanosoma brucei 12
## 196 Tupaia chinensis 7
## 197 Vibrio cholerae 8
## 198 Vitis vinifera 2
## 199 Xenopus (Silurana) tropicalis 62
## 200 Xenopus laevis 29
## 201 Xenopus tropicalis 133
## 202 Xenopus tropicalis x Xenopus laevis 4
## 203 Zea mays 173
## 204 synthetic construct 42
The summary can also based on genome version as illustrated below:
## organism genomeVersion Freq
## 1 Anolis carolinensis anoCar2 7
## 2 Bos taurus bosTau4 2
## 3 Bos taurus bosTau6 33
## 4 Bos taurus bosTau7 2
## 5 Canis lupus familiaris canFam3 10
## 6 Caenorhabditis elegans ce10 328
## 7 Caenorhabditis elegans ce6 64
## 8 Danio rerio danRer6 89
## 9 Danio rerio danRer7 89
## 10 Drosophila melanogaster dm3 767
## 11 Gallus gallus galGal3 33
## 12 Gallus gallus galGal4 73
## 13 Gasterosteus aculeatus gasAcu1 1
## 14 Heterocephalus glaber hetGla2 4
## 15 Homo sapiens hg18 3368
## 16 Homo sapiens hg19 30326
## 17 Homo sapiens hg38 4091
## 18 Mus musculus mm10 19089
## 19 Mus musculus mm8 556
## 20 Mus musculus mm9 17247
## 21 Monodelphis domestica monDom5 31
## 22 Ovis aries oviAri3 2
## 23 Pan troglodytes panTro3 48
## 24 Pan troglodytes panTro4 48
## 25 Macaca mulatta rheMac2 84
## 26 Macaca mulatta rheMac3 40
## 27 Rattus norvegicus rn5 66
## 28 Saccharomyces cerevisiae sacCer2 145
## 29 Saccharomyces cerevisiae sacCer3 2646
## 30 Sus scrofa susScr2 17
## 31 Sus scrofa susScr3 18
## 32 Taeniopygia guttata taeGut2 20
## 33 Xenopus (Silurana) tropicalis xenTro3 3
User can access the detail information by getGEOInfo
,
for each genome version.
## series_id gsm organism
## 111 GSE16256 GSM521889 Homo sapiens
## 112 GSE16256 GSM521887 Homo sapiens
## 113 GSE16256 GSM521883 Homo sapiens
## 114 GSE16256 GSM1010966 Homo sapiens
## 115 GSE16256 GSM896166 Homo sapiens
## 116 GSE16256 GSM910577 Homo sapiens
## title
## 111 Reference Epigenome: ChIP-Seq Analysis of H3K27me3 in IMR90 Cells; renlab.H3K27me3.IMR90-02.01
## 112 Reference Epigenome: ChIP-Seq Analysis of H3K27ac in IMR90 Cells; renlab.H3K27ac.IMR90-03.01
## 113 Reference Epigenome: ChIP-Seq Analysis of H3K14ac in IMR90 Cells; renlab.H3K14ac.IMR90-02.01
## 114 polyA RNA sequencing of STL003 Pancreas Cultured Cells; polyA-RNA-seq_STL003PA_r1a
## 115 Reference Epigenome: ChIP-Seq Analysis of H4K8ac in hESC H1 Cells; renlab.H4K8ac.hESC.H1.01.01
## 116 Reference Epigenome: ChIP-Seq Analysis of H3K4me1 in Human Spleen Tissue; renlab.H3K4me1.STL003SX.01.01
## supplementary_file
## 111 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521889/suppl/GSM521889_UCSD.IMR90.H3K27me3.SK05.bed.gz
## 112 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521887/suppl/GSM521887_UCSD.IMR90.H3K27ac.YL58.bed.gz
## 113 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521883/suppl/GSM521883_UCSD.IMR90.H3K14ac.SK17.bed.gz
## 114 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1010nnn/GSM1010966/suppl/GSM1010966_UCSD.Pancreas.mRNA-Seq.STL003.bed.gz
## 115 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM896nnn/GSM896166/suppl/GSM896166_UCSD.H1.H4K8ac.AY17.bed.gz
## 116 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM910nnn/GSM910577/suppl/GSM910577_UCSD.Spleen.H3K4me1.STL003.bed.gz
## genomeVersion pubmed_id
## 111 hg19 19829295
## 112 hg19 19829295
## 113 hg19 19829295
## 114 hg19 19829295
## 115 hg19 19829295
## 116 hg19 19829295
If simplify is set to FALSE, extra information including source_name, extract_protocol, description, data_processing and submission_date will be incorporated.
ChIPseeker
provide function downloadGEObedFiles
to download all the
bed files of a particular genome.
Or a vector of GSM accession number by
downloadGSMbedFiles
.
After download the bed files from GEO, we can pass them to
enrichPeakOverlap
for testing the significant of overlap.
Parameter targetPeak can be the folder, e.g. hg19,
that containing bed files. enrichPeakOverlap
will parse the
folder and compare all the bed files. It is possible to test the overlap
with bed files that are mapping to different genome or different genome
versions, enrichPeakOverlap
provide a parameter
chainFile that can pass a chain file and liftOver the
targetPeak to the genome version consistent with
queryPeak. Signifcant overlap can be use to generate hypothesis
of cooperative regulation.By mining the data deposited in GEO, we can
identify some putative complex or interacted regulators in gene
expression regulation or chromsome remodelling for further
validation.
If you have questions/issues, please visit ChIPseeker homepage first. Your problems are mostly documented. If you think you found a bug, please follow the guide and provide a reproducible example to be posted on github issue tracker. For questions, please post to Bioconductor support site and tag your post with ChIPseeker.
For Chinese user, you can follow me on WeChat (微信).
Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ChIPseeker_1.43.0
## [2] ReactomePA_1.49.1
## [3] clusterProfiler_4.13.4
## [4] ggplot2_3.5.1
## [5] org.Hs.eg.db_3.20.0
## [6] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [7] GenomicFeatures_1.57.1
## [8] AnnotationDbi_1.69.0
## [9] Biobase_2.65.1
## [10] GenomicRanges_1.57.2
## [11] GenomeInfoDb_1.41.2
## [12] IRanges_2.39.2
## [13] S4Vectors_0.43.2
## [14] BiocGenerics_0.51.3
## [15] yulab.utils_0.1.7
## [16] prettydoc_0.4.1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] ggtangle_0.0.3 farver_2.1.2
## [7] rmarkdown_2.28 fs_1.6.4
## [9] BiocIO_1.17.0 zlibbioc_1.51.2
## [11] vctrs_0.6.5 memoise_2.0.1
## [13] Rsamtools_2.21.2 RCurl_1.98-1.16
## [15] ggtree_3.13.2 htmltools_0.5.8.1
## [17] S4Arrays_1.5.11 plotrix_3.8-4
## [19] curl_5.2.3 SparseArray_1.5.45
## [21] gridGraphics_0.5-1 sass_0.4.9
## [23] KernSmooth_2.23-24 bslib_0.8.0
## [25] plyr_1.8.9 cachem_1.1.0
## [27] buildtools_1.0.0 GenomicAlignments_1.41.0
## [29] igraph_2.1.1 lifecycle_1.0.4
## [31] pkgconfig_2.0.3 Matrix_1.7-1
## [33] R6_2.5.1 fastmap_1.2.0
## [35] gson_0.1.0 GenomeInfoDbData_1.2.13
## [37] MatrixGenerics_1.17.1 digest_0.6.37
## [39] aplot_0.2.3 enrichplot_1.25.6
## [41] colorspace_2.1-1 patchwork_1.3.0
## [43] RSQLite_2.3.7 labeling_0.4.3
## [45] fansi_1.0.6 polyclip_1.10-7
## [47] httr_1.4.7 abind_1.4-8
## [49] compiler_4.4.1 bit64_4.5.2
## [51] withr_3.0.2 graphite_1.51.1
## [53] BiocParallel_1.39.0 viridis_0.6.5
## [55] DBI_1.2.3 highr_0.11
## [57] gplots_3.2.0 ggforce_0.4.2
## [59] R.utils_2.12.3 MASS_7.3-61
## [61] rappdirs_0.3.3 DelayedArray_0.31.14
## [63] rjson_0.2.23 caTools_1.18.3
## [65] gtools_3.9.5 tools_4.4.1
## [67] ape_5.8 R.oo_1.26.0
## [69] glue_1.8.0 restfulr_0.0.15
## [71] nlme_3.1-166 GOSemSim_2.31.2
## [73] grid_4.4.1 reshape2_1.4.4
## [75] fgsea_1.31.6 generics_0.1.3
## [77] gtable_0.3.6 R.methodsS3_1.8.2
## [79] tidyr_1.3.1 data.table_1.16.2
## [81] tidygraph_1.3.1 utf8_1.2.4
## [83] XVector_0.45.0 ggrepel_0.9.6
## [85] pillar_1.9.0 stringr_1.5.1
## [87] splines_4.4.1 tweenr_2.0.3
## [89] dplyr_1.1.4 treeio_1.29.2
## [91] lattice_0.22-6 rtracklayer_1.65.0
## [93] bit_4.5.0 tidyselect_1.2.1
## [95] GO.db_3.20.0 maketools_1.3.1
## [97] Biostrings_2.73.2 reactome.db_1.89.0
## [99] knitr_1.48 gridExtra_2.3
## [101] SummarizedExperiment_1.35.5 xfun_0.48
## [103] graphlayouts_1.2.0 matrixStats_1.4.1
## [105] stringi_1.8.4 UCSC.utils_1.1.0
## [107] boot_1.3-31 lazyeval_0.2.2
## [109] ggfun_0.1.7 yaml_2.3.10
## [111] evaluate_1.0.1 codetools_0.2-20
## [113] ggraph_2.2.1 tibble_3.2.1
## [115] qvalue_2.37.0 graph_1.83.0
## [117] ggplotify_0.1.2 cli_3.6.3
## [119] munsell_0.5.1 jquerylib_0.1.4
## [121] Rcpp_1.0.13 png_0.1-8
## [123] XML_3.99-0.17 parallel_4.4.1
## [125] blob_1.2.4 DOSE_3.99.1
## [127] bitops_1.0-9 viridisLite_0.4.2
## [129] tidytree_0.4.6 scales_1.3.0
## [131] purrr_1.0.2 crayon_1.5.3
## [133] rlang_1.1.4 cowplot_1.1.3
## [135] fastmatch_1.1-4 KEGGREST_1.45.1