---
title: "An end-to-end ChIP-seq workflow: from a peak BED to enriched motifs"
shorttitle: "ChIP-seq workflow"
date: 23 May 2026
author:
- Benjamin Jean-Marie Tremblay^[benjamin.tremblay@uwaterloo.ca]
bibliography: universalmotif.bib
abstract: >
  This vignette walks through a complete ChIP-seq motif analysis on
  real data. Starting from a peak BED file bundled in
  `inst/extdata`, we extract peak sequences from a BSgenome,
  discover motifs de novo with `motif_finder()`, deduplicate them
  with `merge_similar2()`, compare them against the JASPAR2022
  plant collection via `MotifDb`, validate enrichment against a
  composition-matched background with `match_bkg()` and
  `enrich_motifs2()`, scan all peaks with `scan_sequences2()`,
  test for central positional bias with `motif_peaks()`, and look
  for motif clustering with `motif_coocc()`. The
  closing section assembles a summary table that combines all of
  the per-motif evidence into one ranked list.
vignette: >
  %\VignetteIndexEntry{An end-to-end ChIP-seq workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  bookdown::pdf_document2
header-includes:
  - \usepackage{float}
  - \floatplacement{figure}{H}
---

```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.pos = "H")
suppressPackageStartupMessages({
  library(universalmotif)
  library(Biostrings)
  library(GenomicRanges)
  library(GenomeInfoDb)
})
options(Biostrings.coloring = FALSE)

has_pkgs <- suppressWarnings(suppressMessages(vapply(
  c("rtracklayer", "BSgenome.Athaliana.TAIR.TAIR9", "MotifDb"),
  requireNamespace, logical(1), quietly = TRUE)))
if (!all(has_pkgs)) {
  knitr::opts_chunk$set(eval = FALSE)
  message("Skipping ChIP-seq vignette: required Suggests packages ",
          "are not installed (",
          paste(names(has_pkgs)[!has_pkgs], collapse = ", "),
          ").")
}
```

# Introduction

This vignette ties together the analysis functions of the
`universalmotif` package into a single end-to-end ChIP-seq workflow.
The previous vignettes cover individual pieces in isolation: for a
introduction to sequence motifs, see the
[introductory](IntroductionToSequenceMotifs.pdf) vignette; for the
plumbing of motif objects (I/O, conversion, logos), see the
[motif manipulation](MotifManipulation.pdf) vignette; and for the
mechanics of scanning, shuffling, and enrichment, see the
[sequence searches](SequenceSearches.pdf) vignette. This document
assumes you have skimmed the first and last of those.

The dataset is a real ChIP-seq peak call set for the *Arabidopsis
thaliana* B3-domain repressor VAL1, taken from @yuan2021val. VAL1
and its paralogue VAL2 silence seed-maturation genes by binding the
RY element (`CATGCA` / `TGCATG`) and recruiting Polycomb Repressive
Complex 2. We expect a properly working motif-discovery pipeline,
fed only the peaks, to recover the RY element and to match it
against B3-domain transcription factors in JASPAR.

The workflow proceeds in eight steps:

1. Import the peak BED.
2. Extract peak sequences from a BSgenome.
3. Discover motifs *de novo* with `motif_finder()`.
4. Deduplicate the discovered set with `merge_similar2()`.
5. Compare the survivors to JASPAR via `compare_motifs2()`.
6. Validate enrichment against a composition-matched background
   with `match_bkg()` and `enrich_motifs2()`.
7. Map hit positions with `scan_sequences2()` and test for central
   bias with `motif_peaks()`.
8. Look for motif clustering with `motif_coocc()`.

We close with a summary table that ranks the discovered motifs by
the combined evidence.

# Loading peaks from a BED file

`universalmotif` does not ship its own BED importer because
`rtracklayer::import()` already does the job well. The peak file
lives in `inst/extdata`; `system.file()` resolves it to an absolute
path inside the installed package.

```{r}
library(rtracklayer)

bed.file <- system.file("extdata",
                        "VAL1-GFP_val1_calledPeaks.bed",
                        package = "universalmotif")
peaks <- import(bed.file, format = "BED")
peaks
```

The BED file uses bare numeric chromosome names (`"1"` through
`"5"`), whereas the matching BSgenome we will use in the next
section uses the `"Chr1"`..`"Chr5"` convention. A single
`seqlevels()` assignment fixes that, and we drop any mitochondrial
or plastid contigs at the same time (there are none in this peak
set, but it is a useful defensive habit).

```{r}
seqlevels(peaks) <- paste0("Chr", seqlevels(peaks))
peaks <- keepSeqlevels(peaks,
                       paste0("Chr", 1:5),
                       pruning.mode = "coarse")
summary(width(peaks))
length(peaks)
```

# Extracting peak sequences and building a matched background

We use the TAIR9 BSgenome and resize every peak to a fixed 500 bp
window centred on the called peak. The fixed window matters
because `motif_peaks()` later treats positions as comparable across
sequences; if the peaks varied in length, the "centre" would mean
different things in different rows.

```{r}
library(BSgenome.Athaliana.TAIR.TAIR9)

peaks.500 <- resize(peaks, width = 500, fix = "center")
## Drop any windows that ran off a chromosome end.
peaks.500 <- trim(peaks.500)
peaks.500 <- peaks.500[width(peaks.500) == 500]

peak.seqs <- getSeq(Athaliana, peaks.500)
names(peak.seqs) <- paste0("peak_", seq_along(peak.seqs))
peak.seqs
```

Plant promoter sequences are strongly AT-skewed, so any background
analysis on these peaks needs to control for composition. We build
that control once, here, and then reuse it for both the *de novo*
discovery in §4 and the enrichment validation in §7. `match_bkg()`
accepts a `BSgenome` directly, so we can pass `genome = Athaliana`
together with `exclude = peaks.500`, and it then samples random
non-peak genomic windows internally, matching each peak on GC
content and length.

```{r}
set.seed(2026)
bkg.seqs <- match_bkg(peak.seqs, genome = Athaliana, exclude = peaks.500)
length(bkg.seqs)
```

It is worth a quick sanity check that the peak set and the matched
background really do have more or less indistinguishable GC
distributions:

```{r}
summary(letterFrequency(peak.seqs, "GC", as.prob = TRUE))
summary(letterFrequency(bkg.seqs,  "GC", as.prob = TRUE))
```

# De novo motif discovery

`motif_finder()` is a self-contained k-mer enumeration discoverer,
and (handily) it does not require an external MEME installation. By
default it builds its own negative pool by shuffling the input at
`shuffle.k = 2`, but passing `bkg.sequences = bkg.seqs` (our
composition-matched genomic pool from §3) tends to work meaningfully
better on plant-promoter ChIP-seq. The reason is that shuffling each
peak preserves that peak's own AT-skew, so AT-rich k-mers cannot
enrich against it, and everything past the most extreme signal ends
up effectively hidden. Against a real genomic background instead, the
lower-abundance motifs (G-box, E-box) come through alongside the
dominant RY element. On this dataset (5,318 peaks of 500 bp each, two
threads) the call returns in roughly 15 seconds, so there is no need
to sub-sample.

```{r motif-finder, cache=FALSE}
discovered.df <- motif_finder(peak.seqs,
                              bkg.sequences = bkg.seqs,
                              nmotifs       = 5,
                              min.width     = 6,
                              max.width     = 12,
                              rng.seed      = 2026,
                              nthreads      = 2)
print(as.data.frame(discovered.df[, c("name", "consensus", "pval")]),
      row.names = FALSE)
```

`motif_finder()` returns a `universalmotif_df` whose `motif` column
wraps the actual motif objects in an `AsIs` envelope. `to_list()`
unwraps it back into a plain list, which is what every downstream
function expects.

```{r}
discovered <- to_list(discovered.df, extrainfo = FALSE)
```

For viewing, we will create a copy which has the discovery
p-value added to the name:

```{r, fig.height=6, fig.width=7}
discovered.labelled <- discovered
for (i in seq_along(discovered.labelled)) {
  discovered.labelled[[i]]["name"] <- sprintf(
    "%s  (p = %.1e)",
    discovered.labelled[[i]]["name"],
    discovered.df$pval[i]
  )
}
view_motifs2(discovered.labelled, sort.by = "similarity",
             show.positions = FALSE, names.pos = "right")
```

With `sort.by = "similarity"`, `view_motifs2()` reorders the panels
by hierarchical clustering on Pearson correlation between aligned
columns, so visually related logos are adjacent. The RY-element
variants (consensuses containing `CATGCA`) cluster together; so do
the G-box (`CACGTG`) variants,
GAGA repeats, and low-complexity AT-rich runs. The next two sections
(enrichment against a composition-matched background, and positional
bias) are what tell us which of these clusters carry a real binding
signal and which only reflect background composition. Pass
`sort.by = "ic"` (the default) for descending information content,
or `sort.by = "none"` to keep `motif_finder()`'s p-value ranking.

# Deduplicating the discovered set

A real motif-discovery run almost always returns several rotated or
extended copies of the same underlying signal. These variations can
be biologically meaningful, but for simplicity we will not dwell on
them here. Instead we merge them with `merge_similar2()`, which
collapses each cluster down to a single representative by
hierarchical clustering on the q-values from a Pearson-based
comparison. It is often wise to trim the merged motifs afterwards as
well, since merging tends to fold mismatching flanks into low-IC
positions.

```{r}
discovered.merged <- merge_similar2(discovered, qvalue = 0.01)
discovered.merged <- trim_motifs(discovered.merged)

length(discovered)
length(discovered.merged)
```

`merge_similar2()` names each merged motif by joining the names of
its contributing motifs with `+`, which becomes unwieldy quite
quickly when several distinct binding signals each produce their own
cluster (the RY element, the bZIP G-box, and so on). So we relabel
each merged cluster by inspecting its consensus, so that the rest of
the vignette can refer to motifs by what they *are*:

```{r}
label_merged <- function(consensus) {
  if (grepl("CATGCA", consensus))                    "RY"
  else if (grepl("CACGTG", consensus))               "G_box"
  else if (grepl("ACGTAC|GTACGT", consensus))        "bZIP"
  else if (grepl("AAAAAA|ATATAT|AAATAA", consensus)) "AT_rich"
  else if (grepl("GAGAGA", consensus))               "GA_repeat"
  else                                               consensus
}
new.names <- vapply(discovered.merged,
                    function(m) m["name"], character(1))
for (i in seq_along(discovered.merged)) {
  new.names[i] <- label_merged(
    discovered.merged[[i]]["consensus"]
  )
  discovered.merged[[i]]["name"] <- new.names[i]
}
new.names
```

```{r, fig.height=4, fig.width=7}
view_motifs2(discovered.merged, sort.by = "similarity",
             show.positions = FALSE, names.pos = "right")
```

# Comparing against JASPAR via MotifDb

For the cross-database comparison we use `MotifDb`, which already
ships JASPAR2022 plant motifs and is therefore the lightest path to
a reference set. We narrow it to *Arabidopsis* JASPAR2022 entries
and convert the resulting list to `universalmotif` objects.

```{r motifdb, warning=FALSE, message=FALSE}
suppressPackageStartupMessages(library(MotifDb))

ath.jaspar <- query(query(MotifDb, "Athaliana"), "jaspar2022")
length(ath.jaspar)
ath.motifs <- convert_motifs(ath.jaspar)
```

`compare_motifs2()` in long-format mode treats indexed entries of
`motifs` as queries and the whole input as the target database, so
we concatenate the discovered set with the reference and pass the
query indices via `compare.to`.

```{r}
combined <- c(discovered.merged, ath.motifs)
n.disc   <- length(discovered.merged)

cmp <- compare_motifs2(combined,
                       compare.to = seq_len(n.disc),
                       qvalue     = 0.2,
                       nthreads   = 2)
cmp <- cmp[cmp$subject != cmp$target, ]
cmp <- cmp[order(cmp$qvalue), ]
print(head(cmp[, c("subject", "target", "score", "qvalue",
                   "subject.consensus", "target.consensus")], 10),
      row.names = FALSE)
```

Most of the strong matches will be for the AT-rich or GA-rich
low-complexity discovered motifs (DOF, BPC, and so on) rather than
for the RY element itself; the *Arabidopsis* genome is AT-skewed, and
`motif_finder()` reports every strong k-mer signal, not only the one
that drove the immunoprecipitation. It is the enrichment and
positional-bias tests in the next two sections that separate the
genuine binding signal from these incidental low-complexity hits.

For the figure we pick the RY motif and pair it with its best JASPAR
match in the B3-domain family. ABI3, FUS3, and LEC2 are the three
B3-domain TFs that JASPAR2022 has for *Arabidopsis*, and any of them
is a biologically sensible neighbour for VAL1.

```{r, fig.height=2.5, fig.width=7}
ref.names <- vapply(ath.motifs, function(m) m["name"], character(1))
ry.cmp <- cmp[cmp$subject == "RY", ]
ry.best <- ry.cmp[order(ry.cmp$qvalue), ][1, ]

top.disc <- discovered.merged[[match("RY", new.names)]]
top.ref  <- ath.motifs[[match(ry.best$target, ref.names)]]
view_motifs2(c(top.disc, top.ref), show.positions.once = FALSE,
             names.pos = "right")
```

# Validating enrichment against a composition-matched background

We can now use `enrich_motifs2()` to validate our new set of merged
motifs. It accepts an explicit `bkg.sequences` argument, so we can
hand it a background that controls for the AT-skew of plant
promoters. We already built exactly that pool back in §3 (`bkg.seqs`,
from `match_bkg()` against a universe of random non-peak genomic
windows), and so the enrichment step simply reuses it.

So we feed the matched background straight into `enrich_motifs2()`:

```{r}
enr <- enrich_motifs2(motifs        = discovered.merged,
                      sequences     = peak.seqs,
                      bkg.sequences = bkg.seqs,
                      pvalue        = 1e-4,
                      qvalue        = 0.05,
                      nthreads      = 2)
print(enr[, c("motif", "consensus", "enrichment", "qvalue")],
      row.names = FALSE)
```

With a GC-matched background, the AT-rich low-complexity motifs drop
to barely above 1-fold enrichment (since they are common in
essentially every AT-skewed plant-promoter window), while the genuine
binding motifs stand out at several fold over background. That
contrast all but disappears if you compare against a uniform shuffled
null instead, which is really why a matched background of this kind
matters so much for plant ChIP-seq.

For the alternative dinucleotide-shuffle null (`shuffle.k = 2`), or
for a single-shuffle background, see the `match_bkg()` and
shuffling sections of the
[sequence searches](SequenceSearches.pdf) vignette.

# Scanning all peaks and testing positional bias

To map hits at single-base resolution across every peak we run
`scan_sequences2()` with the discovered motifs. Because every window
is a fixed 500 bp, the motif positions are directly comparable from
one row to the next.

```{r}
hits <- scan_sequences2(discovered.merged,
                        peak.seqs,
                        pvalue         = 1e-4,
                        return.granges = FALSE,
                        nthreads       = 2)
nrow(hits)
table(motif = hits$motif)
```

`motif_peaks()` then asks, for each motif in turn, whether its hits
cluster near the centre of the peak. It is a non-parametric,
CentriMo-style test: at each window size it compares the observed
number of hits inside the window against the number we would expect
under a uniform-position null.

```{r}
pk <- motif_peaks(hits, seq.length = 500, mode = "central")
print(pk[, c("motif", "best.center", "best.window", "qvalue")],
      row.names = FALSE)
```

```{r, fig.height=3.5, fig.width=7}
plot_motif_peaks(pk)
```

A motif that drives binding should peak sharply at position 250,
the centre of the 500 bp window. The RY element does (very
strongly, since VAL1 is the immunoprecipitated factor); the G-box
/ bZIP motifs are also centrally enriched but with a
broader best window, consistent with bZIP TFs co-binding nearby
without being the primary signal. The low-complexity AT-rich
motifs are filtered out by the `motif_peaks()` default
`qvalue = 0.1` threshold.

# Motif clustering with `motif_coocc()`

`motif_coocc()` asks whether motif pairs appear together in the
same peak (and, optionally, within `max.distance` bp of each other)
more often than expected by chance. For VAL1 the interesting
question is not so much "which pair?" as "does the RY element
cluster with itself inside peaks?". B3-domain TFs often bind
cooperatively to tandem RY elements, and ChIP-seq peaks are wide
enough to accommodate several copies. We turn on `self.pairs` so
the (motif, motif) self-pairs are included:

```{r}
co <- motif_coocc(discovered.merged,
                  hits         = hits,
                  n.sequences  = length(peak.seqs),
                  max.distance = 50L,
                  self.pairs   = TRUE)
print(co[, c("motif_a", "motif_b", "both", "both.clustered",
             "median.distance", "qvalue")],
      row.names = FALSE)
```

The (`RY`, `RY`) self-pair shows how often the RY element appears
in tight clusters within a peak (`both.clustered` counts sequences
where two RY hits land within 50 bp of each other, with
`median.distance` giving the typical spacing). A high
`both.clustered` for the RY self-pair, with a small
`median.distance`, is the cooperative-binding signature: multiple
RY copies packed together at the binding locus. The cross-motif
rows give the rest of the picture: pairwise co-occurrence of the
RY motif with each low-complexity neighbour is generally weak (no
obligate cofactor), and the strongest cross-pairs are between the
low-complexity motifs themselves, reflecting plant promoter
composition rather than specific cobinding.

`motif_coocc()` reports the summary statistic, but it is well worth
plotting the actual distribution too. For every peak carrying two or
more RY hits, we compute all of the pairwise distances between the RY
hit starts, and histogram them.

```{r, fig.height=3, fig.width=7}
ry.hits  <- hits[hits$motif == "RY", ]
ry.dists <- unlist(lapply(split(ry.hits$start, ry.hits$sequence.i),
                          function(s) if (length(s) < 2L)
                            integer(0) else as.integer(dist(s))))
length(ry.dists)
ggplot2::ggplot(data.frame(distance = ry.dists),
                ggplot2::aes(x = .data$distance)) +
  ggplot2::geom_histogram(binwidth = 5, fill = "black", colour = NA) +
  ggplot2::geom_vline(xintercept = 50, linetype = "dashed",
                      colour = "grey50") +
  ggplot2::labs(x = "RY-RY distance (bp)", y = "Count") +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.grid         = ggplot2::element_blank(),
                 panel.border       = ggplot2::element_blank(),
                 axis.line.x.bottom = ggplot2::element_line(colour = "black"),
                 axis.line.y.left   = ggplot2::element_line(colour = "black"))
```

The dashed line marks the 50 bp threshold used in `max.distance`
above. A heavy tail at very short spacings (under 20 bp) is the
direct signature of cooperative B3-domain binding to tandem RY
elements.

The cross-motif case repeats the same calculation, this time
measuring the distance from each RY hit to every hit of every other
motif in the same peak. A pile-up at short distances would point to a
TF whose binding co-localises with RY, whereas a flat distribution
would indicate only incidental co-occurrence, driven by nothing more
than both motifs being common across the peak set.

```{r, fig.height=5, fig.width=7}
ry.by.seq <- split(ry.hits$start, ry.hits$sequence.i)
other.motifs <- setdiff(unique(hits$motif), "RY")
cross.dists <- do.call(rbind, lapply(other.motifs, function(om) {
  om.hits   <- hits[hits$motif == om, ]
  om.by.seq <- split(om.hits$start, om.hits$sequence.i)
  shared    <- intersect(names(ry.by.seq), names(om.by.seq))
  d <- unlist(lapply(shared, function(s)
    abs(as.integer(outer(ry.by.seq[[s]], om.by.seq[[s]], "-")))))
  if (!length(d)) return(NULL)
  data.frame(other = om, distance = d, stringsAsFactors = FALSE)
}))
ggplot2::ggplot(cross.dists, ggplot2::aes(x = .data$distance)) +
  ggplot2::geom_histogram(binwidth = 10, fill = "black", colour = NA) +
  ggplot2::geom_vline(xintercept = 50, linetype = "dashed",
                      colour = "grey50") +
  ggplot2::facet_wrap(~ other, scales = "free_y") +
  ggplot2::labs(x = "RY-to-other distance (bp)", y = "Count") +
  ggplot2::theme_bw() +
  ggplot2::theme(panel.grid         = ggplot2::element_blank(),
                 panel.border       = ggplot2::element_blank(),
                 strip.background   = ggplot2::element_rect(fill = NA,
                                                            colour = NA),
                 axis.line.x.bottom = ggplot2::element_line(colour = "black"),
                 axis.line.y.left   = ggplot2::element_line(colour = "black"))
```

None of the cross-motif panels shows the dramatic lone 0-5 bp spike
that the (RY, RY) self-pair did. The AT-rich and GA-repeat panels do
show a slightly raised abundance within the 50 bp window (which might
hint at some relationship between the RY motifs and structural
elements), though they also carry plenty of hits well outside it. The
G_box and bZIP panels, by comparison, do show a concentration of hits
at short distances, which hints at genuine colocalisation between RY
and the bZIP-family motifs; this is biologically quite plausible for
seed-maturation regulatory regions, where RY and G-box elements are
known to co-occur. The overall impression, then, is that RY's
strongest signal is its own self-clustering, with only a secondary
and much milder bZIP / G-box neighbourship alongside it.

# Summary table

The final step is to pull everything together into one compact ranked
summary, with a single row per discovered motif that gathers up the
per-motif results (its best JASPAR match, its enrichment q, its
positional q) and adds a "best partner" (excluding self) column drawn
from the co-occurrence table.

```{r}
disc.names <- new.names

best.match <- vapply(disc.names, function(nm) {
  i <- which(cmp$subject == nm)
  if (!length(i)) NA_character_ else cmp$target[i[1]]
}, character(1))

enrich.q <- vapply(disc.names, function(nm) {
  i <- which(enr$motif == nm)
  if (!length(i)) NA_real_ else enr$qvalue[i[1]]
}, numeric(1))

pos.q <- vapply(disc.names, function(nm) {
  i <- which(pk$motif == nm)
  if (!length(i)) NA_real_ else pk$qvalue[i[1]]
}, numeric(1))

best.partner <- vapply(disc.names, function(nm) {
  ix <- which((co$motif_a == nm | co$motif_b == nm) &
              co$motif_a != co$motif_b)
  if (!length(ix)) return(NA_character_)
  ix <- ix[which.min(co$qvalue[ix])]
  if (co$motif_a[ix] == nm) co$motif_b[ix] else co$motif_a[ix]
}, character(1))

summary.tbl <- data.frame(
  motif         = disc.names,
  jaspar.match  = best.match,
  enrich.q      = signif(enrich.q, 3),
  positional.q  = signif(pos.q, 3),
  best.partner  = best.partner,
  stringsAsFactors = FALSE
)
rownames(summary.tbl) <- NULL
print(summary.tbl, row.names = FALSE)
```

Taking the combined evidence together, the analysis has clearly
recovered the dominant VAL1 RY motif, along with a couple of
secondary motifs that may well be binding sites for co-regulating
TFs. The remaining repetitive motifs, although they are significantly
enriched, fail the positional test, and are most likely incidental
structural features of the kinds of sequences that VAL1 tends to
bind.

# Session info

```{r}
sessionInfo()
```

# References