--- title: "Building a curated motif database from multiple sources" shorttitle: "Curated motif databases" date: 25 May 2026 author: - Benjamin Jean-Marie Tremblay^[benjamin.tremblay@uwaterloo.ca] bibliography: universalmotif.bib abstract: > In practice, a motif scan rarely relies on a single public database. This vignette walks through curating one consolidated motif set from several public sources, using *Homo sapiens* motifs pulled out of three `MotifDb` sub-collections (JASPAR2022, HOCOMOCOv11-core, CIS-BP) as the working example. We import, inspect cross-source redundancy with `compare_motifs2()`, deduplicate via graph-based clustering in `merge_similar2()`, trim low-information flanks with `trim_motifs()`, annotate the survivors with the existing metadata via `to_df()` / `update_motifs()`, and export the whole curated library as a single MEME file ready for downstream use. vignette: > %\VignetteIndexEntry{Building a curated motif database} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: bookdown::pdf_document2 header-includes: - \usepackage{float} - \floatplacement{figure}{H} --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.pos = "H") suppressPackageStartupMessages({ library(universalmotif) library(ggplot2) }) has_pkgs <- suppressWarnings(suppressMessages(vapply( c("MotifDb", "ape", "ggtree"), requireNamespace, logical(1), quietly = TRUE))) if (!all(has_pkgs)) { knitr::opts_chunk$set(eval = FALSE) message("Skipping curated-DB vignette: required Suggests packages ", "are not installed (", paste(names(has_pkgs)[!has_pkgs], collapse = ", "), ").") } ``` # Introduction No single public motif database covers every transcription factor in every organism, and no single source is perfectly consistent: each ships its own naming conventions, its own matrix-type conventions (PCM, PPM, or PWM), and its own intra-source redundancies (several matrices for the same factor, often with slight quality differences). Anyone doing applied motif scanning sooner or later ends up pulling from several sources and reconciling them more or less by hand. This vignette gathers the building blocks that `universalmotif` provides for that reconciliation into a single end-to-end recipe, demonstrated here on a few hundred *Homo sapiens* motifs drawn from three `MotifDb` sub-collections. The previous vignettes cover the underlying processes in isolation: for single-motif import and export, see the [motif manipulation](MotifManipulation.pdf) vignette; for the mechanics of motif-vs-motif scoring and p-value calculation, see the [comparisons and p-values](MotifComparisonAndPvalues.pdf) vignette; for an end-to-end ChIP-seq analysis that would use a curated motif library, see the [ChIP-seq workflow](ChIPseqWorkflow.pdf) vignette. This vignette sits one step upstream of the ChIP-seq one: it produces the kind of consolidated MEME file that a real analyst would feed into `scan_sequences2()` or `enrich_motifs2()`. The recipe proceeds in eight steps: 1. Pull motifs from three sources via `MotifDb`. 2. Mark the originating source onto each motif. 3. Inspect intra- and inter-source redundancy with `compare_motifs2()`. 4. Deduplicate via graph-based clustering with `merge_similar2()`. 5. Visualise the cluster structure with `motif_tree2()`. 6. Trim uninformative flanks with `trim_motifs()`. 7. Annotate the survivors with metadata via `to_df()` / `update_motifs()`. 8. Export the curated set as a single MEME file with `write_meme()`. # Loading motifs from several sources `MotifDb` ships a large indexed collection that has already been parsed from a wide variety of source formats. For a single demonstration we narrow it to *Homo sapiens* and pull three sub-collections that, between them, cover a fair slice of human TFs: `jaspar2022`, `HOCOMOCOv11-core-A`, and `cisbp_1.02`. ```{r, message=FALSE, warning=FALSE} suppressPackageStartupMessages(library(MotifDb)) hs <- suppressWarnings(query(MotifDb, "hsapiens")) sources <- c("jaspar2022", "HOCOMOCOv11-core-A", "cisbp_1.02") per.source <- suppressWarnings(lapply(sources, function(s) convert_motifs(query(hs, s)))) names(per.source) <- sources sapply(per.source, length) ``` We annotate each motif's `family` slot with the originating source. Doing this here, before the lists are concatenated, means the source survives every subsequent operation (clustering, trimming, MEME export) and gives `motif_tree2()` a per-motif colouring column to use later. `family` is the natural slot for this because `MotifDb` rarely populates it, and because `motif_tree2(linecol = "family")` is the function's default. ```{r} for (s in sources) { for (i in seq_along(per.source[[s]])) per.source[[s]][[i]]["family"] <- s } all.motifs <- do.call(c, per.source) length(all.motifs) ``` As a quick visual sanity check, we can plot the per-source counts: ```{r, fig.height=2.5, fig.width=6} src.df <- data.frame(source = sources, count = sapply(per.source, length), stringsAsFactors = FALSE) ggplot(src.df, aes(x = .data$source, y = .data$count)) + geom_col(fill = "black") + geom_text(aes(label = .data$count), vjust = -0.4) + labs(x = NULL, y = "Number of motifs") + theme_bw() + theme(panel.grid = element_blank(), panel.border = element_blank(), axis.line.x.bottom = element_line(colour = "black"), axis.line.y.left = element_line(colour = "black")) ``` ## Sidebar: importing raw files directly The rest of this vignette assumes the motifs were sourced through `MotifDb`. If instead you have raw downloaded files sitting on disk, the import step is just a one-liner per format, and the rest of the workflow is identical. The chunk below is purely for illustration. ```{r, eval=FALSE} jaspar.motifs <- read_jaspar("JASPAR2024_CORE_vertebrates.txt") hocomoco.motifs <- read_meme("HOCOMOCOv11_core_HUMAN_mono_meme.txt") cisbp.motifs <- read_cisbp("CIS-BP_2.00_Homo_sapiens.txt") transfac.motifs <- read_transfac("transfac_matrix.dat") uniprobe.motifs <- read_uniprobe("uniprobe_results.txt") all.motifs <- c(jaspar.motifs, hocomoco.motifs, cisbp.motifs, transfac.motifs, uniprobe.motifs) ``` # Inspecting cross-source redundancy Before deduplicating, it is worth quantifying how much redundancy is actually in the combined set. `compare_motifs2()` in matrix mode returns a square q-value matrix (one q-value per ordered pair, after multiple-testing correction). With several hundred motifs the result is a `r format(length(all.motifs), big.mark = ",")`-row matrix that holds every pairwise q-value. Two threads finish the comparison in about four seconds on my laptop. ```{r} qmat <- compare_motifs2(all.motifs, matrix.out = "qvalue", nthreads = 2) dim(qmat) qvals <- qmat[upper.tri(qmat)] length(qvals) summary(qvals) ``` The pairwise q-value distribution is the visual cue for picking a sensible deduplication threshold. On a log10 axis it spans many orders of magnitude, with a long left tail of genuinely redundant pairs (within-TF or cross-source duplicates) and a dense pile of unrelated pairs heaped at q = 1. ```{r, fig.height=3, fig.width=6} log10q <- log10(pmax(qvals, .Machine$double.xmin)) ggplot(data.frame(log10q = log10q), aes(x = .data$log10q)) + geom_histogram(binwidth = 0.5, fill = "black", colour = NA) + geom_vline(xintercept = log10(1e-6), linetype = "dashed", colour = "red") + labs(x = expression(log[10] * "(q-value) of pairwise comparison"), y = "Number of pairs") + theme_bw() + theme(panel.grid = element_blank(), panel.border = element_blank(), axis.line.x.bottom = element_line(colour = "black"), axis.line.y.left = element_line(colour = "black")) ``` The dashed line marks q = 1e-6, the threshold we will use for the deduplication in the next section. Picking that threshold is something of a judgement call. Set it too loose and graph transitivity starts collapsing unrelated motifs into giant components (with this dataset, a threshold of q = 1e-3 produces a single large cluster of hundreds of motifs, swallowing much of the zinc-finger and homeodomain families, because their pairwise scores share just enough small q-values to form one connected component). Set it too tight and very few motifs are merged at all. For our purposes here, q = 1e-6 seems to strike a reasonable balance, collapsing a good deal of redundancy without bridging unrelated families. ```{r} mean(qvals < 1e-6) sum(qvals < 1e-6) ``` # Deduplicating via graph clustering `merge_similar2()` builds a graph whose edges link any two motifs that pass the q-value threshold, finds the connected components, and returns one representative motif per cluster. We use it twice: once with `return.clusters = TRUE` to get the per-motif cluster assignments (so we can confirm them visually), and once without, to get the final merged set of motifs. ```{r} clusters <- merge_similar2(all.motifs, qvalue = 1e-6, return.clusters = TRUE, nthreads = 2) ``` To judge whether the q-value ought to be tightened further, we can take a look at the largest cluster: ```{r, fig.height=6, fig.width=7} big.cluster <- names(sort(table(clusters$cluster), decreasing = TRUE)[1]) big.cluster.motifs <- all.motifs[ clusters$motif.i[clusters$cluster == big.cluster]] if (length(big.cluster.motifs) > 10) big.cluster.motifs <- big.cluster.motifs[1:10] view_motifs2(big.cluster.motifs, names.pos = "right", show.positions = FALSE) ``` It is well worth taking the time to go carefully back and forth between examining the clusters and adjusting the q-value threshold, since the final merge is really the most consequential decision in curating a motif database. In our case we will carry on with this threshold. ```{r} merged <- merge_similar2(all.motifs, qvalue = 1e-6, nthreads = 2) length(all.motifs) length(merged) ``` `merge_similar2()` autogenerates compound names for the merged motifs by joining every contributing name with `+`, which again becomes unwieldy once several motifs collapse into a single cluster. A more compact alternative is to keep the first one or two contributing names and roll the rest up into a count suffix, so that the lineage stays visible without overflowing either the table or the MEME header. Singletons simply keep their original name: ```{r} contrib.names <- tapply(clusters$name, clusters$cluster, function(n) as.character(n)) for (i in seq_along(merged)) { nms <- contrib.names[[i]] if (length(nms) == 2L) merged[[i]]["name"] <- paste(nms, collapse = "+") else if (length(nms) >= 3L) merged[[i]]["name"] <- sprintf("%s+%s+%dmore", nms[1], nms[2], length(nms) - 2L) } ## `clusters` is a universalmotif_df: one row per input motif, with ## the motif objects in the `motif` column plus the `motif.i` and ## `cluster` bookkeeping columns. Show just the assignment columns: print(head(as.data.frame(clusters)[, c("name", "motif.i", "cluster")], 8), row.names = FALSE) ``` The reduction is substantial, with roughly a third of the input motifs collapsing into shared clusters. The distribution of cluster sizes shows the long tail of singletons (motifs unique enough to stay on their own), together with a moderate body of size-2 and size-3 clusters (the common within-TF and cross-source redundancies): ```{r, fig.height=3, fig.width=6} cluster.sizes <- table(clusters$cluster) size.dist <- as.data.frame(table(cluster.sizes), stringsAsFactors = FALSE) names(size.dist) <- c("cluster.size", "n.clusters") size.dist$cluster.size <- as.integer(size.dist$cluster.size) ggplot(size.dist, aes(x = factor(.data$cluster.size), y = .data$n.clusters)) + geom_col(fill = "black") + geom_text(aes(label = .data$n.clusters), vjust = -0.4, size = 3) + labs(x = "Cluster size (number of input motifs)", y = "Number of clusters") + theme_bw() + theme(panel.grid = element_blank(), panel.border = element_blank(), axis.line.x.bottom = element_line(colour = "black"), axis.line.y.left = element_line(colour = "black")) ``` A cluster drawing on two or three different sources is direct evidence of cross-database redundancy, which is precisely the kind of duplication that a single-source library would quietly miss. We can tally the sources per cluster: ```{r} clusters$source <- vapply(all.motifs[clusters$motif.i], function(m) m["family"], character(1)) sources.per.cluster <- tapply(clusters$source, clusters$cluster, function(s) length(unique(s))) table(sources.per.cluster) ``` Many clusters contain motifs from more than one source; those multi-source clusters are exactly the cross-database redundancies that this whole workflow exists to collapse. # Visualising the cluster structure `motif_tree2()` builds a hierarchical tree from the pairwise comparison and renders it via `ggtree`. On all `r format(length(all.motifs), big.mark = ",")` motifs the labels would be too crowded to read, so we plot a random 200-motif subsample instead, coloured by source. The clades typically mix all three sources, which mirrors the multi-source cluster finding from above. ```{r, fig.height=6, fig.width=6, warning=FALSE, message=FALSE} set.seed(2026) sub.idx <- sort(sample(length(all.motifs), 200)) sub <- all.motifs[sub.idx] clean.names <- make.unique(gsub("[^A-Za-z0-9_]", "_", vapply(sub, function(m) m["name"], character(1)))) for (i in seq_along(sub)) sub[[i]]["name"] <- clean.names[i] motif_tree2(sub, linecol = "family", labels = "none", layout = "circular", progress = FALSE) + ggplot2::scale_color_discrete(breaks = sources) + ggplot2::theme(legend.position = "bottom", legend.title = ggplot2::element_blank()) ``` # Trimming uninformative flanks PWM-style downstream scanning is sensitive to long low-information flanks, which dilute scores and inflate match counts. `trim_motifs()` removes columns whose information content falls below `min.ic` (default 0.25 bits) from the edges. The default "both" mode trims from both ends. For a clean demonstration we pick the merged motif that loses the most columns to trimming. Computing the trim delta directly (the width before, minus the width after) is rather more reliable than trying to guess from the flank IC, because `trim_motifs()` walks inward from the edges and stops at the first column above `min.ic`, which is not always the column the eye would have picked out as "low IC". ```{r, fig.height=2.5, fig.width=7} trim.delta <- vapply(merged, function(m) ncol(m["motif"]) - ncol(trim_motifs(m)["motif"]), integer(1)) trim.candidate.idx <- which.max(trim.delta) trim.before <- merged[[trim.candidate.idx]] trim.before["name"] <- paste0(trim.before["name"], " (untrimmed)") trim.after <- trim_motifs(merged[[trim.candidate.idx]]) trim.after["name"] <- paste0(trim.after["name"], " (trimmed)") view_motifs2(list(trim.before, trim.after), names.pos = "right") ``` We then apply the trim to every merged motif before exporting: ```{r} curated <- trim_motifs(merged) range(sapply(merged, function(m) ncol(m["motif"]))) range(sapply(curated, function(m) ncol(m["motif"]))) ``` The width range shrinks, as expected: the merge step often pads clusters out with low-confidence flanking columns wherever the input motifs disagreed, and `trim_motifs()` is the natural way to clean those back up. # Adding the metadata back For a curated library to be genuinely useful, users will want to know, for each final motif, which sources contributed to it. The `universalmotif_df` interface (`to_df()` returns a data-frame view, and `update_motifs()` writes columns back into the motif list) is the natural way to add per-motif annotations in a tidy fashion. We compute the contributing-source list per cluster, then attach it to the curated motifs together with a curation date: ```{r} src.by.cluster <- tapply(clusters$source, clusters$cluster, function(s) paste(sort(unique(s)), collapse = ";")) n.by.cluster <- tapply(clusters$source, clusters$cluster, length) curated.df <- to_df(curated, extrainfo = TRUE) curated.df$dataSource <- NULL # MotifDb's single-source field; # superseded by the multi-source # db.sources column we are about # to add curated.df$db.sources <- as.character(src.by.cluster) curated.df$n.contributors <- as.integer(n.by.cluster) curated.df$curation.date <- "2026-05-25" print(as.data.frame(head(curated.df[, c("name", "db.sources", "n.contributors", "curation.date")], 10)), row.names = FALSE) ``` We round-trip back into motif objects with `update_motifs()`. The new columns land in the `extrainfo` slot of each motif, and are then preserved everywhere that `universalmotif` reads or writes its native representation. ```{r} curated <- to_list(update_motifs(curated.df), extrainfo = TRUE) curated[[1]]["extrainfo"] ``` The MEME format has no slot for arbitrary key-value metadata, so how much of the `extrainfo` survives is really limited by what the output format itself can carry. For analyses where that metadata is critical, my suggestion would be to write the curated library out as a `universalmotif_df` RDS file (which preserves everything) and treat the MEME export only as the format that consumers like FIMO and the MEME suite happen to need. # Export as a single MEME file `write_meme()` writes the whole motif list out to a single file. The MEME header carries the `name`, `altname`, `nsites`, and `eval` of each motif, plus a global background. ```{r} out.file <- tempfile(fileext = ".meme") write_meme(curated, file = out.file, overwrite = TRUE) file.info(out.file)$size ``` The first few lines of the file show the standard MEME preamble and the start of the first motif block: ```{r} cat(head(readLines(out.file), 25), sep = "\n") ``` We load it back with `read_meme()` to confirm that the file parses cleanly and that the length matches: ```{r} re.in <- read_meme(out.file) length(re.in) length(curated) identical(sapply(re.in, function(m) m["name"]), sapply(curated, function(m) m["name"])) ``` The names match, the motif matrices match, and the file is now ready for any downstream tool that reads MEME-format motif collections (FIMO, AME, CentriMo, or `read_meme()` itself for the next analysis). # Session info ```{r} sessionInfo() ``` # References