--- title: "Sequence manipulation and scanning" shorttitle: "Sequence utilities" date: 17 October 2021 author: - Benjamin Jean-Marie Tremblay^[benjamin.tremblay@uwaterloo.ca] bibliography: universalmotif.bib abstract: > Sequences stored as XStringSet objects (from the Biostrings package) can be used by several functions in the universalmotif package. These functions are demonstrated here and fall into two categories: sequence manipulation and motif scanning. Sequences can be generated, shuffled, and background frequencies of any order calculated. Scanning can be done simply to find locations of motif hits above a certain threshold, or to find instances of enriched motifs. vignette: > %\VignetteIndexEntry{Sequence manipulation and scanning} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: bookdown::pdf_document2 --- ```{r setup, echo=FALSE} knitr::opts_chunk$set(collapse=TRUE, comment = "#>") suppressPackageStartupMessages(library(universalmotif)) suppressPackageStartupMessages(library(Biostrings)) suppressPackageStartupMessages(library(GenomicRanges)) suppressPackageStartupMessages(library(ggbio)) data(ArabidopsisPromoters) data(ArabidopsisMotif) options(Biostrings.coloring = FALSE) ``` # Introduction This vignette goes through generating your own sequences from a specified background model, shuffling sequences whilst maintaining a certain `k`-let size, and the scanning of sequences and scoring of motifs. For an introduction to sequence motifs, see the [introductory](IntroductionToSequenceMotifs.pdf) vignette. For a basic overview of available motif-related functions, see the [motif manipulation](MotifManipulation.pdf) vignette. For a discussion on motif comparisons and P-values, see the [motif comparisons and P-values](MotifComparisonAndPvalues.pdf) vignette. # Basic sequence handling ## Creating random sequences The `Biostrings` package offers an excellent suite of functions for dealing with biological sequences. The `universalmotif` package hopes to help extend these by providing the `create_sequences()` and `shuffle_sequences()` functions. The first of these, `create_sequences()`, generates a set of letters in random order, then passes these strings to the `Biostrings` package to generate the final `XStringSet` object. The number and length of sequences can be specified. The probabilities of individual letters can also be set. The `freqs` option of `create_sequences()` also takes higher order backgrounds. In these cases the sequences are constructed in a Markov-style manner, where the probability of each letter is based on which letters precede it. ```{r} library(universalmotif) library(Biostrings) ## Create some DNA sequences for use with an external program (default ## is DNA): sequences.dna <- create_sequences(seqnum = 500, freqs = c(A=0.3, C=0.2, G=0.2, T=0.3)) ## writeXStringSet(sequences.dna, "dna.fasta") sequences.dna ## Amino acid: create_sequences(alphabet = "AA") ## Any set of characters can be used create_sequences(alphabet = paste0(letters, collapse = "")) ``` ## Calculating sequence background Sequence backgrounds can be retrieved for DNA and RNA sequences with `oligonucleotideFrequency()` from `"Biostrings`. Unfortunately, no such `Biostrings` function exists for other sequence alphabets. The `universalmotif` package proves `get_bkg()` to remedy this. Similarly, the `get_bkg()` function can calculate higher order backgrounds for any alphabet as well. It is recommended to use the original `Biostrings` for very long (e.g. billions of characters) DNA and RNA sequences whenever possible though, as it is much faster than `get_bkg()`. ```{r} library(universalmotif) ## Background of DNA sequences: dna <- create_sequences() get_bkg(dna, k = 1:2) ## Background of non DNA/RNA sequences: qwerty <- create_sequences("QWERTY") get_bkg(qwerty, k = 1:2) ``` ## Clustering sequences by k-let composition One way to compare sequences is by k-let composition. The following example illustrates how one could go about doing this using only the `universalmotif` package and base graphics. ```{r,fig.height=4.5,fig.width=4.5} library(universalmotif) ## Generate three random sets of sequences: s1 <- create_sequences(seqnum = 20, freqs = c(A = 0.3, C = 0.2, G = 0.2, T = 0.3)) s2 <- create_sequences(seqnum = 20, freqs = c(A = 0.4, C = 0.4, G = 0.1, T = 0.1)) s3 <- create_sequences(seqnum = 20, freqs = c(A = 0.2, C = 0.3, G = 0.3, T = 0.2)) ## Create a function to get properly formatted k-let counts: get_klet_matrix <- function(seqs, k, groupName) { bkg <- get_bkg(seqs, k = k, merge.res = FALSE) bkg <- bkg[, c("sequence", "klet", "count")] bkg <- reshape(bkg, idvar = "sequence", timevar = "klet", direction = "wide") suppressWarnings(as.data.frame(cbind(Group = groupName, bkg))) } ## Calculate k-let content (up to you what size k you want!): s1 <- get_klet_matrix(s1, 4, 1) s2 <- get_klet_matrix(s2, 4, 2) s3 <- get_klet_matrix(s3, 4, 3) # Combine everything into a single object: sAll <- rbind(s1, s2, s3) ## Do the PCA: sPCA <- prcomp(sAll[, -(1:2)]) ## Plot the PCA: plot(sPCA$x, col = c("red", "forestgreen", "blue")[sAll$Group], pch = 19) ``` This example could be improved by using `tidyr::spread()` instead of `reshape()` (the former is much faster), and plotting the PCA using the `ggfortify` package to create a nicer `ggplot2` plot. Feel free to play around with different ways of plotting the data! Additionally, you could even try using t-SNE instead of PCA (such as via the `Rtsne` package). # Shuffling ## Shuffling sequences When performing _de novo_ motif searches or motif enrichment analyses, it is common to do so against a set of background sequences. In order to properly identify consistent patterns or motifs in the target sequences, it is important that there be maintained a certain level of sequence composition between the target and background sequences. This reduces results which are derived purely from base differential letter frequency biases. In order to avoid these results, typically it desirable to use a set of background sequences which preserve a certain `k`-let size (such as dinucleotide or trinucleotide frequencies in the case of DNA sequences). Though for some cases a set of similar sequences may already be available for use as background sequences, usually background sequences are obtained by shuffling the target sequences, while preserving a desired `k`-let size. For this purpose, a commonly used tool is `uShuffle` [@ushuffle]. The `universalmotif` package aims to provide its own `k`-let shuffling capabilities for use within `R` via `shuffle_sequences()`. The `universalmotif` package offers three different methods for sequence shuffling: `euler`, `markov` and `linear`. The first method, `euler`, can shuffle sequences while preserving any desired `k`-let size. Furthermore 1-letter counts will always be maintained. However due to the nature of the method, the first and last letters will remain unshuffled. This method is based on the initial random Eulerian walk algorithm proposed by @markovmodel2 and the subsequent cycle-popping algorithm detailed by @eulerAlgo for quickly and efficiently finding Eulerian walks. The second method, `markov` can only guarantee that the approximate `k`-let frequency will be maintained, but not that the original letter counts will be preserved. The `markov` method involves determining the original `k`-let frequencies, then creating a new set of sequences which will have approximately similar `k`-let frequency. As a result the counts for the individual letters will likely be different. Essentially, it involves a combination of determining `k`-let frequencies followed by `create_sequences()`. This type of pseudo-shuffling is discussed by @markovmodel. The third method `linear` preserves the original 1-letter counts exactly, but uses a more crude shuffling technique. In this case the sequence is split into sub-sequences every `k`-let (of any size), which are then re-assembled randomly. This means that while shuffling the same sequence multiple times with `method = "linear"` will result in different sequences, they will all have started from the same set of `k`-length sub-sequences (just re-assembled differently). ```{r} library(universalmotif) library(Biostrings) data(ArabidopsisPromoters) ## Potentially starting off with some external sequences: # ArabidopsisPromoters <- readDNAStringSet("ArabidopsisPromoters.fasta") euler <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "euler") markov <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "markov") linear <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "linear") k1 <- shuffle_sequences(ArabidopsisPromoters, k = 1) ``` Let us compare how the methods perform: ```{r} o.letter <- get_bkg(ArabidopsisPromoters, 1) e.letter <- get_bkg(euler, 1) m.letter <- get_bkg(markov, 1) l.letter <- get_bkg(linear, 1) data.frame(original=o.letter$count, euler=e.letter$count, markov=m.letter$count, linear=l.letter$count, row.names = DNA_BASES) o.counts <- get_bkg(ArabidopsisPromoters, 2) e.counts <- get_bkg(euler, 2) m.counts <- get_bkg(markov, 2) l.counts <- get_bkg(linear, 2) data.frame(original=o.counts$count, euler=e.counts$count, markov=m.counts$count, linear=l.counts$count, row.names = get_klets(DNA_BASES, 2)) ``` ## Local shuffling If you have a fairly heterogeneous sequence and wish to preserve the presence of local "patches" of differential sequence composition, you can set `window = TRUE` in the `shuffle_sequences()` function. In the following example, the sequence of interest has an AT rich first half followed by a second half with an even background. The impact on this specific sequence composition is observed after regular and local shuffling, using the per-window functionality of `get_bkg()` (via `window = TRUE`). Fine-tune the window size and overlap between windows with `window.size` and `window.overlap`. ```{r} library(Biostrings) library(universalmotif) library(ggplot2) myseq <- DNAStringSet(paste0( create_sequences(seqlen = 500, freqs = c(A=0.4, T=0.4, C=0.1, G=0.1)), create_sequences(seqlen = 500) )) myseq_shuf <- shuffle_sequences(myseq) myseq_shuf_local <- shuffle_sequences(myseq, window = TRUE) myseq_bkg <- get_bkg(myseq, k = 1, window = TRUE) myseq_shuf_bkg <- get_bkg(myseq_shuf, k = 1, window = TRUE) myseq_shuf_local_bkg <- get_bkg(myseq_shuf_local, k = 1, window = TRUE) myseq_bkg$group <- "original" myseq_shuf_bkg$group <- "shuffled" myseq_shuf_local_bkg$group <- "shuffled-local" myseq_all <- suppressWarnings(as.data.frame( rbind(myseq_bkg, myseq_shuf_bkg, myseq_shuf_local_bkg) )) ggplot(myseq_all, aes(x = start, y = probability, colour = klet)) + geom_line() + theme_minimal() + scale_colour_manual(values = universalmotif:::DNA_COLOURS) + xlab(element_blank()) + facet_wrap(~group, ncol = 1) ``` # Sequence scanning and enrichment There are many motif-programs available with sequence scanning capabilities, such as [HOMER](http://homer.ucsd.edu/homer/index.html) and tools from the [MEME suite](http://meme-suite.org/). The `universalmotif` package does not aim to supplant these, but rather provide convenience functions for quickly scanning a few sequences without needing to leave the `R` environment. Furthermore, these functions allow for taking advantage of the higher-order (`multifreq`) motif format described here. Two scanning-related functions are provided: `scan_sequences()` and `enrich_motifs()`. The latter simply runs `scan_sequences()` twice on a set of target and background sequences. Given a motif of length `n`, `scan_sequences()` considers every possible `n`-length subset in a sequence and scores it using the PWM format. If the match surpasses the minimum threshold, it is reported. This is case regardless of whether one is scanning with a regular motif, or using the higher-order (`multifreq`) motif format (the `multifreq` matrix is converted to a PWM). ## Choosing a logodds threshold Before scanning a set of sequences, one must first decide the minimum logodds threshold for retrieving matches. This decision is not always the same between scanning programs out in the wild, nor is it usually told to the user what the cutoff is or how it is decided. As a result, `universalmotif` aims to be as transparent as possible in this regard by allowing for complete control of the threshold. For more details on PWMs, see the [introductory](IntroductionToSequenceMotifs.pdf) vignette. ### Logodds thresholds {.unnumbered} One way is to set a cutoff between 0 and 1, then multiplying the highest possible PWM score to get a threshold. The `matchPWM()` function from the `Biostrings` package for example uses a default of 0.8 (shown as `"80%"`). This is quite arbitrary of course, and every motif will end up with a different threshold. For high information content motifs, there is really no right or wrong threshold, as they tend to have fewer non-specific positions. This means that incorrect letters in a match will be more punishing. To illustrate this, contrast the following PWMs: ```{r} library(universalmotif) m1 <- create_motif("TATATATATA", nsites = 50, type = "PWM", pseudocount = 1) m2 <- matrix(c(0.10,0.27,0.23,0.19,0.29,0.28,0.51,0.12,0.34,0.26, 0.36,0.29,0.51,0.38,0.23,0.16,0.17,0.21,0.23,0.36, 0.45,0.05,0.02,0.13,0.27,0.38,0.26,0.38,0.12,0.31, 0.09,0.40,0.24,0.30,0.21,0.19,0.05,0.30,0.31,0.08), byrow = TRUE, nrow = 4) m2 <- create_motif(m2, alphabet = "DNA", type = "PWM") m1["motif"] m2["motif"] ``` In the first example, sequences which do not have a matching base in every position are punished heavily. The maximum logodds score in this case is approximately 20, and for each incorrect position the score is reduced approximately by 5.7. This means that a threshold of zero would allow for at most three mismatches. At this point, it is up to you how many mismatches you would deem appropriate. ### P-values {.unnumbered} This thinking becomes impossible for the second example. In this case, mismatches are much less punishing, to the point that one could ask: what even constitutes a mismatch? The answer to this question is usually much more difficult in such cases. An alternative to manually deciding upon a threshold is to instead start with maximum P-value one would consider appropriate for a match. If, say, we want matches with a P-value of at most 0.001, then we can use `motif_pvalue()` to calculate the appropriate threshold (see the [comparisons and P-values](MotifComparisonAndPvalues.pdf) vignette for details on motif P-values). ```{r} motif_pvalue(m2, pvalue = 0.001) ``` ### Multiple testing-corrected P-values {.unnumbered} This P-value can be further refined to correct for multiple testing (and becomes a Q-value). There are three available corrections that can be set in `scan_sequences()`: Bonferroni ("bonferroni"), Benjamini & Hochberg ("BH"), and the false discovery rate ("fdr") based on the empirical null distribution of motif hits in a set of sequences. They are excellently explained in @noble09, and these explanations will be briefly regurgitated here. To begin to understand how these different corrections are implemented, consider the following motif, sequences, example P-value for an example motif hit, and the theoretical maximum number of motif hits: ```{r} library(universalmotif) data(ArabidopsisMotif) data(ArabidopsisPromoters) (Example.Score <- score_match(ArabidopsisMotif, "TTCTCTTTTTTTTTT")) (Example.Pvalue <- motif_pvalue(ArabidopsisMotif, Example.Score)) (Max.Possible.Hits <- sum(width(ArabidopsisPromoters) - ncol(ArabidopsisMotif) + 1)) ``` The first correction method, Bonferroni, is by far the simplest. To calculate it, take the P-value of a motif hit and multiply it by the theoretical maximum number of hits: ```{r} (Example.bonferroni <- Example.Pvalue * Max.Possible.Hits) ``` As you can imagine, the level of punishment the P-value receives corresponds to the size of the sequences you are scanning. If you are scanning an entire genome, then you can expect this to be very punishing and only return near-perfect matches (or no matches). However for smaller sets of sequences this correction can be more appropriate. Next, Benjamini & Hochberg. To perform this correction, the P-value is divided by the percentile rank of the P-value in the list of P-values for all theoretically possible hits sorted in ascending order (it also assumes that P-values are normally distributed under the null hypothesis). It is important to note that this means the correction cannot be calculated before the sequences have been scanned for the motif, and P-values have been calculated for all returned hits. When requesting this type of Q-value for the minimum threshold of score, `scan_sequences()` instead calculates the threshold from the input Q-value as a P-value, then filters the final results after Q-values have been calculated. Returning to our example: ```{r} (Scan.Results <- scan_sequences(ArabidopsisMotif, ArabidopsisPromoters, threshold = 0.8, threshold.type = "logodds", calc.qvals = FALSE)) ``` First we sort and calculate the percentile ranks of our P-values, and then divide the P-values: ```{r} Pvalues <- Scan.Results$pvalue Pvalues.Ranks <- (rank(Pvalues) / Max.Possible.Hits) * 100 Qvalues.BH <- Pvalues / Pvalues.Ranks (Example.BH <- Qvalues.BH[Scan.Results$match == "TTCTCTTTTTTTTTT"][1]) ``` Finally, calculating the false discovery rate from the empirical distribution of scores. This method requires some additional steps, as we must obtain the observed and null distributions of hits in our sequences. Then for each hit, divide the number of hits with a score equal to or greater in the null distribution with the number of hits with a score equal to or greater in the observed distribution. Along the way we must be wary of the nonmonotonicity of the final Q-values (meaning that as scores get smaller the Q-value does not always increase), and thus always select the minimum available Q-value as the score increases. To get the null distribution of hits, we can simply use the P-values associated with each score as these are analytically calculated from the null based on the background probabilities (see `?motif_pvalue`). ```{r} Scan.Results <- Scan.Results[order(Scan.Results$score, decreasing = TRUE), ] Observed.Hits <- 1:nrow(Scan.Results) Null.Hits <- Max.Possible.Hits * Scan.Results$pvalue Qvalues.fdr <- Null.Hits / Observed.Hits Qvalues.fdr <- rev(cummin(rev(Qvalues.fdr))) (Example.fdr <- Qvalues.fdr[Scan.Results$match == "TTCTCTTTTTTTTTT"][1]) ``` Similarly to Benjamini & Hochberg, these can only be known after scanning has occurred. To summarize, we can compare the initial P-value with the different corrections: ```{r} knitr::kable( data.frame( What = c("Score", "P-value", "bonferroni", "BH", "fdr"), Value = format( c(Example.Score, Example.Pvalue, Example.bonferroni, Example.BH, Example.fdr), scientific = FALSE ) ), format = "markdown", caption = "Comparing P-value correction methods" ) ``` Use your best judgement as to which method is most appropriate for your specific use case. ## Regular and higher order scanning Furthermore, the `scan_sequences()` function offers the ability to scan using the `multifreq` slot, if available. This allows to take into account inter-positional dependencies, and get matches which more faithfully represent the original sequences from which the motif originated. ```{r} library(universalmotif) library(Biostrings) data(ArabidopsisPromoters) ## A 2-letter example: motif.k2 <- create_motif("CWWWWCC", nsites = 6) sequences.k2 <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3)) motif.k2 <- add_multifreq(motif.k2, sequences.k2) ``` Regular scanning: ```{r} scan_sequences(motif.k2, ArabidopsisPromoters, RC = TRUE, threshold = 0.9, threshold.type = "logodds") ``` Using 2-letter information to scan: ```{r} scan_sequences(motif.k2, ArabidopsisPromoters, use.freq = 2, RC = TRUE, threshold = 0.9, threshold.type = "logodds") ``` Furthermore, sequence scanning can be further refined to avoid overlapping hits. Consider: ```{r} motif <- create_motif("AAAAAA") ## Leave in overlapping hits: scan_sequences(motif, ArabidopsisPromoters, RC = TRUE, threshold = 0.9, threshold.type = "logodds") ## Only keep the highest scoring hit amongst overlapping hits: scan_sequences(motif, ArabidopsisPromoters, RC = TRUE, threshold = 0.9, threshold.type = "logodds", no.overlaps = TRUE) ``` Finally, the results can be returned as a `GRanges` object for further manipulation: ```{r} scan_sequences(motif.k2, ArabidopsisPromoters, RC = TRUE, threshold = 0.9, threshold.type = "logodds", return.granges = TRUE) ``` ## Visualizing motif hits across sequences A few suggestions for different ways of plotting hits across sequences are presented here. Using the `ggbio` package, it is rather trivial to generate nice visualizations of the output of `scan_sequences()`. This requires having the `GenomicRanges` and `ggbio` packages installed, and outputting the `scan_sequences()` result as a `GRanges` object (via `return.granges = TRUE`). ```{r, fig.height = 2.5, fig.width = 6} library(universalmotif) library(GenomicRanges) library(ggbio) data(ArabidopsisPromoters) motif1 <- create_motif("AAAAAA", name = "Motif A") motif2 <- create_motif("CWWWWCC", name = "Motif B") res <- scan_sequences(c(motif1, motif2), ArabidopsisPromoters[1:10], return.granges = TRUE, calc.pvals = TRUE, no.overlaps = TRUE, threshold = 0.2, threshold.type = "logodds") ## Just plot the motif hits: autoplot(res, layout = "karyogram", aes(fill = motif, color = motif)) + theme( strip.background = element_rect(fill = NA, colour = NA), panel.background = element_rect(fill = NA, colour = NA) ) ## Plot Motif A hits by P-value: autoplot(res[res$motif.i == 1, ], layout = "karyogram", aes(fill = log10(pvalue), colour = log10(pvalue))) + scale_fill_gradient(low = "black", high = "grey75") + scale_colour_gradient(low = "black", high = "grey75") + theme( strip.background = element_rect(fill = NA, colour = NA), panel.background = element_rect(fill = NA, colour = NA) ) ``` Alternatively, just a simple heatmap with only `ggplot2`. ```{r, fig.height = 4.5, fig.width=6.5} library(universalmotif) library(ggplot2) data(ArabidopsisMotif) data(ArabidopsisPromoters) res <- scan_sequences(ArabidopsisMotif, ArabidopsisPromoters, threshold = 0, threshold.type = "logodds.abs") res <- suppressWarnings(as.data.frame(res)) res$x <- mapply(function(x, y) mean(c(x, y)), res$start, res$stop) ggplot(res, aes(x, sequence, fill = score)) + scale_fill_viridis_c() + scale_x_continuous(expand = c(0, 0)) + xlim(0, 1000) + xlab(element_blank()) + ylab(element_blank()) + geom_tile(width = ncol(ArabidopsisMotif)) + theme_bw() + theme(panel.grid = element_blank(), axis.text.y = element_text(size = 6)) ``` Using packages such as `ggExtra` or `ggpubr`, one could even plot marginal histogram or density plots above or below to illustrate any motif positional preference within the sequences. (Though keep in mind that the hit coordinates and sequence lengths would need to be normalized if not all sequences were of the same length, as they are here.) Finally, the distribution of all possible motif scores could be shown as a line plot across the sequences. ```{r,fig.height=5,fig.width=6.5} library(universalmotif) library(ggplot2) data(ArabidopsisMotif) data(ArabidopsisPromoters) res <- scan_sequences(ArabidopsisMotif, ArabidopsisPromoters[1:5], threshold = -Inf, threshold.type = "logodds.abs") res <- suppressWarnings(as.data.frame(res)) res$position <- mapply(function(x, y) mean(c(x, y)), res$start, res$stop) ggplot(res, aes(position, score, colour = score)) + geom_line() + geom_hline(yintercept = 0, colour = "red", alpha = 0.3) + theme_bw() + scale_colour_viridis_c() + facet_wrap(~sequence, ncol = 1) + xlab(element_blank()) + ylab(element_blank()) + theme(strip.background = element_blank()) ``` ## Enrichment analyses The `universalmotif` package offers the ability to search for enriched motif sites in a set of sequences via `enrich_motifs()`. There is little complexity to this, as it simply runs `scan_sequences()` twice: once on a set of target sequences, and once on a set of background sequences. After which the results between the two sequences are collated and run through enrichment tests. The background sequences can be given explicitly, or else `enrich_motifs()` will create background sequences on its own by using `shuffle_sequences()` on the target sequences. Let us consider the following basic example: ```{r} library(universalmotif) data(ArabidopsisMotif) data(ArabidopsisPromoters) enrich_motifs(ArabidopsisMotif, ArabidopsisPromoters, shuffle.k = 3, threshold = 0.001, RC = TRUE) ``` Here we can see that the motif is significantly enriched in the target sequences. The `Pval` was calculated by calling `stats::fisher.test()`. One final point: always keep in mind the `threshold` parameter, as this will ultimately decide the number of hits found. (A bad threshold can lead to a false negative.) ## Fixed and variable-length gapped motifs `universalmotif` class motifs can be gapped, which can be used by `scan_sequences()` and `enrich_motifs()`. Note that gapped motif support is currently limited to these two functions. All other functions will ignore the gap information, and even discard them in functions such as `merge_motifs()`. First, obtain the component motifs: ```{r} library(universalmotif) data(ArabidopsisPromoters) m1 <- create_motif("TTTATAT", name = "PartA") m2 <- create_motif("GGTTCGA", name = "PartB") ``` Then, combine them and add the desired gap. In this case, a gap will be added between the two motifs which can range in size from 4-6 bases. ```{r} m <- cbind(m1, m2) m <- add_gap(m, gaploc = ncol(m1), mingap = 4, maxgap = 6) m ``` Now, it can be used directly in `scan_sequences()` or `enrich_motifs()`: ```{r} scan_sequences(m, ArabidopsisPromoters, threshold = 0.4, threshold.type = "logodds") ``` ## Detecting low complexity regions and sequence masking Highly-repetitive low complexity regions can oftentimes cause problems during _de novo_ motif discovery, leading to obviously false motifs being returned. One way to get around this issue is to preemptively remove or mask these regions. The `universalmotif` package includes a few functions which can help carry out this task. Using `mask_seqs()`, one can mask a specific pattern of letters in `XStringSet` objects. Consider the following sequences: ```{r} library(universalmotif) library(Biostrings) Ex.seq <- DNAStringSet(c( A = "GTTGAAAAAAAAAAAAAAAACAGACGT", B = "TTAGATGGCCCATAGCTTATACGGCAA", C = "AATAAAATGCTTAGGAAATCGATTGCC" )) ``` We can easily mask portions that contain, say, stretches of at least 8 `A`s: ```{r} mask_seqs(Ex.seq, "AAAAAAAA") ``` Alternatively, instead of masking a know stretch of letters one can find low complexity regions using `sequence_complexity()`, and then mask specific regions in the sequences using `mask_ranges()`. The `sequence_complexity()` function has several complexity metrics available: the Wootton-Federhen [@wootton93] and Trifonov [@trifonov90] algorithms (and their approximations) are well described in @orlov04, and DUST in @morgulis06. See `?sequence_complexity` for more details. ```{r} (Ex.DUST <- sequence_complexity(Ex.seq, window.size = 10, method = "DUST", return.granges = TRUE)) ``` Using the DUST algorithm, we can see there are a couple of regions which spike in the complexity score (for this particular algorithm, more complex sequences converge towards zero). Now it is only a matter of filtering for those regions and using `mask_ranges()`. ```{r} (Ex.DUST <- Ex.DUST[Ex.DUST$complexity >= 3]) mask_ranges(Ex.seq, Ex.DUST) ``` Now these sequences could be used directly with `scan_sequences()` or written to a fasta file using `Biostrings::writeXStringSet()` for use with an external _de novo_ motif discovery program such as MEME. # Motif discovery with MEME _Note: In the time since the inception of the `run_meme()` function, Spencer Nystrom (a contributor to `universalmotif`) has created the `memes` package as a interface to much of the `MEME` suite. It is fully interoperable with the `universalmotif` package and provides a much more convenient way to run `MEME` programs from within `R`. Install it from Bioconductor with `BiocManager::install("memes")`._ The `universalmotif` package provides a simple wrapper to the powerful motif discovery tool `MEME` [@meme3]. To run an analysis with `MEME`, all that is required is a set of `XStringSet` class sequences (defined in the `Biostrings` package), and `run_meme()` will take care of running the program and reading the output for use within `R`. The first step is to check that R can find the `MEME` binary in your `$PATH` by running `run_meme()` without any parameters. If successful, you should see the default `MEME` help message in your console. If not, then you'll need to provide the complete path to the `MEME` binary. There are two options: ```{r,eval=FALSE} library(universalmotif) ## 1. Once per session: via `options()` options(meme.bin = "/path/to/meme/bin/meme") run_meme(...) ## 2. Once per run: via `run_meme()` run_meme(..., bin = "/path/to/meme/bin/meme") ``` Now we need to get some sequences to use with `run_meme()`. At this point we can read sequences from disk or extract them from one of the `Bioconductor` `BSgenome` packages. ```{r,eval=FALSE} library(universalmotif) data(ArabidopsisPromoters) ## 1. Read sequences from disk (in fasta format): library(Biostrings) # The following `read*()` functions are available in Biostrings: # DNA: readDNAStringSet # DNA with quality scores: readQualityScaledDNAStringSet # RNA: readRNAStringSet # Amino acid: readAAStringSet # Any: readBStringSet sequences <- readDNAStringSet("/path/to/sequences.fasta") run_meme(sequences, ...) ## 2. Extract from a `BSgenome` object: library(GenomicFeatures) library(TxDb.Athaliana.BioMart.plantsmart28) library(BSgenome.Athaliana.TAIR.TAIR9) # Let us retrieve the same promoter sequences from ArabidopsisPromoters: gene.names <- names(ArabidopsisPromoters) # First get the transcript coordinates from the relevant `TxDb` object: transcripts <- transcriptsBy(TxDb.Athaliana.BioMart.plantsmart28, by = "gene")[gene.names] # There are multiple transcripts per gene, we only care for the first one # in each: transcripts <- lapply(transcripts, function(x) x[1]) transcripts <- unlist(GRangesList(transcripts)) # Then the actual sequences: # Unfortunately this is a case where the chromosome names do not match # between the two databases seqlevels(TxDb.Athaliana.BioMart.plantsmart28) #> [1] "1" "2" "3" "4" "5" "Mt" "Pt" seqlevels(BSgenome.Athaliana.TAIR.TAIR9) #> [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC" # So we must first rename the chromosomes in `transcripts`: seqlevels(transcripts) <- seqlevels(BSgenome.Athaliana.TAIR.TAIR9) # Finally we can extract the sequences promoters <- getPromoterSeq(transcripts, BSgenome.Athaliana.TAIR.TAIR9, upstream = 1000, downstream = 0) run_meme(promoters, ...) ``` Once the sequences are ready, there are few important options to keep in mind. One is whether to conserve the output from `MEME`. The default is not to, but this can be changed by setting the relevant option: ```{r,eval=FALSE} run_meme(sequences, output = "/path/to/desired/output/folder") ``` The second important option is the search function (`objfun`). Some search functions such as the default `classic` do not require a set of background sequences, whilst some do (such as `de`). If you choose one of the latter, then you can either let `MEME` create them for you (it will shuffle the target sequences) or you can provide them via the `control.sequences` parameter. Finally, choose how you'd like the data imported into `R`. Once the `MEME` program exits, `run_meme()` will import the results into `R` with `read_meme()`. At this point you can decide if you want just the motifs themselves (`readsites = FALSE`) or if you'd like the original sequence sites as well (`readsites = TRUE`, the default). Doing the latter gives you the option of generating higher order representations for the imported `MEME` motifs as shown here: ```{r, eval=FALSE} motifs <- run_meme(sequences) motifs.k23 <- mapply(add_multifreq, motifs$motifs, motifs$sites) ``` There are a wealth of other `MEME` options available, such as the number of desired motifs (`nmotifs`), the width of desired motifs (`minw`, `maxw`), the search mode (`mod`), assigning sequence weights (`weights`), using a custom alphabet (`alph`), and many others. See the output from `run_meme()` for a brief description of the options, or visit the [online manual](http://meme-suite.org/doc/meme.html) for more details. # Miscellaneous string utilities Since biological sequences are usually contained in `XStringSet` class objects, `sequence_complexity()`, `get_bkg()` and `shuffle_sequences()` are designed to work with such objects. For cases when strings are not `XStringSet` objects, the following functions are available: * `calc_complexity()`: alternative to `sequence_complexity()` * `count_klets()`: alternative to `get_bkg()` * `shuffle_string()`: alternative to `shuffle_sequences()` ```{r} library(universalmotif) string <- "DASDSDDSASDSSA" calc_complexity(string) count_klets(string, 2) shuffle_string(string, 2) ``` A few other utilities have also been made available (based on the internal code of other `universalmotif` functions) that work on simple character vectors: * `calc_windows()`: calculate the coordinates for sliding windows from 1 to any number `n` * `get_klets()`: get a list of all possible `k`-lets for any sequence alphabet * `slide_fun()`: apply a function over sliding windows across a single string * `window_string()`: retrieve characters from sliding windows of a single string ```{r} library(universalmotif) calc_windows(n = 12, window = 4, overlap = 2) get_klets(c("A", "S", "D"), 2) slide_fun("ABCDEFGH", charToRaw, raw(2), window = 2, overlap = 1) window_string("ABCDEFGH", window = 2, overlap = 1) ``` # Session info {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # References {.unnumbered}