This vignette goes through generating your own sequences from a
specified background model, shuffling sequences whilst maintaining a
certain k
-let size, and the scanning of sequences and
scoring of motifs. For an introduction to sequence motifs, see the introductory vignette. For a
basic overview of available motif-related functions, see the motif manipulation vignette. For a
discussion on motif comparisons and P-values, see the motif comparisons and P-values
vignette.
The Biostrings
package offers an excellent suite of
functions for dealing with biological sequences. The
universalmotif
package hopes to help extend these by
providing the create_sequences()
and
shuffle_sequences()
functions. The first of these,
create_sequences()
, generates a set of letters in random
order, then passes these strings to the Biostrings
package
to generate the final XStringSet
object. The number and
length of sequences can be specified. The probabilities of individual
letters can also be set.
The freqs
option of create_sequences()
also
takes higher order backgrounds. In these cases the sequences are
constructed in a Markov-style manner, where the probability of each
letter is based on which letters precede it.
library(universalmotif)
library(Biostrings)
## Create some DNA sequences for use with an external program (default
## is DNA):
sequences.dna <- create_sequences(seqnum = 500,
freqs = c(A=0.3, C=0.2, G=0.2, T=0.3))
## writeXStringSet(sequences.dna, "dna.fasta")
sequences.dna
#> DNAStringSet object of length 500:
#> width seq
#> [1] 100 TTCTTCATGGAGTTAATTGGACCTCCTCTATGT...ATTGTCTTCTTTAGCGCAGTAGTTAAGTATTC
#> [2] 100 TTACAATAAGTCTTGTCGATCACTGGAACAATT...TTCGGTTTACATCGAGCGGACAGTATATACTG
#> [3] 100 AGCACTTCTCATTCTATCAGCAGTTTTATTCAT...ATCACGATCCTTCCATAATCACTGATCGACGA
#> [4] 100 GCATGCATGAGCTCCAGGCTCCATCATTTCTAT...TAAAGCCTGTGCTGTTACTATTGGACCTGATG
#> [5] 100 TCATCTTAAACATGATCTAACATTCGCGTGTCG...CCGGATAGTTACGTGAGAATCTTTTAACCTTT
#> ... ... ...
#> [496] 100 ATCCGACATTGGTAAAGGTACCGTCTCGAATCT...CCAAGTTCTATTGCATCACCGCAATTTTGTGC
#> [497] 100 AAGCATCCTAATGAATTACGTCATGGCTCACAG...TCATTCAGGTGCCGCACACCCGAGTTACTTGA
#> [498] 100 TATTAATCTATGGCCTGTCCCTTATCCTGGGTA...ATCTATACGACTCTTCCGGCTCCAATGAACAA
#> [499] 100 GCGACATAAGCCTAGATCGTATCGCAATCTCAG...ATTAATCGAGGGAAATGACGCAGAATTTTTTT
#> [500] 100 CTCGGGGTTGGGATCTACAAGTGGGGTGTGCAG...CATCAGGAAGCCTCCCATCAATCTGACTAGAA
## Amino acid:
create_sequences(alphabet = "AA")
#> AAStringSet object of length 100:
#> width seq
#> [1] 100 WWVMGEHEHVAVCKSDFKCQYDNDYDEYRRIHA...VADTFQGGWVTPWHDISMQDGKILNMWGHTKE
#> [2] 100 FWYEPNWMPIINQDVVDWFTQYRDSGGQIAFER...QIVRYMMSMMAHRTMRNMIPMQVGPVYAKEMH
#> [3] 100 QGRQAKTPYPIVRKEEGMGQAGFFHYGGMTSSS...KFIIAQEHWLPIYSMEKLLVDPFHLQGDGICW
#> [4] 100 PNMGPGTRIKQATDTEAFPWIKSLARYESRMAA...RICNAANNACPTQVGRTHMTDWPTTKVKWVLA
#> [5] 100 DMYHQMHFPLAKHYHICGPSPQYMNMHMESEWY...RGYDRCRVFVLDDPPISTTWALLAMIWVNVHL
#> ... ... ...
#> [96] 100 HGRQHETPQGPTTVASNLLVYHWKRMYFSCVYF...LEPDMSVAVGRVNNVPHMYDHNEDEPMEGGCF
#> [97] 100 WPHVAHAAPGGGRHIWILVSTIMSTIIGYKTPA...NHTRCNMWGCMTNTYQTDIWPRYAQNALVIAP
#> [98] 100 AQWYALIENQFPGPRFGNTWKAYKQWPFGPWGC...SFGGFMVRKGYAKHSQPKFMGTLNPFGNYYTG
#> [99] 100 EFGRAMCMNGIFLWGFQFCGDTVLCTNGYASWK...FNAFRKRCHDPYSHKQPMFHIQVNDVDPKQNK
#> [100] 100 QFTCSFFPDCKQRMSMPGWAWQACNTEHYVYYP...KTMLKKRGQNTICMWWTMLPMPLWEHWIGKQL
## Any set of characters can be used
create_sequences(alphabet = paste0(letters, collapse = ""))
#> BStringSet object of length 100:
#> width seq
#> [1] 100 ujnyvtvguqytvootuptzvxvaapptvsqga...rrkrdqjtwygnoshmyldkexbkmdycprfq
#> [2] 100 cvmdslpazuexfotbkkvhavmmunawcaxct...udrkwggsdzdeyfualjcbdzvgsyswgjah
#> [3] 100 gxqtewytcmpynbouqztickdttvzjhtbvm...jqxdmqxwexrawdsbawxempfjjjoytcth
#> [4] 100 vrydvxbcnoscpivdqjaoyiiobxhojmmab...lcnipgcacixjuxbozdujagdjryaaqxog
#> [5] 100 roebjjmbbsaalidnoyenuemkbcbgxpdee...aqakfemydczlsrwvpiiegyeeotludowu
#> ... ... ...
#> [96] 100 ufwtosisxevdjxgecrwawbnfwyanahedw...lcsorvhyvtpuecjazkixhgfezafmiara
#> [97] 100 wxswcpgiatidixkufjtrqbpftjdnknoem...wkquiagipvmjwqlevtxvasclukjultjh
#> [98] 100 zkvweukjjqxcuhjpptxbpckhpxqvwbkix...dnismuxpcvjjuvygqnjxpzswfackzrmb
#> [99] 100 vmmjysnmavxtnowwfowqwjfetfjuapllp...vwbnqduvvkbdfrmibxdlmfjgajezhaya
#> [100] 100 zzoonuspoodjylfdckqjbvgaxlzdfirvy...maxxustskrlnhepotjsryospvqsevqha
Sequence backgrounds can be retrieved for DNA and RNA sequences with
oligonucleotideFrequency()
from "Biostrings
.
Unfortunately, no such Biostrings
function exists for other
sequence alphabets. The universalmotif
package proves
get_bkg()
to remedy this. Similarly, the
get_bkg()
function can calculate higher order backgrounds
for any alphabet as well. It is recommended to use the original
Biostrings
for very long (e.g. billions of characters) DNA
and RNA sequences whenever possible though, as it is much faster than
get_bkg()
.
library(universalmotif)
## Background of DNA sequences:
dna <- create_sequences()
get_bkg(dna, k = 1:2)
#> DataFrame with 20 rows and 3 columns
#> klet count probability
#> <character> <numeric> <numeric>
#> 1 A 2437 0.2437000
#> 2 C 2471 0.2471000
#> 3 G 2490 0.2490000
#> 4 T 2602 0.2602000
#> 5 AA 603 0.0609091
#> ... ... ... ...
#> 16 GT 644 0.0650505
#> 17 TA 594 0.0600000
#> 18 TC 641 0.0647475
#> 19 TG 670 0.0676768
#> 20 TT 671 0.0677778
## Background of non DNA/RNA sequences:
qwerty <- create_sequences("QWERTY")
get_bkg(qwerty, k = 1:2)
#> DataFrame with 42 rows and 3 columns
#> klet count probability
#> <character> <numeric> <numeric>
#> 1 E 1629 0.1629
#> 2 Q 1704 0.1704
#> 3 R 1689 0.1689
#> 4 T 1677 0.1677
#> 5 W 1677 0.1677
#> ... ... ... ...
#> 38 YQ 257 0.0259596
#> 39 YR 299 0.0302020
#> 40 YT 230 0.0232323
#> 41 YW 274 0.0276768
#> 42 YY 276 0.0278788
One way to compare sequences is by k-let composition. The following
example illustrates how one could go about doing this using only the
universalmotif
package and base graphics.
library(universalmotif)
## Generate three random sets of sequences:
s1 <- create_sequences(seqnum = 20,
freqs = c(A = 0.3, C = 0.2, G = 0.2, T = 0.3))
s2 <- create_sequences(seqnum = 20,
freqs = c(A = 0.4, C = 0.4, G = 0.1, T = 0.1))
s3 <- create_sequences(seqnum = 20,
freqs = c(A = 0.2, C = 0.3, G = 0.3, T = 0.2))
## Create a function to get properly formatted k-let counts:
get_klet_matrix <- function(seqs, k, groupName) {
bkg <- get_bkg(seqs, k = k, merge.res = FALSE)
bkg <- bkg[, c("sequence", "klet", "count")]
bkg <- reshape(bkg, idvar = "sequence", timevar = "klet",
direction = "wide")
suppressWarnings(as.data.frame(cbind(Group = groupName, bkg)))
}
## Calculate k-let content (up to you what size k you want!):
s1 <- get_klet_matrix(s1, 4, 1)
s2 <- get_klet_matrix(s2, 4, 2)
s3 <- get_klet_matrix(s3, 4, 3)
# Combine everything into a single object:
sAll <- rbind(s1, s2, s3)
## Do the PCA:
sPCA <- prcomp(sAll[, -(1:2)])
## Plot the PCA:
plot(sPCA$x, col = c("red", "forestgreen", "blue")[sAll$Group], pch = 19)
This example could be improved by using tidyr::spread()
instead of reshape()
(the former is much faster), and
plotting the PCA using the ggfortify
package to create a
nicer ggplot2
plot. Feel free to play around with different
ways of plotting the data! Additionally, you could even try using t-SNE
instead of PCA (such as via the Rtsne
package).
When performing de novo motif searches or motif enrichment analyses, it is common to do so against a set of background sequences. In order to properly identify consistent patterns or motifs in the target sequences, it is important that there be maintained a certain level of sequence composition between the target and background sequences. This reduces results which are derived purely from base differential letter frequency biases.
In order to avoid these results, typically it desirable to use a set
of background sequences which preserve a certain k
-let size
(such as dinucleotide or trinucleotide frequencies in the case of DNA
sequences). Though for some cases a set of similar sequences may already
be available for use as background sequences, usually background
sequences are obtained by shuffling the target sequences, while
preserving a desired k
-let size. For this purpose, a
commonly used tool is uShuffle
(Jiang et al. 2008). The
universalmotif
package aims to provide its own
k
-let shuffling capabilities for use within R
via shuffle_sequences()
.
The universalmotif
package offers three different
methods for sequence shuffling: euler
, markov
and linear
. The first method, euler
, can
shuffle sequences while preserving any desired k
-let size.
Furthermore 1-letter counts will always be maintained. However due to
the nature of the method, the first and last letters will remain
unshuffled. This method is based on the initial random Eulerian walk
algorithm proposed by Altschul and Erickson
(1985) and the subsequent cycle-popping algorithm detailed by
Propp and Wilson (1998) for quickly and
efficiently finding Eulerian walks.
The second method, markov
can only guarantee that the
approximate k
-let frequency will be maintained, but not
that the original letter counts will be preserved. The
markov
method involves determining the original
k
-let frequencies, then creating a new set of sequences
which will have approximately similar k
-let frequency. As a
result the counts for the individual letters will likely be different.
Essentially, it involves a combination of determining k
-let
frequencies followed by create_sequences()
. This type of
pseudo-shuffling is discussed by Fitch
(1983).
The third method linear
preserves the original 1-letter
counts exactly, but uses a more crude shuffling technique. In this case
the sequence is split into sub-sequences every k
-let (of
any size), which are then re-assembled randomly. This means that while
shuffling the same sequence multiple times with
method = "linear"
will result in different sequences, they
will all have started from the same set of k
-length
sub-sequences (just re-assembled differently).
library(universalmotif)
library(Biostrings)
data(ArabidopsisPromoters)
## Potentially starting off with some external sequences:
# ArabidopsisPromoters <- readDNAStringSet("ArabidopsisPromoters.fasta")
euler <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "euler")
markov <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "markov")
linear <- shuffle_sequences(ArabidopsisPromoters, k = 2, method = "linear")
k1 <- shuffle_sequences(ArabidopsisPromoters, k = 1)
Let us compare how the methods perform:
o.letter <- get_bkg(ArabidopsisPromoters, 1)
e.letter <- get_bkg(euler, 1)
m.letter <- get_bkg(markov, 1)
l.letter <- get_bkg(linear, 1)
data.frame(original=o.letter$count, euler=e.letter$count,
markov=m.letter$count, linear=l.letter$count, row.names = DNA_BASES)
#> original euler markov linear
#> A 17384 17384 17286 17384
#> C 8081 8081 8078 8081
#> G 7583 7583 7676 7583
#> T 16952 16952 16960 16952
o.counts <- get_bkg(ArabidopsisPromoters, 2)
e.counts <- get_bkg(euler, 2)
m.counts <- get_bkg(markov, 2)
l.counts <- get_bkg(linear, 2)
data.frame(original=o.counts$count, euler=e.counts$count,
markov=m.counts$count, linear=l.counts$count,
row.names = get_klets(DNA_BASES, 2))
#> original euler markov linear
#> AA 6893 6893 6053 6468
#> AC 2614 2614 2693 2734
#> AG 2592 2592 2596 2609
#> AT 5276 5276 5924 5560
#> CA 3014 3014 2759 2958
#> CC 1376 1376 1372 1315
#> CG 1051 1051 1249 1136
#> CT 2621 2621 2695 2662
#> GA 2734 2734 2620 2623
#> GC 1104 1104 1293 1205
#> GG 1176 1176 1231 1197
#> GT 2561 2561 2525 2550
#> TA 4725 4725 5829 5314
#> TC 2977 2977 2714 2820
#> TG 2759 2759 2596 2636
#> TT 6477 6477 5801 6163
If you have a fairly heterogeneous sequence and wish to preserve the
presence of local “patches” of differential sequence composition, you
can set window = TRUE
in the
shuffle_sequences()
function. In the following example, the
sequence of interest has an AT rich first half followed by a second half
with an even background. The impact on this specific sequence
composition is observed after regular and local shuffling, using the
per-window functionality of get_bkg()
(via
window = TRUE
). Fine-tune the window size and overlap
between windows with window.size
and
window.overlap
.
library(Biostrings)
library(universalmotif)
library(ggplot2)
myseq <- DNAStringSet(paste0(
create_sequences(seqlen = 500, freqs = c(A=0.4, T=0.4, C=0.1, G=0.1)),
create_sequences(seqlen = 500)
))
myseq_shuf <- shuffle_sequences(myseq)
myseq_shuf_local <- shuffle_sequences(myseq, window = TRUE)
myseq_bkg <- get_bkg(myseq, k = 1, window = TRUE)
myseq_shuf_bkg <- get_bkg(myseq_shuf, k = 1, window = TRUE)
myseq_shuf_local_bkg <- get_bkg(myseq_shuf_local, k = 1, window = TRUE)
myseq_bkg$group <- "original"
myseq_shuf_bkg$group <- "shuffled"
myseq_shuf_local_bkg$group <- "shuffled-local"
myseq_all <- suppressWarnings(as.data.frame(
rbind(myseq_bkg, myseq_shuf_bkg, myseq_shuf_local_bkg)
))
ggplot(myseq_all, aes(x = start, y = probability, colour = klet)) +
geom_line() +
theme_minimal() +
scale_colour_manual(values = universalmotif:::DNA_COLOURS) +
xlab(element_blank()) +
facet_wrap(~group, ncol = 1)
There are many motif-programs available with sequence scanning
capabilities, such as HOMER and tools from
the MEME suite. The
universalmotif
package does not aim to supplant these, but
rather provide convenience functions for quickly scanning a few
sequences without needing to leave the R
environment.
Furthermore, these functions allow for taking advantage of the
higher-order (multifreq
) motif format described here.
Two scanning-related functions are provided:
scan_sequences()
and enrich_motifs()
. The
latter simply runs scan_sequences()
twice on a set of
target and background sequences. Given a motif of length n
,
scan_sequences()
considers every possible
n
-length subset in a sequence and scores it using the PWM
format. If the match surpasses the minimum threshold, it is reported.
This is case regardless of whether one is scanning with a regular motif,
or using the higher-order (multifreq
) motif format (the
multifreq
matrix is converted to a PWM).
Before scanning a set of sequences, one must first decide the minimum
logodds threshold for retrieving matches. This decision is not always
the same between scanning programs out in the wild, nor is it usually
told to the user what the cutoff is or how it is decided. As a result,
universalmotif
aims to be as transparent as possible in
this regard by allowing for complete control of the threshold. For more
details on PWMs, see the introductory vignette.
One way is to set a cutoff between 0 and 1, then multiplying the
highest possible PWM score to get a threshold. The
matchPWM()
function from the Biostrings
package for example uses a default of 0.8 (shown as "80%"
).
This is quite arbitrary of course, and every motif will end up with a
different threshold. For high information content motifs, there is
really no right or wrong threshold, as they tend to have fewer
non-specific positions. This means that incorrect letters in a match
will be more punishing. To illustrate this, contrast the following
PWMs:
library(universalmotif)
m1 <- create_motif("TATATATATA", nsites = 50, type = "PWM", pseudocount = 1)
m2 <- matrix(c(0.10,0.27,0.23,0.19,0.29,0.28,0.51,0.12,0.34,0.26,
0.36,0.29,0.51,0.38,0.23,0.16,0.17,0.21,0.23,0.36,
0.45,0.05,0.02,0.13,0.27,0.38,0.26,0.38,0.12,0.31,
0.09,0.40,0.24,0.30,0.21,0.19,0.05,0.30,0.31,0.08),
byrow = TRUE, nrow = 4)
m2 <- create_motif(m2, alphabet = "DNA", type = "PWM")
m1["motif"]
#> T A T A T A T
#> A -5.672425 1.978626 -5.672425 1.978626 -5.672425 1.978626 -5.672425
#> C -5.672425 -5.672425 -5.672425 -5.672425 -5.672425 -5.672425 -5.672425
#> G -5.672425 -5.672425 -5.672425 -5.672425 -5.672425 -5.672425 -5.672425
#> T 1.978626 -5.672425 1.978626 -5.672425 1.978626 -5.672425 1.978626
#> A T A
#> A 1.978626 -5.672425 1.978626
#> C -5.672425 -5.672425 -5.672425
#> G -5.672425 -5.672425 -5.672425
#> T -5.672425 1.978626 -5.672425
m2["motif"]
#> S H C N N N
#> A -1.3219281 0.09667602 -0.12029423 -0.3959287 0.2141248 0.1491434
#> C 0.5260688 0.19976951 1.02856915 0.6040713 -0.1202942 -0.6582115
#> G 0.8479969 -2.33628339 -3.64385619 -0.9434165 0.1110313 0.5897160
#> T -1.4739312 0.66371661 -0.05889369 0.2630344 -0.2515388 -0.4102840
#> R N N V
#> A 1.0430687 -1.0732490 0.4436067 0.04222824
#> C -0.5418938 -0.2658941 -0.1202942 0.51171352
#> G 0.0710831 0.5897160 -1.0588937 0.29598483
#> T -2.3074285 0.2486791 0.3103401 -1.65821148
In the first example, sequences which do not have a matching base in every position are punished heavily. The maximum logodds score in this case is approximately 20, and for each incorrect position the score is reduced approximately by 5.7. This means that a threshold of zero would allow for at most three mismatches. At this point, it is up to you how many mismatches you would deem appropriate.
This thinking becomes impossible for the second example. In this
case, mismatches are much less punishing, to the point that one could
ask: what even constitutes a mismatch? The answer to this question is
usually much more difficult in such cases. An alternative to manually
deciding upon a threshold is to instead start with maximum P-value one
would consider appropriate for a match. If, say, we want matches with a
P-value of at most 0.001, then we can use motif_pvalue()
to
calculate the appropriate threshold (see the comparisons and P-values
vignette for details on motif P-values).
This P-value can be further refined to correct for multiple testing
(and becomes a Q-value). There are three available corrections that can
be set in scan_sequences()
: Bonferroni (“bonferroni”),
Benjamini & Hochberg (“BH”), and the false discovery rate (“fdr”)
based on the empirical null distribution of motif hits in a set of
sequences. They are excellently explained in Noble (2009), and these explanations will be
briefly regurgitated here.
To begin to understand how these different corrections are implemented, consider the following motif, sequences, example P-value for an example motif hit, and the theoretical maximum number of motif hits:
library(universalmotif)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
(Example.Score <- score_match(ArabidopsisMotif, "TTCTCTTTTTTTTTT"))
#> [1] 16.81
(Example.Pvalue <- motif_pvalue(ArabidopsisMotif, Example.Score))
#> [1] 6.612819e-07
(Max.Possible.Hits <- sum(width(ArabidopsisPromoters) - ncol(ArabidopsisMotif) + 1))
#> [1] 49300
The first correction method, Bonferroni, is by far the simplest. To calculate it, take the P-value of a motif hit and multiply it by the theoretical maximum number of hits:
As you can imagine, the level of punishment the P-value receives corresponds to the size of the sequences you are scanning. If you are scanning an entire genome, then you can expect this to be very punishing and only return near-perfect matches (or no matches). However for smaller sets of sequences this correction can be more appropriate.
Next, Benjamini & Hochberg. To perform this correction, the
P-value is divided by the percentile rank of the P-value in the list of
P-values for all theoretically possible hits sorted in ascending order
(it also assumes that P-values are normally distributed under the null
hypothesis). It is important to note that this means the correction
cannot be calculated before the sequences have been scanned for the
motif, and P-values have been calculated for all returned hits. When
requesting this type of Q-value for the minimum threshold of score,
scan_sequences()
instead calculates the threshold from the
input Q-value as a P-value, then filters the final results after
Q-values have been calculated. Returning to our example:
(Scan.Results <- scan_sequences(ArabidopsisMotif, ArabidopsisPromoters,
threshold = 0.8, threshold.type = "logodds", calc.qvals = FALSE))
#> DataFrame with 20 rows and 14 columns
#> motif motif.i sequence sequence.i start stop
#> <character> <integer> <character> <integer> <integer> <integer>
#> 1 YTTTYTTTTTYTTTY 1 AT1G05670 47 68 82
#> 2 YTTTYTTTTTYTTTY 1 AT1G19510 45 402 416
#> 3 YTTTYTTTTTYTTTY 1 AT1G49840 27 899 913
#> 4 YTTTYTTTTTYTTTY 1 AT2G22500 14 946 960
#> 5 YTTTYTTTTTYTTTY 1 AT2G22500 14 948 962
#> ... ... ... ... ... ... ...
#> 16 YTTTYTTTTTYTTTY 1 AT3G23170 34 603 617
#> 17 YTTTYTTTTTYTTTY 1 AT4G19520 3 792 806
#> 18 YTTTYTTTTTYTTTY 1 AT4G19520 3 793 807
#> 19 YTTTYTTTTTYTTTY 1 AT4G27652 20 879 893
#> 20 YTTTYTTTTTYTTTY 1 AT4G27652 20 881 895
#> score match thresh.score min.score max.score score.pct
#> <numeric> <character> <numeric> <numeric> <numeric> <numeric>
#> 1 15.407 GTTTCTTTTTTCTTT 15.0272 -125.07 18.784 82.0219
#> 2 17.405 TTTTCTTTTTCTTTT 15.0272 -125.07 18.784 92.6586
#> 3 15.177 CTTTTTGTTTTTTTC 15.0272 -125.07 18.784 80.7975
#> 4 15.827 TCCTCTCTTTCTCTC 15.0272 -125.07 18.784 84.2579
#> 5 15.908 CTCTCTTTCTCTCTT 15.0272 -125.07 18.784 84.6891
#> ... ... ... ... ... ... ...
#> 16 15.734 GTTTCTTCTTTTTTT 15.0272 -125.07 18.784 83.7628
#> 17 15.352 TTTTTTTTTTTTTTT 15.0272 -125.07 18.784 81.7291
#> 18 15.352 TTTTTTTTTTTTTTT 15.0272 -125.07 18.784 81.7291
#> 19 16.410 TTTTCTCTTTTTTTT 15.0272 -125.07 18.784 87.3616
#> 20 16.810 TTCTCTTTTTTTTTT 15.0272 -125.07 18.784 89.4911
#> strand pvalue
#> <character> <numeric>
#> 1 + 3.95595e-06
#> 2 + 2.44369e-07
#> 3 + 5.01977e-06
#> 4 + 2.53853e-06
#> 5 + 2.39165e-06
#> ... ... ...
#> 16 + 2.83419e-06
#> 17 + 4.33848e-06
#> 18 + 4.33848e-06
#> 19 + 1.23950e-06
#> 20 + 6.61282e-07
First we sort and calculate the percentile ranks of our P-values, and then divide the P-values:
Pvalues <- Scan.Results$pvalue
Pvalues.Ranks <- (rank(Pvalues) / Max.Possible.Hits) * 100
Qvalues.BH <- Pvalues / Pvalues.Ranks
(Example.BH <- Qvalues.BH[Scan.Results$match == "TTCTCTTTTTTTTTT"][1])
#> [1] 6.52024e-05
Finally, calculating the false discovery rate from the empirical
distribution of scores. This method requires some additional steps, as
we must obtain the observed and null distributions of hits in our
sequences. Then for each hit, divide the number of hits with a score
equal to or greater in the null distribution with the number of hits
with a score equal to or greater in the observed distribution. Along the
way we must be wary of the nonmonotonicity of the final Q-values
(meaning that as scores get smaller the Q-value does not always
increase), and thus always select the minimum available Q-value as the
score increases. To get the null distribution of hits, we can simply use
the P-values associated with each score as these are analytically
calculated from the null based on the background probabilities (see
?motif_pvalue
).
Scan.Results <- Scan.Results[order(Scan.Results$score, decreasing = TRUE), ]
Observed.Hits <- 1:nrow(Scan.Results)
Null.Hits <- Max.Possible.Hits * Scan.Results$pvalue
Qvalues.fdr <- Null.Hits / Observed.Hits
Qvalues.fdr <- rev(cummin(rev(Qvalues.fdr)))
(Example.fdr <- Qvalues.fdr[Scan.Results$match == "TTCTCTTTTTTTTTT"][1])
#> [1] 0.00652024
Similarly to Benjamini & Hochberg, these can only be known after scanning has occurred.
To summarize, we can compare the initial P-value with the different corrections:
knitr::kable(
data.frame(
What = c("Score", "P-value", "bonferroni", "BH", "fdr"),
Value = format(
c(Example.Score, Example.Pvalue, Example.bonferroni, Example.BH, Example.fdr),
scientific = FALSE
)
),
format = "markdown", caption = "Comparing P-value correction methods"
)
What | Value |
---|---|
Score | 16.8100000000000 |
P-value | 0.0000006612819 |
bonferroni | 0.0326011986749 |
BH | 0.0000652023973 |
fdr | 0.0065202397350 |
Use your best judgement as to which method is most appropriate for your specific use case.
Furthermore, the scan_sequences()
function offers the
ability to scan using the multifreq
slot, if available.
This allows to take into account inter-positional dependencies, and get
matches which more faithfully represent the original sequences from
which the motif originated.
library(universalmotif)
library(Biostrings)
data(ArabidopsisPromoters)
## A 2-letter example:
motif.k2 <- create_motif("CWWWWCC", nsites = 6)
sequences.k2 <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3))
motif.k2 <- add_multifreq(motif.k2, sequences.k2)
Regular scanning:
scan_sequences(motif.k2, ArabidopsisPromoters, RC = TRUE,
threshold = 0.9, threshold.type = "logodds")
#> DataFrame with 94 rows and 15 columns
#> motif motif.i sequence sequence.i start stop score
#> <character> <integer> <character> <integer> <integer> <integer> <numeric>
#> 1 motif 1 AT1G03850 4 203 209 9.08
#> 2 motif 1 AT1G03850 4 334 328 9.08
#> 3 motif 1 AT1G03850 4 713 707 9.08
#> 4 motif 1 AT1G05670 47 706 700 9.08
#> 5 motif 1 AT1G06160 48 498 492 9.08
#> ... ... ... ... ... ... ... ...
#> 90 motif 1 AT5G22690 46 81 87 9.08
#> 91 motif 1 AT5G22690 46 362 368 9.08
#> 92 motif 1 AT5G24660 49 146 140 9.08
#> 93 motif 1 AT5G58430 16 332 338 9.08
#> 94 motif 1 AT5G58430 16 343 349 9.08
#> match thresh.score min.score max.score score.pct strand
#> <character> <numeric> <numeric> <numeric> <numeric> <character>
#> 1 CTAATCC 8.172 -19.649 9.08 100 +
#> 2 CTTTTCC 8.172 -19.649 9.08 100 -
#> 3 CTTAACC 8.172 -19.649 9.08 100 -
#> 4 CTTTACC 8.172 -19.649 9.08 100 -
#> 5 CTAAACC 8.172 -19.649 9.08 100 -
#> ... ... ... ... ... ... ...
#> 90 CAATACC 8.172 -19.649 9.08 100 +
#> 91 CAAATCC 8.172 -19.649 9.08 100 +
#> 92 CATTACC 8.172 -19.649 9.08 100 -
#> 93 CATAACC 8.172 -19.649 9.08 100 +
#> 94 CAAATCC 8.172 -19.649 9.08 100 +
#> pvalue qvalue
#> <numeric> <numeric>
#> 1 0.000976562 1
#> 2 0.000976562 1
#> 3 0.000976562 1
#> 4 0.000976562 1
#> 5 0.000976562 1
#> ... ... ...
#> 90 0.000976562 1
#> 91 0.000976562 1
#> 92 0.000976562 1
#> 93 0.000976562 1
#> 94 0.000976562 1
Using 2-letter information to scan:
scan_sequences(motif.k2, ArabidopsisPromoters, use.freq = 2, RC = TRUE,
threshold = 0.9, threshold.type = "logodds")
#> DataFrame with 8 rows and 15 columns
#> motif motif.i sequence sequence.i start stop score
#> <character> <integer> <character> <integer> <integer> <integer> <numeric>
#> 1 motif 1 AT1G19510 45 960 965 17.827
#> 2 motif 1 AT1G49840 27 959 964 17.827
#> 3 motif 1 AT1G77210 32 184 189 17.827
#> 4 motif 1 AT1G77210 32 954 959 17.827
#> 5 motif 1 AT2G37950 15 751 756 17.827
#> 6 motif 1 AT3G57640 33 917 922 17.827
#> 7 motif 1 AT4G12690 12 938 943 17.827
#> 8 motif 1 AT4G14365 35 977 982 17.827
#> match thresh.score min.score max.score score.pct strand
#> <character> <numeric> <numeric> <numeric> <numeric> <character>
#> 1 CTTTTC 16.0443 -16.842 17.827 100 +
#> 2 CTTTTC 16.0443 -16.842 17.827 100 +
#> 3 CAAAAC 16.0443 -16.842 17.827 100 +
#> 4 CAAAAC 16.0443 -16.842 17.827 100 +
#> 5 CAAAAC 16.0443 -16.842 17.827 100 +
#> 6 CTTTTC 16.0443 -16.842 17.827 100 +
#> 7 CAAAAC 16.0443 -16.842 17.827 100 +
#> 8 CTTTTC 16.0443 -16.842 17.827 100 +
#> pvalue qvalue
#> <numeric> <numeric>
#> 1 1.90735e-06 0.0236988
#> 2 1.90735e-06 0.0236988
#> 3 1.90735e-06 0.0236988
#> 4 1.90735e-06 0.0236988
#> 5 1.90735e-06 0.0236988
#> 6 1.90735e-06 0.0236988
#> 7 1.90735e-06 0.0236988
#> 8 1.90735e-06 0.0236988
Furthermore, sequence scanning can be further refined to avoid overlapping hits. Consider:
motif <- create_motif("AAAAAA")
## Leave in overlapping hits:
scan_sequences(motif, ArabidopsisPromoters, RC = TRUE, threshold = 0.9,
threshold.type = "logodds")
#> DataFrame with 491 rows and 15 columns
#> motif motif.i sequence sequence.i start stop score
#> <character> <integer> <character> <integer> <integer> <integer> <numeric>
#> 1 motif 1 AT1G03850 4 56 51 11.934
#> 2 motif 1 AT1G03850 4 57 52 11.934
#> 3 motif 1 AT1G03850 4 58 53 11.934
#> 4 motif 1 AT1G03850 4 59 54 11.934
#> 5 motif 1 AT1G03850 4 243 248 11.934
#> ... ... ... ... ... ... ... ...
#> 487 motif 1 AT5G64310 22 589 594 11.934
#> 488 motif 1 AT5G64310 22 590 595 11.934
#> 489 motif 1 AT5G64310 22 591 596 11.934
#> 490 motif 1 AT5G64310 22 592 597 11.934
#> 491 motif 1 AT5G64310 22 696 701 11.934
#> match thresh.score min.score max.score score.pct strand
#> <character> <numeric> <numeric> <numeric> <numeric> <character>
#> 1 AAAAAA 10.7406 -39.948 11.934 100 -
#> 2 AAAAAA 10.7406 -39.948 11.934 100 -
#> 3 AAAAAA 10.7406 -39.948 11.934 100 -
#> 4 AAAAAA 10.7406 -39.948 11.934 100 -
#> 5 AAAAAA 10.7406 -39.948 11.934 100 +
#> ... ... ... ... ... ... ...
#> 487 AAAAAA 10.7406 -39.948 11.934 100 +
#> 488 AAAAAA 10.7406 -39.948 11.934 100 +
#> 489 AAAAAA 10.7406 -39.948 11.934 100 +
#> 490 AAAAAA 10.7406 -39.948 11.934 100 +
#> 491 AAAAAA 10.7406 -39.948 11.934 100 +
#> pvalue qvalue
#> <numeric> <numeric>
#> 1 0.000244141 0.0494745
#> 2 0.000244141 0.0494745
#> 3 0.000244141 0.0494745
#> 4 0.000244141 0.0494745
#> 5 0.000244141 0.0494745
#> ... ... ...
#> 487 0.000244141 0.0494745
#> 488 0.000244141 0.0494745
#> 489 0.000244141 0.0494745
#> 490 0.000244141 0.0494745
#> 491 0.000244141 0.0494745
## Only keep the highest scoring hit amongst overlapping hits:
scan_sequences(motif, ArabidopsisPromoters, RC = TRUE, threshold = 0.9,
threshold.type = "logodds", no.overlaps = TRUE)
#> DataFrame with 220 rows and 15 columns
#> motif motif.i sequence sequence.i start stop score
#> <character> <integer> <character> <integer> <integer> <integer> <numeric>
#> 1 motif 1 AT1G03850 4 56 51 11.934
#> 2 motif 1 AT1G03850 4 243 248 11.934
#> 3 motif 1 AT1G03850 4 735 740 11.934
#> 4 motif 1 AT1G05670 47 32 27 11.934
#> 5 motif 1 AT1G05670 47 78 73 11.934
#> ... ... ... ... ... ... ... ...
#> 216 motif 1 AT5G64310 22 251 246 11.934
#> 217 motif 1 AT5G64310 22 342 347 11.934
#> 218 motif 1 AT5G64310 22 586 591 11.934
#> 219 motif 1 AT5G64310 22 592 597 11.934
#> 220 motif 1 AT5G64310 22 696 701 11.934
#> match thresh.score min.score max.score score.pct strand
#> <character> <numeric> <numeric> <numeric> <numeric> <character>
#> 1 AAAAAA 10.7406 -39.948 11.934 100 -
#> 2 AAAAAA 10.7406 -39.948 11.934 100 +
#> 3 AAAAAA 10.7406 -39.948 11.934 100 +
#> 4 AAAAAA 10.7406 -39.948 11.934 100 -
#> 5 AAAAAA 10.7406 -39.948 11.934 100 -
#> ... ... ... ... ... ... ...
#> 216 AAAAAA 10.7406 -39.948 11.934 100 -
#> 217 AAAAAA 10.7406 -39.948 11.934 100 +
#> 218 AAAAAA 10.7406 -39.948 11.934 100 +
#> 219 AAAAAA 10.7406 -39.948 11.934 100 +
#> 220 AAAAAA 10.7406 -39.948 11.934 100 +
#> pvalue qvalue
#> <numeric> <numeric>
#> 1 0.000244141 0.0494745
#> 2 0.000244141 0.0494745
#> 3 0.000244141 0.0494745
#> 4 0.000244141 0.0494745
#> 5 0.000244141 0.0494745
#> ... ... ...
#> 216 0.000244141 0.0494745
#> 217 0.000244141 0.0494745
#> 218 0.000244141 0.0494745
#> 219 0.000244141 0.0494745
#> 220 0.000244141 0.0494745
Finally, the results can be returned as a GRanges
object
for further manipulation:
scan_sequences(motif.k2, ArabidopsisPromoters, RC = TRUE,
threshold = 0.9, threshold.type = "logodds",
return.granges = TRUE)
#> GRanges object with 94 ranges and 11 metadata columns:
#> seqnames ranges strand | motif motif.i sequence.i score
#> <Rle> <IRanges> <Rle> | <character> <integer> <integer> <numeric>
#> [1] AT1G03850 203-209 + | motif 1 4 9.08
#> [2] AT1G03850 328-334 - | motif 1 4 9.08
#> [3] AT1G03850 707-713 - | motif 1 4 9.08
#> [4] AT1G05670 700-706 - | motif 1 47 9.08
#> [5] AT1G06160 956-962 + | motif 1 48 9.08
#> ... ... ... ... . ... ... ... ...
#> [90] AT5G22690 362-368 + | motif 1 46 9.08
#> [91] AT5G22690 52-58 - | motif 1 46 9.08
#> [92] AT5G24660 140-146 - | motif 1 49 9.08
#> [93] AT5G58430 332-338 + | motif 1 16 9.08
#> [94] AT5G58430 343-349 + | motif 1 16 9.08
#> match thresh.score min.score max.score score.pct pvalue
#> <character> <numeric> <numeric> <numeric> <numeric> <numeric>
#> [1] CTAATCC 8.172 -19.649 9.08 100 0.000976562
#> [2] CTTTTCC 8.172 -19.649 9.08 100 0.000976562
#> [3] CTTAACC 8.172 -19.649 9.08 100 0.000976562
#> [4] CTTTACC 8.172 -19.649 9.08 100 0.000976562
#> [5] CTAATCC 8.172 -19.649 9.08 100 0.000976562
#> ... ... ... ... ... ... ...
#> [90] CAAATCC 8.172 -19.649 9.08 100 0.000976562
#> [91] CATTACC 8.172 -19.649 9.08 100 0.000976562
#> [92] CATTACC 8.172 -19.649 9.08 100 0.000976562
#> [93] CATAACC 8.172 -19.649 9.08 100 0.000976562
#> [94] CAAATCC 8.172 -19.649 9.08 100 0.000976562
#> qvalue
#> <numeric>
#> [1] 1
#> [2] 1
#> [3] 1
#> [4] 1
#> [5] 1
#> ... ...
#> [90] 1
#> [91] 1
#> [92] 1
#> [93] 1
#> [94] 1
#> -------
#> seqinfo: 50 sequences from an unspecified genome
A few suggestions for different ways of plotting hits across sequences are presented here.
Using the ggbio
package, it is rather trivial to
generate nice visualizations of the output of
scan_sequences()
. This requires having the
GenomicRanges
and ggbio
packages installed,
and outputting the scan_sequences()
result as a
GRanges
object (via
return.granges = TRUE
).
library(universalmotif)
library(GenomicRanges)
library(ggbio)
data(ArabidopsisPromoters)
motif1 <- create_motif("AAAAAA", name = "Motif A")
motif2 <- create_motif("CWWWWCC", name = "Motif B")
res <- scan_sequences(c(motif1, motif2), ArabidopsisPromoters[1:10],
return.granges = TRUE, calc.pvals = TRUE, no.overlaps = TRUE,
threshold = 0.2, threshold.type = "logodds")
## Just plot the motif hits:
autoplot(res, layout = "karyogram", aes(fill = motif, color = motif)) +
theme(
strip.background = element_rect(fill = NA, colour = NA),
panel.background = element_rect(fill = NA, colour = NA)
)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
## Plot Motif A hits by P-value:
autoplot(res[res$motif.i == 1, ], layout = "karyogram",
aes(fill = log10(pvalue), colour = log10(pvalue))) +
scale_fill_gradient(low = "black", high = "grey75") +
scale_colour_gradient(low = "black", high = "grey75") +
theme(
strip.background = element_rect(fill = NA, colour = NA),
panel.background = element_rect(fill = NA, colour = NA)
)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
Alternatively, just a simple heatmap with only
ggplot2
.
library(universalmotif)
library(ggplot2)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
res <- scan_sequences(ArabidopsisMotif, ArabidopsisPromoters,
threshold = 0, threshold.type = "logodds.abs")
res <- suppressWarnings(as.data.frame(res))
res$x <- mapply(function(x, y) mean(c(x, y)), res$start, res$stop)
ggplot(res, aes(x, sequence, fill = score)) +
scale_fill_viridis_c() +
scale_x_continuous(expand = c(0, 0)) +
xlim(0, 1000) +
xlab(element_blank()) +
ylab(element_blank()) +
geom_tile(width = ncol(ArabidopsisMotif)) +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.y = element_text(size = 6))
Using packages such as ggExtra
or ggpubr
,
one could even plot marginal histogram or density plots above or below
to illustrate any motif positional preference within the sequences.
(Though keep in mind that the hit coordinates and sequence lengths would
need to be normalized if not all sequences were of the same length, as
they are here.)
Finally, the distribution of all possible motif scores could be shown as a line plot across the sequences.
library(universalmotif)
library(ggplot2)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
res <- scan_sequences(ArabidopsisMotif, ArabidopsisPromoters[1:5],
threshold = -Inf, threshold.type = "logodds.abs")
res <- suppressWarnings(as.data.frame(res))
res$position <- mapply(function(x, y) mean(c(x, y)), res$start, res$stop)
ggplot(res, aes(position, score, colour = score)) +
geom_line() +
geom_hline(yintercept = 0, colour = "red", alpha = 0.3) +
theme_bw() +
scale_colour_viridis_c() +
facet_wrap(~sequence, ncol = 1) +
xlab(element_blank()) +
ylab(element_blank()) +
theme(strip.background = element_blank())
The universalmotif
package offers the ability to search
for enriched motif sites in a set of sequences via
enrich_motifs()
. There is little complexity to this, as it
simply runs scan_sequences()
twice: once on a set of target
sequences, and once on a set of background sequences. After which the
results between the two sequences are collated and run through
enrichment tests. The background sequences can be given explicitly, or
else enrich_motifs()
will create background sequences on
its own by using shuffle_sequences()
on the target
sequences.
Let us consider the following basic example:
library(universalmotif)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
enrich_motifs(ArabidopsisMotif, ArabidopsisPromoters, shuffle.k = 3,
threshold = 0.001, RC = TRUE)
#> DataFrame with 1 row and 15 columns
#> motif motif.i motif.consensus target.hits target.seq.hits
#> <character> <integer> <character> <integer> <integer>
#> 1 YTTTYTTTTTYTTTY 1 YTYTYTTYTTYTTTY 244 50
#> target.seq.count bkg.hits bkg.seq.hits bkg.seq.count Pval Qval
#> <integer> <integer> <integer> <integer> <numeric> <numeric>
#> 1 50 144 46 50 2.27389e-07 2.27389e-07
#> Eval pct.target.seq.hits pct.bkg.seq.hits target.enrichment
#> <numeric> <numeric> <numeric> <numeric>
#> 1 4.54779e-07 100 92 1.08696
Here we can see that the motif is significantly enriched in the
target sequences. The Pval
was calculated by calling
stats::fisher.test()
.
One final point: always keep in mind the threshold
parameter, as this will ultimately decide the number of hits found. (A
bad threshold can lead to a false negative.)
universalmotif
class motifs can be gapped, which can be
used by scan_sequences()
and enrich_motifs()
.
Note that gapped motif support is currently limited to these two
functions. All other functions will ignore the gap information, and even
discard them in functions such as merge_motifs()
.
First, obtain the component motifs:
library(universalmotif)
data(ArabidopsisPromoters)
m1 <- create_motif("TTTATAT", name = "PartA")
m2 <- create_motif("GGTTCGA", name = "PartB")
Then, combine them and add the desired gap. In this case, a gap will be added between the two motifs which can range in size from 4-6 bases.
m <- cbind(m1, m2)
m <- add_gap(m, gaploc = ncol(m1), mingap = 4, maxgap = 6)
m
#>
#> Motif name: PartA/PartB
#> Alphabet: DNA
#> Type: PCM
#> Strands: +-
#> Total IC: 28
#> Pseudocount: 0
#> Consensus: TTTATAT..GGTTCGA
#> Target sites: 1
#> Gap locations: 7-8
#> Gap sizes: 4-6
#>
#> T T T A T A T G G T T C G A
#> A 0 0 0 1 0 1 0 .. 0 0 0 0 0 0 1
#> C 0 0 0 0 0 0 0 .. 0 0 0 0 1 0 0
#> G 0 0 0 0 0 0 0 .. 1 1 0 0 0 1 0
#> T 1 1 1 0 1 0 1 .. 0 0 1 1 0 0 0
Now, it can be used directly in scan_sequences()
or
enrich_motifs()
:
scan_sequences(m, ArabidopsisPromoters, threshold = 0.4, threshold.type = "logodds")
#> DataFrame with 75 rows and 15 columns
#> motif motif.i sequence sequence.i start stop score
#> <character> <integer> <character> <integer> <integer> <integer> <numeric>
#> 1 PartA/PartB 1 AT1G03850 4 376 394 11.178
#> 2 PartA/PartB 1 AT1G03850 4 414 432 12.168
#> 3 PartA/PartB 1 AT1G06160 48 144 161 11.918
#> 4 PartA/PartB 1 AT1G12990 28 71 90 11.428
#> 5 PartA/PartB 1 AT1G19380 2 226 245 11.428
#> ... ... ... ... ... ... ... ...
#> 71 PartA/PartB 1 AT5G22690 46 638 656 11.178
#> 72 PartA/PartB 1 AT5G47230 24 91 110 12.418
#> 73 PartA/PartB 1 AT5G47230 24 449 468 11.428
#> 74 PartA/PartB 1 AT5G64310 22 869 888 11.428
#> 75 PartA/PartB 1 AT5G64310 22 909 927 11.178
#> match thresh.score min.score max.score score.pct strand
#> <character> <numeric> <numeric> <numeric> <numeric> <character>
#> 1 TATATGT.....GGTGCAA 11.1384 -93.212 27.846 40.1422 +
#> 2 TTGATAT.....TGTTAGA 11.1384 -93.212 27.846 43.6975 +
#> 3 TTTATGT....GGTTTGT 11.1384 -93.212 27.846 42.7997 +
#> 4 GTTATGT......TGTTAGA 11.1384 -93.212 27.846 41.0400 +
#> 5 TTTACAG......CGTTCGT 11.1384 -93.212 27.846 41.0400 +
#> ... ... ... ... ... ... ...
#> 71 TTCATTT.....GGCTTGA 11.1384 -93.212 27.846 40.1422 +
#> 72 TTTATAC......TGTTCCA 11.1384 -93.212 27.846 44.5953 +
#> 73 TATATGT......GGGTCAA 11.1384 -93.212 27.846 41.0400 +
#> 74 ATAATAT......CGTTAGA 11.1384 -93.212 27.846 41.0400 +
#> 75 TTCATAT.....GTCACGA 11.1384 -93.212 27.846 40.1422 +
#> pvalue qvalue
#> <numeric> <numeric>
#> 1 1.60187e-07 0.000105403
#> 2 1.60187e-07 0.000105403
#> 3 1.60187e-07 0.000105403
#> 4 1.60187e-07 0.000105403
#> 5 1.60187e-07 0.000105403
#> ... ... ...
#> 71 1.60187e-07 0.000105403
#> 72 1.60187e-07 0.000105403
#> 73 1.60187e-07 0.000105403
#> 74 1.60187e-07 0.000105403
#> 75 1.60187e-07 0.000105403
Highly-repetitive low complexity regions can oftentimes cause
problems during de novo motif discovery, leading to obviously
false motifs being returned. One way to get around this issue is to
preemptively remove or mask these regions. The
universalmotif
package includes a few functions which can
help carry out this task.
Using mask_seqs()
, one can mask a specific pattern of
letters in XStringSet
objects. Consider the following
sequences:
library(universalmotif)
library(Biostrings)
Ex.seq <- DNAStringSet(c(
A = "GTTGAAAAAAAAAAAAAAAACAGACGT",
B = "TTAGATGGCCCATAGCTTATACGGCAA",
C = "AATAAAATGCTTAGGAAATCGATTGCC"
))
We can easily mask portions that contain, say, stretches of at least
8 A
s:
mask_seqs(Ex.seq, "AAAAAAAA")
#> DNAStringSet object of length 3:
#> width seq names
#> [1] 27 GTTG----------------CAGACGT A
#> [2] 27 TTAGATGGCCCATAGCTTATACGGCAA B
#> [3] 27 AATAAAATGCTTAGGAAATCGATTGCC C
Alternatively, instead of masking a know stretch of letters one can
find low complexity regions using sequence_complexity()
,
and then mask specific regions in the sequences using
mask_ranges()
. The sequence_complexity()
function has several complexity metrics available: the Wootton-Federhen
(Wootton and Federhen 1993) and Trifonov
(Trifonov 1990) algorithms (and their
approximations) are well described in Orlov and
Potapov (2004), and DUST in Morgulis et
al. (2006). See ?sequence_complexity
for more
details.
(Ex.DUST <- sequence_complexity(Ex.seq, window.size = 10, method = "DUST",
return.granges = TRUE))
#> GRanges object with 15 ranges and 1 metadata column:
#> seqnames ranges strand | complexity
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] A 1-10 * | 0.857143
#> [2] A 6-15 * | 4.000000
#> [3] A 11-20 * | 4.000000
#> [4] A 16-25 * | 0.428571
#> [5] A 21-27 * | 0.000000
#> ... ... ... ... . ...
#> [11] C 1-10 * | 0.285714
#> [12] C 6-15 * | 0.000000
#> [13] C 11-20 * | 0.000000
#> [14] C 16-25 * | 0.000000
#> [15] C 21-27 * | 0.000000
#> -------
#> seqinfo: 3 sequences from an unspecified genome
Using the DUST algorithm, we can see there are a couple of regions
which spike in the complexity score (for this particular algorithm, more
complex sequences converge towards zero). Now it is only a matter of
filtering for those regions and using mask_ranges()
.
(Ex.DUST <- Ex.DUST[Ex.DUST$complexity >= 3])
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | complexity
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] A 6-15 * | 4
#> [2] A 11-20 * | 4
#> -------
#> seqinfo: 3 sequences from an unspecified genome
mask_ranges(Ex.seq, Ex.DUST)
#> DNAStringSet object of length 3:
#> width seq names
#> [1] 27 GTTGA---------------CAGACGT A
#> [2] 27 TTAGATGGCCCATAGCTTATACGGCAA B
#> [3] 27 AATAAAATGCTTAGGAAATCGATTGCC C
Now these sequences could be used directly with
scan_sequences()
or written to a fasta file using
Biostrings::writeXStringSet()
for use with an external
de novo motif discovery program such as MEME.
Note: In the time since the inception of the
run_meme()
function, Spencer Nystrom (a contributor to
universalmotif
) has created the memes
package
as a interface to much of the MEME
suite. It is fully
interoperable with the universalmotif
package and provides
a much more convenient way to run MEME
programs from within
R
. Install it from Bioconductor with
BiocManager::install("memes")
.
The universalmotif
package provides a simple wrapper to
the powerful motif discovery tool MEME
(Bailey and Elkan 1994). To run an analysis with
MEME
, all that is required is a set of
XStringSet
class sequences (defined in the
Biostrings
package), and run_meme()
will take
care of running the program and reading the output for use within
R
.
The first step is to check that R can find the MEME
binary in your $PATH
by running run_meme()
without any parameters. If successful, you should see the default
MEME
help message in your console. If not, then you’ll need
to provide the complete path to the MEME
binary. There are
two options:
library(universalmotif)
## 1. Once per session: via `options()`
options(meme.bin = "/path/to/meme/bin/meme")
run_meme(...)
## 2. Once per run: via `run_meme()`
run_meme(..., bin = "/path/to/meme/bin/meme")
Now we need to get some sequences to use with
run_meme()
. At this point we can read sequences from disk
or extract them from one of the Bioconductor
BSgenome
packages.
library(universalmotif)
data(ArabidopsisPromoters)
## 1. Read sequences from disk (in fasta format):
library(Biostrings)
# The following `read*()` functions are available in Biostrings:
# DNA: readDNAStringSet
# DNA with quality scores: readQualityScaledDNAStringSet
# RNA: readRNAStringSet
# Amino acid: readAAStringSet
# Any: readBStringSet
sequences <- readDNAStringSet("/path/to/sequences.fasta")
run_meme(sequences, ...)
## 2. Extract from a `BSgenome` object:
library(GenomicFeatures)
library(TxDb.Athaliana.BioMart.plantsmart28)
library(BSgenome.Athaliana.TAIR.TAIR9)
# Let us retrieve the same promoter sequences from ArabidopsisPromoters:
gene.names <- names(ArabidopsisPromoters)
# First get the transcript coordinates from the relevant `TxDb` object:
transcripts <- transcriptsBy(TxDb.Athaliana.BioMart.plantsmart28,
by = "gene")[gene.names]
# There are multiple transcripts per gene, we only care for the first one
# in each:
transcripts <- lapply(transcripts, function(x) x[1])
transcripts <- unlist(GRangesList(transcripts))
# Then the actual sequences:
# Unfortunately this is a case where the chromosome names do not match
# between the two databases
seqlevels(TxDb.Athaliana.BioMart.plantsmart28)
#> [1] "1" "2" "3" "4" "5" "Mt" "Pt"
seqlevels(BSgenome.Athaliana.TAIR.TAIR9)
#> [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
# So we must first rename the chromosomes in `transcripts`:
seqlevels(transcripts) <- seqlevels(BSgenome.Athaliana.TAIR.TAIR9)
# Finally we can extract the sequences
promoters <- getPromoterSeq(transcripts,
BSgenome.Athaliana.TAIR.TAIR9,
upstream = 1000, downstream = 0)
run_meme(promoters, ...)
Once the sequences are ready, there are few important options to keep
in mind. One is whether to conserve the output from MEME
.
The default is not to, but this can be changed by setting the relevant
option:
The second important option is the search function
(objfun
). Some search functions such as the default
classic
do not require a set of background sequences,
whilst some do (such as de
). If you choose one of the
latter, then you can either let MEME
create them for you
(it will shuffle the target sequences) or you can provide them via the
control.sequences
parameter.
Finally, choose how you’d like the data imported into R
.
Once the MEME
program exits, run_meme()
will
import the results into R
with read_meme()
. At
this point you can decide if you want just the motifs themselves
(readsites = FALSE
) or if you’d like the original sequence
sites as well (readsites = TRUE
, the default). Doing the
latter gives you the option of generating higher order representations
for the imported MEME
motifs as shown here:
There are a wealth of other MEME
options available, such
as the number of desired motifs (nmotifs
), the width of
desired motifs (minw
, maxw
), the search mode
(mod
), assigning sequence weights (weights
),
using a custom alphabet (alph
), and many others. See the
output from run_meme()
for a brief description of the
options, or visit the online manual for more
details.
Since biological sequences are usually contained in
XStringSet
class objects,
sequence_complexity()
, get_bkg()
and
shuffle_sequences()
are designed to work with such objects.
For cases when strings are not XStringSet
objects, the
following functions are available:
calc_complexity()
: alternative to
sequence_complexity()
count_klets()
: alternative to
get_bkg()
shuffle_string()
: alternative to
shuffle_sequences()
library(universalmotif)
string <- "DASDSDDSASDSSA"
calc_complexity(string)
#> [1] 0.7823323
count_klets(string, 2)
#> klets counts
#> 1 AA 0
#> 2 AD 0
#> 3 AS 2
#> 4 DA 1
#> 5 DD 1
#> 6 DS 3
#> 7 SA 2
#> 8 SD 3
#> 9 SS 1
shuffle_string(string, 2)
#> [1] "DSDSDDASSASDSA"
A few other utilities have also been made available (based on the
internal code of other universalmotif
functions) that work
on simple character vectors:
calc_windows()
: calculate the coordinates for sliding
windows from 1 to any number n
get_klets()
: get a list of all possible
k
-lets for any sequence alphabetslide_fun()
: apply a function over sliding windows
across a single stringwindow_string()
: retrieve characters from sliding
windows of a single stringlibrary(universalmotif)
calc_windows(n = 12, window = 4, overlap = 2)
#> start stop
#> 1 1 4
#> 2 3 6
#> 3 5 8
#> 4 7 10
#> 5 9 12
get_klets(c("A", "S", "D"), 2)
#> [1] "AA" "AS" "AD" "SA" "SS" "SD" "DA" "DS" "DD"
slide_fun("ABCDEFGH", charToRaw, raw(2), window = 2, overlap = 1)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 41 42 43 44 45 46 47
#> [2,] 42 43 44 45 46 47 48
window_string("ABCDEFGH", window = 2, overlap = 1)
#> [1] "AB" "BC" "CD" "DE" "EF" "FG" "GH"
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggbio_1.55.0 TFBSTools_1.45.0 cowplot_1.1.3
#> [4] dplyr_1.1.4 ggtree_3.15.0 ggplot2_3.5.1
#> [7] MotifDb_1.49.0 GenomicRanges_1.59.0 Biostrings_2.75.1
#> [10] GenomeInfoDb_1.43.0 XVector_0.47.0 IRanges_2.41.0
#> [13] S4Vectors_0.45.1 BiocGenerics_0.53.1 generics_0.1.3
#> [16] universalmotif_1.25.1 bookdown_0.41
#>
#> loaded via a namespace (and not attached):
#> [1] BiocIO_1.17.0 filelock_1.0.3
#> [3] bitops_1.0-9 ggplotify_0.1.2
#> [5] tibble_3.2.1 R.oo_1.27.0
#> [7] graph_1.85.0 XML_3.99-0.17
#> [9] rpart_4.1.23 DirichletMultinomial_1.49.0
#> [11] httr2_1.0.6 lifecycle_1.0.4
#> [13] pwalign_1.3.0 OrganismDbi_1.49.0
#> [15] ensembldb_2.31.0 lattice_0.22-6
#> [17] MASS_7.3-61 backports_1.5.0
#> [19] magrittr_2.0.3 Hmisc_5.2-0
#> [21] sass_0.4.9 rmarkdown_2.29
#> [23] jquerylib_0.1.4 yaml_2.3.10
#> [25] grImport2_0.3-3 DBI_1.2.3
#> [27] buildtools_1.0.0 CNEr_1.43.0
#> [29] RColorBrewer_1.1-3 ade4_1.7-22
#> [31] abind_1.4-8 zlibbioc_1.52.0
#> [33] purrr_1.0.2 R.utils_2.12.3
#> [35] AnnotationFilter_1.31.0 biovizBase_1.55.0
#> [37] RCurl_1.98-1.16 yulab.utils_0.1.8
#> [39] nnet_7.3-19 pracma_2.4.4
#> [41] rappdirs_0.3.3 VariantAnnotation_1.53.0
#> [43] GenomeInfoDbData_1.2.13 tidytree_0.4.6
#> [45] maketools_1.3.1 seqLogo_1.73.0
#> [47] annotate_1.85.0 codetools_0.2-20
#> [49] DelayedArray_0.33.1 xml2_1.3.6
#> [51] tidyselect_1.2.1 ggseqlogo_0.2
#> [53] aplot_0.2.3 UCSC.utils_1.3.0
#> [55] farver_2.1.2 BiocFileCache_2.15.0
#> [57] matrixStats_1.4.1 base64enc_0.1-3
#> [59] GenomicAlignments_1.43.0 jsonlite_1.8.9
#> [61] motifStack_1.51.0 Formula_1.2-5
#> [63] progress_1.2.3 tools_4.4.2
#> [65] treeio_1.31.0 TFMPvalue_0.0.9
#> [67] Rcpp_1.0.13-1 glue_1.8.0
#> [69] gridExtra_2.3 SparseArray_1.7.1
#> [71] xfun_0.49 MatrixGenerics_1.19.0
#> [73] withr_3.0.2 BiocManager_1.30.25
#> [75] fastmap_1.2.0 GGally_2.2.1
#> [77] fansi_1.0.6 caTools_1.18.3
#> [79] digest_0.6.37 R6_2.5.1
#> [81] gridGraphics_0.5-1 colorspace_2.1-1
#> [83] GO.db_3.20.0 gtools_3.9.5
#> [85] poweRlaw_0.80.0 biomaRt_2.63.0
#> [87] jpeg_0.1-10 dichromat_2.0-0.1
#> [89] RSQLite_2.3.7 R.methodsS3_1.8.2
#> [91] utf8_1.2.4 tidyr_1.3.1
#> [93] data.table_1.16.2 rtracklayer_1.67.0
#> [95] prettyunits_1.2.0 httr_1.4.7
#> [97] htmlwidgets_1.6.4 S4Arrays_1.7.1
#> [99] ggstats_0.7.0 pkgconfig_2.0.3
#> [101] gtable_0.3.6 blob_1.2.4
#> [103] sys_3.4.3 htmltools_0.5.8.1
#> [105] RBGL_1.83.0 ProtGenerics_1.39.0
#> [107] scales_1.3.0 Biobase_2.67.0
#> [109] png_0.1-8 ggfun_0.1.7
#> [111] knitr_1.49 rstudioapi_0.17.1
#> [113] tzdb_0.4.0 reshape2_1.4.4
#> [115] rjson_0.2.23 checkmate_2.3.2
#> [117] nlme_3.1-166 curl_6.0.0
#> [119] cachem_1.1.0 stringr_1.5.1
#> [121] parallel_4.4.2 foreign_0.8-87
#> [123] AnnotationDbi_1.69.0 restfulr_0.0.15
#> [125] pillar_1.9.0 grid_4.4.2
#> [127] vctrs_0.6.5 dbplyr_2.5.0
#> [129] xtable_1.8-4 cluster_2.1.6
#> [131] htmlTable_2.4.3 evaluate_1.0.1
#> [133] readr_2.1.5 GenomicFeatures_1.59.1
#> [135] cli_3.6.3 compiler_4.4.2
#> [137] Rsamtools_2.23.0 rlang_1.1.4
#> [139] crayon_1.5.3 labeling_0.4.3
#> [141] plyr_1.8.9 fs_1.6.5
#> [143] stringi_1.8.4 viridisLite_0.4.2
#> [145] BiocParallel_1.41.0 txdbmaker_1.3.0
#> [147] munsell_0.5.1 lazyeval_0.2.2
#> [149] Matrix_1.7-1 BSgenome_1.75.0
#> [151] hms_1.1.3 patchwork_1.3.0
#> [153] bit64_4.5.2 KEGGREST_1.47.0
#> [155] SummarizedExperiment_1.37.0 memoise_2.0.1
#> [157] bslib_0.8.0 bit_4.5.0
#> [159] splitstackshape_1.4.8 ape_5.8