Motif import, export, and manipulation

Introduction

This vignette will introduce the universalmotif class and its structure, the import and export of motifs in R, basic motif manipulation, creation, and visualization. For an introduction to sequence motifs, see the introductory vignette. For sequence-related utilities, see the sequences vignette. For motif comparisons and P-values, see the motif comparisons and P-values vignette.

The universalmotif class and conversion utilities

The universalmotif class

The universalmotif package stores motifs using the universalmotif class. The most basic universalmotif object exposes the name, alphabet, type, type, strand, icscore, consensus, and motif slots; furthermore, the pseudocount and bkg slots are also stored but not shown. universalmotif class motifs can be PCM, PPM, PWM, or ICM type.

library(universalmotif)
data(examplemotif)
examplemotif
#> 
#>        Motif name:   motif
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   11.54
#>       Pseudocount:   1
#>         Consensus:   TATAWAW
#> 
#>   T A T A   W A   W
#> A 0 1 0 1 0.5 1 0.5
#> C 0 0 0 0 0.0 0 0.0
#> G 0 0 0 0 0.0 0 0.0
#> T 1 0 1 0 0.5 0 0.5

A brief description of all the available slots:

  • name: motif name
  • altname: (optional) alternative motif name
  • family: (optional) a word representing the transcription factor or matrix family
  • organism: (optional) organism of origin
  • motif: the actual motif matrix
  • alphabet: motif alphabet
  • type: motif ‘type’, one of PCM, PPM, PWM, ICM; see the introductory vignette
  • icscore: (generated automatically) Sum of information content for the motif
  • nsites: (optional) number of sites the motif was created from
  • pseudocount: this value to added to the motif matrix during certain type conversions; this is necessary to avoid -Inf values from appearing in PWM type motifs
  • bkg: a named vector of probabilities which represent the background letter frequencies
  • bkgsites: (optional) total number of background sequences from motif creation
  • consensus: (generated automatically) for DNA/RNA/AA motifs, the motif consensus
  • strand: strand motif can be found on
  • pval: (optional) P-value from de novo motif search
  • qval: (optional) Q-value from de novo motif search
  • eval: (optional) E-value from de novo motif search
  • multifreq: (optional) higher-order motif representations.
  • extrainfo: (optional) any extra motif information that cannot fit in the existing slots

The other slots will be shown as they are filled.

library(universalmotif)
data(examplemotif)

## The various slots can be accessed individually using `[`

examplemotif["consensus"]
#> [1] "TATAWAW"

## To change a slot, use `[<-`

examplemotif["family"] <- "My motif family"
examplemotif
#> 
#>        Motif name:   motif
#>            Family:   My motif family
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   11.54
#>       Pseudocount:   1
#>         Consensus:   TATAWAW
#> 
#>   T A T A   W A   W
#> A 0 1 0 1 0.5 1 0.5
#> C 0 0 0 0 0.0 0 0.0
#> G 0 0 0 0 0.0 0 0.0
#> T 1 0 1 0 0.5 0 0.5

Though the slots can easily be changed manually with [<-, a number of safeguards have been put in place for some of the slots which will prevent incorrect values from being introduced.

library(universalmotif)
data(examplemotif)

## The consensus slot is dependent on the motif matrix

examplemotif["consensus"]
#> [1] "TATAWAW"

## Changing this would mean it no longer matches the motif

examplemotif["consensus"] <- "GGGAGAG"
#> Error in .local(x, i, ..., value): this slot is unmodifiable with [<-

## Another example of trying to change a protected slot:

examplemotif["strand"] <- "x"
#> Error: * strand must be one of +, -, +-

Below the exposed metadata slots, the actual ‘motif’ matrix is shown. Each position is its own column: row names showing the alphabet letters, and the column names showing the consensus letter at each position.

Converting to and from another package’s class

The universalmotif package aims to unify most of the motif-related Bioconductor packages by providing the convert_motifs() function. This allows for easy transition between supported packages (see ?convert_motifs for a complete list of supported packages). Should you ever come across a motif class from another Bioconductor package which is not supported by the universalmotif package, but believe it should be, then feel free to bring it up with me.

The convert_motifs function is embedded in most of the universalmotif functions, meaning that compatible motif classes from other packages can be used without needed to manually convert them first. However keep in mind some conversions are final. Furthermore, internally, all motifs regardless of class are handled as universalmotif objects, even if the returning class is not. This will result in at times slightly different objects (though usually no information should be lost).

library(universalmotif)
library(MotifDb)
data(examplemotif)
data(MA0003.2)

## convert from a `universalmotif` motif to another class

convert_motifs(examplemotif, "TFBSTools-PWMatrix")
#> Note: motif [motif] has an empty nsites slot, using 100.
#> An object of class PWMatrix
#> ID: 
#> Name: motif
#> Matrix Class: Unknown
#> strand: *
#> Pseudocounts: 1
#> Tags: 
#> list()
#> Background: 
#>    A    C    G    T 
#> 0.25 0.25 0.25 0.25 
#> Matrix: 
#>           T         A         T         A          W         A          W
#> A -6.658211  1.989247 -6.658211  1.989247  0.9928402  1.989247  0.9928402
#> C -6.658211 -6.658211 -6.658211 -6.658211 -6.6582115 -6.658211 -6.6582115
#> G -6.658211 -6.658211 -6.658211 -6.658211 -6.6582115 -6.658211 -6.6582115
#> T  1.989247 -6.658211  1.989247 -6.658211  0.9928402 -6.658211  0.9928402

## convert to universalmotif

convert_motifs(MA0003.2)
#> 
#>        Motif name:   TFAP2A
#>    Alternate name:   MA0003.2
#>            Family:   Helix-Loop-Helix
#>          Organism:   9606
#>          Alphabet:   DNA
#>              Type:   PCM
#>           Strands:   +
#>          Total IC:   12.9
#>       Pseudocount:   1
#>         Consensus:   NNNNGCCYSAGGSCA
#>      Target sites:   5098
#>        Extra info:   [centrality_logp] -4343 
#>                      [family] Helix-Loop-Helix
#>                      [medline] 10497269
#>                      ...
#> 
#>      N    N    N    N    G    C    C    Y    S    A    G    G    S    C    A
#> A 1387 2141  727 1517   56    0    0   62  346 3738  460    0  116  451 3146
#> C 1630 1060 1506  519 1199 5098 4762 1736 2729  236    0    0 1443 3672  690
#> G  851  792  884  985 3712    0    0   85 1715  920 4638 5098 3455  465  168
#> T 1230 1105 1981 2077  131    0  336 3215  308  204    0    0   84  510 1094

## convert between two packages

convert_motifs(MotifDb[1], "TFBSTools-ICMatrix")
#> Note: motif [ABF2] has an empty nsites slot, using 100.
#> [[1]]
#> An object of class ICMatrix
#> ID: badis.ABF2
#> Name: ABF2
#> Matrix Class: Unknown
#> strand: *
#> Pseudocounts: 1
#> Schneider correction: FALSE
#> Tags: 
#> $dataSource
#> [1] "ScerTF"
#> 
#> Background: 
#>    A    C    G    T 
#> 0.25 0.25 0.25 0.25 
#> Matrix: 
#>            T          C          T          A          G          A
#> A 0.08997357 0.02119039 0.02119039 1.64861232 0.02119039 1.43716039
#> C 0.08997357 1.64861232 0.02119039 0.02119039 0.02119039 0.03430887
#> G 0.02188546 0.02119039 0.02119039 0.02119039 1.64861232 0.03430887
#> T 0.78058151 0.02119039 1.64861232 0.02119039 0.02119039 0.03430887

Importing and exporting motifs

Importing

The universalmotif package offers a number of read_*() functions to allow for easy import of various motif formats. These include:

  • read_cisbp(): CIS-BP (Weirauch et al. 2014)
  • read_homer(): HOMER (Heinz et al. 2010)
  • read_jaspar(): JASPAR (Khan et al. 2018)
  • read_matrix(): generic reader for simply formatted motifs
  • read_meme(): MEME (Bailey et al. 2009)
  • read_motifs(): native universalmotif format (not recommended; use saveRDS() instead)
  • read_transfac(): TRANSFAC (Wingender et al. 1996)
  • read_uniprobe(): UniPROBE (Hume et al. 2015)

These functions should work natively with these formats, but if you are generating your own motifs in one of these formats than it must adhere quite strictly to the format. An example of each of these is included in this package (see system.file("extdata", package="universalmotif")). If you know of additional motif formats which are not supported in the universalmotif package that you believe should be, or of any mistakes in the way the universalmotif package parses supported formats, then please let me know.

Exporting

Compatible motif classes can be written to disk using:

  • write_homer()
  • write_jaspar()
  • write_matrix()
  • write_meme()
  • write_motifs()
  • write_transfac()

The write_matrix() function, similar to its read_matrix() counterpart, can write motifs as simple matrices with an optional header. Additionally, please keep in mind format limitations. For example, multiple MEME motifs written to a single file will all share the same alphabet, with identical background letter frequencies.

Motif creation

Though universalmotif class motifs can be created using the new constructor, the universalmotif package provides the create_motif() function which aims to provide a simpler interface to motif creation. The universalmotif class was initially designed to work natively with DNA, RNA, and amino acid motifs. Currently though, it can handle any custom alphabet just as easily. The only downsides to custom alphabets is the lack of support for certain slots such as the consensus and strand slots.

The create_motif() function will be introduced here only briefly; see ?create_motif for details.

From a PCM/PPM/PWM/ICM matrix

Should you wish to make use of the universalmotif functions starting from a motif class unsupported by convert_motifs(), you can instead manually create universalmotif class motifs using the create_motif() function and the motif matrix.

motif.matrix <- matrix(c(0.7, 0.1, 0.1, 0.1,
                         0.7, 0.1, 0.1, 0.1,
                         0.1, 0.7, 0.1, 0.1,
                         0.1, 0.7, 0.1, 0.1,
                         0.1, 0.1, 0.7, 0.1,
                         0.1, 0.1, 0.7, 0.1,
                         0.1, 0.1, 0.1, 0.7,
                         0.1, 0.1, 0.1, 0.7), nrow = 4)

motif <- create_motif(motif.matrix, alphabet = "RNA", name = "My motif",
                      pseudocount = 1, nsites = 20, strand = "+")

## The 'type', 'icscore' and 'consensus' slots will be filled for you

motif
#> 
#>        Motif name:   My motif
#>          Alphabet:   RNA
#>              Type:   PPM
#>           Strands:   +
#>          Total IC:   4.68
#>       Pseudocount:   1
#>         Consensus:   AACCGGUU
#>      Target sites:   20
#> 
#>     A   A   C   C   G   G   U   U
#> A 0.7 0.7 0.1 0.1 0.1 0.1 0.1 0.1
#> C 0.1 0.1 0.7 0.7 0.1 0.1 0.1 0.1
#> G 0.1 0.1 0.1 0.1 0.7 0.7 0.1 0.1
#> U 0.1 0.1 0.1 0.1 0.1 0.1 0.7 0.7

As a brief aside: if you have a motif formatted simply as a matrix, you can still use it with the universalmotif package functions natively without creating a motif with create_motif(), as convert_motifs() also has the ability to handle motifs formatted simply as matrices. However it is much safer to first format the motif beforehand with create_motif().

From sequences or character strings

If all you have is a particular consensus sequence in mind, you can easily create a full motif using create_motif(). This can be convenient if you’d like to create a quick motif to use with an external program such as from the MEME suite or HOMER. Note that ambiguity letters can be used with single strings.

motif <- create_motif("CCNSNGG", nsites = 50, pseudocount = 1)

## Now to disk:
## write_meme(motif, "meme_motif.txt")

motif
#> 
#>        Motif name:   motif
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   8.39
#>       Pseudocount:   1
#>         Consensus:   CCNSNGG
#>      Target sites:   50
#> 
#>      C    C    N   S    N    G    G
#> A 0.00 0.00 0.22 0.0 0.22 0.00 0.00
#> C 0.99 0.99 0.26 0.5 0.26 0.00 0.00
#> G 0.00 0.00 0.26 0.5 0.26 0.99 0.99
#> T 0.00 0.00 0.26 0.0 0.26 0.00 0.00

Generating random motifs

If you wish to, it’s easy to create random motifs. The values within the motif are generated using rgamma() to avoid creating low information content motifs. If background probabilities are not provided, then they are generated with rpois().

create_motif()
#> 
#>        Motif name:   motif
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   10.94
#>       Pseudocount:   0
#>         Consensus:   GGCWCCSMRA
#> 
#>      G G C    W    C    C    S    M    R    A
#> A 0.05 0 0 0.36 0.13 0.09 0.00 0.39 0.38 0.96
#> C 0.02 0 1 0.03 0.62 0.67 0.33 0.50 0.11 0.02
#> G 0.93 1 0 0.18 0.24 0.02 0.65 0.08 0.46 0.01
#> T 0.00 0 0 0.42 0.01 0.22 0.02 0.03 0.04 0.00

You can change the probabilities used to generate the values within the motif matrix:

create_motif(bkg = c(A = 0.2, C = 0.4, G = 0.2, T = 0.2))
#> 
#>        Motif name:   motif
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   13.17
#>       Pseudocount:   0
#>         Consensus:   CGACWATGRA
#> 
#>      C    G    A    C    W    A    T    G   R    A
#> A 0.00 0.04 0.88 0.00 0.27 0.83 0.00 0.01 0.5 0.83
#> C 0.97 0.01 0.04 0.72 0.00 0.10 0.00 0.22 0.1 0.00
#> G 0.03 0.95 0.08 0.08 0.00 0.00 0.00 0.77 0.4 0.01
#> T 0.00 0.00 0.00 0.20 0.72 0.07 0.99 0.00 0.0 0.16

With a custom alphabet:

create_motif(alphabet = "QWERTY")
#> 
#>        Motif name:   motif
#>          Alphabet:   EQRTWY
#>              Type:   PPM
#>          Total IC:   11.51
#>       Pseudocount:   0
#> 
#>   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> E 0.40 0.00 0.00 0.22 0.00 0.48 0.73 0.06 0.00  0.01
#> Q 0.04 0.16 0.10 0.17 0.00 0.44 0.00 0.01 0.00  0.00
#> R 0.03 0.19 0.01 0.00 0.13 0.00 0.00 0.03 0.19  0.00
#> T 0.28 0.13 0.79 0.00 0.68 0.00 0.08 0.00 0.31  0.04
#> W 0.09 0.19 0.08 0.03 0.19 0.03 0.00 0.61 0.07  0.93
#> Y 0.17 0.34 0.03 0.58 0.00 0.06 0.19 0.30 0.43  0.01

Motif visualization

Motif logos

There are several packages which offer motif visualization capabilities, such as seqLogo, motifStack, and ggseqlogo. The universalmotif package has its own implementation via the function view_motifs(), which renders motifs using the ggplot2 package (similar to ggseqlogo). Here I will briefly show how to use these to visualize universalmotif class motifs.

library(universalmotif)
data(examplemotif)

## With the native `view_motifs` function:
view_motifs(examplemotif)

The view_motifs() function generates ggplot objects; feel free to manipulate them as such. For example, flipping the position numbers for larger motifs (where the text spacing can become tight):

view_motifs(create_motif(15)) +
  ggplot2::theme(
    axis.text.x = ggplot2::element_text(angle = 90, hjust = 1)
  )

A large number of options are available for tuning the way motifs are plotted in view_motifs(). Visit the documentation for more information.

Using the other Bioconductor packages to view universalmotif motifs is fairly easy as well:

## For all the following examples, simply passing the functions a PPM is
## sufficient
motif <- convert_type(examplemotif, "PPM")
## Only need the matrix itself
motif <- motif["motif"]

## seqLogo:
seqLogo::seqLogo(motif)


## motifStack:
motifStack::plotMotifLogo(motif)
#> also installing the dependency 'jpeg'
#> Loading required namespace: Cairo
#> Warning in checkValidSVG(doc, warn = warn): This picture may not have been
#> generated by Cairo graphics; errors may result
#> Loading required namespace: Cairo
#> Warning in checkValidSVG(doc, warn = warn): This picture may not have been
#> generated by Cairo graphics; errors may result
#> Loading required namespace: Cairo
#> Warning in checkValidSVG(doc, warn = warn): This picture may not have been
#> generated by Cairo graphics; errors may result
#> Loading required namespace: Cairo
#> Warning in checkValidSVG(doc, warn = warn): This picture may not have been
#> generated by Cairo graphics; errors may result


## ggseqlogo:
ggseqlogo::ggseqlogo(motif)
#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
#> of ggplot2 3.3.4.
#> ℹ The deprecated feature was likely used in the ggseqlogo package.
#>   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Stacked motif logos

The motifStack package allows for a number of different motif stacking visualizations. The universalmotif package, while not capable of emulating most of these, still offers basic stacking via view_motifs(). The motifs are aligned using compare_motifs().

library(universalmotif)
library(MotifDb)

motifs <- convert_motifs(MotifDb[50:54])
view_motifs(motifs, show.positions.once = FALSE, names.pos = "right")

Plot arbitrary text logos

The logo plotting capabilities of view_motifs() can be used for any kind of arbitrary text logo. All you need is a numeric matrix (the heights of the characters), with the desired characters as row names. The following example is taken from the view_logo() documentation.

library(universalmotif)
data(examplemotif)

## Start from a numeric matrix:
toplot <- examplemotif["motif"]

# Adjust the character heights as you wish (negative values are possible):
toplot[4] <- 2
toplot[20] <- -0.5

# Mix and match the number of characters per letter/position:
rownames(toplot)[1] <- "AA"

toplot <- toplot[c(1, 4), ]

toplot
#>    T A T A    W A   W
#> AA 0 1 0 1  0.5 1 0.5
#> T  2 0 1 0 -0.5 0 0.5

view_logo(toplot)

Higher-order motifs

Though PCM, PPM, PWM, and ICM type motifs are still widely used today, a few ‘next generation’ motif formats have been proposed. These wish to add another layer of information to motifs: positional interdependence. To illustrate this, consider the following sequences:

(#tab:seqs2) Example sequences.
# Sequence
1 CAAAACC
2 CAAAACC
3 CAAAACC
4 CTTTTCC
5 CTTTTCC
6 CTTTTCC

This becomes the following PPM:

(#tab:ppm2) Position Probability Matrix.
Position 1 2 3 4 5 6 7
A 0.0 0.5 0.5 0.5 0.5 0.0 0.0
C 1.0 0.0 0.0 0.0 0.0 1.0 1.0
G 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T 0.0 0.5 0.5 0.5 0.5 0.0 0.0

Based on the PPM representation, all three of CAAAACC, CTTTTCC, and CTATACC are equally likely. Though looking at the starting sequences, should CTATACC really be considered so? For transcription factor binding sites, this sometimes is not the case. By incorporating this type of information into the motif, it can allow for increased accuracy in motif searching. A few example implementations of this include: TFFM by Mathelier and Wasserman (2013), BaMM by Siebert and Soding (2016), and KSM by Guo et al. (2018).

The universalmotif package implements its own, rather simplified, version of this concept. Plainly, the standard PPM has been extended to include k-letter frequencies, with k being any number higher than 1. For example, the 2-letter version of the table @ref(tab:ppm2) motif would be:

(#tab:multi) 2-letter probability matrix.
Position 1 2 3 4 5 6
AA 0.0 0.5 0.5 0.5 0.0 0.0
AC 0.0 0.0 0.0 0.0 0.5 0.0
AG 0.0 0.0 0.0 0.0 0.0 0.0
AT 0.0 0.0 0.0 0.0 0.0 0.0
CA 0.5 0.0 0.0 0.0 0.0 0.0
CC 0.0 0.0 0.0 0.0 0.0 1.0
CG 0.0 0.0 0.0 0.0 0.0 0.0
CT 0.5 0.0 0.0 0.0 0.0 0.0
GA 0.0 0.0 0.0 0.0 0.0 0.0
GC 0.0 0.0 0.0 0.0 0.0 0.0
GG 0.0 0.0 0.0 0.0 0.0 0.0
GT 0.0 0.0 0.0 0.0 0.0 0.0
TA 0.0 0.0 0.0 0.0 0.0 0.0
TC 0.0 0.0 0.0 0.0 0.5 0.0
TG 0.0 0.0 0.0 0.0 0.0 0.0
TT 0.0 0.5 0.5 0.5 0.0 0.0

This format shows the probability of each letter combined with the probability of the letter in the next position. The seventh column has been dropped, since it is not needed: the information in the sixth column is sufficient, and there is no eighth position to draw 2-letter probabilities from. Now, the probability of getting CTATACC is no longer equal to CTTTTCC and CAAAACC. This information is kept in the multifreq slot of universalmotif class motifs. To add this information, use the add_multifreq() function.

library(universalmotif)

motif <- create_motif("CWWWWCC", nsites = 6)
sequences <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3))
motif.k2 <- add_multifreq(motif, sequences, add.k = 2)

## Alternatively:
# motif.k2 <- create_motif(sequences, add.multifreq = 2)

motif.k2
#> 
#>        Motif name:   motif
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   10
#>       Pseudocount:   0
#>         Consensus:   CWWWWCC
#>      Target sites:   6
#>    k-letter freqs:   2
#> 
#>   C   W   W   W   W C C
#> A 0 0.5 0.5 0.5 0.5 0 0
#> C 1 0.0 0.0 0.0 0.0 1 1
#> G 0 0.0 0.0 0.0 0.0 0 0
#> T 0 0.5 0.5 0.5 0.5 0 0

To plot these motifs, use view_motifs():

view_motifs(motif.k2, use.freq = 2)

This information is most useful with functions such as scan_sequences() and enrich_motifs(). Though other tools in the universalmotif can work with multifreq motifs (such as motif_pvalue(), compare_motifs()), keep in mind they are not as well supported as regular motifs (getting P-values from multifreq motifs is exponentially slower, and P-values from using compare_motifs() for multifreq motifs are not available by default). See the sequences vignette for using scan_sequences() with the multifreq slot.

Tidy motif manipulation with the universalmotif_df data structure

For those who enjoy using the tidyverse functions for data handling, motifs can additionally represented as the modified data.frame format: universalmotif_df. This format allows one to modify motif slots for multiples motifs simultaneously using the universalmotif_df columns, and then return to a list of motifs afterwards to resume use with universalmotif package functions. A few key functions have been provided in relation to this format:

  • to_df(): Generate a universalmotif_df object from a list of motifs.
  • update_motifs(): After modifying the universalmotif_df object, apply these modifications to the actual universalmotif objects (contained within the motif column).
  • to_list(): Return to a list of universalmotif objects for use with universalmotif package functions. Note that it is not required to use update_motifs() before using to_list(), as modifications will be checked for and applied if found.
  • requires_update(): Boolean check as to whether the universalmotif objects and the universalmotif_df columns differ and require either a update_motifs() or to_list() call to re-sync them.
library(universalmotif)
library(MotifDb)

## Obtain a `universalmotif_df` object
motifs <- to_df(MotifDb)
head(motifs)
#>         motif  name     altname    organism consensus alphabet strand   icscore
#> 1  <mot:ABF2>  ABF2  badis.ABF2 Scerevisiae    TCTAGA      DNA     +-  9.371235
#> 2  <mot:CAT8>  CAT8  badis.CAT8 Scerevisiae    CCGGAN      DNA     +-  7.538740
#> 3  <mot:CST6>  CST6  badis.CST6 Scerevisiae    TGACGT      DNA     +-  9.801864
#> 4 <mot:ECM23> ECM23 badis.ECM23 Scerevisiae     AGATC      DNA     +-  6.567494
#> 5  <mot:EDS1>  EDS1  badis.EDS1 Scerevisiae   GGAANAA      DNA     +-  9.314287
#> 6  <mot:FKH2>  FKH2  badis.FKH2 Scerevisiae   GTAAACA      DNA     +- 11.525400
#>   type pseudocount          bkg dataSource
#> 1  PPM           1 0.25, 0.....     ScerTF
#> 2  PPM           1 0.25, 0.....     ScerTF
#> 3  PPM           1 0.25, 0.....     ScerTF
#> 4  PPM           1 0.25, 0.....     ScerTF
#> 5  PPM           1 0.25, 0.....     ScerTF
#> 6  PPM           1 0.25, 0.....     ScerTF
#> 
#> [Hidden empty columns: family, nsites, bkgsites, pval, qval, eval.]

Some tidy manipulation:

library(dplyr)

motifs <- motifs %>%
  mutate(bkg = case_when(
    organism == "Athaliana" ~ list(c(A = 0.32, C = 0.18, G = 0.18, T = 0.32)),
    TRUE ~ list(c(A = 0.25, C = 0.25, G = 0.25, T = 0.25))
  ))
head(filter(motifs, organism == "Athaliana"))
#>            motif      name    altname family  organism  consensus alphabet
#> 1 *  <mot:ORA59>     ORA59 M0005_1.02    AP2 Athaliana   MGCCGCCN      DNA
#> 2 *   <mot:WIN1>      WIN1 M0006_1.02    AP2 Athaliana NCRCCGCNNN      DNA
#> 3 * <mot:AT1G..> AT1G22985 M0007_1.02    AP2 Athaliana  NNGCCGNNN      DNA
#> 4 *   <mot:TEM1>      TEM1 M0008_1.02    AP2 Athaliana   WATGTTGC      DNA
#> 5 *  <mot:ERF11>     ERF11 M0009_1.02    AP2 Athaliana NNGCCGNNNN      DNA
#> 6 * <mot:RAP2.6>    RAP2.6 M0010_1.02    AP2 Athaliana   NNGCCGNN      DNA
#>   strand   icscore type pseudocount          bkg dataSource
#> 1     +- 11.351632  PPM           1 0.32, 0..... cisbp_1.02
#> 2     +-  6.509679  PPM           1 0.32, 0..... cisbp_1.02
#> 3     +-  5.155725  PPM           1 0.32, 0..... cisbp_1.02
#> 4     +- 11.182383  PPM           1 0.32, 0..... cisbp_1.02
#> 5     +-  5.148803  PPM           1 0.32, 0..... cisbp_1.02
#> 6     +-  4.227144  PPM           1 0.32, 0..... cisbp_1.02
#> 
#> [Hidden empty columns: nsites, bkgsites, pval, qval, eval.]
#> [Rows marked with * are changed. Run update_motifs() or to_list() to apply
#>   changes.]

Feel free to add columns as well. You can add 1d vectors which will be added to the extrainfo slots of motifs. (Note that they will be coerced to character vectors!)

motifs <- motifs %>%
  mutate(MotifIndex = 1:n())
head(motifs)
#>           motif  name     altname    organism consensus alphabet strand
#> 1 *  <mot:ABF2>  ABF2  badis.ABF2 Scerevisiae    TCTAGA      DNA     +-
#> 2 *  <mot:CAT8>  CAT8  badis.CAT8 Scerevisiae    CCGGAN      DNA     +-
#> 3 *  <mot:CST6>  CST6  badis.CST6 Scerevisiae    TGACGT      DNA     +-
#> 4 * <mot:ECM23> ECM23 badis.ECM23 Scerevisiae     AGATC      DNA     +-
#> 5 *  <mot:EDS1>  EDS1  badis.EDS1 Scerevisiae   GGAANAA      DNA     +-
#> 6 *  <mot:FKH2>  FKH2  badis.FKH2 Scerevisiae   GTAAACA      DNA     +-
#>     icscore type pseudocount          bkg dataSource MotifIndex
#> 1  9.371235  PPM           1 0.25, 0.....     ScerTF          1
#> 2  7.538740  PPM           1 0.25, 0.....     ScerTF          2
#> 3  9.801864  PPM           1 0.25, 0.....     ScerTF          3
#> 4  6.567494  PPM           1 0.25, 0.....     ScerTF          4
#> 5  9.314287  PPM           1 0.25, 0.....     ScerTF          5
#> 6 11.525400  PPM           1 0.25, 0.....     ScerTF          6
#> 
#> [Hidden empty columns: family, nsites, bkgsites, pval, qval, eval.]
#> [Rows marked with * are changed. Run update_motifs() or to_list() to apply
#>   changes.]

to_list(motifs)[[1]]
#> 
#>        Motif name:   ABF2
#>    Alternate name:   badis.ABF2
#>          Organism:   Scerevisiae
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   9.37
#>       Pseudocount:   1
#>         Consensus:   TCTAGA
#>        Extra info:   [dataSource] ScerTF 
#>                      [MotifIndex] 1
#> 
#>      T    C    T    A    G    A
#> A 0.09 0.01 0.01 0.97 0.01 0.94
#> C 0.09 0.97 0.01 0.01 0.01 0.02
#> G 0.02 0.01 0.01 0.01 0.97 0.02
#> T 0.80 0.01 0.97 0.01 0.01 0.02

If during the course of your manipulation you’ve generated temporary columns which you wish to drop, you can set extrainfo = FALSE to discard all extra columns. Be careful though, this will discard any previously existing extrainfo data as well.

to_list(motifs, extrainfo = FALSE)[[1]]
#> Discarding unknown slot(s) 'dataSource', 'MotifIndex' (set `extrainfo=TRUE`
#>   to preserve these).
#> 
#>        Motif name:   ABF2
#>    Alternate name:   badis.ABF2
#>          Organism:   Scerevisiae
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   9.37
#>       Pseudocount:   1
#>         Consensus:   TCTAGA
#> 
#>      T    C    T    A    G    A
#> A 0.09 0.01 0.01 0.97 0.01 0.94
#> C 0.09 0.97 0.01 0.01 0.01 0.02
#> G 0.02 0.01 0.01 0.01 0.97 0.02
#> T 0.80 0.01 0.97 0.01 0.01 0.02

Miscellaneous motif utilities

A number of convenience functions are included for manipulating motifs.

DNA/RNA/AA consensus functions

For DNA, RNA and AA motifs, the universalmotif will automatically generate a consensus string slot. Furthermore, create_motif() can generate motifs from consensus strings. The internal functions for these have been made available:

  • consensus_to_ppm()
  • consensus_to_ppmAA()
  • get_consensus()
  • get_consensusAA()
library(universalmotif)

get_consensus(c(A = 0.7, C = 0.1, G = 0.1, T = 0.1))
#> [1] "A"

consensus_to_ppm("G")
#> [1] 0.001 0.001 0.997 0.001

Filter through lists of motifs

Filter a list of motifs, using the universalmotif slots with filter_motifs().

library(universalmotif)
library(MotifDb)

## Let us extract all of the Arabidopsis and C. elegans motifs 

motifs <- filter_motifs(MotifDb, organism = c("Athaliana", "Celegans"))
#> motifs converted to class 'universalmotif'

## Only keeping motifs with sufficient information content and length:

motifs <- filter_motifs(motifs, icscore = 10, width = 10)

head(summarise_motifs(motifs))
#>        name    altname family  organism     consensus alphabet strand  icscore
#> 1      ERF1 M0025_1.02    AP2 Athaliana    NMGCCGCCRN      DNA     +- 12.40700
#> 2    ATERF6 M0027_1.02    AP2 Athaliana    NTGCCGGCGB      DNA     +- 11.77649
#> 3    ATCBF3 M0032_1.02    AP2 Athaliana    ATGTCGGYNN      DNA     +- 10.66970
#> 4 AT2G18300 M0155_1.02   bHLH Athaliana  NNNGCACGTGNN      DNA     +- 11.50133
#> 5   bHLH104 M0159_1.02   bHLH Athaliana    GGCACGTGCC      DNA     +- 16.05350
#> 6    hlh-16 M0173_1.02   bHLH  Celegans NNNCAATATKGNN      DNA     +- 10.32432
#>   nsites
#> 1     NA
#> 2     NA
#> 3     NA
#> 4     NA
#> 5     NA
#> 6     NA

Generate random motif matches

Get a random set of sequences which are created using the probabilities of the motif matrix, in effect generating motif sites, with sample_sites().

library(universalmotif)
data(examplemotif)

sample_sites(examplemotif)
#> DNAStringSet object of length 100:
#>       width seq
#>   [1]     7 TATATAA
#>   [2]     7 TATAAAT
#>   [3]     7 TATATAT
#>   [4]     7 TATATAT
#>   [5]     7 TATATAT
#>   ...   ... ...
#>  [96]     7 TATAAAT
#>  [97]     7 TATAAAT
#>  [98]     7 TATATAT
#>  [99]     7 TATATAT
#> [100]     7 TATAAAA

Motif shuffling

Shuffle a set of motifs with shuffle_motifs(). The original shuffling implementation is taken from the linear shuffling method of shuffle_sequences(), described in the sequences vignette.

library(universalmotif)
library(MotifDb)

motifs <- convert_motifs(MotifDb[1:50])
head(summarise_motifs(motifs))
#>    name     altname    organism consensus alphabet strand   icscore
#> 1  ABF2  badis.ABF2 Scerevisiae    TCTAGA      DNA     +-  9.371235
#> 2  CAT8  badis.CAT8 Scerevisiae    CCGGAN      DNA     +-  7.538740
#> 3  CST6  badis.CST6 Scerevisiae    TGACGT      DNA     +-  9.801864
#> 4 ECM23 badis.ECM23 Scerevisiae     AGATC      DNA     +-  6.567494
#> 5  EDS1  badis.EDS1 Scerevisiae   GGAANAA      DNA     +-  9.314287
#> 6  FKH2  badis.FKH2 Scerevisiae   GTAAACA      DNA     +- 11.525400

motifs.shuffled <- shuffle_motifs(motifs, k = 3)
head(summarise_motifs(motifs.shuffled))
#>               name consensus alphabet strand   icscore
#> 1  ABF2 [shuffled]    TGYHTA      DNA     +-  7.349461
#> 2  CAT8 [shuffled]    ACAAGG      DNA     +- 10.215587
#> 3  CST6 [shuffled]    ATGAAC      DNA     +-  7.191681
#> 4 ECM23 [shuffled]     RACAC      DNA     +-  6.945743
#> 5  EDS1 [shuffled]   CTAMKYC      DNA     +-  7.121778
#> 6  FKH2 [shuffled]   AGGTRTG      DNA     +-  7.666473

Scoring and match functions

Motif matches in a set of sequences are typically obtained using logodds scores. Several functions are exposed to reveal some of the internal work that goes on.

  • get_matches(): show all possible sequence matches above a certain score
  • get_scores(): obtain all possible scores from all possible sequence matches
  • motif_score(): translate score thresholds to logodds scores
  • prob_match(): return probabilities for sequence matches
  • score_match(): return logodds scores for sequence matches
library(universalmotif)
data(examplemotif)
examplemotif
#> 
#>        Motif name:   motif
#>          Alphabet:   DNA
#>              Type:   PPM
#>           Strands:   +-
#>          Total IC:   11.54
#>       Pseudocount:   1
#>         Consensus:   TATAWAW
#> 
#>   T A T A   W A   W
#> A 0 1 0 1 0.5 1 0.5
#> C 0 0 0 0 0.0 0 0.0
#> G 0 0 0 0 0.0 0 0.0
#> T 1 0 1 0 0.5 0 0.5

## Get the min and max possible scores:
motif_score(examplemotif)
#>      0%    100% 
#> -46.606  11.929

## Show matches above a score of 10:
get_matches(examplemotif, 10)
#> [1] "TATAAAA" "TATATAA" "TATAAAT" "TATATAT"

## Get the probability of a match:
prob_match(examplemotif, "TTTTTTT", allow.zero = FALSE)
#> [1] 6.103516e-05

## Score a specific sequence:
score_match(examplemotif, "TTTTTTT")
#> [1] -14.012

## Take a look at the distribution of scores:
plot(density(get_scores(examplemotif), bw = 5))

Type conversion functions

While convert_type() will take care of switching the current type for universalmotif objects, the individual type conversion functions are also available for personal use. These are:

  • icm_to_ppm()
  • pcm_to_ppm()
  • ppm_to_icm()
  • ppm_to_pcm()
  • ppm_to_pwm()
  • pwm_to_ppm()

These functions take a one dimensional vector. To use these for matrices:

library(universalmotif)

m <- create_motif(type = "PCM")["motif"]
m
#>    Y  A  T  A  K  K  G   G  Y  T
#> A  0 94  4 76 19  0  1   0  8  0
#> C 52  6  5  4  1  3  0   0 60 21
#> G 20  0 16 19 44 38 99 100  1  2
#> T 28  0 75  1 36 59  0   0 31 77

apply(m, 2, pcm_to_ppm)
#>         Y    A    T    A    K    K    G G    Y    T
#> [1,] 0.00 0.94 0.04 0.76 0.19 0.00 0.01 0 0.08 0.00
#> [2,] 0.52 0.06 0.05 0.04 0.01 0.03 0.00 0 0.60 0.21
#> [3,] 0.20 0.00 0.16 0.19 0.44 0.38 0.99 1 0.01 0.02
#> [4,] 0.28 0.00 0.75 0.01 0.36 0.59 0.00 0 0.31 0.77

Additionally, the position_icscore() can be used to get the total information content per position:

library(universalmotif)

position_icscore(c(0.7, 0.1, 0.1, 0.1))
#> [1] 0.6307803

Session info

#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] TFBSTools_1.45.0      cowplot_1.1.3         dplyr_1.1.4          
#>  [4] ggtree_3.15.0         ggplot2_3.5.1         MotifDb_1.49.0       
#>  [7] GenomicRanges_1.59.1  Biostrings_2.75.2     GenomeInfoDb_1.43.2  
#> [10] XVector_0.47.0        IRanges_2.41.2        S4Vectors_0.45.2     
#> [13] BiocGenerics_0.53.3   generics_0.1.3        universalmotif_1.25.1
#> [16] bookdown_0.41        
#> 
#> loaded via a namespace (and not attached):
#>   [1] sys_3.4.3                   jsonlite_1.8.9             
#>   [3] magrittr_2.0.3              farver_2.1.2               
#>   [5] rmarkdown_2.29              fs_1.6.5                   
#>   [7] BiocIO_1.17.1               zlibbioc_1.52.0            
#>   [9] vctrs_0.6.5                 memoise_2.0.1              
#>  [11] Rsamtools_2.23.1            RCurl_1.98-1.16            
#>  [13] base64enc_0.1-3             htmltools_0.5.8.1          
#>  [15] S4Arrays_1.7.1              curl_6.0.1                 
#>  [17] CNEr_1.43.0                 SparseArray_1.7.2          
#>  [19] gridGraphics_0.5-1          sass_0.4.9                 
#>  [21] pracma_2.4.4                bslib_0.8.0                
#>  [23] htmlwidgets_1.6.4           plyr_1.8.9                 
#>  [25] cachem_1.1.0                buildtools_1.0.0           
#>  [27] GenomicAlignments_1.43.0    lifecycle_1.0.4            
#>  [29] pkgconfig_2.0.3             Matrix_1.7-1               
#>  [31] R6_2.5.1                    fastmap_1.2.0              
#>  [33] GenomeInfoDbData_1.2.13     MatrixGenerics_1.19.0      
#>  [35] digest_0.6.37               aplot_0.2.3                
#>  [37] colorspace_2.1-1            TFMPvalue_0.0.9            
#>  [39] patchwork_1.3.0             AnnotationDbi_1.69.0       
#>  [41] RSQLite_2.3.9               seqLogo_1.73.0             
#>  [43] labeling_0.4.3              fansi_1.0.6                
#>  [45] httr_1.4.7                  abind_1.4-8                
#>  [47] compiler_4.4.2              bit64_4.5.2                
#>  [49] withr_3.0.2                 BiocParallel_1.41.0        
#>  [51] DBI_1.2.3                   R.utils_2.12.3             
#>  [53] MASS_7.3-61                 poweRlaw_0.80.0            
#>  [55] DelayedArray_0.33.3         rjson_0.2.23               
#>  [57] gtools_3.9.5                caTools_1.18.3             
#>  [59] tools_4.4.2                 splitstackshape_1.4.8      
#>  [61] ape_5.8                     ggseqlogo_0.2              
#>  [63] R.oo_1.27.0                 glue_1.8.0                 
#>  [65] restfulr_0.0.15             nlme_3.1-166               
#>  [67] grid_4.4.2                  ade4_1.7-22                
#>  [69] reshape2_1.4.4              BSgenome_1.75.0            
#>  [71] gtable_0.3.6                tzdb_0.4.0                 
#>  [73] R.methodsS3_1.8.2           tidyr_1.3.1                
#>  [75] data.table_1.16.4           hms_1.1.3                  
#>  [77] utf8_1.2.4                  pillar_1.9.0               
#>  [79] stringr_1.5.1               yulab.utils_0.1.8          
#>  [81] treeio_1.31.0               lattice_0.22-6             
#>  [83] rtracklayer_1.67.0          bit_4.5.0.1                
#>  [85] annotate_1.85.0             tidyselect_1.2.1           
#>  [87] DirichletMultinomial_1.49.0 GO.db_3.20.0               
#>  [89] maketools_1.3.1             knitr_1.49                 
#>  [91] grImport2_0.3-3             SummarizedExperiment_1.37.0
#>  [93] xfun_0.49                   Biobase_2.67.0             
#>  [95] matrixStats_1.4.1           stringi_1.8.4              
#>  [97] UCSC.utils_1.3.0            lazyeval_0.2.2             
#>  [99] ggfun_0.1.8                 yaml_2.3.10                
#> [101] evaluate_1.0.1              codetools_0.2-20           
#> [103] tibble_3.2.1                ggplotify_0.1.2            
#> [105] cli_3.6.3                   xtable_1.8-4               
#> [107] munsell_0.5.1               jquerylib_0.1.4            
#> [109] Rcpp_1.0.13-1               png_0.1-8                  
#> [111] XML_3.99-0.17               parallel_4.4.2             
#> [113] readr_2.1.5                 blob_1.2.4                 
#> [115] jpeg_0.1-10                 bitops_1.0-9               
#> [117] pwalign_1.3.1               tidytree_0.4.6             
#> [119] motifStack_1.51.0           scales_1.3.0               
#> [121] purrr_1.0.2                 crayon_1.5.3               
#> [123] rlang_1.1.4                 KEGGREST_1.47.0

References

Bailey, T. L., M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. 2009. “MEME SUITE: Tools for Motif Discovery and Searching.” Nucleic Acids Research 37: W202–8.
Guo, Y., K. Tian, H. Zeng, X. Guo, and D. K. Gifford. 2018. “A Novel k-Mer Set Memory (KSM) Motif Representation Improves Regulatory Variant Prediction.” Genome Research 28: 891–900.
Heinz, S., C. Benner, N. Spann, E. Bertolino, Y. C. Lin, P. Laslo, J. X. Cheng, C. Murre, H. Singh, and C. K. Glass. 2010. “Simple Combinations of Lineage-Determining Transcription Factors Prime Cis-Regulatory Elements Required for Macrophage and b Cell Identities.” Molecular Cell 38: 576–89.
Hume, M. A., L. A. Barrera, S. S. Gisselbrecht, and M. L. Bulyk. 2015. “UniPROBE, Update 2015: New Tools and Content for the Online Database of Protein-Binding Microarray Data on Protein-DNA Interactions.” Nucleic Acids Research 43: D117–22.
Khan, A., O. Fornes, A. Stigliani, M. Gheorghe, J. A. Castro-Mondragon, R. van der Lee, A. Bessy, et al. 2018. “JASPAR 2018: Update of the Open-Access Database of Transcription Factor Binding Profiles and Its Web Framework.” Nucleic Acids Research 46: D260–66.
Mathelier, A., and W. W. Wasserman. 2013. “The Next Generation of Transcription Factor Binding Site Prediction.” PLoS Computational Biology 9: e1003214.
Siebert, M., and J. Soding. 2016. “Bayesian Markov Models Consistently Outperform PWMs at Predicting Motifs in Nucleotide Sequences.” Nucleic Acids Research 44: 6055–69.
Weirauch, M. T., A. Yang, M. Albu, A. G. Cote, A. Montenegro-Montero, P. Drewe, H. S. Najafabadi, et al. 2014. “Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity.” Cell 158: 1431–43.
Wingender, E., P. Dietze, H. Karas, and R. Knuppel. 1996. “TRANSFAC: A Database on Transcription Factors and Their DNA Binding Sites.” Nucleic Acids Research 24: 238–41.