Motifs are a more practical representation of consensus elements in
biological sequences, allowing for a more detailed description of the
variability at each site (see G. D. Stormo
2000). There are a number of ways of representing motifs; four
types are available here. All of these can be stored using the same
universalmotif
class. Unfortunately, the naming conventions
for the various motif types are not unanimous. In order to provide a
simple and consistent interface to the various types, in this package
they will always be referred to as Position Count Matrix (PCM), Position
Probability Matrix (PPM), Position Weight Matrix (PWM), and Information
Content Matrix (ICM).
These four types are described within this vignette. Furthermore, the
actual code used to go between these types can be seen in
?'utils-motif'
. They are also reviewed briefly in the motif manipulation vignette.
Also known as position frequency matrices, these are typically the most basic representation of a motif. Simply, for each position the total counts of letters is shown. For example, the following sequences:
# | Sequence |
---|---|
1 | AAGAAT |
2 | ATCATA |
3 | AAGTAA |
4 | AACAAA |
5 | ATTAAA |
6 | AAGAAT |
… would be represented as:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 6 | 4 | 0 | 5 | 5 | 4 |
C | 0 | 0 | 2 | 0 | 0 | 0 |
G | 0 | 0 | 3 | 0 | 0 | 0 |
T | 0 | 2 | 1 | 1 | 1 | 2 |
(Note that all positions add up to 6, the initial number of sequences.)
Also sometimes known as position frequency matrices or position weight matrices, these represent the probabilities for each letter at each position. Using the previous motif (table @ref(tab:pcm)) as an example, the following formula would be used to calculate the probability (P) of each letter N at an individual position from counts (C):
The equivalent vectorized R
code:
This leads to the following motif representation:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 1.00 | 0.67 | 0.00 | 0.83 | 0.83 | 0.66 |
C | 0.00 | 0.00 | 0.33 | 0.00 | 0.00 | 0.00 |
G | 0.00 | 0.00 | 0.50 | 0.00 | 0.00 | 0.00 |
T | 0.00 | 0.33 | 0.17 | 0.17 | 0.17 | 0.33 |
(Note that all positions sum to 1.)
From this type of representation, the probability of any combination of letters can be calculated. For example, the probability for AAGAAA is about 15%. However, when starting from a small pool of sequences, many zeroes can appear in the PPM. This means that, for example, the probability of AAAAAA is currently zero. When scanning through large numbers of biological sequences, throwing away combinations of letters such as these can potentially be undesirable, as sometimes mismatches can occur in some transcription factor binding sites. This can be fixed by adding a pseudocount (p). Usually a small number such as 1, it is introduced into the PCM to PPM calculation as such:
The equivalent vectorized R
code:
In this equation, the pseudocount is added to the top and bottom of the fraction. However for the top fraction, which is specific to each letter, the pseudocount is divided by the total number of letters n (in the case of DNA, 4). This then generates the following motif:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 0.892 | 0.610 | 0.036 | 0.750 | 0.750 | 0.610 |
C | 0.036 | 0.035 | 0.320 | 0.035 | 0.035 | 0.035 |
G | 0.036 | 0.035 | 0.464 | 0.035 | 0.035 | 0.035 |
T | 0.036 | 0.320 | 0.180 | 0.180 | 0.180 | 0.320 |
Now, though unlikely, it is no longer considered impossible for the sequence AAAAAA to exist as part of this motif. Since the total number of sequences in this case is quite low, the pseudocount can have a large impact on the values within the matrix. This impact decreases as the number of sequences increases. Of course, this can also be changed by using different pseudocounts.
The position weight matrix, also known as position-specific weight matrix, position-specific scoring matrix, and logodds scoring matrix, was first proposed by Gary D. Stormo et al. (1982). In this case for each position, every letter has a ‘score’ which can be used to evaluate how well a sequence matches a motif. Though there can be multiple ways of calculating these scores (S), the most common method is to calculate the log of each probability, correcting for background frequencies (B). This results in the following calculation:
The equivalent vectorized R
code:
Using this equation, the log of fractions where the probability of a certain letter in a sequence is higher than that of the background probability of that letter result in positive scores, and vice versa for negative scores. Using the table @ref(tab:ppm) motif and assuming a uniform background frequency (i.e. the probability of each of the four letters is 0.25), this results in the following PWM:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 2 | 1.425 | -Inf | 1.737 | 1.737 | 1.415 |
C | -Inf | -Inf | 0.415 | -Inf | -Inf | -Inf |
G | -Inf | -Inf | 1.000 | -Inf | -Inf | -Inf |
T | -Inf | 0.415 | -0.585 | -0.585 | -0.595 | 0.415 |
(Note that the position totals no longer have equal sums.)
In order to score a sequence, add up the score for the letters at the specific positions. For example AAGAAA has a score of 9.31. However, similar to with PPMs, if starting from a small pool of sequences the sequence AAAAAA could never be recovered using this motif model, with a score of -Inf. This can be avoided simply by starting from a pseudocount-adjusted PPM. Using the table @ref(tab:ppmpseudo) motif, this becomes:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 1.840 | 1.280 | -2.807 | 1.585 | 1.585 | 1.280 |
C | -2.807 | -2.807 | 0.363 | -2.807 | -2.807 | -2.807 |
G | -2.807 | -2.807 | 0.893 | -2.807 | -2.807 | -2.807 |
T | -2.807 | 0.363 | -0.485 | -0.485 | -0.485 | 0.363 |
Now, the score for AAGAAA is 8.46, and the score for AAAAAA is 4.76. Though the score for the latter is low, it is no longer -Inf as a result of one mismatch.
When searching for instances of this motif using this scoring system, one more consideration is needed: a minimum score. For example, both sequences AAGAAA and ATTTTT have positive scores; but one is much higher than the other. Should ATTTTT then be discarded? In order to answer this a threshold is set, typically this is a certain percent of the highest possible score. For example, in the table @ref(tab:pwmpseudo) motif, the highest possible score is 8.46. Using a threshold of 25%, the minimum score then becomes 2.115. This means that ATTTTT, with a score of 1.112, would indeed be discarded.
Another method for determining the minimum score include starting
from P-values. The package TFMPvalue
for example can
calculate minimum scores from P-values, using the algorithm described by
Touzet and Varre (2007). The
universalmotif
package also offers this capability with the
motif_pvalue()
function. See the
MotifComparisonAndPvalues
vignette for a detailed
overview.
Whichever method you use however, it is important to remember that there is likely never one single threshold that works in every situation. Lower threshold sequences can be bound in the case of transcription factors which bind DNA with less specificity or are present in higher concentrations, and the opposite can be true of higher-specificity or less concentrated transcription factors. See Djordjevic, Sengupta, and Shraiman (2003) and Ruan and Stormo (2017) for some background on the topic.
Finally, the information content matrix (Thomas D. Schneider et al. 1986; T. D. Schneider and Stephens 1990). This type aims to include another consideration: are some positions more important than others? To explore this, let us consider the table @ref(tab:ppm) motif. This matrix can be represented as a probability sequence logo as seen in figure @ref(fig:logo1).
In this case, the height of the letters represent their probabilities at each position. However, when represented as an ICM, the sequence logo then resembles as seen in figure @ref(fig:logo2).
Now, the total height of each position is scaled using the total information at that position. Simply put, the total information of each position is an indication of the level of conservation. In the example motif, the first position is highly conserved, always being the letter A, whereas the third position is less so conserved, as the probabilities for any one letter are quite lower.
For every position, the letter heights are calculated as the total possible information content (IC) or maximum uncertainty, then subtracting the actual uncertainty at each position (Thomas D. Schneider et al. 1986; T. D. Schneider and Stephens 1990). This calculation is based on Shannon’s entropy (Shannon 1948), with the final values representing ‘bits’ (T. D. Schneider 1991). The total IC is based on alphabet length (n), using the following equation:
The equivalent R
code:
For DNA motifs which have an alphabet length of 4, the total IC is 2. To calculate the actual uncertainty (U) per position, the following equation is used:
The equivalent vectorized R
code:
Where the position uncertainty is the sum of the uncertainty of all alphabet letters (A, C, G, and T). To calculate the final information content:
The equivalent vectorized R
code:
In the original implementation described by Thomas D. Schneider et al. (1986), an additional error correction factor (E) is included to account for sample size. This correction is rarely used however, but can be incorporated as such:
The details for calculating this factor will not be covered here.
(Refer to Thomas D. Schneider et al.
(1986).) The TFBSTools
package offers the ability to
incorporate this error correction (and can be used in the
universalmotif
package via
convert_type(..., nsize_correction=TRUE)
).
Finally, to get the height of each letter at each position, the final IC is multiplied by the letter and position probabilities:
The equivalent vectorized R
code:
Using the above equations, the table @ref(tab:ppm) motif then becomes:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 2.000 | 0.721 | 0.000 | 1.125 | 1.125 | 0.721 |
C | 0.000 | 0.000 | 0.180 | 0.000 | 0.000 | 0.000 |
G | 0.000 | 0.000 | 0.270 | 0.000 | 0.000 | 0.000 |
T | 0.000 | 0.361 | 0.090 | 0.225 | 0.225 | 0.361 |
(Note that none of the positions have a sum larger than 2.)
An alternative to representing information content as Shannon’s entropy is relative entropy, or Kullback-Leibler divergence (Kullback and Leibler 1951). While IC as Shannon’s entropy has the advantage of having a consistent maximum IC for every position, it does not take into account non-uniform background frequencies. Relative entropy on the other hand will take this into account, but all positions no longer share the same maximum IC. To calculate relative entropy:
The equivalent vectorized R
code:
Using this equation can lead to IC less than zero. These values are not allowed, so they are simply replaced with zero. With this equation and assuming uniform background frequencies, the table @ref(tab:ppm) motif becomes:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 1.640 | 0.777 | 0.000 | 1.190 | 1.190 | 0.777 |
C | 0.000 | 0.000 | 0.177 | 0.000 | 0.000 | 0.000 |
G | 0.000 | 0.000 | 0.415 | 0.000 | 0.000 | 0.000 |
T | 0.000 | 0.177 | 0.090 | 0.000 | 0.000 | 0.117 |
This motif would look significantly different with non-uniform
background frequencies. For example, starting from the following
background frequencies:
c(A = 0.4, C = 0.1, G = 0.1, T = 0.4)
, the motif
resembles:
Position | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A | 1.030 | 0.366 | 0.000 | 0.680 | 0.680 | 0.366 |
C | 0.000 | 0.000 | 0.541 | 0.000 | 0.000 | 0.000 |
G | 0.000 | 0.000 | 1.028 | 0.000 | 0.000 | 0.000 |
T | 0.000 | 0.000 | 0.090 | 0.000 | 0.000 | 0.000 |
Notice how the height of A in position 1 has decreased, and how the height of G in position 3 has increased. This makes sense when you consider that you are more likely to get As and less likely to get Gs to begin with.