| Title: | Import, Modify, and Export Motifs with R |
|---|---|
| Description: | A comprehensive toolkit for working with sequence motifs in R. Imports and exports most common motif formats (JASPAR, MEME, HOMER, TRANSFAC, CIS-BP, UNIPROBE) and interoperates with the other Bioconductor motif packages. Analysis functions cover de novo motif discovery, motif-vs-motif comparison and clustering, P-value calculation, sequence scanning, enrichment against shuffled or composition-matched backgrounds, positional bias testing, and pairwise motif co-occurrence. Also includes utilities for sequence shuffling, motif trimming, higher-order representations, ground-truth simulation by motif implantation, and logo-plotting functionality. |
| Authors: | Benjamin Jean-Marie Tremblay [aut, cre] (ORCID: <https://orcid.org/0000-0002-7441-2951>), Spencer Nystrom [ctb] (ORCID: <https://orcid.org/0000-0003-1000-1579>) |
| Maintainer: | Benjamin Jean-Marie Tremblay <[email protected]> |
| License: | GPL-3 |
| Version: | 1.31.32 |
| Built: | 2026-05-30 21:01:39 UTC |
| Source: | https://github.com/bioc/universalmotif |
If the original sequences are available for a particular motif, then they can be used to generate higher-order PPM matrices. See the "Motif import, export, and manipulation" vignette for more information.
add_multifreq(motif, sequences, add.k = 2:3, RC = FALSE, threshold = 0.001, threshold.type = "pvalue", motifs.perseq = 1, add.bkg = FALSE)add_multifreq(motif, sequences, add.k = 2:3, RC = FALSE, threshold = 0.001, threshold.type = "pvalue", motifs.perseq = 1, add.bkg = FALSE)
motif |
See |
sequences |
|
add.k |
|
RC |
|
threshold |
|
threshold.type |
|
motifs.perseq |
|
add.bkg |
|
See scan_sequences() for more info on scanning parameters.
At each position in the motif, then the probability of each k-let
covering from the initial position to ncol - 1 is calculated. Only
positions within the motif are considered: this means that the
final k-let probability matrix will have ncol - 1 fewer columns.
Calculating k-let probabilities for the missing columns would be
trivial however, as you would only need the background frequencies.
Since these would not be useful for scan_sequences()
though, they are not calculated.
Currently add_multifreq() does not try to stay faithful to the default
motif matrix when generating multifreq matrices. This means that if the
sequences used for training are completely different from the actual
motif, the multifreq matrices will be as well. However this is only really
a problem if you supply add_multifreq() with a set of sequences of the
same length as the motif. In this case add_multifreq() is forced to
create the multifreq matrices from these sequences. Otherwise
add_multifreq() will scan the input sequences for the motif and use the
best matches to construct the multifreq matrices.
This 'multifreq' representation is only really useful within the
universalmotif environment. Despite this, if you wish it can be
preserved in text using write_motifs().
The number of rows for each k-let matrix is n^k, with n being the
number of letters in the alphabet being used. This means that the size
of the k-let matrix can become quite large as k increases. For example,
if one were to wish to represent a DNA motif of length 10 as a 10-let,
this would require a matrix with 1,048,576 rows (though at this point
if what you want is to search for exact sequence matches,
the motif format itself is not very useful).
A universalmotif object with filled multifreq slot. The
bkg slot is also expanded with corresponding higher order probabilities
if add.bkg = TRUE.
Benjamin Jean-Marie Tremblay, [email protected]
scan_sequences(), convert_motifs(), write_motifs()
sequences <- create_sequences(seqlen = 10) motif <- create_motif() motif.trained <- add_multifreq(motif, sequences, add.k = 2:4) ## peek at the 2-let matrix: motif.trained["multifreq"]$`2`sequences <- create_sequences(seqlen = 10) motif <- create_motif() motif.trained <- add_multifreq(motif, sequences, add.k = 2:4) ## peek at the 2-let matrix: motif.trained["multifreq"]$`2`
universalmotif format.Arabidopsis motif trained from ArabidopsisPromoters using
MEME version 4. This motif was generated at the command line using
the following command: meme promoters.fa -revcomp -nmotifs 3 -mod anr -dna.
ArabidopsisMotifArabidopsisMotif
DNAStringSet.50 Arabidopsis promoters, each 1000 bases long. See the "Sequence manipulation and scanning" vignette for an example workflow describing extracting promoter sequences.
ArabidopsisPromotersArabidopsisPromoters
Compare motifs using one of the several available metrics. See the
"Motif comparisons and P-values" vignette for detailed information.
Note: for regular DNA/RNA motif comparison using PCC, see
compare_motifs2() for a faster and simpler alternative.
compare_motifs(motifs, compare.to, db.scores, use.freq = 1, use.type = "PPM", method = "PCC", tryRC = TRUE, min.overlap = 6, min.mean.ic = 0.25, min.position.ic = 0, relative_entropy = FALSE, normalise.scores = FALSE, max.p = 0.01, max.e = 10, nthreads = 1, score.strat = "a.mean", output.report, output.report.max.print = 10)compare_motifs(motifs, compare.to, db.scores, use.freq = 1, use.type = "PPM", method = "PCC", tryRC = TRUE, min.overlap = 6, min.mean.ic = 0.25, min.position.ic = 0, relative_entropy = FALSE, normalise.scores = FALSE, max.p = 0.01, max.e = 10, nthreads = 1, score.strat = "a.mean", output.report, output.report.max.print = 10)
motifs |
See |
compare.to |
|
db.scores |
|
use.freq |
|
use.type |
|
method |
|
tryRC |
|
min.overlap |
|
min.mean.ic |
|
min.position.ic |
|
relative_entropy |
|
normalise.scores |
|
max.p |
|
max.e |
|
nthreads |
|
score.strat |
|
output.report |
|
output.report.max.print |
|
The following metrics are available:
Euclidean distance (EUCL) (Choi et al. 2004)
Weighted Euclidean distance (WEUCL)
Kullback-Leibler divergence (KL) (Kullback and Leibler 1951; Roepcke et al. 2005)
Hellinger distance (HELL) (Hellinger 1909)
Squared Euclidean distance (SEUCL)
Manhattan distance (MAN)
Pearson correlation coefficient (PCC)
Weighted Pearson correlation coefficient (WPCC)
Sandelin-Wasserman similarity (SW), or sum of squared distances (Sandelin and Wasserman 2004)
Average log-likelihood ratio (ALLR) (Wang and Stormo 2003)
Lower limit ALLR (ALLR_LL) (Mahony et al. 2007)
Bhattacharyya coefficient (BHAT) (Bhattacharyya 1943)
Comparisons are calculated between two motifs at a time. All possible alignments
are scored, and the best score is reported. In an alignment scores are calculated
individually between columns. How those scores are combined to generate the final
alignment scores depends on score.strat.
See the "Motif comparisons and P-values" vignette for a description of the
various metrics. Note that PCC, WPCC, SW, ALLR, ALLR_LL and BHAT
are similarities;
higher values mean more similar motifs. For the remaining metrics, values closer
to zero represent more similar motifs.
Small pseudocounts are automatically added when one of the following methods
is used: KL, ALLR, ALLR_LL, IS. This is avoid
zeros in the calculations.
To note regarding p-values: P-values are pre-computed using the
make_DBscores() function. If not given, then uses a set of internal
precomputed P-values from the JASPAR2018 CORE motifs. These precalculated
scores are dependent on the length of the motifs being compared. This takes
into account that comparing small motifs with larger motifs leads to higher
scores, since the probability of finding a higher scoring alignment is
higher.
The default P-values have been precalculated for regular DNA motifs. They
are of little use for motifs with a different number of alphabet letters
(or even the multifreq slot).
matrix if compare.to is missing; DataFrame otherwise. For the
latter, function args are stored in the metadata slot.
Benjamin Jean-Marie Tremblay, [email protected]
Bhattacharyya A (1943). “On a measure of divergence between two statistical populations defined by their probability distributions.” Bulletin of the Calcutta Mathematical Society, 35, 99-109.
Choi I, Kwon J, Kim S (2004). “Local feature frequency profile: a method to measure structural similarity in proteins.” PNAS, 101, 3797-3802.
Hellinger E (1909). “Neue Begrundung der Theorie quadratischer Formen von unendlichvielen Veranderlichen.” Journal fur die reine und angewandte Mathematik, 136, 210-271.
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A (2018). “JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.” Nucleic Acids Research, 46, D260-D266.
Kullback S, Leibler RA (1951). “On information and sufficiency.” The Annals of Mathematical Statistics, 22, 79-86.
Itakura F, Saito S (1968). “Analysis synthesis telephony based on the maximum likelihood method.” In 6th International Congress on Acoustics, C-17.
Mahony S, Auron PE, Benos PV (2007). “DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies.” PLoS Computational Biology, 3.
Pietrokovski S (1996). “Searching databases of conserved sequence regions by aligning protein multiple-alignments.” Nucleic Acids Research, 24, 3836-3845.
Roepcke S, Grossmann S, Rahmann S, Vingron M (2005). “T-Reg Comparator: an analysis tool for the comparison of position weight matrices.” Nucleic Acids Research, 33, W438-W441.
Sandelin A, Wasserman WW (2004). “Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics.” Journal of Molecular Biology, 338, 207-215.
Wang T, Stormo GD (2003). “Combining phylogenetic data with co-regulated genes to identify motifs.” Bioinformatics, 19, 2369-2380.
convert_motifs(), motif_tree(), view_motifs(),
make_DBscores(), compare_motifs2(), view_motifs2()
motif1 <- create_motif(name = "1") motif2 <- create_motif(name = "2") motif1vs2 <- compare_motifs(c(motif1, motif2), method = "PCC") ## To get a dist object: as.dist(1 - motif1vs2) motif3 <- create_motif(name = "3") motif4 <- create_motif(name = "4") motifs <- c(motif1, motif2, motif3, motif4) ## Compare motif "2" to all the other motifs: if (R.Version()$arch != "i386") { compare_motifs(motifs, compare.to = 2, max.p = 1, max.e = Inf) } ## If you are working with a large list of motifs and the mean.min.ic ## option is not set to zero, you may get a number of failed comparisons ## due to low IC. To filter the list of motifs to avoid these, use ## the average_ic() function to remove motifs with low average IC: ## Not run: library(MotifDb) motifs <- convert_motifs(MotifDb)[1:100] compare_motifs(motifs) #> Warning in compare_motifs(motifs) : #> Some comparisons failed due to low IC motifs <- motifs[average_ic(motifs) > 0.5] compare_motifs(motifs) ## End(Not run)motif1 <- create_motif(name = "1") motif2 <- create_motif(name = "2") motif1vs2 <- compare_motifs(c(motif1, motif2), method = "PCC") ## To get a dist object: as.dist(1 - motif1vs2) motif3 <- create_motif(name = "3") motif4 <- create_motif(name = "4") motifs <- c(motif1, motif2, motif3, motif4) ## Compare motif "2" to all the other motifs: if (R.Version()$arch != "i386") { compare_motifs(motifs, compare.to = 2, max.p = 1, max.e = Inf) } ## If you are working with a large list of motifs and the mean.min.ic ## option is not set to zero, you may get a number of failed comparisons ## due to low IC. To filter the list of motifs to avoid these, use ## the average_ic() function to remove motifs with low average IC: ## Not run: library(MotifDb) motifs <- convert_motifs(MotifDb)[1:100] compare_motifs(motifs) #> Warning in compare_motifs(motifs) : #> Some comparisons failed due to low IC motifs <- motifs[average_ic(motifs) > 0.5] compare_motifs(motifs) ## End(Not run)
compare_motifs2() is a deliberately pared-down counterpart to
compare_motifs(), with defaults and an algorithm that mirror those of
the command-line tool yamtk. It uses the
per-column Pearson correlation as its similarity metric, computes
significance from a TOMTOM-style null PMF (either empirical, from the
target database, or parametric, via a Dirichlet-Multinomial over a K=5
simplex grid), applies both Bonferroni and Benjamini-Hochberg adjustment,
and returns either a long-format data.frame of the significant pairs or
a square score / p-value / q-value matrix.
compare_motifs2(motifs, compare.to = NULL, qvalue = 0.1, min.overlap = 5L, RC = TRUE, null = c("empirical", "parametric"), bkg = NULL, matrix.out = c("score", "pvalue", "qvalue"), nthreads = 1)compare_motifs2(motifs, compare.to = NULL, qvalue = 0.1, min.overlap = 5L, RC = TRUE, null = c("empirical", "parametric"), bkg = NULL, matrix.out = c("score", "pvalue", "qvalue"), nthreads = 1)
motifs |
See |
compare.to |
|
qvalue |
|
min.overlap |
|
RC |
|
null |
|
bkg |
|
matrix.out |
|
nthreads |
|
Use compare_motifs() when you need any of the broader features:
alternative similarity metrics (EUCL, ALLR, KL, …), score-aggregation
strategies, IC-based filters, multifreq scoring, ICM/PPM dispatch,
relative-entropy mode, or the JASPAR-derived db.scores lookup-table
p-value path.
P-values are computed in two modes mirroring yamtk cmp:
null = "empirical" (default)Per-query column, score against every target-database column, histogram into 51 PCC bins, convolve to overlap length L. Cost per query is O(N x T) where N is the number of target motifs and T their mean width. Best for databases with more than a few hundred total columns.
null = "parametric"Enumerate the 56 compositions of A/C/G/T on a K=5 simplex grid, weight each by a Dirichlet-Multinomial PMF under alpha = 4 * bkg, score against the query column. Per-query cost is constant in database size. Best for small databases or when histogram sparsity would distort the empirical estimate.
The reported score is the mean PCC over the overlap (always in
), not the raw sum.
Matrix mode (compare.to = NULL): a square numeric matrix
indexed by motif name, holding the value selected by matrix.out.
Long-format mode (compare.to provided): a data.frame with
columns
subject, subject.i, target, target.i, offset, strand,
overlap, score, subject.consensus, target.consensus,
pvalue, qvalue, filtered to qvalue <= qvalue and sorted by
q-value ascending. Coordinates are 1-based; offset is the query
column where target column 1 aligns.
Benjamin Jean-Marie Tremblay, [email protected]
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS (2007). "Quantifying similarity between motifs." Genome Biology, 8(2), R24.
Tremblay BJM (2026). yamtk: Yet Another Motif ToolKit. https://github.com/bjmt/yamtk.
compare_motifs(), scan_sequences2(), view_motifs2()
library(universalmotif) set.seed(1) motifs <- lapply(1:6, function(i) create_motif(paste(sample(c("A","C","G","T"), 8, replace = TRUE), collapse = ""), name = paste0("M", i))) ## Matrix of mean PCC scores m <- compare_motifs2(motifs) round(m, 2) ## Long-format hits at q <= 0.5 with motif 1 as the query compare_motifs2(motifs, compare.to = 1, qvalue = 0.5)library(universalmotif) set.seed(1) motifs <- lapply(1:6, function(i) create_motif(paste(sample(c("A","C","G","T"), 8, replace = TRUE), collapse = ""), name = paste0("M", i))) ## Matrix of mean PCC scores m <- compare_motifs2(motifs) round(m, 2) ## Long-format hits at q <= 0.5 with motif 1 as the query compare_motifs2(motifs, compare.to = 1, qvalue = 0.5)
Allows for easy transfer of motif information between different classes as defined by other Bioconductor packages. This function is also used by nearly all other functions in this package, so any motifs of a compatible class can be used without needing to be converted beforehand.
convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'AsIs' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'list' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'universalmotif' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'MotifList' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'TFFMFirst' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'PFMatrix' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'PWMatrix' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'ICMatrix' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'XMatrixList' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'pwm' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'pcm' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'pfm' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'PWM' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'Motif' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'matrix' convert_motifs(motifs, class = "universalmotif-universalmotif")convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'AsIs' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'list' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'universalmotif' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'MotifList' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'TFFMFirst' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'PFMatrix' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'PWMatrix' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'ICMatrix' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'XMatrixList' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'pwm' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'pcm' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'pfm' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'PWM' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'Motif' convert_motifs(motifs, class = "universalmotif-universalmotif") ## S4 method for signature 'matrix' convert_motifs(motifs, class = "universalmotif-universalmotif")
motifs |
Single motif object or list. See details. |
class |
|
The following packge-class combinations can be used as input:
MotifDb-MotifList
TFBSTools-PFMatrix
TFBSTools-PWMatrix
TFBSTools-ICMatrix
TFBSTools-PFMatrixList
TFBSTools-PWMatrixList
TFBSTools-ICMatrixList
TFBSTools-TFFMFirst
seqLogo-pwm
motifStack-pcm
motifStack-pfm
PWMEnrich-PWM
motifRG-Motif
universalmotif-universalmotif
matrix
The following package-class combinations can be output:
MotifDb-MotifList
TFBSTools-PFMatrix
TFBSTools-PWMatrix
TFBSTools-ICMatrix
TFBSTools-TFFMFirst
seqLogo-pwm
motifStack-pcm
motifStack-pfm
PWMEnrich-PWM
Biostrings-PWM (type = 'log2prob')
rGADEM-motif
universalmotif-universalmotif (the default, no need to specify this)
Note: MotifDb-MotifList output was a later addition to convert_motifs().
As a result, to stay consistent with previous behaviour most functions
will always convert MotifDb-MotifList objects to a list of universalmotif
motifs, even if other formats would be simply returned as is (e.g. for
other formats, filter_motifs() will return the input format; for
MotifDb-MotifList, a list of universalmotif objects will be returned).
Single motif object or list.
convert_motifs(AsIs): Generate an error to remind users to run
to_list() instead of using the column from to_df() directly.
convert_motifs(list): Convert a list of motifs.
convert_motifs(universalmotif): Convert a universalmotif object.
convert_motifs(MotifList): Convert MotifList motifs. (MotifDb)
convert_motifs(TFFMFirst): Convert TFFMFirst motifs. (TFBSTools)
convert_motifs(PFMatrix): Convert PFMatrix motifs. (TFBSTools)
convert_motifs(PWMatrix): Convert PWMatrix motifs. (TFBSTools)
convert_motifs(ICMatrix): Convert ICMatrix motifs. (TFBSTools)
convert_motifs(XMatrixList): Convert XMatrixList motifs. (TFBSTools)
convert_motifs(pwm): Convert pwm motifs. (seqLogo)
convert_motifs(pcm): Convert pcm motifs. (motifStack)
convert_motifs(pfm): Convert pfm motifs. (motifStack)
convert_motifs(PWM): Convert PWM motifs. (PWMEnrich)
convert_motifs(Motif): Convert Motif motifs. (motifRG)
convert_motifs(matrix): Create motif from matrices.
Benjamin Jean-Marie Tremblay, [email protected]
Bembom O (2018). seqLogo: Sequence logos for DNA sequence alignments. R package version 1.46.0.
Droit A, Gottardo R, Robertson G, Li L (2014). rGADEM: de novo motif discovery. R package version 2.28.0.
Mercier E, Gottardo R (2014). MotIV: Motif Identification and Validation. R package version 1.36.0.
Ou J, Wolfe SA, Brodsky MH, Zhu LJ (2018). “motifStack for the analysis of transcription factor binding site evolution.” Nature Methods, 15, 8-9. doi: 10.1038/nmeth.4555.
Shannon P, Richards M (2018). MotifDb: An Annotated Collection of Protein-DNA Binding Sequence Motifs. R package version 1.22.0.
Stojnic R, Diez D (2015). PWMEnrich: PWM enrichment analysis. R package version 4.16.0.
Tan G, Lenhard B (2016). “TFBSTools: an R/Bioconductor package for transcription factor binding site analysis.” Bioinformatics, 32, 1555-1556. doi: 10.1093/bioinformatics/btw024.
Yao Z (2012). motifRG: A package for discriminative motif discovery, designed for high throughput sequencing dataset. R package version 1.24.0.
# Convert from universalmotif: jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) if (requireNamespace("motifStack", quietly = TRUE)) { jaspar.motifstack.pfm <- convert_motifs(jaspar, "motifStack-pfm") } # Convert from another class to universalmotif: if (requireNamespace("TFBSTools", quietly = TRUE)) { library(TFBSTools) data(MA0003.2) motif <- convert_motifs(MA0003.2) # Convert from another class to another class if (requireNamespace("PWMEnrich", quietly = TRUE)) { motif <- convert_motifs(MA0003.2, "PWMEnrich-PWM") } # The 'convert_motifs' function is embedded in the rest of the universalmotif # functions: non-universalmotif class motifs can be used MA0003.2.trimmed <- trim_motifs(MA0003.2) # Note: if the motif object going in has information that the # 'universalmotif' class can't hold, it will be lost }# Convert from universalmotif: jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) if (requireNamespace("motifStack", quietly = TRUE)) { jaspar.motifstack.pfm <- convert_motifs(jaspar, "motifStack-pfm") } # Convert from another class to universalmotif: if (requireNamespace("TFBSTools", quietly = TRUE)) { library(TFBSTools) data(MA0003.2) motif <- convert_motifs(MA0003.2) # Convert from another class to another class if (requireNamespace("PWMEnrich", quietly = TRUE)) { motif <- convert_motifs(MA0003.2, "PWMEnrich-PWM") } # The 'convert_motifs' function is embedded in the rest of the universalmotif # functions: non-universalmotif class motifs can be used MA0003.2.trimmed <- trim_motifs(MA0003.2) # Note: if the motif object going in has information that the # 'universalmotif' class can't hold, it will be lost }
Switch between position count matrix (PCM), position probability matrix
(PPM), position weight matrix (PWM), information count matrix (ICM), and
contribution weight matrix (CWM) types.
See the "Introduction to sequence motifs" vignette for details. Please
also note that type conversion occurs implicitly throughout the
universalmotif package, so there is generally no need to perform this
manual conversion. Also please be aware that the message concerning
pseudocount-adjusting motifs can be disabled via
options(pseudocount.warning=FALSE).
convert_type(motifs, type, pseudocount, nsize_correction = FALSE, relative_entropy = FALSE)convert_type(motifs, type, pseudocount, nsize_correction = FALSE, relative_entropy = FALSE)
motifs |
See |
type |
|
pseudocount |
|
nsize_correction |
|
relative_entropy |
|
Position count matrix (PCM), also known as position frequency matrix (PFM). For n sequences from which the motif was built, each position is represented by the numbers of each letter at that position. In theory all positions should have sums equal to n, but not all databases are this consistent. If converting from another type to PCM, column sums will be equal to the 'nsites' slot. If empty, 100 is used.
Position probability matrix (PPM), also known as position frequency
matrix (PFM). At each position, the probability of individual letters
is calculated by dividing the count for that letter by the total sum of
counts at that position (letter_count / position_total).
As a result, each position will sum to 1. Letters with counts of 0 will
thus have a probability of 0, which can be undesirable when searching for
motifs in a set of sequences. To avoid this a pseudocount can be added
((letter_count + pseudocount) / (position_total + pseudocount)).
Position weight matrix (PWM; Stormo et al. (1982)),
also known as position-specific weight
matrix (PSWM), position-specific scoring matrix (PSSM), or
log-odds matrix. At each position, each letter is represented by it's
log-likelihood (log2(letter_probability / background_probility)),
which is normalized using the background letter frequencies. A PWM matrix
is constructed from a PPM. If any position has 0-probability letters to
which pseudocounts were not added, then the final log-likelihood of these
letters will be -Inf.
Information content matrix (ICM; Schneider and Stephens 1990).
An ICM is a PPM where each letter probability is multiplied by the total
information content at that position. The information content of each
position is determined as: totalIC - Hi, where the total information
totalIC
totalIC <- log2(alphabet_length), and the Shannon entropy
(Shannon 1948) for a specific
position (Hi)
Hi <- -sum(sapply(alphabet_frequencies, function(x) x * log(2)).
As a result, the total sum or height of each position is representative of
it's sequence conservation, measured in the unit 'bits', which is a unit
of energy (Schneider 1991; see
https://fr-s-schneider.ncifcrf.gov/logorecommendations.html
for more information). However not all programs will calculate
information content the same. Some will 'correct' the total information
content at each position using a correction factor as described by
Schneider et al. (1986). This correction can
applied by setting nsize_correction = TRUE, however it will only
be applied if the 'nsites' slot is not empty. This is done using
TFBSTools:::schneider_correction
(Tan and Lenhard 2016). As such, converting from an ICM to
which some form of correction has been applied will result in a
PCM/PPM/PWM with slight inaccuracies.
Another method of calculating information content is calculating the
relative entropy, also known as Kullback-Leibler divergence
(Kullback and Leibler 1951). This accounts for background
frequencies, which
can be useful for genomes with a heavy imbalance in letter frequencies.
For each position, the individual letter frequencies are calculated as
letter_freq * log2(letter_freq / bkg_freq). When calculating
information content using Shannon entropy, the maximum content for
each position will always be log2(alphabet_length). This does
not hold for information content calculated as relative entropy.
Please note that conversion from ICM assumes the information content
was not calculated as relative entropy.
Contribution weight matrix (CWM), as produced by attribution
methods such as TF-MoDISco (Shrikumar et al. 2018) and
DeepLIFT (Shrikumar, Greenside, and Kundaje 2017). Each cell
holds a signed real-valued contribution score: how much the
given letter at the given position drives a model's prediction.
Unlike the four probabilistic types above, CWMs are not
normalised (columns do not sum to 1 or to nsites), values
may be negative (the model penalises that letter at that
position), and there is no canonical maximum height. CWM is
therefore only ever a source type for convert_type():
convert_type(cwm, "PPM") column-normalises absolute values
via the TF-MoDISco-lite convention
ppm[i,j] = |cwm[i,j]| / sum_i |cwm[i,j]| (columns whose
absolute column sum is zero collapse to a uniform distribution).
convert_type(cwm, "PWM" | "ICM" | "PCM") routes through
CWM -> PPM first and then applies the standard
PPM -> * step, so CWM -> PWM produces a real log-odds
matrix (log2(ppm / bkg)), not a re-labelled CWM. The reverse
direction (PPM, PWM, ICM, or PCM -> CWM) errors with
a clear message: contribution scores cannot be inferred from
a probability distribution, so a CWM must originate from an
attribution method (read via read_meme() with CWM = TRUE,
or built with create_motif(matrix, type = "CWM")).
See convert_motifs() for possible output motif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Kullback S, Leibler RA (1951). “On information and sufficiency.” The Annals of Mathematical Statistics, 22, 79-86.
Nishida K, Frith MC, Nakai K (2009). “Pseudocounts for transcription factor binding sites.” Nucleic Acids Research, 37, 939-944.
Schneider TD, Stormo GD, Gold L, Ehrenfeucht A (1986). “Information content of binding sites on nucleotide sequences.” Journal of Molecular Biology, 188, 415-431.
Schneider TD, Stephens RM (1990). “Sequence Logos: A New Way to Display Consensus Sequences.” Nucleic Acids Research, 18, 6097-6100.
Schneider TD (1991). “Theory of Molecular Machines. II. Energy Dissipation from Molecular Machines.” Journal of Theoretical Biology, 148, 125-137.
Shannon CE (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal, 27, 379-423.
Shrikumar A, Greenside P, Kundaje A (2017). “Learning important features through propagating activation differences.” Proceedings of the 34th International Conference on Machine Learning, 70, 3145-3153.
Shrikumar A, Tian K, Avsec Ž, Shcherbina A, Banerjee A, Sharmin M, Nair S, Kundaje A (2018). “TF-MoDISco v0.4.4.2-alpha: Technical Note.” arXiv:1811.00416.
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982). “Use of the Perceptron algorithm to distinguish translational initiation sites in E. coli.” Nucleic Acids Research, 10, 2997-3011.
Tan G, Lenhard B (2016). “TFBSTools: an R/Bioconductor package for transcription factor binding site analysis.” Bioinformatics, 32, 1555-1556. doi: 10.1093/bioinformatics/btw024.
jaspar.pcm <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) ## The motifs pseudocounts are 1: these will be used in the PCM->PPM ## calculation jaspar.pwm <- convert_type(jaspar.pcm, type = "PPM") ## Setting pseudocount to 0 will prevent any correction from being ## applied to PPM/PWM matrices, overriding the motifs own pseudocounts jaspar.pwm <- convert_type(jaspar.pcm, type = "PWM", pseudocount = 0) ## CWM -> PPM via |cwm| / sum (the TF-MoDISco convention): cwm.mat <- matrix(c( 0.10, -0.20, 0.80, -0.10, -0.30, 0.90, -0.20, 0.10, 0.70, -0.10, -0.10, -0.40, -0.20, -0.10, -0.10, 0.90), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(cwm.mat, type = "CWM", name = "modisco_example") convert_type(cwm, "PPM")jaspar.pcm <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) ## The motifs pseudocounts are 1: these will be used in the PCM->PPM ## calculation jaspar.pwm <- convert_type(jaspar.pcm, type = "PPM") ## Setting pseudocount to 0 will prevent any correction from being ## applied to PPM/PWM matrices, overriding the motifs own pseudocounts jaspar.pwm <- convert_type(jaspar.pcm, type = "PWM", pseudocount = 0) ## CWM -> PPM via |cwm| / sum (the TF-MoDISco convention): cwm.mat <- matrix(c( 0.10, -0.20, 0.80, -0.10, -0.30, 0.90, -0.20, 0.10, 0.70, -0.10, -0.10, -0.40, -0.20, -0.10, -0.10, 0.90), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(cwm.mat, type = "CWM", name = "modisco_example") convert_type(cwm, "PPM")
Create a motif from a set of sequences, a matrix, or generate a random motif. See the "Motif import, export and manipulation" vignette for details.
create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'missing' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'numeric' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'character' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'matrix' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'DNAStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'RNAStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'AAStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'BStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq)create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'missing' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'numeric' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'character' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'matrix' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'DNAStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'RNAStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'AAStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq) ## S4 method for signature 'BStringSet' create_motif(input, alphabet, type = "PPM", name = "motif", pseudocount = 0, bkg, nsites, altname, family, organism, bkgsites, strand, pval, qval, eval, extrainfo, add.multifreq)
input |
|
alphabet |
|
type |
|
name |
|
pseudocount |
|
bkg |
|
nsites |
|
altname |
|
family |
|
organism |
|
bkgsites |
|
strand |
|
pval |
|
qval |
|
eval |
|
extrainfo |
|
add.multifreq |
|
The aim of this function is provide an easy interface to creating
universalmotif motifs, as an alternative to the
default class constructor (i.e. new('universalmotif', name=...)).
See examples for potential use cases.
Note: when generating random motifs, the nsites slot is also given a
random value.
Note: type = "CWM" is only supported when input is a matrix.
CWMs originate from contribution-attribution methods (TF-MoDISco,
DeepLIFT) and cannot be derived from sequences, consensus strings,
or random generation. The input matrix is stored as-is (signed
real values, no column-sum constraint); no auto-normalisation
is applied. See convert_type() for the full CWM semantics.
See the examples section for more info on motif creation.
universalmotif object.
create_motif(missing): Create a random motif of length 10.
create_motif(numeric): Create a random motif with a specified length.
create_motif(character): Create motif from a consensus string.
create_motif(matrix): Create motif from a matrix.
create_motif(DNAStringSet): Create motif from a DNAStringSet.
create_motif(RNAStringSet): Create motif from a RNAStringSet.
create_motif(AAStringSet): Create motif from a AAStringSet.
create_motif(BStringSet): Create motif from a BStringSet.
Benjamin Jean-Marie Tremblay, [email protected]
convert_type(), add_multifreq(), create_sequences(),
shuffle_motifs().
##### create motifs from a single string # Motif is by default generated as a PPM: change final type as desired DNA.motif <- create_motif("TATAWAW") DNA.motif <- create_motif("TATAWAW", type = "PCM") # Nsites will be set to the number of input sequences unless specified or # a single string is used as input DNA.motif <- create_motif("TTTTTTT", nsites = 10) # Ambiguity letters can be used: DNA.motif <- create_motif("TATAWAW") DNA.motif <- create_motif("NNVVWWAAWWDDN") # Be careful about setting nsites when using ambiguity letters! DNA.motif <- create_motif("NNVVWWAAWWDDN", nsites = 1) RNA.motif <- create_motif("UUUCCG") # 'create_motif' will try to detect the alphabet type; this can be # unreliable for AA and custom alphabets as DNA and RNA alphabets are # detected first AA.motif <- create_motif("AVLK", alphabet = "AA") custom.motif <- create_motif("QWER", alphabet = "QWER") # Specify custom alphabet custom.motif <- create_motif("QWER", alphabet = "QWERASDF") ###### Create motifs from multiple strings of equal length DNA.motif <- create_motif(c("TTTT", "AAAA", "AACC", "TTGG"), type = "PPM") DNA.motif <- create_motif(c("TTTT", "AAAA", "AACC", "TTGG"), nsites = 20) RNA.motif <- create_motif(c("UUUU", "AAAA", "AACC", "UUGG"), type = "PWM") AA.motif <- create_motif(c("ARNDCQ", "EGHILK", "ARNDCQ"), alphabet = "AA") custom.motif <- create_motif(c("POIU", "LKJH", "POIU", "CVBN"), alphabet = "POIULKJHCVBN") # Ambiguity letters are only allowed for single consensus strings: the # following fails ## Not run: create_motif(c("WWTT", "CCGG")) create_motif(c("XXXX", "XXXX"), alphabet = "AA") ## End(Not run) ##### Create motifs from XStringSet objects library(Biostrings) DNA.set <- DNAStringSet(c("TTTT", "AAAA", "AACC", "TTGG")) DNA.motif <- create_motif(DNA.set) RNA.set <- RNAStringSet(c("UUUU", "AACC", "UUCC")) RNA.motif <- create_motif(RNA.set) AA.set <- AAStringSet(c("VVVLLL", "AAAIII")) AA.motif <- create_motif(AA.set) # Custom motifs can be created from BStringSet objects B.set <- BStringSet(c("QWER", "ASDF", "ZXCV", "TYUI")) custom.motif <- create_motif(B.set) ##### Create motifs with filled 'multifreq' slot DNA.motif.k2 <- create_motif(DNA.set, add.multifreq = 2) ##### Create motifs from matrices mat <- matrix(c(1, 1, 1, 1, 2, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0), nrow = 4, byrow = TRUE) DNA.motif <- create_motif(mat, alphabet = "DNA") RNA.motif <- create_motif(mat, alphabet = "RNA", nsites = 20) custom.motif <- create_motif(mat, alphabet = "QWER") # Specify custom alphabet custom.motif <- create_motif(mat, alphabet = "QWER") # Alphabet can be detected from rownames rownames(mat) <- DNA_BASES DNA.motif <- create_motif(mat) rownames(mat) <- c("Q", "W", "E", "R") custom.motif <- create_motif(mat) # Matrices can also be used as input mat.ppm <- matrix(c(0.1, 0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5, 0.1, 0.1, 0.1, 0.1, 0.3, 0.3, 0.3, 0.3), nrow = 4, byrow = TRUE) DNA.motif <- create_motif(mat.ppm, alphabet = "DNA", type = "PPM") # Contribution weight matrices (signed, no column-sum constraint). # Typically these come from TF-MoDISco / DeepLIFT pipelines and are # imported via `read_meme(..., CWM = TRUE)`, but can be built from # a matrix directly: cwm.mat <- matrix(c( 0.10, -0.20, 0.80, -0.10, -0.30, 0.90, -0.20, 0.10, 0.70, -0.10, -0.10, -0.40, -0.20, -0.10, -0.10, 0.90), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) CWM.motif <- create_motif(cwm.mat, type = "CWM", name = "modisco_example") ##### Create random motifs # These are generated as PPMs with 10 positions DNA.motif <- create_motif() RNA.motif <- create_motif(alphabet = "RNA") AA.motif <- create_motif(alphabet = "AA") custom.motif <- create_motif(alphabet = "QWER") # The number of positions can be specified DNA.motif <- create_motif(5) # If the background frequencies are not provided, they are generated # using `rpois`; positions are created using `rdirichlet(1, bkg)`. # (calling `create_motif()` creates motifs with an average # positional IC of 1) DNA.motif <- create_motif(bkg = c(0.3, 0.2, 0.2, 0.3)) DNA.motif <- create_motif(10, bkg = c(0.1, 0.4, 0.4, 0.1))##### create motifs from a single string # Motif is by default generated as a PPM: change final type as desired DNA.motif <- create_motif("TATAWAW") DNA.motif <- create_motif("TATAWAW", type = "PCM") # Nsites will be set to the number of input sequences unless specified or # a single string is used as input DNA.motif <- create_motif("TTTTTTT", nsites = 10) # Ambiguity letters can be used: DNA.motif <- create_motif("TATAWAW") DNA.motif <- create_motif("NNVVWWAAWWDDN") # Be careful about setting nsites when using ambiguity letters! DNA.motif <- create_motif("NNVVWWAAWWDDN", nsites = 1) RNA.motif <- create_motif("UUUCCG") # 'create_motif' will try to detect the alphabet type; this can be # unreliable for AA and custom alphabets as DNA and RNA alphabets are # detected first AA.motif <- create_motif("AVLK", alphabet = "AA") custom.motif <- create_motif("QWER", alphabet = "QWER") # Specify custom alphabet custom.motif <- create_motif("QWER", alphabet = "QWERASDF") ###### Create motifs from multiple strings of equal length DNA.motif <- create_motif(c("TTTT", "AAAA", "AACC", "TTGG"), type = "PPM") DNA.motif <- create_motif(c("TTTT", "AAAA", "AACC", "TTGG"), nsites = 20) RNA.motif <- create_motif(c("UUUU", "AAAA", "AACC", "UUGG"), type = "PWM") AA.motif <- create_motif(c("ARNDCQ", "EGHILK", "ARNDCQ"), alphabet = "AA") custom.motif <- create_motif(c("POIU", "LKJH", "POIU", "CVBN"), alphabet = "POIULKJHCVBN") # Ambiguity letters are only allowed for single consensus strings: the # following fails ## Not run: create_motif(c("WWTT", "CCGG")) create_motif(c("XXXX", "XXXX"), alphabet = "AA") ## End(Not run) ##### Create motifs from XStringSet objects library(Biostrings) DNA.set <- DNAStringSet(c("TTTT", "AAAA", "AACC", "TTGG")) DNA.motif <- create_motif(DNA.set) RNA.set <- RNAStringSet(c("UUUU", "AACC", "UUCC")) RNA.motif <- create_motif(RNA.set) AA.set <- AAStringSet(c("VVVLLL", "AAAIII")) AA.motif <- create_motif(AA.set) # Custom motifs can be created from BStringSet objects B.set <- BStringSet(c("QWER", "ASDF", "ZXCV", "TYUI")) custom.motif <- create_motif(B.set) ##### Create motifs with filled 'multifreq' slot DNA.motif.k2 <- create_motif(DNA.set, add.multifreq = 2) ##### Create motifs from matrices mat <- matrix(c(1, 1, 1, 1, 2, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0), nrow = 4, byrow = TRUE) DNA.motif <- create_motif(mat, alphabet = "DNA") RNA.motif <- create_motif(mat, alphabet = "RNA", nsites = 20) custom.motif <- create_motif(mat, alphabet = "QWER") # Specify custom alphabet custom.motif <- create_motif(mat, alphabet = "QWER") # Alphabet can be detected from rownames rownames(mat) <- DNA_BASES DNA.motif <- create_motif(mat) rownames(mat) <- c("Q", "W", "E", "R") custom.motif <- create_motif(mat) # Matrices can also be used as input mat.ppm <- matrix(c(0.1, 0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5, 0.1, 0.1, 0.1, 0.1, 0.3, 0.3, 0.3, 0.3), nrow = 4, byrow = TRUE) DNA.motif <- create_motif(mat.ppm, alphabet = "DNA", type = "PPM") # Contribution weight matrices (signed, no column-sum constraint). # Typically these come from TF-MoDISco / DeepLIFT pipelines and are # imported via `read_meme(..., CWM = TRUE)`, but can be built from # a matrix directly: cwm.mat <- matrix(c( 0.10, -0.20, 0.80, -0.10, -0.30, 0.90, -0.20, 0.10, 0.70, -0.10, -0.10, -0.40, -0.20, -0.10, -0.10, 0.90), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) CWM.motif <- create_motif(cwm.mat, type = "CWM", name = "modisco_example") ##### Create random motifs # These are generated as PPMs with 10 positions DNA.motif <- create_motif() RNA.motif <- create_motif(alphabet = "RNA") AA.motif <- create_motif(alphabet = "AA") custom.motif <- create_motif(alphabet = "QWER") # The number of positions can be specified DNA.motif <- create_motif(5) # If the background frequencies are not provided, they are generated # using `rpois`; positions are created using `rdirichlet(1, bkg)`. # (calling `create_motif()` creates motifs with an average # positional IC of 1) DNA.motif <- create_motif(bkg = c(0.3, 0.2, 0.2, 0.3)) DNA.motif <- create_motif(10, bkg = c(0.1, 0.4, 0.4, 0.1))
Generate random sequences from any set of characters, represented as
XStringSet objects.
create_sequences(alphabet = "DNA", seqnum = 100, seqlen = 100, freqs, nthreads = 1, rng.seed = sample.int(10000, 1))create_sequences(alphabet = "DNA", seqnum = 100, seqlen = 100, freqs, nthreads = 1, rng.seed = sample.int(10000, 1))
alphabet |
|
seqnum |
|
seqlen |
|
freqs |
|
nthreads |
|
rng.seed |
|
XStringSet The returned sequences are unnamed.
Benjamin Jean-Marie Tremblay, [email protected]
create_motif(), shuffle_sequences()
## Create DNA sequences with slightly increased AT content: sequences <- create_sequences(freqs = c(A=0.3, C=0.2, G=0.2, T=0.3)) ## Create custom sequences: sequences.QWER <- create_sequences("QWER") ## You can include non-alphabet characters are well, even spaces: sequences.custom <- create_sequences("!@#$ ")## Create DNA sequences with slightly increased AT content: sequences <- create_sequences(freqs = c(A=0.3, C=0.2, G=0.2, T=0.3)) ## Create custom sequences: sequences.QWER <- create_sequences("QWER") ## You can include non-alphabet characters are well, even spaces: sequences.custom <- create_sequences("!@#$ ")
Given a set of motif hits, typically the output of
scan_sequences() or scan_sequences2(), or a GRanges /
data.frame from any other source, drop overlapping hits within
each (sequence, motif, strand) group using a greedy
algorithm: cluster overlapping hits, then within each cluster keep
the hit with the best priority and, after that, any other hit that
does not overlap an already-kept hit. This is the same algorithm
implemented by yamtk.
dedup_hits(hits, by = "pvalue", reverse = FALSE, ignore.strand = FALSE, ignore.motif = FALSE, seq = "sequence", motif = "motif", strand = "strand")dedup_hits(hits, by = "pvalue", reverse = FALSE, ignore.strand = FALSE, ignore.motif = FALSE, seq = "sequence", motif = "motif", strand = "strand")
hits |
|
by |
|
reverse |
|
ignore.strand |
|
ignore.motif |
|
seq |
|
motif |
|
strand |
|
Note that this differs from scan_sequences()'s built-in
no.overlaps argument, which collapses every connected-overlap
cluster to a single survivor (hierarchical clustering with
cutree). The greedy approach used here keeps legitimately distinct
hits even when an intermediate noisy hit bridges them, which is
usually what users want.
Coordinate convention: rows must have start <= end. Overlaps are
inclusive: two hits overlap if a$start <= b$end & b$start <= a$end.
The input hits, subsetted to the kept rows in the original
order.
Benjamin Jean-Marie Tremblay, [email protected]
Tremblay BJM (2026). yamtk: Yet Another Motif ToolKit. https://github.com/bjmt/yamtk.
scan_sequences2(), scan_sequences()
library(universalmotif) motifs <- list(create_motif("TATAAA", name = "M1"), create_motif("CACGTG", name = "M2")) seqs <- create_sequences(seqnum = 5, seqlen = 400, rng.seed = 1) hits <- scan_sequences2(motifs, seqs, pvalue = 5e-2, return.granges = FALSE) deduped <- dedup_hits(hits) nrow(hits) nrow(deduped)library(universalmotif) motifs <- list(create_motif("TATAAA", name = "M1"), create_motif("CACGTG", name = "M2")) seqs <- create_sequences(seqnum = 5, seqlen = 400, rng.seed = 1) hits <- scan_sequences2(motifs, seqs, pvalue = 5e-2, return.granges = FALSE) deduped <- dedup_hits(hits) nrow(hits) nrow(deduped)
Given a set of target and background sequences, test if the input motifs
are significantly enriched in the targets sequences relative to the
background sequences. See the "Sequence manipulation and scanning" vignette.
Note: for regular DNA/RNA motif enrichment, see enrich_motifs2() for a
faster and simpler alternative.
enrich_motifs(motifs, sequences, bkg.sequences, max.p = 0.001, max.q = 0.001, max.e = 0.001, qval.method = "fdr", threshold = 1e-04, threshold.type = "pvalue", verbose = 0, RC = TRUE, use.freq = 1, shuffle.k = 2, shuffle.method = "euler", return.scan.results = FALSE, nthreads = 1, rng.seed = sample.int(10000, 1), motif_pvalue.k = 8, use.gaps = TRUE, allow.nonfinite = FALSE, warn.NA = TRUE, no.overlaps = TRUE, no.overlaps.by.strand = FALSE, no.overlaps.strat = "score", respect.strand = FALSE, motif_pvalue.method = c("dynamic", "exhaustive"), scan_sequences.qvals.method = c("BH", "fdr", "bonferroni"), mode = c("total.hits", "seq.hits"), pseudocount = 1)enrich_motifs(motifs, sequences, bkg.sequences, max.p = 0.001, max.q = 0.001, max.e = 0.001, qval.method = "fdr", threshold = 1e-04, threshold.type = "pvalue", verbose = 0, RC = TRUE, use.freq = 1, shuffle.k = 2, shuffle.method = "euler", return.scan.results = FALSE, nthreads = 1, rng.seed = sample.int(10000, 1), motif_pvalue.k = 8, use.gaps = TRUE, allow.nonfinite = FALSE, warn.NA = TRUE, no.overlaps = TRUE, no.overlaps.by.strand = FALSE, no.overlaps.strat = "score", respect.strand = FALSE, motif_pvalue.method = c("dynamic", "exhaustive"), scan_sequences.qvals.method = c("BH", "fdr", "bonferroni"), mode = c("total.hits", "seq.hits"), pseudocount = 1)
motifs |
See |
sequences |
|
bkg.sequences |
|
max.p |
|
max.q |
|
max.e |
|
qval.method |
|
threshold |
|
threshold.type |
|
verbose |
|
RC |
|
use.freq |
|
shuffle.k |
|
shuffle.method |
|
return.scan.results |
|
nthreads |
|
rng.seed |
|
motif_pvalue.k |
|
use.gaps |
|
allow.nonfinite |
|
warn.NA |
|
no.overlaps |
|
no.overlaps.by.strand |
|
no.overlaps.strat |
|
respect.strand |
|
motif_pvalue.method |
|
scan_sequences.qvals.method |
|
mode |
|
pseudocount |
|
To find enriched motifs, scan_sequences() is run on both
target and background sequences.
stats::fisher.test() is run to test for enrichment.
See scan_sequences() for more info on scanning parameters.
DataFrame Enrichment results in a DataFrame. Function args and
(optionally) scan results are stored in the metadata slot.
Benjamin Jean-Marie Tremblay [email protected]
McLeay R, Bailey TL (2010). “Motif Enrichment Analysis: A unified framework and method evaluation.” BMC Bioinformatics, 11.
scan_sequences(), shuffle_sequences(),
add_multifreq(), motif_pvalue(), enrich_motifs2(),
scan_sequences2(), match_bkg()
data(ArabidopsisPromoters) data(ArabidopsisMotif) if (R.Version()$arch != "i386") { enrich_motifs(ArabidopsisMotif, ArabidopsisPromoters, threshold = 0.01) }data(ArabidopsisPromoters) data(ArabidopsisMotif) if (R.Version()$arch != "i386") { enrich_motifs(ArabidopsisMotif, ArabidopsisPromoters, threshold = 0.01) }
enrich_motifs2() is a deliberately pared-down counterpart to
enrich_motifs(), with a default surface that mirrors the command-line
tool yamtk. It exposes a single p-value
cutoff for the hits, a single q-value cutoff for the results, and two
test modes ("seqs" and "sites", both Fisher's exact), and it leans on
scan_sequences2() under the hood to scan the target and background
sequences. The p-value adjustment is hard-coded to Benjamini-Hochberg.
enrich_motifs2(motifs, sequences, bkg.sequences = NULL, pvalue = 1e-04, qvalue = 0.1, test = c("seqs", "sites"), RC = TRUE, shuffle.k = 2L, rng.seed = sample.int(1e+06, 1), pseudocount = 1L, nthreads = 1)enrich_motifs2(motifs, sequences, bkg.sequences = NULL, pvalue = 1e-04, qvalue = 0.1, test = c("seqs", "sites"), RC = TRUE, shuffle.k = 2L, rng.seed = sample.int(1e+06, 1), pseudocount = 1L, nthreads = 1)
motifs |
See |
sequences |
|
bkg.sequences |
|
pvalue |
|
qvalue |
|
test |
|
RC |
|
shuffle.k |
|
rng.seed |
|
pseudocount |
|
nthreads |
|
Use enrich_motifs() when you need any of: q-value adjustment methods
other than BH; multifreq / gapped motifs; multiple-testing E-values;
the respect.strand, allow.nonfinite, or non-pvalue threshold.type
machinery; or amino-acid motifs.
Internally, scan_sequences2() is run once on the target set and once on
the background set; the resulting hit tables are aggregated per motif into
a 2x2 contingency table whose entries depend on test:
"seqs"a = target sequences with >=1 hit;
b = background sequences with >=1 hit;
c = target sequences with no hits;
d = background sequences with no hits.
"sites"a = total hits in target;
b = total hits in background;
c = scannable target positions minus a;
d = scannable background positions minus b.
Scannable positions are computed per motif as
sum(pmax(width(seqs) - motif_width + 1, 0)), doubled if RC = TRUE.
Each cell has pseudocount added before stats::fisher.test() is run
with alternative = "greater". Q-values are computed via
p.adjust(method = "BH") across all input motifs (not just the ones
that pass the filter), then rows with qvalue > qvalue are dropped.
A data.frame with one row per significant motif, sorted by
pvalue ascending. Columns:
motif: motif name
motif.i: 1-based index into the input motifs
consensus: IUPAC consensus of the motif
target.seq.n: number of target sequences scanned
target.seq.hits: number of target sequences containing >=1 hit
target.site.hits: total hit count across all target sequences
bkg.seq.n: number of background sequences scanned
bkg.seq.hits: number of background sequences containing >=1 hit
bkg.site.hits: total hit count across all background sequences
enrichment: fold-enrichment (rate ratio; floor of 0.5/n in the
denominator to avoid division by zero), per yamtk's formula
log2.enrichment: log2(enrichment)
pvalue: Fisher's exact one-sided ("greater") p-value
qvalue: Benjamini-Hochberg adjusted p-value across motifs
Benjamin Jean-Marie Tremblay, [email protected]
Tremblay BJM (2026). yamtk: Yet Another Motif ToolKit. https://github.com/bjmt/yamtk.
McLeay R, Bailey TL (2010). "Motif Enrichment Analysis: A unified framework and method evaluation." BMC Bioinformatics, 11.
enrich_motifs(), scan_sequences2(), shuffle_sequences(),
match_bkg(), motif_pvalue()
library(universalmotif) data(ArabidopsisPromoters) data(ArabidopsisMotif) if (R.Version()$arch != "i386") { enrich_motifs2(ArabidopsisMotif, ArabidopsisPromoters, pvalue = 1e-3, qvalue = 0.5, rng.seed = 1) }library(universalmotif) data(ArabidopsisPromoters) data(ArabidopsisMotif) if (R.Version()$arch != "i386") { enrich_motifs2(ArabidopsisMotif, ArabidopsisPromoters, pvalue = 1e-3, qvalue = 0.5, rng.seed = 1) }
universalmotif format.A simple DNA motif. To recreate this motif:
create_motif("TATAWAW", nsites = numeric())
examplemotifexamplemotif
universalmotif format.A simple DNA motif with a non-empty multifreq slot.
To recreate to this motif:
add_multifreq(examplemotif, DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3)))
examplemotif2examplemotif2
Filter motifs based on the contents of available universalmotif
slots. If the input motifs are not of universalmotif, then
they will be converted for the duration of the filter_motifs() operation.
filter_motifs(motifs, name, altname, family, organism, width, alphabet, type, icscore, nsites, strand, pval, qval, eval, extrainfo)filter_motifs(motifs, name, altname, family, organism, width, alphabet, type, icscore, nsites, strand, pval, qval, eval, extrainfo)
motifs |
|
name |
|
altname |
|
family |
|
organism |
|
width |
|
alphabet |
|
type |
|
icscore |
|
nsites |
|
strand |
|
pval |
|
qval |
|
eval |
|
extrainfo |
|
list Motifs. An attempt will be made to preserve the original
class, see convert_motifs() for limitations.
Benjamin Jean-Marie Tremblay, [email protected]
## By minimum IC: jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) jaspar.ic3 <- filter_motifs(jaspar, icscore = 3) ## Starting from version 1.10.0 of the universalmotif package, one ## could instead make use of the universalmotif_df structure: jaspar.ic3 <- jaspar |> to_df() |> subset(icscore > 3) |> to_list() ## By organism: ## Not run: library(MotifDb) motifs <- convert_motifs(MotifDb) motifs <- filter_motifs(motifs, organism = c("Athaliana", "Mmusculus"), extrainfo = c("dataSource" = "cisbp_1.02")) ## Or: motifs <- convert_motifs(MotifDb) |> to_df() |> subset(organism %in% c("Athaliana", "Mmusculus") & dataSource == "cisbp_1.02") |> to_list() ## End(Not run)## By minimum IC: jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) jaspar.ic3 <- filter_motifs(jaspar, icscore = 3) ## Starting from version 1.10.0 of the universalmotif package, one ## could instead make use of the universalmotif_df structure: jaspar.ic3 <- jaspar |> to_df() |> subset(icscore > 3) |> to_list() ## By organism: ## Not run: library(MotifDb) motifs <- convert_motifs(MotifDb) motifs <- filter_motifs(motifs, organism = c("Athaliana", "Mmusculus"), extrainfo = c("dataSource" = "cisbp_1.02")) ## Or: motifs <- convert_motifs(MotifDb) |> to_df() |> subset(organism %in% c("Athaliana", "Mmusculus") & dataSource == "cisbp_1.02") |> to_list() ## End(Not run)
DataFrame of polygon coordinates used by view_motifs() for plotting
letters. It was generated using the createPolygons function from the
gglogo package for the font Roboto Medium.
fontDFrobotofontDFroboto
For a set of input sequences, calculate the overall sequence background for
any k-let size. For very large sequences DNA and RNA sequences (in the billions of bases),
please be aware of the much faster and more efficient
Biostrings::oligonucleotideFrequency().
get_bkg() can still be used in these cases, though it may take several seconds or
minutes to calculate the results (depending on requested k-let sizes).
get_bkg(sequences, k = 1:3, as.prob = NULL, pseudocount = 0, alphabet = NULL, to.meme = NULL, RC = FALSE, list.out = NULL, nthreads = 1, merge.res = TRUE, window = FALSE, window.size = 0.1, window.overlap = 0)get_bkg(sequences, k = 1:3, as.prob = NULL, pseudocount = 0, alphabet = NULL, to.meme = NULL, RC = FALSE, list.out = NULL, nthreads = 1, merge.res = TRUE, window = FALSE, window.size = 0.1, window.overlap = 0)
sequences |
|
k |
|
as.prob |
Deprecated. |
pseudocount |
|
alphabet |
|
to.meme |
If not |
RC |
|
list.out |
Deprecated. |
nthreads |
|
merge.res |
|
window |
|
window.size |
|
window.overlap |
|
If to.meme = NULL, a DataFrame with columns klet, count,
and probability. If merge.res = FALSE, there will be an additional
sequence column. If window = TRUE, there will be an additional start
and stop columns.
If to.meme is not NULL, then NULL is returned, invisibly.
Benjamin Jean-Marie Tremblay, [email protected]
Bailey TL, Elkan C (1994). “Fitting a mixture model by expectation maximization to discover motifs in biopolymers.” Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 2, 28-36.
create_sequences(), scan_sequences(), shuffle_sequences()
## Compare to Biostrings version library(Biostrings) seqs.DNA <- create_sequences() bkg.DNA <- get_bkg(seqs.DNA, k = 3) bkg.DNA2 <- oligonucleotideFrequency(seqs.DNA, 3, 1, as.prob = FALSE) bkg.DNA2 <- colSums(bkg.DNA2) all(bkg.DNA$count == bkg.DNA2) ## Create a MEME background file get_bkg(seqs.DNA, k = 1:3, to.meme = stdout(), pseudocount = 1) ## Non-DNA/RNA/AA alphabets seqs.QWERTY <- create_sequences("QWERTY") bkg.QWERTY <- get_bkg(seqs.QWERTY, k = 1:2)## Compare to Biostrings version library(Biostrings) seqs.DNA <- create_sequences() bkg.DNA <- get_bkg(seqs.DNA, k = 3) bkg.DNA2 <- oligonucleotideFrequency(seqs.DNA, 3, 1, as.prob = FALSE) bkg.DNA2 <- colSums(bkg.DNA2) all(bkg.DNA$count == bkg.DNA2) ## Create a MEME background file get_bkg(seqs.DNA, k = 1:3, to.meme = stdout(), pseudocount = 1) ## Non-DNA/RNA/AA alphabets seqs.QWERTY <- create_sequences("QWERTY") bkg.QWERTY <- get_bkg(seqs.QWERTY, k = 1:2)
implant_motifs() overwrites bases in an existing XStringSet with
samples drawn column-by-column from one or more motif PPMs at chosen
positions, returning the modified sequences together (optionally)
with a ground-truth data.frame of where each implant was placed.
This produces benchmark-grade positive sequences for testing
scan_sequences2(), motif_finder(), enrich_motifs2(),
motif_peaks(), or any other discovery / scanning pipeline against
a known answer.
implant_motifs(motifs, sequences, n.per.seq = 1L, rate = NULL, positions = NULL, strand = c("both", "+"), centre.bias = 1L, min.spacing = 0L, max.retries = 100L, return.indices = FALSE)implant_motifs(motifs, sequences, n.per.seq = 1L, rate = NULL, positions = NULL, strand = c("both", "+"), centre.bias = 1L, min.spacing = 0L, max.retries = 100L, return.indices = FALSE)
motifs |
See |
sequences |
|
n.per.seq |
|
rate |
|
positions |
|
strand |
|
centre.bias |
|
min.spacing |
|
max.retries |
|
return.indices |
|
Three insertion modes (mutually exclusive; later non-NULL argument
wins):
Fixed count (n.per.seq): plant exactly n.per.seq
instances in every sequence. Matches HOMER-style ground-truth
simulations.
Poisson rate (rate): per-sequence count drawn from
rpois(1, rate * width(seq)). Density scales naturally with
sequence length.
Explicit positions (positions): a list of length
length(sequences); element i is a 1-based integer vector of
start positions for that sequence. Bypasses centre.bias,
min.spacing, and max.retries.
For each implant the function (a) picks a motif uniformly at random
from motifs, (b) chooses a strand (uniform +/- if
strand = "both"), (c) picks a start position (uniform or centre-
biased via Irwin-Hall, retried up to max.retries times if it
collides with an already-placed implant within min.spacing),
(d) samples one instance from the motif's PPM column-by-column,
reverse-complements if needed, and (e) overwrites that span in the
target sequence.
If return.indices = FALSE: an XStringSet of the same
length, width, and class as sequences, with bases overwritten
at the implant positions. Names are preserved. If
return.indices = TRUE: a list with two elements –
sequences (the modified XStringSet, as above) and indices
(a data.frame with columns sequence.i (1-based),
motif.i (1-based index into motifs), start (1-based),
width, strand ("+" / "-"), and planted (the actual
implanted string, already reverse-complemented when
strand == "-")).
Benjamin Jean-Marie Tremblay, [email protected]
sample_sites(), create_sequences(), scan_sequences2(),
motif_finder(), motif_peaks(), enrich_motifs2()
## Not run: library(universalmotif) data(ArabidopsisMotif) seqs <- create_sequences("DNA", seqnum = 100, seqlen = 500) set.seed(1) res <- implant_motifs(ArabidopsisMotif, seqs, n.per.seq = 1, return.indices = TRUE) out <- res$sequences truth <- res$indices ## Recover-rate check hits <- scan_sequences(ArabidopsisMotif, out, threshold = 0.85, threshold.type = "logodds") ## Density check peaks <- motif_peaks(out, motif = ArabidopsisMotif) plot_motif_peaks(peaks) ## End(Not run)## Not run: library(universalmotif) data(ArabidopsisMotif) seqs <- create_sequences("DNA", seqnum = 100, seqlen = 500) set.seed(1) res <- implant_motifs(ArabidopsisMotif, seqs, n.per.seq = 1, return.indices = TRUE) out <- res$sequences truth <- res$indices ## Recover-rate check hits <- scan_sequences(ArabidopsisMotif, out, threshold = 0.85, threshold.type = "logodds") ## Density check peaks <- motif_peaks(out, motif = ArabidopsisMotif) plot_motif_peaks(peaks) ## End(Not run)
For use with compare_motifs(). The precomputed scores allow for fast P-value estimation.
These scores were
generated using make_DBscores() with the JASPAR2018 CORE motif set.
The scores are organized in a DataFrame.
In this DataFrame is the location and scale
of scores resulting from a statistical
distribution using the the comparisons of JASPAR2018 CORE motifs with
randomized motifs of the specified
subject and target motif length. Created using make_DBscores() from
universalmotif v1.4.0. The parameters used can be seen via
S4Vectors::metadata(JASPAR2018_CORE_DBSCORES).
JASPAR2018_CORE_DBSCORESJASPAR2018_CORE_DBSCORES
DataFrame with function args in the metadata slot.
Generate data used by compare_motifs() for P-value calculations. By default,
compare_motifs() uses an internal database based on the JASPAR2018 core motifs
(Khan et al. 2018). Parameters for distributions are
are estimated for every combination of motif widths.
make_DBscores(db.motifs, method = c("PCC", "EUCL", "SW", "KL", "WEUCL", "ALLR", "BHAT", "HELL", "WPCC", "SEUCL", "MAN", "ALLR_LL"), shuffle.db = TRUE, shuffle.k = 3, shuffle.method = "linear", rand.tries = 1000, widths = 5:30, min.position.ic = 0, normalise.scores = c(FALSE, TRUE), min.overlap = 6, min.mean.ic = 0.25, progress = TRUE, nthreads = 1, tryRC = TRUE, score.strat = c("sum", "a.mean", "g.mean", "median", "wa.mean", "wg.mean", "fzt"))make_DBscores(db.motifs, method = c("PCC", "EUCL", "SW", "KL", "WEUCL", "ALLR", "BHAT", "HELL", "WPCC", "SEUCL", "MAN", "ALLR_LL"), shuffle.db = TRUE, shuffle.k = 3, shuffle.method = "linear", rand.tries = 1000, widths = 5:30, min.position.ic = 0, normalise.scores = c(FALSE, TRUE), min.overlap = 6, min.mean.ic = 0.25, progress = TRUE, nthreads = 1, tryRC = TRUE, score.strat = c("sum", "a.mean", "g.mean", "median", "wa.mean", "wg.mean", "fzt"))
db.motifs |
|
method |
|
shuffle.db |
|
shuffle.k |
|
shuffle.method |
|
rand.tries |
|
widths |
|
min.position.ic |
|
normalise.scores |
|
min.overlap |
|
min.mean.ic |
|
progress |
|
nthreads |
|
tryRC |
|
score.strat |
|
See compare_motifs() for more info on comparison parameters.
To replicate the internal universalmotif DB scores, run
make_DBscores() with the default settings. Note that this will be
a slow process.
Arguments widths, method, normalise.scores and score.strat are
vectorized; all combinations will be attempted.
A DataFrame with score distributions for the
input database. If more than one make_DBscores() run occurs (i.e. args
method, normalise.scores or score.strat are longer than 1), then
the function args are included in the metadata slot.
Benjamin Jean-Marie Tremblay, [email protected]
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A (2018). “JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.” Nucleic Acids Research, 46, D260-D266.
## Not run: library(MotifDb) motifs <- convert_motifs(MotifDb[1:100]) scores <- make_DBscores(motifs, method = "PCC") compare_motifs(motifs, 1:100, db.scores = scores) ## End(Not run)## Not run: library(MotifDb) motifs <- convert_motifs(MotifDb[1:100]) scores <- make_DBscores(motifs, method = "PCC") compare_motifs(motifs, 1:100, db.scores = scores) ## End(Not run)
For each sequence in sequences, match_bkg() draws one or more
background sequences from universe that share its GC fraction and
length. This is the HOMER-style binned matching used by motif
enrichment tools to control for sequence-composition biases (GC,
length) that would otherwise inflate enrichment estimates relative
to shuffle-only nulls. Matching can be extended to additional
external per-sequence covariates (e.g. ChIP signal strength,
conservation, mappability) via covariates, which simply add more
binned axes to the grid.
match_bkg(sequences, universe = NULL, genome = NULL, exclude = NULL, n.candidates = NULL, chromosomes = NULL, n.per.target = 1L, n.bins.gc = 20L, n.bins.length = 10L, covariates = NULL, universe.covariates = NULL, n.bins.covariates = 10L, bin.type.covariates = "quantile", unique = TRUE, return.indices = FALSE)match_bkg(sequences, universe = NULL, genome = NULL, exclude = NULL, n.candidates = NULL, chromosomes = NULL, n.per.target = 1L, n.bins.gc = 20L, n.bins.length = 10L, covariates = NULL, universe.covariates = NULL, n.bins.covariates = 10L, bin.type.covariates = "quantile", unique = TRUE, return.indices = FALSE)
sequences |
|
universe |
|
genome |
A |
exclude |
|
n.candidates |
|
chromosomes |
|
n.per.target |
|
n.bins.gc |
|
n.bins.length |
|
covariates |
|
universe.covariates |
|
n.bins.covariates |
|
bin.type.covariates |
|
unique |
|
return.indices |
|
Algorithm (HOMER-style):
Compute per-sequence GC fraction and length for sequences
and universe (plus any supplied covariates).
Bin the universe into a grid: GC bins equal-width over
[0, 1], length bins quantile-based on the universe widths,
and one extra axis per covariate (quantile- or equal-width
binned, see bin.type.covariates).
For each target, locate its joint bin and sample
n.per.target universe sequences from that bin (without
replacement when unique = TRUE). If the bin is empty or
undersized, expand outward in Manhattan-distance rings.
Use the result as a bkg.sequences argument to enrich_motifs2()
or motif_finder() for a composition-controlled null.
shuffle_sequences() is a complementary alternative that
randomises each input in place (preserves k-let composition);
match_bkg() instead samples real sequences from a larger pool.
If return.indices = FALSE (default): an XStringSet
(same subclass as sequences) of length
length(sequences) * n.per.target. The first n.per.target
entries correspond to sequences[[1]], the next to
sequences[[2]], etc. If return.indices = TRUE: a
data.frame with columns target.i, target.gc,
target.length, universe.i, universe.gc, universe.length,
followed (when covariates is supplied) by a
target.<name> / universe.<name> pair for each covariate.
Benjamin Jean-Marie Tremblay, [email protected]
Heinz S, Benner C, Spann N, et al. (2010). "Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities." Molecular Cell, 38(4):576-589. doi:10.1016/j.molcel.2010.05.004.
shuffle_sequences(), enrich_motifs2(), motif_finder(),
plot_match_bkg()
## Not run: library(universalmotif) data(ArabidopsisPromoters) target <- ArabidopsisPromoters[1:10] universe <- ArabidopsisPromoters[11:50] bkg <- match_bkg(target, universe, n.per.target = 2) ## Use as a null background for enrichment / discovery # enrich_motifs2(motifs, target, bkg.sequences = bkg) # motif_finder(target, bkg.sequences = bkg) plot_match_bkg(target, bkg) ## Additionally match on an external covariate (e.g. a ChIP signal ## score aligned to each sequence). Covariate matching needs an ## explicit universe + universe.covariates. sig.t <- runif(length(target)) sig.u <- runif(length(universe)) idx <- match_bkg(target, universe, covariates = data.frame(signal = sig.t), universe.covariates = data.frame(signal = sig.u), return.indices = TRUE) plot_match_bkg(indices = idx) # GC, length and signal panels ## End(Not run)## Not run: library(universalmotif) data(ArabidopsisPromoters) target <- ArabidopsisPromoters[1:10] universe <- ArabidopsisPromoters[11:50] bkg <- match_bkg(target, universe, n.per.target = 2) ## Use as a null background for enrichment / discovery # enrich_motifs2(motifs, target, bkg.sequences = bkg) # motif_finder(target, bkg.sequences = bkg) plot_match_bkg(target, bkg) ## Additionally match on an external covariate (e.g. a ChIP signal ## score aligned to each sequence). Covariate matching needs an ## explicit universe + universe.covariates. sig.t <- runif(length(target)) sig.u <- runif(length(universe)) idx <- match_bkg(target, universe, covariates = data.frame(signal = sig.t), universe.covariates = data.frame(signal = sig.u), return.indices = TRUE) plot_match_bkg(indices = idx) # GC, length and signal panels ## End(Not run)
Aligns the motifs using compare_motifs(), then averages the
motif PPMs. Currently the multifreq slot, if filled in any of the motifs,
will be dropped. Only 0-order background probabilities will be kept.
Motifs are merged one at a time, starting with the first entry in the
list. For merging regular DNA/RNA motifs, see the simpler and
faster merge_motifs2().
merge_motifs(motifs, method = "ALLR", use.type = "PPM", min.overlap = 6, min.mean.ic = 0.25, tryRC = TRUE, relative_entropy = FALSE, normalise.scores = FALSE, min.position.ic = 0, score.strat = "sum", new.name = NULL)merge_motifs(motifs, method = "ALLR", use.type = "PPM", min.overlap = 6, min.mean.ic = 0.25, tryRC = TRUE, relative_entropy = FALSE, normalise.scores = FALSE, min.position.ic = 0, score.strat = "sum", new.name = NULL)
motifs |
See |
method |
|
use.type |
|
min.overlap |
|
min.mean.ic |
|
tryRC |
|
relative_entropy |
|
normalise.scores |
|
min.position.ic |
|
score.strat |
|
new.name |
|
See compare_motifs() for more info on comparison parameters.
If using a comparison metric where 0s are not allowed (KL, ALLR, ALLR_LL, IS),
then pseudocounts will be added internally. These pseudocounts are only used for
comparison and alignment, and are not used in the final merging step.
Note: score.strat = "a.mean" is NOT recommended, as merge_motifs() will
not discriminate between two alignments with equal mean scores, even if one
alignment is longer than the other.
A single motif object. See convert_motifs() for
available formats.
Benjamin Jean-Marie Tremblay, [email protected]
compare_motifs(), merge_motifs2(), merge_similar(),
trim_motifs()
## Not run: library(MotifDb) merged.motif <- merge_motifs(MotifDb[1:5]) ## End(Not run) m1 <- create_motif("TTAAACCCC", name = "1") m2 <- create_motif("AACC", name = "2") m3 <- create_motif("AACCCCGG", name = "3") view_motifs(merge_motifs(c(m1, m2, m3)))## Not run: library(MotifDb) merged.motif <- merge_motifs(MotifDb[1:5]) ## End(Not run) m1 <- create_motif("TTAAACCCC", name = "1") m2 <- create_motif("AACC", name = "2") m3 <- create_motif("AACCCCGG", name = "3") view_motifs(merge_motifs(c(m1, m2, m3)))
merge_motifs2() is a faster minimalist counterpart to merge_motifs() that
aligns motifs using compare_motifs2()'s alignment finder and averages
their position-probability columns in a single shared coordinate frame.
merge_motifs2(motifs, min.overlap = 5L, RC = TRUE, new.name = NULL, weighted = FALSE, nthreads = 1)merge_motifs2(motifs, min.overlap = 5L, RC = TRUE, new.name = NULL, weighted = FALSE, nthreads = 1)
motifs |
|
min.overlap |
|
RC |
|
new.name |
|
weighted |
|
nthreads |
|
For each non-anchor motif, compare_motifs2() gives the best
alignment offset, strand, and overlap against the anchor. Each
aligned motif (reverse-complemented if its best alignment is -
strand) is then placed in a unified column coordinate frame whose
origin coincides with the anchor's first column. The output's column
at frame position p is the (optionally @nsites-weighted) mean of
the PPM columns from all input motifs that cover p; positions
covered by only some motifs are still emitted (no automatic
IC-trimming). Run trim_motifs() on the result if you want to
shrink low-information edges.
The merged motif's @bkg is the unweighted mean of the input
motifs' @bkg vectors (or uniform if missing). @nsites is the
sum of input @nsites.
A single universalmotif S4 object of type "PPM".
Benjamin Jean-Marie Tremblay, [email protected]
merge_motifs(), merge_similar2(), compare_motifs2(),
trim_motifs()
library(universalmotif) m1 <- create_motif("TTGACATA", name = "a") m2 <- create_motif("CTTGACAT", name = "b") m3 <- create_motif("TGACATAT", name = "c") merged <- merge_motifs2(list(m1, m2, m3)) mergedlibrary(universalmotif) m1 <- create_motif("TTGACATA", name = "a") m2 <- create_motif("CTTGACAT", name = "b") m3 <- create_motif("TGACATAT", name = "c") merged <- merge_motifs2(list(m1, m2, m3)) merged
Given a list of motifs, merge_similar() will identify similar motifs with
compare_motifs(), and merge similar ones with merge_motifs(). For merging
regular DNA/RNA motifs, see the simpler and faster merge_similar2().
merge_similar(motifs, threshold = 0.95, threshold.type = "score.abs", method = "PCC", use.type = "PPM", min.overlap = 6, min.mean.ic = 0, tryRC = TRUE, relative_entropy = FALSE, normalise.scores = FALSE, min.position.ic = 0, score.strat.compare = "a.mean", score.strat.merge = "sum", nthreads = 1, return.clusters = FALSE)merge_similar(motifs, threshold = 0.95, threshold.type = "score.abs", method = "PCC", use.type = "PPM", min.overlap = 6, min.mean.ic = 0, tryRC = TRUE, relative_entropy = FALSE, normalise.scores = FALSE, min.position.ic = 0, score.strat.compare = "a.mean", score.strat.merge = "sum", nthreads = 1, return.clusters = FALSE)
motifs |
See |
threshold |
|
threshold.type |
|
method |
|
use.type |
|
min.overlap |
|
min.mean.ic |
|
tryRC |
|
relative_entropy |
|
normalise.scores |
|
min.position.ic |
|
score.strat.compare |
|
score.strat.merge |
|
nthreads |
|
return.clusters |
|
See compare_motifs() for more info on comparison parameters, and
merge_motifs() for more info on motif merging.
See convert_motifs() for available output formats.
Benjamin Jean-Marie Tremblay, [email protected]
compare_motifs(), merge_motifs(), merge_similar2()
## Not run: library(MotifDb) motifs <- filter_motifs(MotifDb, family = "bHLH")[1:50] length(motifs) motifs <- merge_similar(motifs) length(motifs) ## End(Not run)## Not run: library(MotifDb) motifs <- filter_motifs(MotifDb, family = "bHLH")[1:50] length(motifs) motifs <- merge_similar(motifs) length(motifs) ## End(Not run)
merge_similar2() is a faster minimalist counterpart to merge_similar()
that builds a pairwise similarity-significance graph using
compare_motifs2()'s empirical or parametric null q-values, finds
connected components on that graph, and collapses each component
(cluster) into a single merged motif via merge_motifs2().
merge_similar2(motifs, qvalue = 0.01, null = c("empirical", "parametric"), min.overlap = 5L, RC = TRUE, bkg = NULL, weighted = FALSE, return.clusters = FALSE, nthreads = 1)merge_similar2(motifs, qvalue = 0.01, null = c("empirical", "parametric"), min.overlap = 5L, RC = TRUE, bkg = NULL, weighted = FALSE, return.clusters = FALSE, nthreads = 1)
motifs |
|
qvalue |
|
null |
|
min.overlap |
|
RC |
|
bkg |
|
weighted |
|
return.clusters |
|
nthreads |
|
Unlike merge_similar() (which builds a distance matrix from raw
similarity scores and applies hierarchical clustering at an absolute
score cutoff) this function clusters on statistical
significance: two motifs are linked if their pairwise q-value
meets the user's qvalue cutoff. This matches the STAMP /
TOMTOM-clustering semantics ("group all motifs that are
significantly similar"), is deterministic, has no linkage-method
choice to expose, and lets the user reason about the cutoff in
q-value units rather than abstract distance.
The significance graph is built from a symmetrised q-value matrix:
Qsym[i,j] = pmin(Q[i,j], Q[j,i]). compare_motifs2()'s per-query BH
adjustment makes its native q-value matrix non-symmetric; the min-
symmetrisation defines a pair as linked if either direction meets
the cutoff (more permissive of the two, which is appropriate when
the goal is to group similar motifs).
Clustering is by connected components on the resulting graph,
implemented with a small union-find. Each component (size >= 2)
becomes a single merged motif via merge_motifs2(); singletons
(size == 1) pass through unchanged.
If return.clusters = FALSE (default): a list of
universalmotif S4 objects with similar motifs collapsed (the
list is shorter than the input whenever any merging occurred).
Singletons pass through unchanged.
If return.clusters = TRUE: a universalmotif_df (as produced
by to_df()) describing the input motifs – one row per input
motif, with the motif column holding the motif objects and
name their names – augmented with two extra columns:
motif.i (1-based input index) and cluster (1-based cluster
id; motifs in the same cluster share an id).
Benjamin Jean-Marie Tremblay, [email protected]
merge_similar(), merge_motifs2(), compare_motifs2()
## Not run: library(universalmotif) ## A clear cluster of 3 related motifs + 1 unrelated motif m1 <- create_motif("TTGACATA", name = "a") m2 <- create_motif("CTTGACAT", name = "b") m3 <- create_motif("TGACATAT", name = "c") m4 <- create_motif("GGGCCCCC", name = "unrelated") merged <- merge_similar2(list(m1, m2, m3, m4), qvalue = 0.05) length(merged) # 2: one merged + the unrelated singleton ## End(Not run)## Not run: library(universalmotif) ## A clear cluster of 3 related motifs + 1 unrelated motif m1 <- create_motif("TTGACATA", name = "a") m2 <- create_motif("CTTGACAT", name = "b") m3 <- create_motif("TGACATAT", name = "c") m4 <- create_motif("GGGCCCCC", name = "unrelated") merged <- merge_similar2(list(m1, m2, m3, m4), qvalue = 0.05) length(merged) # 2: one merged + the unrelated singleton ## End(Not run)
motif_coocc() tests every motif pair for over-co-occurrence across
a set of sequences using a one-sided Fisher's exact test on the 2x2
contingency table of per-sequence presence/absence. The result is
the long-format pair table, BH-corrected across all tested pairs.
Optionally, when max.distance is non-NULL, the function also
reports two descriptive spatial columns, both.clustered
(number of co-occurring sequences with a within-max.distance
(A, B) hit pair) and median.distance (the median nearest-pair
spacing), so users can flag heterodimer-like arrangements. The
Fisher p-value itself is always computed on the unfiltered 2x2
(the spatial filter is informational, not a test).
motif_coocc(motifs, sequences = NULL, hits = NULL, n.sequences = NULL, pvalue = 1e-04, RC = TRUE, max.distance = NULL, min.coocc = 1L, pseudocount = 0L, self.pairs = FALSE, nthreads = 1L)motif_coocc(motifs, sequences = NULL, hits = NULL, n.sequences = NULL, pvalue = 1e-04, RC = TRUE, max.distance = NULL, min.coocc = 1L, pseudocount = 0L, self.pairs = FALSE, nthreads = 1L)
motifs |
See |
sequences |
|
hits |
|
n.sequences |
|
pvalue |
|
RC |
|
max.distance |
|
min.coocc |
|
pseudocount |
|
self.pairs |
|
nthreads |
|
Two input paths are supported:
Internal scan (default): pass motifs + sequences and
motif_coocc() will call scan_sequences2() with the given
pvalue and RC arguments to build the hit table. DNA/RNA
only (the restriction comes from scan_sequences2()).
Precomputed hits: pass motifs + hits + n.sequences.
hits is a data.frame or GRanges with motif.i and
sequence.i columns (the native return shape of
scan_sequences() / scan_sequences2()), optionally start
when max.distance is set. This path accepts any motif
alphabet; once a hit table exists, co-occurrence is pure
set arithmetic on integer indices and no sequence bytes are
read.
data.frame with columns:
motif_a, motif_b: motif names (1-based indexing into the
supplied motifs list; deduplicated names if duplicates exist).
a_only, b_only, both, neither: 2x2 contingency cells.
odds_ratio: conditional MLE odds ratio from fisher.test.
pvalue: one-sided Fisher's exact p-value
(alternative = "greater"). NA when both < min.coocc.
qvalue: BH-corrected q-value across all tested pairs.
both.clustered (spatial mode only): subset of both
where a within-max.distance (A, B) hit pair exists.
median.distance (spatial mode only): median of those
nearest-pair distances; NA if no clustered pair.
Rows are sorted by q-value ascending; NA-pvalue rows go last.
Benjamin Jean-Marie Tremblay, [email protected]
Heinz S, Benner C, Spann N, et al. (2010). "Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities." Molecular Cell, 38(4):576-589. doi:10.1016/j.molcel.2010.05.004. (HOMER's pairwise-Fisher co-occurrence convention.)
Whitington T, Frith MC, Johnson J, Bailey TL (2011). "Inferring transcription factor complexes from ChIP-seq data." Nucleic Acids Research, 39(15):e98. doi:10.1093/nar/gkr341. (SpaMo, the spatial-distance motif-pair analysis.)
scan_sequences2(), enrich_motifs2(), motif_peaks()
## Not run: library(universalmotif) data(ArabidopsisMotif) seqs <- create_sequences("DNA", seqnum = 200, seqlen = 500) ## Toy: implant two copies of the same motif jointly in all seqs. set.seed(1) imp <- implant_motifs(list(ArabidopsisMotif, ArabidopsisMotif), seqs, n.per.seq = 1, return.indices = TRUE) co <- motif_coocc(list(ArabidopsisMotif, ArabidopsisMotif), imp$sequences, pvalue = 1e-3) co ## End(Not run)## Not run: library(universalmotif) data(ArabidopsisMotif) seqs <- create_sequences("DNA", seqnum = 200, seqlen = 500) ## Toy: implant two copies of the same motif jointly in all seqs. set.seed(1) imp <- implant_motifs(list(ArabidopsisMotif, ArabidopsisMotif), seqs, n.per.seq = 1, return.indices = TRUE) co <- motif_coocc(list(ArabidopsisMotif, ArabidopsisMotif), imp$sequences, pvalue = 1e-3) co ## End(Not run)
motif_finder() is a minimalist de novo motif discovery function,
whose defaults mirror the command-line tool
yamtk. The pipeline works through a
user-controlled range of motif widths; at each width it enumerates the
over-represented k-mer seeds (using a Fisher's exact test on per-sequence
presence against a background set), aligns the Hamming-1 neighbours of
the best seed to form a PPM, refines that PPM over two re-scan passes,
accepts the motif if its Fisher's exact p-value passes stop.pvalue, and
then masks the covered positions and repeats, up to nmotifs times per
width. Once every width is done, the motifs are deduplicated across
widths by coverage overlap, BH-adjusted, IC-trimmed at the flanks, and
returned in p-value order.
motif_finder(sequences, bkg.sequences = NULL, min.width = 6L, max.width = 15L, nmotifs = 10L, hit.pvalue = 1e-04, stop.pvalue = 0.001, qvalue = 0.001, dedup.overlap = 0.5, RC = TRUE, shuffle.k = 2L, rng.seed = sample.int(1e+06, 1), pseudocount = 1L, nthreads = 1)motif_finder(sequences, bkg.sequences = NULL, min.width = 6L, max.width = 15L, nmotifs = 10L, hit.pvalue = 1e-04, stop.pvalue = 0.001, qvalue = 0.001, dedup.overlap = 0.5, RC = TRUE, shuffle.k = 2L, rng.seed = sample.int(1e+06, 1), pseudocount = 1L, nthreads = 1)
sequences |
|
bkg.sequences |
|
min.width |
|
max.width |
|
nmotifs |
|
hit.pvalue |
|
stop.pvalue |
|
qvalue |
|
dedup.overlap |
|
RC |
|
shuffle.k |
|
rng.seed |
|
pseudocount |
|
nthreads |
|
Algorithm and defaults are a faithful port of yamtk, whose own design is in turn based on the STREME algorithm (Bailey 2021): seed enumeration via word counting with per-sequence Fisher's exact ranking, iterative PPM refinement on positive sequences, per-motif Fisher's exact significance against a shuffled or user-supplied background, and position masking between discoveries.
Motifs with qvalue > qvalue are dropped from the result. To see
all discovered motifs regardless of significance, set
qvalue = 1.
A universalmotif_df (see to_df()) with one row per
discovered motif, sorted by pvalue ascending. The standard
to_df() columns (name, altname, family, organism,
consensus, alphabet, type, nsites, pval, eval, motif,
...) are populated from each discovered motif. The yamtk-specific
stats are carried as additional columns:
rank: 1-based discovery rank
width: motif width (post IC-trim)
seqs_pos: positive sequences with >=1 hit
seqs_neg: background sequences with >=1 hit
sites_pos: total target hits
sites_neg: total background hits
n_pos: number of positive sequences
n_neg: number of background sequences
pvalue: Fisher's exact one-sided p-value
qvalue: Benjamini-Hochberg q-value
Use to_list() on the returned object to recover a plain list of
universalmotif S4 objects.
Benjamin Jean-Marie Tremblay, [email protected]
Bailey TL (2021). "STREME: accurate and versatile sequence motif discovery." Bioinformatics, 37(18), 2834-2840. doi:10.1093/bioinformatics/btab203.
Tremblay BJM (2026). yamtk: Yet Another Motif ToolKit. https://github.com/bjmt/yamtk.
scan_sequences2(), enrich_motifs2(), shuffle_sequences(),
create_motif(), to_df(), to_list()
## Not run: library(universalmotif) library(Biostrings) set.seed(1) planted <- DNAString("TTGACATA") seqs <- create_sequences(seqnum = 100, seqlen = 200, rng.seed = 1) ## Plant the motif at random positions in 80% of sequences: for (i in seq_len(80)) { pos <- sample.int(200 - 8, 1) seqs[[i]][pos:(pos + 7)] <- planted } motifs <- motif_finder(seqs, qvalue = 1, rng.seed = 1) to_list(motifs)[[1]] ## End(Not run)## Not run: library(universalmotif) library(Biostrings) set.seed(1) planted <- DNAString("TTGACATA") seqs <- create_sequences(seqnum = 100, seqlen = 200, rng.seed = 1) ## Plant the motif at random positions in 80% of sequences: for (i in seq_len(80)) { pos <- sample.int(200 - 8, 1) seqs[[i]][pos:(pos + 7)] <- planted } motifs <- motif_finder(seqs, qvalue = 1, rng.seed = 1) to_list(motifs)[[1]] ## End(Not run)
motif_peaks() tests whether motif hits cluster non-uniformly along
input sequences. It implements an analytical CentriMo-style test
(Bailey & Machanick 2012): under the null that hit positions are
uniformly distributed over [1, seq.length], the count of hits
falling inside a candidate window follows a binomial distribution.
For each motif and each candidate window, a one-sided binomial
p-value is computed; the best-scoring window per motif is reported,
Bonferroni-corrected over the number of windows tested, then BH-
adjusted across motifs.
motif_peaks(hits, seq.length = NULL, mode = c("central", "local"), qvalue = 0.1, min.window = 10L, max.window = NULL, window.step = 10L, position.step = 5L, nthreads = 1)motif_peaks(hits, seq.length = NULL, mode = c("central", "local"), qvalue = 0.1, min.window = 10L, max.window = NULL, window.step = 10L, position.step = 5L, nthreads = 1)
hits |
Motif hit table from |
seq.length |
|
mode |
|
qvalue |
|
min.window, max.window
|
|
window.step |
|
position.step |
|
nthreads |
|
Two modes are supported. mode = "central" (default) tests only
windows centred on seq.length / 2, which is the appropriate
hypothesis for centred input sequences (e.g. ChIP-seq peak summits).
mode = "local" additionally varies the window centre, scanning the
whole sequence for any positionally-enriched region; this is more
permissive and uses a larger Bonferroni correction.
All sequences are assumed to be of equal length; the function does not enforce centring on a biological reference, but its results are only meaningful if such centring is in place upstream.
A data.frame with one row per motif passing the q-value
cutoff, sorted by pvalue ascending. Columns: motif,
motif.i, mode, seq.length, nhits, best.window,
best.center, hits.in, hits.out, expected.in,
enrichment, log2.enrichment, pvalue, qvalue, centers
(list-column of integer vectors of hit centres, one entry per
sequence after best-hit-per-sequence dedup; consumed by
plot_motif_peaks()).
Benjamin Jean-Marie Tremblay, [email protected]
Bailey TL, Machanick P (2012). "Inferring direct DNA binding from ChIP-seq." Nucleic Acids Research, 40(17):e128. doi:10.1093/nar/gks433.
scan_sequences2(), plot_motif_peaks()
## Not run: library(universalmotif) library(Biostrings) set.seed(1) seqs <- create_sequences(seqnum = 200, seqlen = 500, rng.seed = 1) planted <- DNAString("TTGACATA") for (i in seq_len(150)) { pos <- 246L + sample.int(8, 1) - 1L subseq(seqs[[i]], start = pos, width = 8) <- planted } m <- create_motif("TTGACATA", name = "test") hits <- scan_sequences2(m, seqs, pvalue = 1e-3, return.granges = TRUE) motif_peaks(hits) ## End(Not run)## Not run: library(universalmotif) library(Biostrings) set.seed(1) seqs <- create_sequences(seqnum = 200, seqlen = 500, rng.seed = 1) planted <- DNAString("TTGACATA") for (i in seq_len(150)) { pos <- 246L + sample.int(8, 1) - 1L subseq(seqs[[i]], start = pos, width = 8) <- planted } m <- create_motif("TTGACATA", name = "test") hits <- scan_sequences2(m, seqs, pvalue = 1e-3, return.granges = TRUE) motif_peaks(hits) ## End(Not run)
For calculating P-values and logodds scores from P-values for any number of motifs.
motif_pvalue(motifs, score, pvalue, bkg.probs, use.freq = 1, k = 8, nthreads = 1, rand.tries = 10, rng.seed = sample.int(10000, 1), allow.nonfinite = FALSE, method = c("dynamic", "exhaustive"))motif_pvalue(motifs, score, pvalue, bkg.probs, use.freq = 1, k = 8, nthreads = 1, rand.tries = 10, rng.seed = sample.int(10000, 1), allow.nonfinite = FALSE, method = c("dynamic", "exhaustive"))
motifs |
See |
score |
|
pvalue |
|
bkg.probs |
|
use.freq |
|
k |
|
nthreads |
|
rand.tries |
|
rng.seed |
|
allow.nonfinite |
|
method |
|
A note regarding vectorizing the calculation when method = "dynamic" (no
vectorization is possible with method = "exhaustive"): to avoid performing
the P-value/score calculation repeatedly for individual motifs, provide the
score/pvalue arguments as a list, with each entry corresponding to the
scores/P-values to be calculated for the respective motifs provided to
motifs. If you simply provide a list of repeating motifs and a single
numeric vector of corresponding input scores/P-values, then motif_pvalue()
will not vectorize. See the Examples section.
One of the algorithms available to motif_pvalue() to calculate scores or
P-values is the dynamic programming algorithm used by FIMO (Grant et al., 2011).
In this method, a small range of possible scores from the possible miminum and maximum
is created and the cumulative probability of each score in this distribution is
incrementally
calculated using the logodds scores and the background probabilities. This
distribution of scores and associated P-values can be used to calculate P-values
or scores for any input, any number of times. This method scales well with large
motifs, and multifreq representations. The only downside is that it is
incompatible with allow.nonfinite = TRUE, as this would not allow for the
creation of the initial range of scores. Although described for a different
purpose, the basic premise of the dynamic programming algorithm is also
described in Gupta et al. (2007).
Calculating P-values exhaustively for motifs can be very computationally intensive. This is due to how P-values must be calculated: for a given score, all possible sequences which score equal or higher must be found, and the probability for each of these sequences (based on background probabilities) summed. For a DNA motif of length 10, the number of possible unique sequences is 4^10 = 1,048,576. Finding all possible sequences higher than a given score can be done very efficiently and quickly with a branch-and-bound algorithm, but as the motif length increases even this calculation becomes impractical. To get around this, the P-value calculation can be approximated.
In order to calculate P-values for longer motifs, this function uses the
approximation proposed by Hartmann et al. (2013), where
the motif is subset, P-values calculated for the subsets, and finally
combined for a total P-value. The smaller the size of the subsets, the
faster the calculation; but also, the bigger the approximation. This can be
controlled by setting k. In fact, for smaller motifs (< 13 positions)
calculating exact P-values can be done individually in reasonable time by
setting k = 12.
To calculate a score from a P-value, all possible scores are calculated
and the (1 - pvalue) * 100 nth percentile score returned.
When k < ncol(motif), the complete set of scores is instead approximated
by randomly adding up all possible scores from each subset.
Note that this approximation
can actually be potentially quite expensive at times and even slower than
the exact version; for jobs requiring lots of repeat calculations, a bit of
benchmarking beforehand can be useful to find the optimal settings.
Please note that bugs are more likely to occur when using the exhaustive
method, as the algorithm contains several times more code compared to the
dynamic method. Unless you have a strong need to use allow.nonfinite = TRUE
then avoid using this method.
numeric, list A vector or list of vectors of scores/P-values.
Benjamin Jean-Marie Tremblay, [email protected]
Grant CE, Bailey TL, Noble WS (2011). "FIMO: scanning for occurrences of a given motif." Bioinformatics, 27, 1017-1018.
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS (2007). "Quantifying similarity between motifs." Genome Biology, 8, R24.
Hartmann H, Guthohrlein EW, Siebert M, Soding SLJ (2013). “P-value-based regulatory motif discovery using positional weight matrices.” Genome Research, 23, 181-194.
get_matches(), get_scores(), motif_range(), motif_score(),
prob_match(), prob_match_bkg(), score_match()
if (R.Version()$arch != "i386") { ## P-value/score calculations are performed using the PWM version of the ## motif data(examplemotif) ## Get a minimum score based on a P-value motif_pvalue(examplemotif, pvalue = 0.001) ## Get the probability of a particular sequence hit motif_pvalue(examplemotif, score = 0) ## The calculations can be performed for multiple motifs motif_pvalue(c(examplemotif, examplemotif), pvalue = c(0.001, 0.0001)) ## Compare score thresholds and P-value: scores <- motif_score(examplemotif, c(0.6, 0.7, 0.8, 0.9)) motif_pvalue(examplemotif, scores) ## Calculate the probability of getting a certain match or better: TATATAT <- score_match(examplemotif, "TATATAT") TATATAG <- score_match(examplemotif, "TATATAG") motif_pvalue(examplemotif, TATATAT) motif_pvalue(examplemotif, TATATAG) ## Get all possible matches by P-value: get_matches(examplemotif, motif_pvalue(examplemotif, pvalue = 0.0001)) ## Vectorize the calculation for multiple motifs and scores/P-values: m <- create_motif() motif_pvalue(c(examplemotif, m), list(1:5, 2:3)) ## The non-vectorized equivalent: motif_pvalue( c(rep(list(examplemotif), 5), rep(list(m), 2)), c(1:5, 2:3) ) }if (R.Version()$arch != "i386") { ## P-value/score calculations are performed using the PWM version of the ## motif data(examplemotif) ## Get a minimum score based on a P-value motif_pvalue(examplemotif, pvalue = 0.001) ## Get the probability of a particular sequence hit motif_pvalue(examplemotif, score = 0) ## The calculations can be performed for multiple motifs motif_pvalue(c(examplemotif, examplemotif), pvalue = c(0.001, 0.0001)) ## Compare score thresholds and P-value: scores <- motif_score(examplemotif, c(0.6, 0.7, 0.8, 0.9)) motif_pvalue(examplemotif, scores) ## Calculate the probability of getting a certain match or better: TATATAT <- score_match(examplemotif, "TATATAT") TATATAG <- score_match(examplemotif, "TATATAG") motif_pvalue(examplemotif, TATATAT) motif_pvalue(examplemotif, TATATAG) ## Get all possible matches by P-value: get_matches(examplemotif, motif_pvalue(examplemotif, pvalue = 0.0001)) ## Vectorize the calculation for multiple motifs and scores/P-values: m <- create_motif() motif_pvalue(c(examplemotif, m), list(1:5, 2:3)) ## The non-vectorized equivalent: motif_pvalue( c(rep(list(examplemotif), 5), rep(list(m), 2)), c(1:5, 2:3) ) }
For any motif, change the motif slot to it's reverse complement. If the
multifreq slot is filled, then it is also applied. No other slots are
affected.
motif_rc(motifs, ignore.alphabet = FALSE)motif_rc(motifs, ignore.alphabet = FALSE)
motifs |
See |
ignore.alphabet |
|
See convert_motifs() for available output formats.
Benjamin Jean-Marie Tremblay, [email protected]
jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) jaspar.rc <- motif_rc(jaspar)jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) jaspar.rc <- motif_rc(jaspar)
For more powerful motif tree functions, see the motifStack package.
The motif_tree() function compares motifs with compare_motifs() to create
a distance matrix, which is used to generate a phylogeny.
This can be plotted with ggtree::ggtree(). The purpose of this function
is simply to combine the compare_motifs() and ggtree::ggtree() steps
into one. For more control over tree creation, it is recommend to do these
steps separately. See the "Motif comparisons and P-values" vignette for such
a workthrough. This function requires the ape and ggtree packages
to be installed separately. For a leaner alternative using compare_motifs2(),
see motif_tree2().
motif_tree(motifs, layout = "circular", linecol = "family", labels = "none", tipsize = "none", legend = TRUE, branch.length = "none", db.scores, method = "EUCL", use.type = "PPM", min.overlap = 6, min.position.ic = 0, tryRC = TRUE, min.mean.ic = 0, relative_entropy = FALSE, progress = FALSE, nthreads = 1, score.strat = "a.mean", ...)motif_tree(motifs, layout = "circular", linecol = "family", labels = "none", tipsize = "none", legend = TRUE, branch.length = "none", db.scores, method = "EUCL", use.type = "PPM", min.overlap = 6, min.position.ic = 0, tryRC = TRUE, min.mean.ic = 0, relative_entropy = FALSE, progress = FALSE, nthreads = 1, score.strat = "a.mean", ...)
motifs |
|
layout |
|
linecol |
|
labels |
|
tipsize |
|
legend |
|
branch.length |
|
db.scores |
|
method |
|
use.type |
|
min.overlap |
|
min.position.ic |
|
tryRC |
|
min.mean.ic |
|
relative_entropy |
|
progress |
|
nthreads |
|
score.strat |
|
... |
ggtree params. See |
See compare_motifs() for more info on comparison parameters.
ggplot object.
Benjamin Jean-Marie Tremblay, [email protected]
Wickham H (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-0-387-98140-6, <URL: http://ggplot2.org>.
Yu G, Smith D, Zhu H, Guan Y, Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, 28-36. doi: 10.1111/2041-210X.12628.
motifStack::motifStack(), compare_motifs(),
ggtree::ggtree(), ggplot2::ggplot(), motif_tree2()
jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) if (requireNamespace("ggtree", quietly = TRUE)) { jaspar.tree <- motif_tree(jaspar, linecol = "none", labels = "name", layout = "rectangular") } ## Not run: ## When inputting a dist object, the linecol and tipsize options are ## not available. To add these manually: library(MotifDb) library(ggtree) library(ggplot2) motifs <- filter_motifs(MotifDb, organism = "Athaliana")[1:50] comparison <- compare_motifs(motifs, method = "PCC", score.strat = "a.mean") comparison <- as.dist(1 - comparison) mot.names <- attr(comparison, "Labels") tree <- motif_tree(comparison) annotations <- data.frame(label = mot.names, icscore = sapply(motifs, function(x) x["icscore"]), family = sapply(motifs, function(x) x["family"])) tree <- tree %<+% annotations + geom_tippoint(aes(size = icscore)) + aes(colour = family) + theme(legend.position = "right", legend.title = element_blank()) ## End(Not run)jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) if (requireNamespace("ggtree", quietly = TRUE)) { jaspar.tree <- motif_tree(jaspar, linecol = "none", labels = "name", layout = "rectangular") } ## Not run: ## When inputting a dist object, the linecol and tipsize options are ## not available. To add these manually: library(MotifDb) library(ggtree) library(ggplot2) motifs <- filter_motifs(MotifDb, organism = "Athaliana")[1:50] comparison <- compare_motifs(motifs, method = "PCC", score.strat = "a.mean") comparison <- as.dist(1 - comparison) mot.names <- attr(comparison, "Labels") tree <- motif_tree(comparison) annotations <- data.frame(label = mot.names, icscore = sapply(motifs, function(x) x["icscore"]), family = sapply(motifs, function(x) x["family"])) tree <- tree %<+% annotations + geom_tippoint(aes(size = icscore)) + aes(colour = family) + theme(legend.position = "right", legend.title = element_blank()) ## End(Not run)
compare_motifs2().motif_tree2() is the leaner counterpart of motif_tree(): it builds the
distance matrix via compare_motifs2() (mean Pearson correlation,
with built-in p-value / q-value machinery) instead of
compare_motifs(). The distance for hclust() is derived from the
mean Pearson correlation matrix as (1 - score) / 2, mapping
similarity in [-1, 1] to a symmetric distance in [0, 1]. The
tree-construction and rendering steps (hclust() followed by
ape::as.phylo() followed by ggtree::ggtree()) match
motif_tree() exactly, so the visual conventions, the linecol /
labels / tipsize slot-mapping arguments, and the dist-input
short-circuit all behave the same.
motif_tree2(motifs, layout = "circular", linecol = "family", labels = "none", tipsize = "none", legend = TRUE, branch.length = "none", min.overlap = 6, tryRC = TRUE, nthreads = 1, progress = FALSE, ...)motif_tree2(motifs, layout = "circular", linecol = "family", labels = "none", tipsize = "none", legend = TRUE, branch.length = "none", min.overlap = 6, tryRC = TRUE, nthreads = 1, progress = FALSE, ...)
motifs |
|
layout |
|
linecol |
|
labels |
|
tipsize |
|
legend |
|
branch.length |
|
min.overlap |
|
tryRC |
|
nthreads |
|
progress |
|
... |
Additional arguments passed through to
|
The PCC-to-distance conversion d = (1 - score) / 2 is symmetric
(the comparison matrix from compare_motifs2() is symmetric in
matrix mode), so as.dist() and hclust() accept it directly.
Anti-correlated motif pairs land at the maximum distance of 1;
identical motifs land at 0.
For more control over tree construction, run compare_motifs2()
directly with matrix.out = "score", convert to a dist object
yourself, and pass that dist to motif_tree2(). The function
detects the dist input and skips the comparison step.
A ggplot (ggtree) object.
Benjamin Jean-Marie Tremblay, [email protected]
motif_tree(), compare_motifs2(), merge_similar2(),
ggtree::ggtree()
## Not run: library(universalmotif) jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) if (requireNamespace("ggtree", quietly = TRUE)) { motif_tree2(jaspar, linecol = "none", labels = "name", layout = "rectangular") } ## Equivalent two-step form, useful when you want to inspect the ## distance matrix before drawing the tree: if (requireNamespace("ggtree", quietly = TRUE)) { score.mat <- compare_motifs2(jaspar, matrix.out = "score") d <- as.dist((1 - score.mat) / 2) motif_tree2(d, labels = "name") } ## End(Not run)## Not run: library(universalmotif) jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) if (requireNamespace("ggtree", quietly = TRUE)) { motif_tree2(jaspar, linecol = "none", labels = "name", layout = "rectangular") } ## Equivalent two-step form, useful when you want to inspect the ## distance matrix before drawing the tree: if (requireNamespace("ggtree", quietly = TRUE)) { score.mat <- compare_motifs2(jaspar, matrix.out = "score") d <- as.dist((1 - score.mat) / 2) motif_tree2(d, labels = "name") } ## End(Not run)
Companion to match_bkg(). Overlays density curves for the target
and matched-background sets in a faceted ggplot, useful for
visually verifying that the binned matching landed where it was
supposed to. Called with two XStringSets it plots GC fraction and
sequence length; called with the indices data.frame from
match_bkg(..., return.indices = TRUE) it plots every matched axis,
including any covariates.
plot_match_bkg(sequences = NULL, bkg = NULL, by = NULL, bins = 30L, indices = NULL)plot_match_bkg(sequences = NULL, bkg = NULL, by = NULL, bins = 30L, indices = NULL)
sequences |
|
bkg |
|
by |
|
bins |
|
indices |
|
A ggplot object.
Benjamin Jean-Marie Tremblay, [email protected]
Companion to motif_peaks(). Returns a ggplot faceted by motif,
showing a histogram of best-hit centre positions and a shaded
rectangle marking the most-enriched window.
plot_motif_peaks(peaks, motifs = NULL, ncol = 2L, bins = 50L, fill.window = "#ffe4a3")plot_motif_peaks(peaks, motifs = NULL, ncol = 2L, bins = 50L, fill.window = "#ffe4a3")
peaks |
|
motifs |
|
ncol |
|
bins |
|
fill.window |
|
A ggplot object.
Benjamin Jean-Marie Tremblay, [email protected]
Import CIS-BP formatted motifs. See http://cisbp.ccbr.utoronto.ca/index.php. Assumed to be DNA motifs.
read_cisbp(file, skip = 0)read_cisbp(file, skip = 0)
file |
|
skip |
|
CIS-BP motifs can be formatted with or without additional header metadata. Motifs without any header start at instances of the word "Pos", whereas motifs with a header start at instances of the word "TF".
list universalmotif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey MG, Huang E, Mukherjee T, Chen X, Reece-Hoyes JS, Govindarajan S, Shaulsky G, Walhout AJ, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR (2014). “Determination and inference of eukaryotic transcription factor sequence specificity.” Cell, 158, 1431-1443.
Other read_motifs:
read_homer(),
read_jaspar(),
read_matrix(),
read_meme(),
read_motifs(),
read_transfac(),
read_uniprobe()
cisbp <- read_cisbp(system.file("extdata", "cisbp.txt", package = "universalmotif"))cisbp <- read_cisbp(system.file("extdata", "cisbp.txt", package = "universalmotif"))
Import HOMER formatted motifs. See http://homer.ucsd.edu/homer/motif/.
Assumed to be DNA motifs. Note that HOMER motifs come with a pre-determined
logodds threshold; if you wish to re-create HOMER's motif scanning, then use
it in scan_sequences() (see examples).
read_homer(file, skip = 0)read_homer(file, skip = 0)
file |
|
skip |
|
list universalmotif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK (2010). “Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities.” Molecular Cell, 38, 576-589.
Other read_motifs:
read_cisbp(),
read_jaspar(),
read_matrix(),
read_meme(),
read_motifs(),
read_transfac(),
read_uniprobe()
data(ArabidopsisPromoters) homer <- read_homer(system.file("extdata", "homer.txt", package = "universalmotif")) thresholds <- homer |> to_df() |> with(logodds.threshold) |> as.numeric() scan_sequences(homer, ArabidopsisPromoters, threshold = thresholds, threshold.type = "logodds.abs")data(ArabidopsisPromoters) homer <- read_homer(system.file("extdata", "homer.txt", package = "universalmotif")) thresholds <- homer |> to_df() |> with(logodds.threshold) |> as.numeric() scan_sequences(homer, ArabidopsisPromoters, threshold = thresholds, threshold.type = "logodds.abs")
Import JASPAR formatted motifs. See http://jaspar.genereg.net/. Can be either DNA, RNA, or AA motifs.
read_jaspar(file, skip = 0)read_jaspar(file, skip = 0)
file |
|
skip |
|
list universalmotif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A (2018). “JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.” Nucleic Acids Research, 46, D260-D266.
Other read_motifs:
read_cisbp(),
read_homer(),
read_matrix(),
read_meme(),
read_motifs(),
read_transfac(),
read_uniprobe()
jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif"))jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif"))
Import simply formatted motifs.
read_matrix(file, skip = 0, type, positions = "columns", alphabet = "DNA", sep = "", headers = TRUE, rownames = FALSE, comment = NULL)read_matrix(file, skip = 0, type, positions = "columns", alphabet = "DNA", sep = "", headers = TRUE, rownames = FALSE, comment = NULL)
file |
|
skip |
|
type |
|
positions |
|
alphabet |
|
sep |
|
headers |
|
rownames |
|
comment |
|
list universalmotif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Other read_motifs:
read_cisbp(),
read_homer(),
read_jaspar(),
read_meme(),
read_motifs(),
read_transfac(),
read_uniprobe()
hocomoco <- system.file("extdata", "hocomoco.txt", package = "universalmotif") hocomoco <- read_matrix(hocomoco, headers = ">", positions = "rows") ## Reading a TF-MoDISco-style CWM dump: signed real values, no ## column-sum constraint. Opt in via type = "CWM". ## Not run: cwms <- read_matrix("modisco_cwms.txt", headers = ">", positions = "rows", type = "CWM") ## End(Not run)hocomoco <- system.file("extdata", "hocomoco.txt", package = "universalmotif") hocomoco <- read_matrix(hocomoco, headers = ">", positions = "rows") ## Reading a TF-MoDISco-style CWM dump: signed real values, no ## column-sum constraint. Opt in via type = "CWM". ## Not run: cwms <- read_matrix("modisco_cwms.txt", headers = ">", positions = "rows", type = "CWM") ## End(Not run)
Import MEME formatted motifs, as well as original motif sequences. See
http://meme-suite.org/doc/meme-format.html. Both 'full' and 'minimal'
formats are supported. DREME and STREME motifs can also be imported, but note
that readsites and readsites.meta arguments do nothing.
read_meme(file, skip = 0, readsites = FALSE, readsites.meta = FALSE, readsites.meta.tidy = FALSE, CWM = FALSE)read_meme(file, skip = 0, readsites = FALSE, readsites.meta = FALSE, readsites.meta.tidy = FALSE, CWM = FALSE)
file |
|
skip |
|
readsites |
|
readsites.meta |
|
readsites.meta.tidy |
|
CWM |
|
Please note that the typical number precision limit in R is around 1e-308.
This means that motif P-values in MEME files below this limit are rounded
automatically to 0. To get around this, the E-value is also stored as a
string in the extrainfo slot. If you require a numeric value for analysis,
use the log_string_pval() function to get the log of the string-formatted
p-value.
list universalmotif objects. If readsites = TRUE, a list
comprising of a sub-list of motif objects and a sub-list of
motif sites will be returned. If readsites.meta = TRUE, then two
additional list items will be present, one containing site positions
and P-values, and another containing combined sequence p-values. If
readsites.meta.tidy = TRUE, an additional list entry named
sites.meta.tidy will be added containing a data.frame.
Benjamin Jean-Marie Tremblay, [email protected]
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS (2009). “MEME SUITE: tools for motif discovery and searching.” Nucleic Acids Research, 37, W202-W208.
Other read_motifs:
read_cisbp(),
read_homer(),
read_jaspar(),
read_matrix(),
read_motifs(),
read_transfac(),
read_uniprobe()
meme.minimal <- read_meme(system.file("extdata", "meme_minimal.txt", package = "universalmotif")) meme.full <- read_meme(system.file("extdata", "meme_full.txt", package = "universalmotif")) ## Get numeric p-value: log_string_pval(meme.minimal[[1]]["extrainfo"]["eval.string"]) ## Read a TF-MoDISco-style MEME file as CWMs (no normalisation, ## values may be negative, type tag = "CWM"): ## Not run: modisco <- read_meme("modisco_cwms.meme", CWM = TRUE) ## End(Not run)meme.minimal <- read_meme(system.file("extdata", "meme_minimal.txt", package = "universalmotif")) meme.full <- read_meme(system.file("extdata", "meme_full.txt", package = "universalmotif")) ## Get numeric p-value: log_string_pval(meme.minimal[[1]]["extrainfo"]["eval.string"]) ## Read a TF-MoDISco-style MEME file as CWMs (no normalisation, ## values may be negative, type tag = "CWM"): ## Not run: modisco <- read_meme("modisco_cwms.meme", CWM = TRUE) ## End(Not run)
Import motifs created from write_motifs(). For optimal storage of
universalmotif class motifs, consider using saveRDS() and
readRDS(). Currently the universalmotif format is YAML-based, but
this is subject to change.
read_motifs(file, skip = 0, progress = FALSE, BP = FALSE)read_motifs(file, skip = 0, progress = FALSE, BP = FALSE)
file |
|
skip |
|
progress |
|
BP |
|
list universalmotif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Other read_motifs:
read_cisbp(),
read_homer(),
read_jaspar(),
read_matrix(),
read_meme(),
read_transfac(),
read_uniprobe()
Import TRANSFAC formatted motifs. Assumed to be DNA motifs, type PCM.
See system.file("extdata", "transfac.txt", pacakge="universalmotif")
for an example motif.
read_transfac(file, skip = 0)read_transfac(file, skip = 0)
file |
|
skip |
|
A few TRANSFAC tags are recognized, including AC, ID, NA, HC and OS. HC will be set to the family slot and OS to the organism slot. If AC, ID and NA are present, then AC will be set as the motif name and NA as the alternate name. If AC is absent, then ID is set as the name. If ID is also absent, then NA is set as the motif name.
list universalmotif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Wingender E, Dietze P, Karas H, Knuppel R (1996). “TRANSFAC: A Database on Transcription Factors and Their DNA Binding Sites.” Nucleic Acids Research, 24, 238-241.
Other read_motifs:
read_cisbp(),
read_homer(),
read_jaspar(),
read_matrix(),
read_meme(),
read_motifs(),
read_uniprobe()
transfac <- read_transfac(system.file("extdata", "transfac.txt", package = "universalmotif"))transfac <- read_transfac(system.file("extdata", "transfac.txt", package = "universalmotif"))
Import UNIPROBE formatted motifs. Assumed DNA.
read_uniprobe(file, skip = 0)read_uniprobe(file, skip = 0)
file |
|
skip |
|
list universalmotif objects.
Benjamin Jean-Marie Tremblay, [email protected]
Hume MA, Barrera LA, Gisselbrecht SS, Bulyk ML (2015). “UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions.” Nucleic Acids Research, 43, D117-D122.
Other read_motifs:
read_cisbp(),
read_homer(),
read_jaspar(),
read_matrix(),
read_meme(),
read_motifs(),
read_transfac()
uniprobe.minimal <- read_uniprobe(system.file("extdata", "uniprobe_minimal.txt", package = "universalmotif")) uniprobe.full <- read_uniprobe(system.file("extdata", "uniprobe_full.txt", package = "universalmotif"))uniprobe.minimal <- read_uniprobe(system.file("extdata", "uniprobe_minimal.txt", package = "universalmotif")) uniprobe.full <- read_uniprobe(system.file("extdata", "uniprobe_full.txt", package = "universalmotif"))
De novo motif discovery via MEME. For a detailed description of the command,
see http://meme-suite.org/doc/meme.html. For a brief description of
the command parameters, call run_meme() without any arguments. Parameters in
run_meme() which are directly taken from the MEME program are tagged with
[MEME]. This function requires that the processx package be installed
separately.
run_meme(target.sequences, output = NULL, overwrite.dir = FALSE, control.sequences = NULL, weights = NULL, text = FALSE, brief = 1000, objfun = "classic", test = NULL, use_llr = FALSE, shuf = 2, hsfrac = NULL, cefrac = NULL, searchsize = NULL, norand = FALSE, csites = 1000, seed = 0, alph = NULL, revcomp = FALSE, pal = FALSE, mod = "zoops", nmotifs = 3, evt = NULL, nsites = NULL, minsites = NULL, maxsites = NULL, wnsites = 0.8, w = NULL, minw = 8, maxw = 50, allw = NULL, nomatrim = FALSE, wg = 11, ws = 1, noendgaps = FALSE, bfile = NULL, markov_order = 0, psp = NULL, maxiter = 50, distance = 0.001, prior = NULL, b = NULL, plib = NULL, spfuzz = NULL, spmap = NULL, cons = NULL, p = NULL, maxsize = NULL, maxtime = NULL, wd = getwd(), logfile = paste0(wd, "/memerun.log"), readsites = TRUE, echo = FALSE, verbose = 1, timeout = Inf, bin = getOption("meme.bin"))run_meme(target.sequences, output = NULL, overwrite.dir = FALSE, control.sequences = NULL, weights = NULL, text = FALSE, brief = 1000, objfun = "classic", test = NULL, use_llr = FALSE, shuf = 2, hsfrac = NULL, cefrac = NULL, searchsize = NULL, norand = FALSE, csites = 1000, seed = 0, alph = NULL, revcomp = FALSE, pal = FALSE, mod = "zoops", nmotifs = 3, evt = NULL, nsites = NULL, minsites = NULL, maxsites = NULL, wnsites = 0.8, w = NULL, minw = 8, maxw = 50, allw = NULL, nomatrim = FALSE, wg = 11, ws = 1, noendgaps = FALSE, bfile = NULL, markov_order = 0, psp = NULL, maxiter = 50, distance = 0.001, prior = NULL, b = NULL, plib = NULL, spfuzz = NULL, spmap = NULL, cons = NULL, p = NULL, maxsize = NULL, maxtime = NULL, wd = getwd(), logfile = paste0(wd, "/memerun.log"), readsites = TRUE, echo = FALSE, verbose = 1, timeout = Inf, bin = getOption("meme.bin"))
target.sequences |
|
output |
|
overwrite.dir |
|
control.sequences |
|
weights |
|
text |
|
brief |
|
objfun |
|
test |
|
use_llr |
|
shuf |
|
hsfrac |
|
cefrac |
|
searchsize |
|
norand |
|
csites |
|
seed |
|
alph |
|
revcomp |
|
pal |
|
mod |
|
nmotifs |
|
evt |
|
nsites |
|
minsites |
|
maxsites |
|
wnsites |
|
w |
|
minw |
|
maxw |
|
allw |
|
nomatrim |
|
wg |
|
ws |
|
noendgaps |
|
bfile |
|
markov_order |
|
psp |
|
maxiter |
|
distance |
|
prior |
|
b |
|
plib |
|
spfuzz |
|
spmap |
|
cons |
|
p |
|
maxsize |
|
maxtime |
|
wd |
|
logfile |
|
readsites |
|
echo |
|
verbose |
|
timeout |
|
bin |
|
list The output file is read with read_meme().
Benjamin Jean-Marie Tremblay, [email protected]
Bailey TL, Elkan C (1994). “Fitting a mixture model by expectation maximization to discover motifs in biopolymers.” Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 2, 28-36.
read_meme(), create_sequences(), shuffle_sequences(),
processx::run()
## Not run: ## To check that you are properly linking to the binary: run_meme() ## End(Not run)## Not run: ## To check that you are properly linking to the binary: run_meme() ## End(Not run)
Given probabilities for a sequence as represented by a motif, generate random sequences with the same length as the motif.
sample_sites(motif, n = 100, use.freq = 1)sample_sites(motif, n = 100, use.freq = 1)
motif |
See |
n |
|
use.freq |
|
XStringSet object.
Benjamin Jean-Marie Tremblay, [email protected]
create_sequences(), create_motif(), add_multifreq(),
implant_motifs()
motif <- create_motif() sites <- sample_sites(motif)motif <- create_motif() sites <- sample_sites(motif)
For sequences of any alphabet, scan them using the PWM matrices of
a set of input motifs. Note: for regular DNA/RNA scanning, see
scan_sequences2() for a faster and simpler alternative.
scan_sequences(motifs, sequences, threshold = 1e-04, threshold.type = c("pvalue", "qvalue", "logodds", "logodds.abs"), RC = FALSE, use.freq = 1, verbose = 0, nthreads = 1, motif_pvalue.k = 8, use.gaps = TRUE, allow.nonfinite = FALSE, warn.NA = TRUE, calc.pvals = TRUE, return.granges = FALSE, no.overlaps = FALSE, no.overlaps.by.strand = FALSE, no.overlaps.strat = c("score", "order"), respect.strand = FALSE, motif_pvalue.method = c("dynamic", "exhaustive"), calc.qvals = calc.pvals, calc.qvals.method = c("fdr", "BH", "bonferroni"))scan_sequences(motifs, sequences, threshold = 1e-04, threshold.type = c("pvalue", "qvalue", "logodds", "logodds.abs"), RC = FALSE, use.freq = 1, verbose = 0, nthreads = 1, motif_pvalue.k = 8, use.gaps = TRUE, allow.nonfinite = FALSE, warn.NA = TRUE, calc.pvals = TRUE, return.granges = FALSE, no.overlaps = FALSE, no.overlaps.by.strand = FALSE, no.overlaps.strat = c("score", "order"), respect.strand = FALSE, motif_pvalue.method = c("dynamic", "exhaustive"), calc.qvals = calc.pvals, calc.qvals.method = c("fdr", "BH", "bonferroni"))
motifs |
See |
sequences |
|
threshold |
|
threshold.type |
|
RC |
|
use.freq |
|
verbose |
|
nthreads |
|
motif_pvalue.k |
|
use.gaps |
|
allow.nonfinite |
|
warn.NA |
|
calc.pvals |
|
return.granges |
|
no.overlaps |
|
no.overlaps.by.strand |
|
no.overlaps.strat |
|
respect.strand |
|
motif_pvalue.method |
|
calc.qvals |
|
calc.qvals.method |
|
Similar to Biostrings::matchPWM(), the scanning method uses
logodds scoring. (To see the scoring matrix for any motif, simply
run convert_type(motif, "PWM"). For a multifreq scoring
matrix: apply(motif["multifreq"][["2"]], 2, ppm_to_pwm)). In order
to score a sequence, at each position within a sequence of length equal
to the length of the motif, the scores for each base are summed. If the
score sum is above the desired threshold, it is kept.
If threshold.type = 'logodds', then the threshold value is multiplied
by the maximum possible motif scores. To calculate the
maximum possible scores a motif (of type PWM) manually, run
motif_score(motif, 1). If threshold.type = 'pvalue',
then threshold logodds scores are generated using motif_pvalue().
Finally, if threshold.type = 'logodds.abs', then the exact values
provided will be used as thresholds. Finally, if threshold.type = 'qvalue',
then the threshold is calculated as if threshold.type = 'pvalue' and the
final set of hits are filtered based on their calculated Q-value. (Note:
this means that the thresh.score column will be incorrect!) This is done
since most Q-values cannot be calculated prior to scanning. If you are
running a very large job, it may be wise to use a P-value threshold
followed by manually filtering by Q-value; this will avoid the scanning
have to parse the larger number of hits from the internally-lowered threshold.
Non-standard letters (such as "N", "+", "-", ".", etc in DNAString
objects) will be safely ignored, resulting only in a warning and a very
minor performance cost. This can used to scan
masked sequences. See Biostrings::mask()
for masking sequences
(generating MaskedXString objects), and Biostrings::injectHardMask()
to recover masked XStringSet objects for use with scan_sequences().
There is also a provided wrapper function which performs both steps: mask_seqs().
DataFrame, GRanges with each row representing one hit. If the input
sequences are DNAStringSet or RNAStringSet,
then an additional column with the strand is included. Function args are
stored in the metadata slot. If return.granges = TRUE
then a GRanges object is returned.
Benjamin Jean-Marie Tremblay, [email protected]
Grant CE, Bailey TL, Noble WS (2011). "FIMO: scanning for occurrences of a given motif." Bioinformatics, 27, 1017-1018.
Hartmann H, Guthohrlein EW, Siebert M, Soding SLJ (2013). “P-value-based regulatory motif discovery using positional weight matrices.” Genome Research, 23, 181-194.
Noble WS (2009). "How does multiple testing work?" Nature Biotechnology, 27, 1135-1137.
add_multifreq(), Biostrings::matchPWM(),
enrich_motifs(), motif_pvalue(), scan_sequences2()
## any alphabet can be used ## Not run: set.seed(1) alphabet <- paste(c(letters), collapse = "") motif <- create_motif("hello", alphabet = alphabet) sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000) scan_sequences(motif, sequences) ## End(Not run) ## Sequence masking: if (R.Version()$arch != "i386") { library(Biostrings) data(ArabidopsisMotif) data(ArabidopsisPromoters) seq <- mask_seqs(ArabidopsisPromoters, "AAAAA") scan_sequences(ArabidopsisMotif, seq) # A warning regarding the presence of non-standard letters will be given, # but can be safely ignored in this case. }## any alphabet can be used ## Not run: set.seed(1) alphabet <- paste(c(letters), collapse = "") motif <- create_motif("hello", alphabet = alphabet) sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000) scan_sequences(motif, sequences) ## End(Not run) ## Sequence masking: if (R.Version()$arch != "i386") { library(Biostrings) data(ArabidopsisMotif) data(ArabidopsisPromoters) seq <- mask_seqs(ArabidopsisPromoters, "AAAAA") scan_sequences(ArabidopsisMotif, seq) # A warning regarding the presence of non-standard letters will be given, # but can be safely ignored in this case. }
yamtk scan defaults.scan_sequences2() is a deliberately pared-down counterpart to
scan_sequences(), with a default surface that mirrors the command-line
tool yamtk. It exposes a single threshold
(a P-value), always scans both strands by default, always computes a
per-hit P-value, and returns either a GRanges (preferred, when
GenomicRanges is installed) or a data.frame. Use scan_sequences() when
you need any of multifreq scoring, gapped motifs, q-values, exhaustive
P-values, respect.strand, allow.nonfinite, or threshold.types other
than a P-value.
scan_sequences2(motifs, sequences, pvalue = 1e-04, RC = TRUE, nthreads = 1, return.granges = NULL, no.overlaps = FALSE, no.overlaps.by = c("pvalue", "score"), no.overlaps.by.strand = FALSE, no.overlaps.by.motif = FALSE)scan_sequences2(motifs, sequences, pvalue = 1e-04, RC = TRUE, nthreads = 1, return.granges = NULL, no.overlaps = FALSE, no.overlaps.by = c("pvalue", "score"), no.overlaps.by.strand = FALSE, no.overlaps.by.motif = FALSE)
motifs |
See |
sequences |
|
pvalue |
|
RC |
|
nthreads |
|
return.granges |
|
no.overlaps |
|
no.overlaps.by |
|
no.overlaps.by.strand |
|
no.overlaps.by.motif |
|
The scanner re-uses the very same C++ core as scan_sequences(); the
speed advantage simply comes from skipping the R-side bookkeeping for the
features this function does not support.
P-values are computed via the dynamic-programming algorithm of
motif_pvalue() (FIMO-style; Grant et al. 2011), which is also what
yamtk scan uses.
Either a GRanges or a data.frame.
Common fields (both shapes):
motif, motif.i, sequence.i, score, score.pct, match, pvalue.
Coordinate fields:
GRanges: seqnames (sequence name), start, end, strand.
data.frame: sequence, start, end, strand.
Coordinates are always 1-based, inclusive, with start <= end regardless of
strand. For hits on the - strand the match column is the reverse
complement of the sequence substring; i.e. the sequence as matched against
the motif.
Benjamin Jean-Marie Tremblay, [email protected]
Grant CE, Bailey TL, Noble WS (2011). "FIMO: scanning for occurrences of a given motif." Bioinformatics, 27(7), 1017-1018.
Tremblay BJM (2026). yamtk: Yet Another Motif ToolKit. https://github.com/bjmt/yamtk.
scan_sequences(), motif_pvalue(), convert_motifs()
library(universalmotif) motifs <- create_motif(c("TATAAA", "CACGTG")) seqs <- create_sequences(seqnum = 5, seqlen = 200, rng.seed = 1) hits <- scan_sequences2(motifs, seqs, pvalue = 1e-3, return.granges = FALSE) head(hits)library(universalmotif) motifs <- create_motif(c("TATAAA", "CACGTG")) seqs <- create_sequences(seqnum = 5, seqlen = 200, rng.seed = 1) hits <- scan_sequences2(motifs, seqs, pvalue = 1e-3, return.granges = FALSE) head(hits)
Calculate sequence complexity using either the Wootton-Federhen, Trifonov, or DUST algorithms.
sequence_complexity(seqs, window.size = 20, window.overlap = round(window.size/2), method = c("WoottonFederhen", "WoottonFederhenFast", "Trifonov", "TrifonovFast", "DUST"), trifonov.max.word.size = 7, nthreads = 1, return.granges = FALSE)sequence_complexity(seqs, window.size = 20, window.overlap = round(window.size/2), method = c("WoottonFederhen", "WoottonFederhenFast", "Trifonov", "TrifonovFast", "DUST"), trifonov.max.word.size = 7, nthreads = 1, return.granges = FALSE)
seqs |
|
window.size |
|
window.overlap |
|
method |
|
trifonov.max.word.size |
|
nthreads |
|
return.granges |
|
The Wootton-Federhen (Wootton and Federhen, 1993) and Trifonov (Trifonov,
1990) algorithms as well as their faster approximations are well described
within Orlov and Potapov (2004). These algorithms score less complex sequences
closer to 0, and more complex ones closer to 1. Please note that the
'fast' approximation versions of the two methods are not actually faster
within sequence_complexity(), and so speed should not be a major consideration
when choosing which method to use within the universalmotif package.
The DUST algorithm
implementation is described in Morgulis et al. (2006). In this case,
less complex sequences score higher, and more complex ones closer
to 0.
Briefly, the Wootton-Federhen complexity score is a reflection of the
numbers of each unique letter found in the window (e.g. for DNA, the
more of all four letters can be found in the window the higher the
score). An increasing Trifonov score is a relection of the numbers of increasingly
larger k-mers (e.g. the count of possible 1-mers, 2-mers, 3-mers, ...,
until trifonov.max.word.size). Finally, the DUST score approaches 0
as the count of unique 3-mers increases. (See the final section in
the examples to see how different types of sequence compositions affect
the methods.)
Please note that the authors of the different methods recommend various
window sizes and complexity thresholds. The authors of DUST for example,
suggest using a window size of 64 and a threshold of 2 for low complexity.
Wootton and Federhen suggest a window size of 40, though show that 10
and 20 can be appropriate as well (for amino acid sequences). Keep in
mind however that these algorithms were implemented at a time when
computers were much slower; perhaps the authors would suggest different
window sizes today. One thing to note is that the Wootton-Federhen
algorithm has a hard limit due to the need to caculate the product from
1:window.size. This can end up calculating values which are greater
than what a double can hold (e.g. try prod(1:500)). Its approximation
does not suffer from this though, as it skips calculating the product.
In terms of speed, the Wootton-Federhen algorithms are fastest, with DUST being 1-3 times slower and the Trifonov algorithms being several times slower (though the exact amount depends on the max word size).
DataFrame, GRanges with each row getting a complexity score for
each window in each input sequence.
Benjamin Jean-Marie Tremblay, [email protected]
Morgulis A, Gertz EM, Schaffer AA, Agarwala R (2006). "A fast and symmetric DUST implementation to mask low-complexity DNA sequences." Journal of Computational Biology, 13, 1028-1040.
Orlov YL, Potapov VN (2004). "Complexity: an internet resource for analysis of DNA sequence complexity." Nucleic Acids Research, 32, W628-W633.
Trifonov EN (1990). "Making sense of the human genome." In Sarma RH, Sarma MH (Eds), Structure & Methods Adenine Press, Albany, 1, 69-77.
Wootton JC, Federhen S (1993). "Statistics of local complexity in amino acid sequences and sequence databases." Computers & Chemistry, 17, 149-163.
calc_complexity(), count_klets(), get_bkg(), mask_ranges(),
mask_seqs()
## Feel free to play around with different toy sequences to get a feel for ## how the different methods perform library(Biostrings) test.seq <- DNAStringSet(c("AAAAAAAAAAA", "ATGACTGATGC")) sequence_complexity(test.seq, method = "WoottonFederhen") sequence_complexity(test.seq, method = "WoottonFederhenFast") sequence_complexity(test.seq, method = "Trifonov") sequence_complexity(test.seq, method = "TrifonovFast") sequence_complexity(test.seq, method = "DUST") ## You could also use this in conjuction with mask_ranges() to hide ## low complexity regions from scanning, de novo motif discovery, etc if (requireNamespace("GenomicRanges", quiet = TRUE)) { data(ArabidopsisPromoters) # Calculate complexity in 20 bp windows, sliding every 1 bp to.mask <- sequence_complexity(ArabidopsisPromoters, method = "DUST", window.size = 20, window.overlap = 19, return.granges = TRUE) # Get the ranges with a complexity score greater than 3.5 to.mask <- to.mask[to.mask$complexity > 3.5] # See what the low complexity regions look like ArabidopsisPromoters[IRanges::reduce(to.mask)] # Mask them with the default '-' character: mask_ranges(ArabidopsisPromoters, to.mask) } ## To demonstrate how the methods work, consider: ## (These examples use the calc_complexity() utility which utilizes ## the same algorithms and works on character vectors, but lacks ## the ability to use sliding windows.) a <- "ACGT" # For Wootton-Federhen, it can be easily shown it is only dependent # on the counts of individual letters (though do note that the # original paper discusses this method in the context of amino acid # sequences and not DNA): calc_complexity("AAACCCGGGTTT", alph = a) # 0.7707 calc_complexity("AACCGGTTACGT", alph = a) # 0.7707 calc_complexity("ACGTACGTACGT", alph = a) # 0.7707 # As letters start to see drops in counts, the scores go down too: calc_complexity("AAAACCCCGGGG", alph = a) # 0.6284 calc_complexity("AAAAAACCCCCC", alph = a) # 0.4105 calc_complexity("AAAAAAAAAACC", alph = a) # 0.2518 # Trifonov on the other hand is greatly affected by the number # of higher order combinations: calc_complexity("AAACCCGGGTTT", c = "Trifonov", alph = a) # 0.6364 calc_complexity("AACCGGTTACGT", c = "Trifonov", alph = a) # 0.7273 # This next one may seem surprising, but it indeed scores very low. # This is because although it has many of each individual letter, # the number of higher order letter combinations in fact is quite # low due to this particular repeating pattern! calc_complexity("ACGTACGTACGT", c = "Trifonov", alph = a) # 0.01231 # By extension, this means it scores sequences with fewer # counts of individual letters lower too. calc_complexity("AAAACCCCGGGG", c = "Trifonov", alph = a) # 0.2386 calc_complexity("AAAAAACCCCCC", c = "Trifonov", alph = a) # 0.0227 calc_complexity("AAAAAAAAAACC", c = "Trifonov", alph = a) # 0.0011 # As for DUST, it considers the number of 3-mers in the sequence. # The higher the numbers of 3-mers, the lower the score. # (0 = the max possible number of DNA 3-mers for the window size) calc_complexity("AAACCCGGGTTT", c = "DUST", alph = a) # 0 calc_complexity("AACCGGTTACGT", c = "DUST", alph = a) # 0 calc_complexity("ACGTACGTACGT", c = "DUST", alph = a) # 0.8889 calc_complexity("AAAACCCCGGGG", c = "DUST", alph = a) # 0.333 calc_complexity("ACGACGACGACG", c = "DUST", alph = a) # 1.333 calc_complexity("AAAAAACCCCCC", c = "DUST", alph = a) # 1.333 # Similarly to Trifonov, the next one also scores as less complex # compared to the previous one: calc_complexity("ACACACACACAC", c = "DUST", alph = a) # 2.222 calc_complexity("AAAAAAAAAACC", c = "DUST", alph = a) # 3.111 calc_complexity("AAAAAAAAAAAC", c = "DUST", alph = a) # 4 calc_complexity("AAAAAAAAAAAA", c = "DUST", alph = a) # 5 # Just to show once more why the seemingly more complex sequences # such as "ACACACACACAC" score as less complex than "AAAAAACCCCCC" # for the Trifonov and DUST methods: count_klets("ACACACACACAC", k = 3) # Only 2 possible 3-mers count_klets("AAAAAACCCCCC", k = 3) # Now 4 possible 3-mers!## Feel free to play around with different toy sequences to get a feel for ## how the different methods perform library(Biostrings) test.seq <- DNAStringSet(c("AAAAAAAAAAA", "ATGACTGATGC")) sequence_complexity(test.seq, method = "WoottonFederhen") sequence_complexity(test.seq, method = "WoottonFederhenFast") sequence_complexity(test.seq, method = "Trifonov") sequence_complexity(test.seq, method = "TrifonovFast") sequence_complexity(test.seq, method = "DUST") ## You could also use this in conjuction with mask_ranges() to hide ## low complexity regions from scanning, de novo motif discovery, etc if (requireNamespace("GenomicRanges", quiet = TRUE)) { data(ArabidopsisPromoters) # Calculate complexity in 20 bp windows, sliding every 1 bp to.mask <- sequence_complexity(ArabidopsisPromoters, method = "DUST", window.size = 20, window.overlap = 19, return.granges = TRUE) # Get the ranges with a complexity score greater than 3.5 to.mask <- to.mask[to.mask$complexity > 3.5] # See what the low complexity regions look like ArabidopsisPromoters[IRanges::reduce(to.mask)] # Mask them with the default '-' character: mask_ranges(ArabidopsisPromoters, to.mask) } ## To demonstrate how the methods work, consider: ## (These examples use the calc_complexity() utility which utilizes ## the same algorithms and works on character vectors, but lacks ## the ability to use sliding windows.) a <- "ACGT" # For Wootton-Federhen, it can be easily shown it is only dependent # on the counts of individual letters (though do note that the # original paper discusses this method in the context of amino acid # sequences and not DNA): calc_complexity("AAACCCGGGTTT", alph = a) # 0.7707 calc_complexity("AACCGGTTACGT", alph = a) # 0.7707 calc_complexity("ACGTACGTACGT", alph = a) # 0.7707 # As letters start to see drops in counts, the scores go down too: calc_complexity("AAAACCCCGGGG", alph = a) # 0.6284 calc_complexity("AAAAAACCCCCC", alph = a) # 0.4105 calc_complexity("AAAAAAAAAACC", alph = a) # 0.2518 # Trifonov on the other hand is greatly affected by the number # of higher order combinations: calc_complexity("AAACCCGGGTTT", c = "Trifonov", alph = a) # 0.6364 calc_complexity("AACCGGTTACGT", c = "Trifonov", alph = a) # 0.7273 # This next one may seem surprising, but it indeed scores very low. # This is because although it has many of each individual letter, # the number of higher order letter combinations in fact is quite # low due to this particular repeating pattern! calc_complexity("ACGTACGTACGT", c = "Trifonov", alph = a) # 0.01231 # By extension, this means it scores sequences with fewer # counts of individual letters lower too. calc_complexity("AAAACCCCGGGG", c = "Trifonov", alph = a) # 0.2386 calc_complexity("AAAAAACCCCCC", c = "Trifonov", alph = a) # 0.0227 calc_complexity("AAAAAAAAAACC", c = "Trifonov", alph = a) # 0.0011 # As for DUST, it considers the number of 3-mers in the sequence. # The higher the numbers of 3-mers, the lower the score. # (0 = the max possible number of DNA 3-mers for the window size) calc_complexity("AAACCCGGGTTT", c = "DUST", alph = a) # 0 calc_complexity("AACCGGTTACGT", c = "DUST", alph = a) # 0 calc_complexity("ACGTACGTACGT", c = "DUST", alph = a) # 0.8889 calc_complexity("AAAACCCCGGGG", c = "DUST", alph = a) # 0.333 calc_complexity("ACGACGACGACG", c = "DUST", alph = a) # 1.333 calc_complexity("AAAAAACCCCCC", c = "DUST", alph = a) # 1.333 # Similarly to Trifonov, the next one also scores as less complex # compared to the previous one: calc_complexity("ACACACACACAC", c = "DUST", alph = a) # 2.222 calc_complexity("AAAAAAAAAACC", c = "DUST", alph = a) # 3.111 calc_complexity("AAAAAAAAAAAC", c = "DUST", alph = a) # 4 calc_complexity("AAAAAAAAAAAA", c = "DUST", alph = a) # 5 # Just to show once more why the seemingly more complex sequences # such as "ACACACACACAC" score as less complex than "AAAAAACCCCCC" # for the Trifonov and DUST methods: count_klets("ACACACACACAC", k = 3) # Only 2 possible 3-mers count_klets("AAAAAACCCCCC", k = 3) # Now 4 possible 3-mers!
Given a set of motifs, shuffle the columns to create new motifs. Currently does not
support keeping the 'multifreq' slot. Only the 'bkg', 'nsites', 'strand',
and 'bkgsites' slots will be preserved. Uses the same shuffling methods
as shuffle_sequences(). When shuffling more than one motif, all motif
columns are merged into a single pool and shuffled together, finally returning
them as motifs of identical lengths as the input motifs.
To instead shuffle motifs individually, call shuffle_motifs()
using lapply().
shuffle_motifs(motifs, k = 2, method = "linear")shuffle_motifs(motifs, k = 2, method = "linear")
motifs |
See |
k |
|
method |
|
Motifs. See convert_motifs() for available output
formats.
Benjamin Jean-Marie Tremblay, [email protected]
Given a set of input sequences, shuffle the letters within those sequences with any k-let size.
shuffle_sequences(sequences, k = 1, method = "euler", nthreads = 1, rng.seed = sample.int(10000, 1), window = FALSE, window.size = 0.1, window.overlap = 0.01)shuffle_sequences(sequences, k = 1, method = "euler", nthreads = 1, rng.seed = sample.int(10000, 1), window = FALSE, window.size = 0.1, window.overlap = 0.01)
sequences |
|
k |
|
method |
|
nthreads |
|
rng.seed |
|
window |
|
window.size |
|
window.overlap |
|
If method = 'markov', then the Markov model is used to
generate sequences which will maintain (on average) the k-let
frequencies. Please note that this method is not a 'true' shuffling, and
for short sequences (e.g. <100bp) this can result in slightly more
dissimilar sequences versus true shuffling. See
Fitch (1983) for a discussion on the
topic.
If method = 'euler', then the sequence shuffling method proposed by
Altschul and Erickson (1985) is used. As opposed
to the 'markov' method, this one preserves exact k-let frequencies. This
is done by creating a k-let edge graph, then following a
random Eulerian walk through the graph. Not all walks will use up all
available letters however, so the cycle-popping algorithm proposed by
Propp and Wilson (1998) is used to find a
random Eulerian path. A side effect of using this method is that the
starting and ending sequence letters will remain unshuffled.
If method = 'linear', then the input sequences are split linearly
every k letters. For example, for k = 3 'ACAGATAGACCC' becomes
'ACA GAT AGA CCC'; after which these 3-lets are shuffled randomly.
Do note however, that the method parameter is only relevant for k > 1.
For k = 1, a simple shuffling is performed using the shuffle function
from the C++ standard library.
XStringSet The input sequences will be returned with
identical names and lengths.
Benjamin Jean-Marie Tremblay, [email protected]
Altschul SF, Erickson BW (1985). “Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage.” Molecular Biology and Evolution, 2, 526-538.
Fitch WM (1983). “Random sequences.” Journal of Molecular Biology, 163, 171-176.
Propp JG, Wilson DW (1998). “How to get a perfectly random sample from a generic markov chain and generate a random spanning tree of a directed graph.” Journal of Algorithms, 27, 170-217.
create_sequences(), scan_sequences(), enrich_motifs(),
shuffle_motifs()
if (R.Version()$arch != "i386") { sequences <- create_sequences() sequences.shuffled <- shuffle_sequences(sequences, k = 2) }if (R.Version()$arch != "i386") { sequences <- create_sequences() sequences.shuffled <- shuffle_sequences(sequences, k = 2) }
Convert a motif from DNA to RNA, or RNA to DNA.
switch_alph(motifs)switch_alph(motifs)
motifs |
See |
The DNA/RNA version of the motifs. See convert_motifs() for
acceptable output formats.
Benjamin Jean-Marie Tremblay, [email protected]
DNA.motif <- create_motif() RNA.motif <- switch_alph(DNA.motif)DNA.motif <- create_motif() RNA.motif <- switch_alph(DNA.motif)
Tidy manipulation of motifs.
to_df(motifs, extrainfo = TRUE) update_motifs(motif_df, extrainfo = TRUE, force = FALSE) to_list(motif_df, extrainfo = TRUE, force = FALSE) requires_update(motifs, extrainfo = TRUE)to_df(motifs, extrainfo = TRUE) update_motifs(motif_df, extrainfo = TRUE, force = FALSE) to_list(motif_df, extrainfo = TRUE, force = FALSE) requires_update(motifs, extrainfo = TRUE)
motifs |
List of motifs. |
extrainfo |
Use the |
motif_df |
Motif |
force |
Whether to coerce non-character data types into characters for
inclusion in |
To turn off the informative messages/warnings when printing the object to
the console, set options(universalmotif_df.warning=FALSE).
For to_df(): a data.frame with the exposed slots as columns.
For update_motifs(): the updated data.frame.
For requires_update(): TRUE if the motifs are out of date,
FALSE if otherwise. Note that this function uses identical()
to check for this, which can be quite slow for large datasets. It
is usually just as fast to simply run update_motifs() in such
cases.
For to_list(): a list of motifs.
Benjamin Jean-Marie Tremblay, [email protected]
## Not run: library(universalmotif) library(dplyr) m <- c(create_motif(name = "motif A"), create_motif(name = "motif B")) # Change the names of the motifs using the tidy way: m <- m %>% to_df() %>% mutate(name = paste0(name, "-2")) %>% to_list() # Add your own metadata to be stored in the extrainfo slot: m_df <- to_df(m) m_df$MyMetadata <- c("Info_1", "Info_2") m <- to_list(m_df, extrainfo = TRUE) ## End(Not run)## Not run: library(universalmotif) library(dplyr) m <- c(create_motif(name = "motif A"), create_motif(name = "motif B")) # Change the names of the motifs using the tidy way: m <- m %>% to_df() %>% mutate(name = paste0(name, "-2")) %>% to_list() # Add your own metadata to be stored in the extrainfo slot: m_df <- to_df(m) m_df$MyMetadata <- c("Info_1", "Info_2") m <- to_list(m_df, extrainfo = TRUE) ## End(Not run)
Trim the low-contribution edge columns of one or more
Contribution Weight Matrix (CWM) motifs. CWMs do not have a
probabilistic interpretation, so trim_motifs()'s
information-content threshold is not the natural cutoff;
instead, trim_cwm() drops edge columns whose absolute
column sum (sum_i |cwm[i,j]|) falls below a threshold.
trim_cwm(motifs, trim.threshold = 0.3, abs.threshold = NULL, trim.from = c("both", "left", "right"))trim_cwm(motifs, trim.threshold = 0.3, abs.threshold = NULL, trim.from = c("both", "left", "right"))
motifs |
A CWM motif or list of CWM motifs ( |
trim.threshold |
|
abs.threshold |
|
trim.from |
|
Two threshold modes are supported. By default, trim_cwm()
uses the TF-MoDISco-lite fraction-of-peak rule: drop edge
columns whose absolute column sum is below
trim.threshold * max_j(sum_i |cwm[i,j]|), with
trim.threshold = 0.3 matching TF-MoDISco-lite's
--trim_threshold default. When abs.threshold is supplied
(non-NULL), the function switches to an absolute-value cutoff:
edge columns with absolute column sum below abs.threshold are
dropped, ignoring trim.threshold.
In both modes the trimmer walks inward only from the chosen
edges (trim.from), matching trim_motifs()'s edge-only
semantics; interior low-contribution columns are kept.
A trimmed CWM motif (single input) or list of trimmed
CWM motifs (list input). The type = "CWM" tag is preserved.
Benjamin Jean-Marie Tremblay, [email protected]
trim_motifs(), create_motif(), convert_type()
library(universalmotif) ## Synthetic CWM with a strong centre and weak flanks. m <- matrix(c(0.02, -0.01, 0.01, 0.00, 0.90, -0.30, -0.30, -0.30, -0.20, 0.80, -0.20, -0.20, -0.30, -0.30, -0.30, 0.90, 0.03, 0.00, -0.02, 0.01), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(m, type = "CWM", name = "demo") ## Default: drop edge columns where |colsum| < 0.3 * peak. trimmed <- trim_cwm(cwm) ncol(trimmed["motif"]) ## Explicit absolute-value cutoff instead. trim_cwm(cwm, abs.threshold = 0.5) ## One-sided trim. trim_cwm(cwm, trim.from = "left")library(universalmotif) ## Synthetic CWM with a strong centre and weak flanks. m <- matrix(c(0.02, -0.01, 0.01, 0.00, 0.90, -0.30, -0.30, -0.30, -0.20, 0.80, -0.20, -0.20, -0.30, -0.30, -0.30, 0.90, 0.03, 0.00, -0.02, 0.01), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(m, type = "CWM", name = "demo") ## Default: drop edge columns where |colsum| < 0.3 * peak. trimmed <- trim_cwm(cwm) ncol(trimmed["motif"]) ## Explicit absolute-value cutoff instead. trim_cwm(cwm, abs.threshold = 0.5) ## One-sided trim. trim_cwm(cwm, trim.from = "left")
Remove edges of motifs with low information content. Currently does
not trim multifreq representations.
trim_motifs(motifs, min.ic = 0.25, trim.from = c("both", "left", "right"))trim_motifs(motifs, min.ic = 0.25, trim.from = c("both", "left", "right"))
motifs |
See |
min.ic |
|
trim.from |
|
Motifs See convert_motifs() for available output
formats.
Benjamin Jean-Marie Tremblay, [email protected]
create_motif(), convert_type()
jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) jaspar.trimmed <- trim_motifs(jaspar)jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) jaspar.trimmed <- trim_motifs(jaspar)
Container for motif objects. See create_motif() for creating
motifs as well as a more detailed description of the slots. For a
brief description of available methods, see examples.
## S4 method for signature 'universalmotif' x[i] ## S4 replacement method for signature 'universalmotif' x[i] <- value ## S4 method for signature 'universalmotif' initialize(.Object, name, altname, family, organism, motif, alphabet = "DNA", type, icscore, nsites, pseudocount = 1, bkg, bkgsites, consensus, strand = "+-", pval, qval, eval, multifreq, extrainfo, gapinfo) ## S4 method for signature 'universalmotif' show(object) ## S4 method for signature 'universalmotif' as.data.frame(x) ## S4 method for signature 'universalmotif' subset(x, select) ## S4 method for signature 'universalmotif' normalize(object) ## S4 method for signature 'universalmotif' rowMeans(x) ## S4 method for signature 'universalmotif' colMeans(x) ## S4 method for signature 'universalmotif' colSums(x) ## S4 method for signature 'universalmotif' rowSums(x) ## S4 method for signature 'universalmotif' nrow(x) ## S4 method for signature 'universalmotif' ncol(x) ## S4 method for signature 'universalmotif' colnames(x) ## S4 method for signature 'universalmotif' rownames(x) ## S4 method for signature 'universalmotif' cbind(..., deparse.level = 0)## S4 method for signature 'universalmotif' x[i] ## S4 replacement method for signature 'universalmotif' x[i] <- value ## S4 method for signature 'universalmotif' initialize(.Object, name, altname, family, organism, motif, alphabet = "DNA", type, icscore, nsites, pseudocount = 1, bkg, bkgsites, consensus, strand = "+-", pval, qval, eval, multifreq, extrainfo, gapinfo) ## S4 method for signature 'universalmotif' show(object) ## S4 method for signature 'universalmotif' as.data.frame(x) ## S4 method for signature 'universalmotif' subset(x, select) ## S4 method for signature 'universalmotif' normalize(object) ## S4 method for signature 'universalmotif' rowMeans(x) ## S4 method for signature 'universalmotif' colMeans(x) ## S4 method for signature 'universalmotif' colSums(x) ## S4 method for signature 'universalmotif' rowSums(x) ## S4 method for signature 'universalmotif' nrow(x) ## S4 method for signature 'universalmotif' ncol(x) ## S4 method for signature 'universalmotif' colnames(x) ## S4 method for signature 'universalmotif' rownames(x) ## S4 method for signature 'universalmotif' cbind(..., deparse.level = 0)
x |
universalmotif Motif. |
i |
|
value |
Object to replace slot with. |
.Object |
universalmotif Final motif. |
name |
|
altname |
|
family |
|
organism |
|
motif |
|
alphabet |
|
type |
|
icscore |
|
nsites |
|
pseudocount |
|
bkg |
|
bkgsites |
|
consensus |
|
strand |
|
pval |
|
qval |
|
eval |
|
multifreq |
|
extrainfo |
|
gapinfo |
|
object |
universalmotif Motif. |
select |
|
... |
universalmotif Motifs. |
deparse.level |
Unused. |
A motif object of class universalmotif.
namecharacter(1)
altnamecharacter(1)
familycharacter(1)
organismcharacter(1)
motifmatrix
alphabetcharacter(1)
typecharacter(1)
icscorenumeric(1) Generated automatically.
nsitesnumeric(1)
pseudocountnumeric(1)
bkgnumeric 0-order probabilities must be provided for all letters.
bkgsitesnumeric(1)
consensuscharacter Generated automatically.
strandcharacter(1)
pvalnumeric(1)
qvalnumeric(1)
evalnumeric(1)
multifreqlist
extrainfocharacter
gapinfouniversalmotif_gapped(1)
Benjamin Jean-Marie Tremblay, [email protected]
## [ ## Access the slots. motif <- create_motif() motif["motif"] # you can also access multiple slots at once, released as a list motif[c("motif", "name")] ## [<- ## Replace the slots. motif["name"] <- "new name" # some slots are protected # motif["consensus"] <- "AAAA" # not allowed ## c ## Assemble a list of motifs. c(motif, motif) ## as.data.frame ## Represent a motif as a data.frame. The actual motif matrix is lost. ## Necessary for `summarise_motifs`. as.data.frame(motif) ## subset ## Subset a motif matrix by column. subset(motif, 3:7) # extract motif core ## normalize ## Apply the pseudocount slot (or `1`, if the slot is set to zero) to the ## motif matrix. motif2 <- create_motif("AAAAA", nsites = 100, pseudocount = 1) normalize(motif2) ## rowMeans ## Calculate motif rowMeans. rowMeans(motif) ## colMeans ## Calculate motif colMeans. colMeans(motif) ## colSums ## Calculate motif colSums colSums(motif) ## rowSums ## Calculate motif rowSums. rowSums(motif) ## nrow ## Count motif rows. nrow(motif) ## ncol ## Count motif columns. ncol(motif) ## colnames ## Get motif colnames. colnames(motif) ## rownames ## Get motif rownames. rownames(motif) ## cbind ## Bind motifs together to create a new motif. cbind(motif, motif2)## [ ## Access the slots. motif <- create_motif() motif["motif"] # you can also access multiple slots at once, released as a list motif[c("motif", "name")] ## [<- ## Replace the slots. motif["name"] <- "new name" # some slots are protected # motif["consensus"] <- "AAAA" # not allowed ## c ## Assemble a list of motifs. c(motif, motif) ## as.data.frame ## Represent a motif as a data.frame. The actual motif matrix is lost. ## Necessary for `summarise_motifs`. as.data.frame(motif) ## subset ## Subset a motif matrix by column. subset(motif, 3:7) # extract motif core ## normalize ## Apply the pseudocount slot (or `1`, if the slot is set to zero) to the ## motif matrix. motif2 <- create_motif("AAAAA", nsites = 100, pseudocount = 1) normalize(motif2) ## rowMeans ## Calculate motif rowMeans. rowMeans(motif) ## colMeans ## Calculate motif colMeans. colMeans(motif) ## colSums ## Calculate motif colSums colSums(motif) ## rowSums ## Calculate motif rowSums. rowSums(motif) ## nrow ## Count motif rows. nrow(motif) ## ncol ## Count motif columns. ncol(motif) ## colnames ## Get motif colnames. colnames(motif) ## rownames ## Get motif rownames. rownames(motif) ## cbind ## Bind motifs together to create a new motif. cbind(motif, motif2)
A collection of utility functions for working with motifs.
Maintainer: Benjamin Jean-Marie Tremblay [email protected] (ORCID)
Other contributors:
Spencer Nystrom (ORCID) [contributor]
Useful links:
Report bugs at https://github.com/bjmt/universalmotif/issues
Utility functions have been split into two categories: those related to motifs ?'utils-motif', and those related to sequences ?'utils-sequence'.
Benjamin Jean-Marie Tremblay, [email protected]
Motif-related utility functions.
add_gap(motif, gaploc = ncol(motif)%/%2, mingap = 1, maxgap = 5) average_ic(motifs, average = c("a.mean", "g.mean", "median", "fzt")) compare_columns(x, y, method, bkg1 = rep(1/length(x), length(x)), bkg2 = rep(1/length(y), length(y)), nsites1 = 100, nsites2 = 100) consensus_to_ppm(letter) consensus_to_ppmAA(letter) get_consensus(position, alphabet = "DNA", type = "PPM", pseudocount = 1) get_consensusAA(position, type = "PPM", pseudocount = 0) get_matches(motif, score, allow.nonfinite = FALSE) get_scores(motif, allow.nonfinite = FALSE) icm_to_ppm(position) motif_range(motif, use.freq = 1, allow.nonfinite = FALSE) motif_score(motif, threshold = c(0, 1), use.freq = 1, allow.nonfinite = FALSE, threshold.type = c("total", "fromzero")) log_string_pval(pval) pcm_to_ppm(position, pseudocount = 0) position_icscore(position, bkg = numeric(), type = "PPM", pseudocount = 1, nsites = 100, relative_entropy = FALSE, schneider_correction = FALSE) ppm_to_icm(position, bkg = numeric(), schneider_correction = FALSE, nsites = 100, relative_entropy = FALSE) ppm_to_pcm(position, nsites = 100) ppm_to_pwm(position, bkg = numeric(), pseudocount = 1, nsites = 100, smooth = TRUE) prob_match(motif, match, allow.zero = TRUE) prob_match_bkg(bkg, match) pwm_to_ppm(position, bkg = numeric()) round_motif(motif, pct.tolerance = 0.05) score_match(motif, match, allow.nonfinite = FALSE) summarise_motifs(motifs, na.rm = TRUE) ungap(motif, delete = FALSE)add_gap(motif, gaploc = ncol(motif)%/%2, mingap = 1, maxgap = 5) average_ic(motifs, average = c("a.mean", "g.mean", "median", "fzt")) compare_columns(x, y, method, bkg1 = rep(1/length(x), length(x)), bkg2 = rep(1/length(y), length(y)), nsites1 = 100, nsites2 = 100) consensus_to_ppm(letter) consensus_to_ppmAA(letter) get_consensus(position, alphabet = "DNA", type = "PPM", pseudocount = 1) get_consensusAA(position, type = "PPM", pseudocount = 0) get_matches(motif, score, allow.nonfinite = FALSE) get_scores(motif, allow.nonfinite = FALSE) icm_to_ppm(position) motif_range(motif, use.freq = 1, allow.nonfinite = FALSE) motif_score(motif, threshold = c(0, 1), use.freq = 1, allow.nonfinite = FALSE, threshold.type = c("total", "fromzero")) log_string_pval(pval) pcm_to_ppm(position, pseudocount = 0) position_icscore(position, bkg = numeric(), type = "PPM", pseudocount = 1, nsites = 100, relative_entropy = FALSE, schneider_correction = FALSE) ppm_to_icm(position, bkg = numeric(), schneider_correction = FALSE, nsites = 100, relative_entropy = FALSE) ppm_to_pcm(position, nsites = 100) ppm_to_pwm(position, bkg = numeric(), pseudocount = 1, nsites = 100, smooth = TRUE) prob_match(motif, match, allow.zero = TRUE) prob_match_bkg(bkg, match) pwm_to_ppm(position, bkg = numeric()) round_motif(motif, pct.tolerance = 0.05) score_match(motif, match, allow.nonfinite = FALSE) summarise_motifs(motifs, na.rm = TRUE) ungap(motif, delete = FALSE)
motif |
Motif object to calculate scores from, or add/remove gap, or round. |
gaploc |
|
mingap |
|
maxgap |
|
motifs |
|
average |
|
x |
|
y |
|
method |
|
bkg1 |
|
bkg2 |
|
nsites1 |
|
nsites2 |
|
letter |
|
position |
|
alphabet |
|
type |
|
pseudocount |
|
score |
|
allow.nonfinite |
|
use.freq |
|
threshold |
|
threshold.type |
|
pval |
|
bkg |
|
nsites |
|
relative_entropy |
|
schneider_correction |
|
smooth |
|
match |
|
allow.zero |
|
pct.tolerance |
|
na.rm |
|
delete |
|
For consensus_to_ppm() and consensus_to_ppmAA(): a numeric
vector of length 4 and 20, respectively.
For get_consensus() and get_consensusAA(): a character vector
of length 1.
For get_matches(): a character vector of motif matches.
For motif_range(): a named numeric vector of motif scores.
For motif_score(): a named numeric vector of motif scores.
For log_string_pval(): a numeric vector of length 1.
For position_icscore(): a numeric vector of length 1.
For ppm_to_icm(), icm_to_ppm(), pcm_to_ppm(),
ppm_to_pcm(), ppm_to_pwm(), and pwm_to_ppm(): a numeric
vector with length equal to input numeric vector.
For prob_match(): a numeric vector of probabilities.
For round_motif(): the input motif, rounded.
For score_match(): a numeric vector with the match motif score.
For summarise_motifs(): a data.frame with columns representing
the universalmotif slots.
Benjamin Jean-Marie Tremblay, [email protected]
data(examplemotif) examplemotif0 <- examplemotif examplemotif0["pseudocount"] <- 0 ####################################################################### ## add_gap ## Add gap information to a motif. m <- create_motif() # Add a gap size 5-8 between positions 4 and 5: m <- add_gap(m, gaploc = 4, mingap = 5, maxgap = 8) ####################################################################### ## average_ic ## Calculate the average information content for a list of motifs. m <- create_motif() average_ic(m, "fzt") ####################################################################### ## compare_columns ## Compare two numeric vectors using the metrics from compare_motifs() compare_columns(c(0.5, 0.1, 0.1, 0.2), c(0.7, 0.1, 0.1, 0.1), "PCC") ####################################################################### ## consensus_to_ppm ## Do the opposite of get_consensus. Note that loss of information is ## inevitable. Generates a sequence matrix. sapply(c("A", "G", "T", "B"), consensus_to_ppm) ####################################################################### ## consensus_to_ppmAA ## Do the opposite of get_consensusAA and generate a motif matrix. sapply(c("V", "A", "L"), consensus_to_ppmAA) ####################################################################### ## get_consensus ## Get a consensus string from a DNA/RNA motif. m <- create_motif()["motif"] apply(m, 2, get_consensus) ####################################################################### ## get_consensusAA ## Get a consensus string from an amino acid motif. Unless each position ## is clearly dominated by a single amino acid, the resulting string will ## likely be useless. m <- create_motif(alphabet = "AA")["motif"] apply(m, 2, get_consensusAA, type = "PPM") ####################################################################### ## get_match ## Get all possible motif matches above input score get_matches(examplemotif, 0) get_matches(examplemotif0, 0, allow.nonfinite = TRUE) ####################################################################### ## get_scores ## Get all possible scores for a motif length(get_scores(examplemotif)) get_scores(examplemotif) get_scores(examplemotif0, allow.nonfinite = TRUE) ####################################################################### ## icm_to_ppm ## Do the opposite of ppm_to_icm. m <- create_motif(type = "ICM")["motif"] apply(m, 2, icm_to_ppm) ####################################################################### ## motif_range ## Calculate the range of possible logodds scores for a motif motif_range(examplemotif) motif_range(examplemotif, allow.nonfinite = TRUE) ####################################################################### ## motif_score ## Calculate motif score from different thresholds m <- normalize(examplemotif) motif_score(m, c(0, 0.8, 1)) motif_score(examplemotif0, c(0, 0.8, 1), allow.nonfinite = TRUE, threshold.type = "fromzero") ####################################################################### ## log_string_pval ## Get the log of a string-formatted p-value log_string_pval("1e-200") ####################################################################### ## pcm_to_ppm ## Go from a count type motif to a probability type motif. m <- create_motif(type = "PCM", nsites = 50)["motif"] apply(m, 2, pcm_to_ppm, pseudocount = 1) ####################################################################### ## position_icscore ## Similar to ppm_to_icm, except this calculates the position sum. m <- create_motif()["motif"] apply(m, 2, position_icscore, type = "PPM", bkg = rep(0.25, 4)) ####################################################################### ## ppm_to_icm ## Convert one column from a probability type motif to an information ## content type motif. m <- create_motif(nsites = 100, pseudocount = 0.8)["motif"] apply(m, 2, ppm_to_icm, nsites = 100, bkg = rep(0.25, 4)) ####################################################################### ## ppm_to_pcm ## Do the opposite of pcm_to_ppm. m <- create_motif()["motif"] apply(m, 2, ppm_to_pcm, nsites = 50) ####################################################################### ## ppm_to_pwm ## Go from a probability type motif to a weight type motif. m <- create_motif()["motif"] apply(m, 2, ppm_to_pwm, nsites = 100, bkg = rep(0.25, 4)) ####################################################################### ## prob_match, prob_match_bkg ## Calculate probability of a particular match based on background ## frequencies prob_match(examplemotif, "TATATAT") ## Since this motif has a uniform background, the probability of ## finding any motif hit within the sequence is equal prob_match(examplemotif, "TATATAG") m <- examplemotif m["bkg"] <- c(0.3, 0.2, 0.2, 0.3) prob_match(m, "TATATAT") ## The prob_match_bkg alternative allows you to simply pass along the ## background frequencies prob_match_bkg(c(A=0.3, C=0.2, G=0.2, T=0.3), c("TATATAT", "TATATAG")) ####################################################################### ## pwm_to_ppm ## Do the opposite of ppm_to_pwm. m <- create_motif(type = "PWM")["motif"] apply(m, 2, pwm_to_ppm, bkg = rep(0.25, 4)) ####################################################################### ## Note that not all type conversions can be done directly; for those ## type conversions which are unavailable, universalmotif just chains ## together others (i.e. from PCM -> ICM => pcm_to_ppm -> ppm_to_icm) ####################################################################### ## round_motif ## Round down letter scores to 0 m <- create_motif() ## Remove letters from positions which are less than 5% of the total ## position: round_motif(m, pct.tolerance = 0.05) ####################################################################### ## score_match ## Calculate score of a particular match score_match(examplemotif, "TATATAT") score_match(examplemotif, "TATATAG") score_match(examplemotif0, "TATATAT", allow.nonfinite = TRUE) score_match(examplemotif0, "TATATAG", allow.nonfinite = TRUE) ####################################################################### ## summarise_motifs ## Create a data.frame of information based on a list of motifs. m1 <- create_motif() m2 <- create_motif() m3 <- create_motif() summarise_motifs(list(m1, m2, m3)) ####################################################################### ## ungap ## Unset motif's gap status. Does not delete actual gap data unless ## delete = TRUE. m <- create_motif() m <- add_gap(m, 3, 2, 4) m <- ungap(m) # Restore gap data: m <- add_gap(m)data(examplemotif) examplemotif0 <- examplemotif examplemotif0["pseudocount"] <- 0 ####################################################################### ## add_gap ## Add gap information to a motif. m <- create_motif() # Add a gap size 5-8 between positions 4 and 5: m <- add_gap(m, gaploc = 4, mingap = 5, maxgap = 8) ####################################################################### ## average_ic ## Calculate the average information content for a list of motifs. m <- create_motif() average_ic(m, "fzt") ####################################################################### ## compare_columns ## Compare two numeric vectors using the metrics from compare_motifs() compare_columns(c(0.5, 0.1, 0.1, 0.2), c(0.7, 0.1, 0.1, 0.1), "PCC") ####################################################################### ## consensus_to_ppm ## Do the opposite of get_consensus. Note that loss of information is ## inevitable. Generates a sequence matrix. sapply(c("A", "G", "T", "B"), consensus_to_ppm) ####################################################################### ## consensus_to_ppmAA ## Do the opposite of get_consensusAA and generate a motif matrix. sapply(c("V", "A", "L"), consensus_to_ppmAA) ####################################################################### ## get_consensus ## Get a consensus string from a DNA/RNA motif. m <- create_motif()["motif"] apply(m, 2, get_consensus) ####################################################################### ## get_consensusAA ## Get a consensus string from an amino acid motif. Unless each position ## is clearly dominated by a single amino acid, the resulting string will ## likely be useless. m <- create_motif(alphabet = "AA")["motif"] apply(m, 2, get_consensusAA, type = "PPM") ####################################################################### ## get_match ## Get all possible motif matches above input score get_matches(examplemotif, 0) get_matches(examplemotif0, 0, allow.nonfinite = TRUE) ####################################################################### ## get_scores ## Get all possible scores for a motif length(get_scores(examplemotif)) get_scores(examplemotif) get_scores(examplemotif0, allow.nonfinite = TRUE) ####################################################################### ## icm_to_ppm ## Do the opposite of ppm_to_icm. m <- create_motif(type = "ICM")["motif"] apply(m, 2, icm_to_ppm) ####################################################################### ## motif_range ## Calculate the range of possible logodds scores for a motif motif_range(examplemotif) motif_range(examplemotif, allow.nonfinite = TRUE) ####################################################################### ## motif_score ## Calculate motif score from different thresholds m <- normalize(examplemotif) motif_score(m, c(0, 0.8, 1)) motif_score(examplemotif0, c(0, 0.8, 1), allow.nonfinite = TRUE, threshold.type = "fromzero") ####################################################################### ## log_string_pval ## Get the log of a string-formatted p-value log_string_pval("1e-200") ####################################################################### ## pcm_to_ppm ## Go from a count type motif to a probability type motif. m <- create_motif(type = "PCM", nsites = 50)["motif"] apply(m, 2, pcm_to_ppm, pseudocount = 1) ####################################################################### ## position_icscore ## Similar to ppm_to_icm, except this calculates the position sum. m <- create_motif()["motif"] apply(m, 2, position_icscore, type = "PPM", bkg = rep(0.25, 4)) ####################################################################### ## ppm_to_icm ## Convert one column from a probability type motif to an information ## content type motif. m <- create_motif(nsites = 100, pseudocount = 0.8)["motif"] apply(m, 2, ppm_to_icm, nsites = 100, bkg = rep(0.25, 4)) ####################################################################### ## ppm_to_pcm ## Do the opposite of pcm_to_ppm. m <- create_motif()["motif"] apply(m, 2, ppm_to_pcm, nsites = 50) ####################################################################### ## ppm_to_pwm ## Go from a probability type motif to a weight type motif. m <- create_motif()["motif"] apply(m, 2, ppm_to_pwm, nsites = 100, bkg = rep(0.25, 4)) ####################################################################### ## prob_match, prob_match_bkg ## Calculate probability of a particular match based on background ## frequencies prob_match(examplemotif, "TATATAT") ## Since this motif has a uniform background, the probability of ## finding any motif hit within the sequence is equal prob_match(examplemotif, "TATATAG") m <- examplemotif m["bkg"] <- c(0.3, 0.2, 0.2, 0.3) prob_match(m, "TATATAT") ## The prob_match_bkg alternative allows you to simply pass along the ## background frequencies prob_match_bkg(c(A=0.3, C=0.2, G=0.2, T=0.3), c("TATATAT", "TATATAG")) ####################################################################### ## pwm_to_ppm ## Do the opposite of ppm_to_pwm. m <- create_motif(type = "PWM")["motif"] apply(m, 2, pwm_to_ppm, bkg = rep(0.25, 4)) ####################################################################### ## Note that not all type conversions can be done directly; for those ## type conversions which are unavailable, universalmotif just chains ## together others (i.e. from PCM -> ICM => pcm_to_ppm -> ppm_to_icm) ####################################################################### ## round_motif ## Round down letter scores to 0 m <- create_motif() ## Remove letters from positions which are less than 5% of the total ## position: round_motif(m, pct.tolerance = 0.05) ####################################################################### ## score_match ## Calculate score of a particular match score_match(examplemotif, "TATATAT") score_match(examplemotif, "TATATAG") score_match(examplemotif0, "TATATAT", allow.nonfinite = TRUE) score_match(examplemotif0, "TATATAG", allow.nonfinite = TRUE) ####################################################################### ## summarise_motifs ## Create a data.frame of information based on a list of motifs. m1 <- create_motif() m2 <- create_motif() m3 <- create_motif() summarise_motifs(list(m1, m2, m3)) ####################################################################### ## ungap ## Unset motif's gap status. Does not delete actual gap data unless ## delete = TRUE. m <- create_motif() m <- add_gap(m, 3, 2, 4) m <- ungap(m) # Restore gap data: m <- add_gap(m)
Sequence-related utility functions.
calc_complexity(string, complexity.method = c("WoottonFederhen", "WoottonFederhenFast", "Trifonov", "TrifonovFast", "DUST"), alph = NULL, trifonov.max.word.size = 7) calc_windows(n, window = 1, overlap = 0, return.incomp = TRUE) count_klets(string, k = 1, alph) get_klets(lets, k = 1) mask_ranges(seqs, ranges, letter = "-") mask_seqs(seqs, pattern, RC = FALSE, letter = "-") meme_alph(core, file = stdout(), complements = NULL, ambiguity = NULL, like = NULL, alph.name = NULL, letter.names = NULL, colours = NULL) shuffle_string(string, k = 1, method = c("euler", "linear", "markov"), rng.seed = sample.int(10000, 1)) slide_fun(string, FUN, FUN.VALUE, window = 1, overlap = 0, return.incomp = TRUE) window_string(string, window = 1, overlap = 0, return.incomp = TRUE, nthreads = 1)calc_complexity(string, complexity.method = c("WoottonFederhen", "WoottonFederhenFast", "Trifonov", "TrifonovFast", "DUST"), alph = NULL, trifonov.max.word.size = 7) calc_windows(n, window = 1, overlap = 0, return.incomp = TRUE) count_klets(string, k = 1, alph) get_klets(lets, k = 1) mask_ranges(seqs, ranges, letter = "-") mask_seqs(seqs, pattern, RC = FALSE, letter = "-") meme_alph(core, file = stdout(), complements = NULL, ambiguity = NULL, like = NULL, alph.name = NULL, letter.names = NULL, colours = NULL) shuffle_string(string, k = 1, method = c("euler", "linear", "markov"), rng.seed = sample.int(10000, 1)) slide_fun(string, FUN, FUN.VALUE, window = 1, overlap = 0, return.incomp = TRUE) window_string(string, window = 1, overlap = 0, return.incomp = TRUE, nthreads = 1)
string |
|
complexity.method |
|
alph |
|
trifonov.max.word.size |
|
n |
|
window |
|
overlap |
|
return.incomp |
|
k |
|
lets |
|
seqs |
|
ranges |
|
letter |
|
pattern |
|
RC |
|
core |
|
file |
Output file. |
complements |
|
ambiguity |
|
like |
|
alph.name |
|
letter.names |
|
colours |
|
method |
|
rng.seed |
|
FUN |
|
FUN.VALUE |
The expected return type for |
nthreads |
|
For calc_complexity(): A vector of numeric values.
For calc_windows(): A data.frame with columns start and stop.
For count_klets(): A data.frame with columns lets and counts.
For get_klets(): A character vector of k-lets.
For mask_ranges(): The masked XStringSet object.
For mask_seqs(): The masked XStringSet object.
For meme_alph(): NULL, invisibly.
For shuffle_string(): A single character string.
For slide_fun(): A vector with type FUN.VALUE.
For window_string(): A character vector.
Benjamin Jean-Marie Tremblay, [email protected]
create_sequences(), get_bkg(), sequence_complexity(),
shuffle_sequences()
####################################################################### ## calc_complexity ## Calculate complexity for abitrary strings calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "WoottonFederhen") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "WoottonFederhenFast") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "Trifonov") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "TrifonovFast") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "DUST") ####################################################################### ## calc_windows ## Calculate window coordinates for any value 'n'. calc_windows(100, 10, 5) ####################################################################### ## count_klets ## Count k-lets for any string of characters count_klets("GCAAATGTACGCAGGGCCGA", k = 2) ## The default 'k' value (1) counts individual letters count_klets("GCAAATGTACGCAGGGCCGA") ####################################################################### ## get_klets ## Generate all possible k-lets for a set of characters get_klets(c("A", "C", "G", "T"), 3) ## Note that each element in 'lets' is considered a single unit; ## see: get_klets(c("AA", "B"), k = 2) ####################################################################### ## mask_ranges ## Mask arbitrary ranges if (requireNamespace("GenomicRanges", quiet = TRUE)) { ranges <- GenomicRanges::GRanges("A", IRanges::IRanges(1, 5)) seq <- Biostrings::DNAStringSet(c(A = "ATGACTGATTACTTATA")) mask_ranges(seq, ranges, "-") } ####################################################################### ## mask_seqs ## Mask repetitive seqeuences data(ArabidopsisPromoters) mask_seqs(ArabidopsisPromoters, "AAAAAA") ####################################################################### ## meme_alph ## Create MEME custom alphabet definition files meme_alph("ACm", complements = "TGM", alph.name = "MethDNA", letter.names = c(A = "Adenine", C = "Cytosine", G = "Guanine", T = "Thymine", m = "Methylcytosine", M = "mC:Guanine"), like = "DNA", ambiguity = c(N = "ACGTmM")) ####################################################################### ## shuffle_string ## Shuffle any string of characters shuffle_string("ASDADASDASDASD", k = 1) ####################################################################### ## slide_fun ## Apply a function to a character vector along sliding windows FUN <- function(x) grepl("[GC]", x) data.frame( Window = window_string("ATGCATCTATGCA", 2, 1), HasGC = slide_fun("ATGCATCTATGCA", FUN, logical(1), 2, 1) ) ####################################################################### ## window_string ## Get sliding windows for a string of characters window_string("ABCDEFGHIJ", 2, 1)####################################################################### ## calc_complexity ## Calculate complexity for abitrary strings calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "WoottonFederhen") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "WoottonFederhenFast") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "Trifonov") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "TrifonovFast") calc_complexity("GTGCCCCGCGGGAACCCCGC", c = "DUST") ####################################################################### ## calc_windows ## Calculate window coordinates for any value 'n'. calc_windows(100, 10, 5) ####################################################################### ## count_klets ## Count k-lets for any string of characters count_klets("GCAAATGTACGCAGGGCCGA", k = 2) ## The default 'k' value (1) counts individual letters count_klets("GCAAATGTACGCAGGGCCGA") ####################################################################### ## get_klets ## Generate all possible k-lets for a set of characters get_klets(c("A", "C", "G", "T"), 3) ## Note that each element in 'lets' is considered a single unit; ## see: get_klets(c("AA", "B"), k = 2) ####################################################################### ## mask_ranges ## Mask arbitrary ranges if (requireNamespace("GenomicRanges", quiet = TRUE)) { ranges <- GenomicRanges::GRanges("A", IRanges::IRanges(1, 5)) seq <- Biostrings::DNAStringSet(c(A = "ATGACTGATTACTTATA")) mask_ranges(seq, ranges, "-") } ####################################################################### ## mask_seqs ## Mask repetitive seqeuences data(ArabidopsisPromoters) mask_seqs(ArabidopsisPromoters, "AAAAAA") ####################################################################### ## meme_alph ## Create MEME custom alphabet definition files meme_alph("ACm", complements = "TGM", alph.name = "MethDNA", letter.names = c(A = "Adenine", C = "Cytosine", G = "Guanine", T = "Thymine", m = "Methylcytosine", M = "mC:Guanine"), like = "DNA", ambiguity = c(N = "ACGTmM")) ####################################################################### ## shuffle_string ## Shuffle any string of characters shuffle_string("ASDADASDASDASD", k = 1) ####################################################################### ## slide_fun ## Apply a function to a character vector along sliding windows FUN <- function(x) grepl("[GC]", x) data.frame( Window = window_string("ATGCATCTATGCA", 2, 1), HasGC = slide_fun("ATGCATCTATGCA", FUN, logical(1), 2, 1) ) ####################################################################### ## window_string ## Get sliding windows for a string of characters window_string("ABCDEFGHIJ", 2, 1)
This function provides the plotting capabilities of view_motifs() without
requiring universalmotif-class objects. Instead, it takes a numeric matrix
with row names as input. Additionally, columns can be of any height and
letters can be a mix of different character lengths.
view_logo(x, fontDF = NULL, fill = "black", colour.scheme = NULL, min.height = 0.01, x.spacer = 0.04, y.spacer = 0.01, sort.positions = FALSE, sort.positions.decreasing = TRUE, fit.to.height = NULL, flip.neg = FALSE)view_logo(x, fontDF = NULL, fill = "black", colour.scheme = NULL, min.height = 0.01, x.spacer = 0.04, y.spacer = 0.01, sort.positions = FALSE, sort.positions.decreasing = TRUE, fit.to.height = NULL, flip.neg = FALSE)
x |
A numeric matrix with row names. The row names can be a mix of different character lengths. |
fontDF |
|
fill |
|
colour.scheme |
|
min.height |
|
x.spacer |
|
y.spacer |
|
sort.positions |
|
sort.positions.decreasing |
|
fit.to.height |
|
flip.neg |
|
A ggplot object. If you wish to plot the data yourself from
polygon paths, access them using $data on the output object. The
theme theme_void() is applied to the object; apply your own theme
or adjust specific plot parameters with theme() to change this.
Benjamin Jean-Marie Tremblay, [email protected]
## Feel free to mix and match row name character lengths and column sums. data(examplemotif) toplot <- examplemotif["motif"] toplot[4] <- 2 toplot[20] <- -0.5 rownames(toplot)[1] <- "AA" view_logo(toplot)## Feel free to mix and match row name character lengths and column sums. data(examplemotif) toplot <- examplemotif["motif"] toplot[4] <- 2 toplot[20] <- -0.5 rownames(toplot)[1] <- "AA" view_logo(toplot)
Show sequence logo. If given a list of more than one motif, then the motifs
are aligned with the first in the list. For a leaner alternative
in line with compare_motifs2(), see view_motifs2().
view_motifs(motifs, use.type = "ICM", method = "ALLR", tryRC = TRUE, min.overlap = 6, min.mean.ic = 0.25, relative_entropy = FALSE, normalise.scores = FALSE, min.position.ic = 0, score.strat = "sum", return.raw = FALSE, dedup.names = TRUE, show.positions = TRUE, show.positions.once = TRUE, sort.by = c("none", "ic", "similarity"), show.names = TRUE, names.pos = c("top", "right"), use.freq = 1, colour.scheme = NULL, fontDF = NULL, min.height = 0.01, x.spacer = if (use.freq == 1) 0.04 else 0.1, y.spacer = 0.01, sort.positions = !use.type %in% c("PCM", "PPM"), sort.positions.decreasing = TRUE, text.size = 16, fit.to.height = if (use.type == "PPM") 1 else NULL, RC.text = " [RC]", flip.neg = FALSE, ...)view_motifs(motifs, use.type = "ICM", method = "ALLR", tryRC = TRUE, min.overlap = 6, min.mean.ic = 0.25, relative_entropy = FALSE, normalise.scores = FALSE, min.position.ic = 0, score.strat = "sum", return.raw = FALSE, dedup.names = TRUE, show.positions = TRUE, show.positions.once = TRUE, sort.by = c("none", "ic", "similarity"), show.names = TRUE, names.pos = c("top", "right"), use.freq = 1, colour.scheme = NULL, fontDF = NULL, min.height = 0.01, x.spacer = if (use.freq == 1) 0.04 else 0.1, y.spacer = 0.01, sort.positions = !use.type %in% c("PCM", "PPM"), sort.positions.decreasing = TRUE, text.size = 16, fit.to.height = if (use.type == "PPM") 1 else NULL, RC.text = " [RC]", flip.neg = FALSE, ...)
motifs |
See |
use.type |
|
method |
|
tryRC |
|
min.overlap |
|
min.mean.ic |
|
relative_entropy |
|
normalise.scores |
|
min.position.ic |
|
score.strat |
|
return.raw |
|
dedup.names |
|
show.positions |
|
show.positions.once |
|
sort.by |
|
show.names |
|
names.pos |
|
use.freq |
|
colour.scheme |
|
fontDF |
|
min.height |
|
x.spacer |
|
y.spacer |
|
sort.positions |
|
sort.positions.decreasing |
|
text.size |
|
fit.to.height |
|
RC.text |
|
flip.neg |
|
... |
Unused. Was previously in place to allow extra args to be given
to |
See compare_motifs() for more info on comparison parameters.
See view_logo() to plot from a numeric matrix with arbitrary values instead
of a motif object.
Note: score.strat = "a.mean" is NOT recommended, as view_motifs() will
not discriminate between two alignments with equal mean scores, even if one
alignment is longer than the other.
Note: if you want to plot the motifs yourself, you can set
return.raw=TRUE to get the numeric motif matrices and calculate
the polygon paths on your own or access the polygon path data directly from
the final ggplot object using $data.
A ggplot object. If return.raw = TRUE, a list of matrices.
Benjamin Jean-Marie Tremblay, [email protected]
compare_motifs(), add_multifreq(), view_logo()
## Plotting multifreq motifs: data(examplemotif2) view_motifs(examplemotif2, use.freq = 2) ## Generate your own letter set: ## Not run: library(gglogo) # install from CRAN first if needed fontDFtimes <- createPolygons(LETTERS, "Times", 800, scale = TRUE) view_motifs(examplemotif2, fontDF = fontDFtimes) ## Note: setting `scale = TRUE` is necessary to properly align letters ## vertically, but this has the effect of horizontally stretching out ## letters which shouldn't be stretched (such as "I"). If you need to plot ## letters which have been badly horizontally scaled, you can fix them ## manually as demonstrated here: # Retrieve the x-coordinates for the desired letter: tofix <- fontDFtimes$x[fontDFtimes$group == "I"] # Scale the letter x-coordinates: tofix <- tofix * 0.35 # Remember to center the letter around 0.5 again: tofix <- tofix + (1 - max(tofix)) / 2 # Apply the fix: fontDFtimes$x[fontDFtimes$group == "I"] <- tofix view_motifs(create_motif("AIG", alphabet = "AA"), fontDF = fontDFtimes) ## End(Not run)## Plotting multifreq motifs: data(examplemotif2) view_motifs(examplemotif2, use.freq = 2) ## Generate your own letter set: ## Not run: library(gglogo) # install from CRAN first if needed fontDFtimes <- createPolygons(LETTERS, "Times", 800, scale = TRUE) view_motifs(examplemotif2, fontDF = fontDFtimes) ## Note: setting `scale = TRUE` is necessary to properly align letters ## vertically, but this has the effect of horizontally stretching out ## letters which shouldn't be stretched (such as "I"). If you need to plot ## letters which have been badly horizontally scaled, you can fix them ## manually as demonstrated here: # Retrieve the x-coordinates for the desired letter: tofix <- fontDFtimes$x[fontDFtimes$group == "I"] # Scale the letter x-coordinates: tofix <- tofix * 0.35 # Remember to center the letter around 0.5 again: tofix <- tofix + (1 - max(tofix)) / 2 # Apply the fix: fontDFtimes$x[fontDFtimes$group == "I"] <- tofix view_motifs(create_motif("AIG", alphabet = "AA"), fontDF = fontDFtimes) ## End(Not run)
compare_motifs2() alignment (v2).view_motifs2() is the leaner counterpart of view_motifs(): it aligns
the input motifs using the same Pearson-correlation backend used by
compare_motifs2() and merge_motifs2().
The rendering (logo polygons, facet_wrap() layout, sort.by permutation,
names.pos placement) is identical to view_motifs().
view_motifs2(motifs, use.type = "ICM", tryRC = TRUE, min.overlap = 6, relative_entropy = FALSE, nthreads = 1, return.raw = FALSE, dedup.names = TRUE, show.positions = TRUE, show.positions.once = TRUE, sort.by = c("none", "ic", "similarity"), show.names = TRUE, names.pos = c("top", "right"), use.freq = 1, colour.scheme = NULL, fontDF = NULL, min.height = 0.01, x.spacer = if (use.freq == 1) 0.04 else 0.1, y.spacer = 0.01, sort.positions = !use.type %in% c("PCM", "PPM"), sort.positions.decreasing = TRUE, text.size = 16, fit.to.height = if (use.type == "PPM") 1 else NULL, RC.text = " [RC]", flip.neg = FALSE)view_motifs2(motifs, use.type = "ICM", tryRC = TRUE, min.overlap = 6, relative_entropy = FALSE, nthreads = 1, return.raw = FALSE, dedup.names = TRUE, show.positions = TRUE, show.positions.once = TRUE, sort.by = c("none", "ic", "similarity"), show.names = TRUE, names.pos = c("top", "right"), use.freq = 1, colour.scheme = NULL, fontDF = NULL, min.height = 0.01, x.spacer = if (use.freq == 1) 0.04 else 0.1, y.spacer = 0.01, sort.positions = !use.type %in% c("PCM", "PPM"), sort.positions.decreasing = TRUE, text.size = 16, fit.to.height = if (use.type == "PPM") 1 else NULL, RC.text = " [RC]", flip.neg = FALSE)
motifs |
See |
use.type |
|
tryRC |
|
min.overlap |
|
relative_entropy |
|
nthreads |
|
return.raw |
|
dedup.names |
|
show.positions |
|
show.positions.once |
|
sort.by |
|
show.names |
|
names.pos |
|
use.freq |
|
colour.scheme |
|
fontDF |
|
min.height |
|
x.spacer |
|
y.spacer |
|
sort.positions |
|
sort.positions.decreasing |
|
text.size |
|
fit.to.height |
|
RC.text |
|
flip.neg |
|
DNA and RNA only. For amino-acid or custom-alphabet motifs use
view_motifs().
The alignment step uses the same
backend as merge_motifs2() and merge_similar2(). The
highest-information-content input motif is chosen as the anchor,
every other motif is aligned against it once, motifs whose best
alignment lands on the reverse complement strand are
reverse-complemented before display, and all motifs are placed in
a shared column frame with zero-padding on either side as needed.
The padded matrices then flow through the same logo-rendering
pipeline used by view_motifs().
See view_motifs() for the original version, which supports
Euclidean / similarity metrics and arbitrary alphabets.
A ggplot object. If return.raw = TRUE, a list of
matrices.
Benjamin Jean-Marie Tremblay, [email protected]
view_motifs(), view_logo(), compare_motifs2(),
merge_motifs2(), merge_similar2()
## Not run: library(universalmotif) m1 <- create_motif("TTGACATA", name = "a") m2 <- create_motif("CTTGACAT", name = "b") m3 <- create_motif("TGACATAT", name = "c") view_motifs2(list(m1, m2, m3)) view_motifs2(list(m1, m2, m3), sort.by = "similarity", names.pos = "right") ## End(Not run)## Not run: library(universalmotif) m1 <- create_motif("TTGACATA", name = "a") m2 <- create_motif("CTTGACAT", name = "b") m3 <- create_motif("TGACATAT", name = "c") view_motifs2(list(m1, m2, m3)) view_motifs2(list(m1, m2, m3), sort.by = "similarity", names.pos = "right") ## End(Not run)
Convert DNA motifs to HOMER format and write to file. See http://homer.ucsd.edu/homer/motif/.
write_homer(motifs, file, logodds_threshold = NULL, overwrite = FALSE, append = FALSE, threshold = 0.8, threshold.type = c("logodds", "logodds.abs", "pvalue"))write_homer(motifs, file, logodds_threshold = NULL, overwrite = FALSE, append = FALSE, threshold = 0.8, threshold.type = c("logodds", "logodds.abs", "pvalue"))
motifs |
See |
file |
|
logodds_threshold |
Deprecated. If set, |
overwrite |
|
append |
|
threshold |
|
threshold.type |
|
NULL, invisibly.
Benjamin Jean-Marie Tremblay, [email protected]
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK (2010). “Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities.” Molecular Cell, 38, 576-589.
Other write_motifs:
write_jaspar(),
write_matrix(),
write_meme(),
write_motifs(),
write_transfac()
motif <- create_motif() write_homer(motif, tempfile())motif <- create_motif() write_homer(motif, tempfile())
Convert motifs to JASPAR format and write to file. See http://jaspar.genereg.net/.
write_jaspar(motifs, file, overwrite = FALSE, append = FALSE)write_jaspar(motifs, file, overwrite = FALSE, append = FALSE)
motifs |
See |
file |
|
overwrite |
|
append |
|
NULL, invisibly.
Benjamin Jean-Marie Tremblay, [email protected]
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A (2018). “JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.” Nucleic Acids Research, 46, D260-D266.
Other write_motifs:
write_homer(),
write_matrix(),
write_meme(),
write_motifs(),
write_transfac()
transfac <- read_transfac(system.file("extdata", "transfac.txt", package = "universalmotif")) write_jaspar(transfac, tempfile())transfac <- read_transfac(system.file("extdata", "transfac.txt", package = "universalmotif")) write_jaspar(transfac, tempfile())
Write motifs as simple matrices with optional headers to file.
write_matrix(motifs, file, positions = "columns", rownames = FALSE, type, sep = "", headers = TRUE, overwrite = FALSE, append = FALSE, digits = 6)write_matrix(motifs, file, positions = "columns", rownames = FALSE, type, sep = "", headers = TRUE, overwrite = FALSE, append = FALSE, digits = 6)
motifs |
See |
file |
|
positions |
|
rownames |
|
type |
|
sep |
|
headers |
|
overwrite |
|
append |
|
digits |
|
NULL, invisibly.
Benjamin Jean-Marie Tremblay, [email protected]
Other write_motifs:
write_homer(),
write_jaspar(),
write_meme(),
write_motifs(),
write_transfac()
motif <- create_motif() write_matrix(motif, tempfile(), headers = ">") ## CWM round-trip: write a CWM with its raw signed values, then ## read it back with type = "CWM" preserved. cwm.mat <- matrix(c( 0.1, -0.2, 0.8, -0.1, -0.3, 0.9, -0.2, 0.1, 0.7, -0.1, -0.1, -0.4, -0.2, -0.1, -0.1, 0.9), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(cwm.mat, type = "CWM", name = "modisco_demo") f <- tempfile() write_matrix(cwm, f, headers = ">", rownames = TRUE) re <- read_matrix(f, headers = ">", rownames = TRUE, type = "CWM") re["type"]motif <- create_motif() write_matrix(motif, tempfile(), headers = ">") ## CWM round-trip: write a CWM with its raw signed values, then ## read it back with type = "CWM" preserved. cwm.mat <- matrix(c( 0.1, -0.2, 0.8, -0.1, -0.3, 0.9, -0.2, 0.1, 0.7, -0.1, -0.1, -0.4, -0.2, -0.1, -0.1, 0.9), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(cwm.mat, type = "CWM", name = "modisco_demo") f <- tempfile() write_matrix(cwm, f, headers = ">", rownames = TRUE) re <- read_matrix(f, headers = ">", rownames = TRUE, type = "CWM") re["type"]
Convert motifs to minimal MEME format and write to file. See http://meme-suite.org/doc/meme-format.html.
write_meme(motifs, file, version = 5, bkg, strand, overwrite = FALSE, append = FALSE, CWM = FALSE)write_meme(motifs, file, version = 5, bkg, strand, overwrite = FALSE, append = FALSE, CWM = FALSE)
motifs |
See |
file |
|
version |
|
bkg |
|
strand |
|
overwrite |
|
append |
|
CWM |
|
NULL, invisibly.
Benjamin Jean-Marie Tremblay, [email protected]
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS (2009). “MEME SUITE: tools for motif discovery and searching.” Nucleic Acids Research, 37, W202-W208.
Other write_motifs:
write_homer(),
write_jaspar(),
write_matrix(),
write_motifs(),
write_transfac()
transfac <- read_transfac(system.file("extdata", "transfac.txt", package = "universalmotif")) write_meme(transfac, tempfile()) ## CWM round-trip: write a CWM with its raw signed values, then ## recover it with read_meme(..., CWM = TRUE). cwm.mat <- matrix(c( 0.1, -0.2, 0.8, -0.1, -0.3, 0.9, -0.2, 0.1, 0.7, -0.1, -0.1, -0.4, -0.2, -0.1, -0.1, 0.9), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(cwm.mat, type = "CWM", name = "modisco_demo") f <- tempfile(fileext = ".meme") write_meme(cwm, f, CWM = TRUE, overwrite = TRUE) re <- read_meme(f, CWM = TRUE) re["type"]transfac <- read_transfac(system.file("extdata", "transfac.txt", package = "universalmotif")) write_meme(transfac, tempfile()) ## CWM round-trip: write a CWM with its raw signed values, then ## recover it with read_meme(..., CWM = TRUE). cwm.mat <- matrix(c( 0.1, -0.2, 0.8, -0.1, -0.3, 0.9, -0.2, 0.1, 0.7, -0.1, -0.1, -0.4, -0.2, -0.1, -0.1, 0.9), nrow = 4, byrow = FALSE, dimnames = list(c("A","C","G","T"), NULL)) cwm <- create_motif(cwm.mat, type = "CWM", name = "modisco_demo") f <- tempfile(fileext = ".meme") write_meme(cwm, f, CWM = TRUE, overwrite = TRUE) re <- read_meme(f, CWM = TRUE) re["type"]
Write motifs as universalmotif objects to file. For optimal storage of
universalmotif class motifs, consider using saveRDS() and
readRDS(). Currently the universalmotif format is YAML-based, but
this is subject to change.
write_motifs(motifs, file, minimal = FALSE, multifreq = TRUE, progress = FALSE, overwrite = FALSE, append = FALSE, BP = FALSE)write_motifs(motifs, file, minimal = FALSE, multifreq = TRUE, progress = FALSE, overwrite = FALSE, append = FALSE, BP = FALSE)
motifs |
See |
file |
|
minimal |
|
multifreq |
|
progress |
|
overwrite |
|
append |
|
BP |
|
NULL, invisibly.
Benjamin Jean-Marie Tremblay, [email protected]
Other write_motifs:
write_homer(),
write_jaspar(),
write_matrix(),
write_meme(),
write_transfac()
Convert motifs to TRANSFAC format and write to file.
write_transfac(motifs, file, overwrite = FALSE, append = FALSE, name.tag = "ID", altname.tag = "NA")write_transfac(motifs, file, overwrite = FALSE, append = FALSE, name.tag = "ID", altname.tag = "NA")
motifs |
See |
file |
|
overwrite |
|
append |
|
name.tag |
|
altname.tag |
|
If the family slot of a motif is not empty, then its contents will included using the HC tag. Similarly for the organism slot using the tag OS. The default name and alternate name tags are ID and NA, respectively, though these can be set manually.
NULL, invisibly.
Benjamin Jean-Marie Tremblay, [email protected]
Wingender E, Dietze P, Karas H, Knuppel R (1996). “TRANSFAC: A Database on Transcription Factors and Their DNA Binding Sites.” Nucleic Acids Research, 24, 238-241.
Other write_motifs:
write_homer(),
write_jaspar(),
write_matrix(),
write_meme(),
write_motifs()
jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) write_transfac(jaspar, tempfile())jaspar <- read_jaspar(system.file("extdata", "jaspar.txt", package = "universalmotif")) write_transfac(jaspar, tempfile())