Title: | motif matching, comparison, and de novo discovery using the MEME Suite |
---|---|
Description: | A seamless interface to the MEME Suite family of tools for motif analysis. 'memes' provides data aware utilities for using GRanges objects as entrypoints to motif analysis, data structures for examining & editing motif lists, and novel data visualizations. 'memes' functions and data structures are amenable to both base R and tidyverse workflows. |
Authors: | Spencer Nystrom [aut, cre, cph] |
Maintainer: | Spencer Nystrom <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.15.0 |
Built: | 2024-10-30 08:44:37 UTC |
Source: | https://github.com/bioc/memes |
Add nucleic acid sequence of regions to metadata column
add_sequence(ranges, genome, name = "sequence")
add_sequence(ranges, genome, name = "sequence")
ranges |
GRanges object |
genome |
BSgenome object or any other valid input to 'Biostrings::getSeq()' (Do 'showMethods(Biostrings::getSeq)' to show valid types) |
name |
name of metadata column to hold sequence information (default: "sequence"). Note, this will overwrite existing columns without warning if the name already exists. |
'ranges' with new metadata column named "sequence" (or another value passed to 'name') holding the DNA or RNA sequence from 'genome'
data(example_peaks, package = "memes") dm.genome <- BSgenome.Dmelanogaster.UCSC.dm3::BSgenome.Dmelanogaster.UCSC.dm3 add_sequence(example_peaks, dm.genome)
data(example_peaks, package = "memes") dm.genome <- BSgenome.Dmelanogaster.UCSC.dm3::BSgenome.Dmelanogaster.UCSC.dm3 add_sequence(example_peaks, dm.genome)
This helper function allows the user to visualize the distribution of their AME results data on different scales to help understand the implications of using different values in 'plot_ame_heatmap()'
ame_compare_heatmap_methods(ame, group, value = -log10(adj.pvalue))
ame_compare_heatmap_methods(ame, group, value = -log10(adj.pvalue))
ame |
ame results data.frame |
group |
optional name of group to split results by |
value |
value to compare to "normalize" method (default: -log10(adj.pvalue)) |
a cowplot 2 panel plot comparing the distribution of 'value' to normalized rank values
data("example_ame", package = "memes") ame_compare_heatmap_methods(example_ame$Decreasing) ame_compare_heatmap_methods(dplyr::bind_rows(example_ame, .id = "type"), type)
data("example_ame", package = "memes") ame_compare_heatmap_methods(example_ame$Decreasing) ame_compare_heatmap_methods(dplyr::bind_rows(example_ame, .id = "type"), type)
In order to use the run* family of functions, memes must detect a local
install of the MEME Suite. MEME is installed in a directory named meme/bin/
which can be located anywhere on the filesystem, but is typically found in ~/meme/bin
.
If the MEME Suite is installed at ~/meme/bin
, memes can autodetect the install. However,
in the case that the MEME Suite is found at a nonstandard location, the user
may specify the location of their meme/bin in three ways:
check_meme_install(meme_path = NULL)
check_meme_install(meme_path = NULL)
meme_path |
path to "meme/bin" (if unset will search |
provide the full path to meme/bin
to the meme_path
argument to each run*
function.
set the meme_bin
option using options(meme_bin = "path/to/meme/bin")
once per R session.
set the MEME_BIN
environment variable either in .Renviron
or ~/.bashrc
with the path to meme/bin
To aid the user in determining if memes can detect their meme/bin
install,
check_meme_install()
will search the aforementioned locations for a valid
meme/bin
, returning green checks for each detected tool, or red X's for
undetected tools. Alternatively, users can run meme_is_installed()
to get a
boolean value indicating whether their MEME Suite can be detected.
check_meme_install()
searches using the following heirarchy. This heirarchy
mimics how all run*
functions search for meme/bin
, thus the paths printed
from check_meme_install()
will indicate the paths used by each run*
function. The heirarchy is as follows:
the meme_path
function argument if set
the meme_bin
option
the MEME_BIN
environment variable
the default location at ~/meme/bin
message indicating which MEME utilities are installed and their location on disk
check_meme_install()
check_meme_install()
Convenience function for dropping all columns created by runTomTom prefixed by "best_match_" and the "best_db_name" column. Keeps the "tomtom" data.frame column. Can be useful if you want to unnest the 'tomtom' data without propagating these columns.
drop_best_match(res)
drop_best_match(res)
res |
results of runTomTom |
'res' without the tomtom best_match_ columns
data("example_dreme_tomtom") names(example_dreme_tomtom) names(drop_best_match(example_dreme_tomtom))
data("example_dreme_tomtom") names(example_dreme_tomtom) names(drop_best_match(example_dreme_tomtom))
Result when running AME using 100bp window around 'example_chip_summits' for "Increasing" and "Decreasing" sites, using "Static" as background.
example_ame
example_ame
A list object of AME results data.frames
'runAme()' Results object for Increasing sites vs Static sites
'runAme()' Results object for Decreasing sites vs Static sites
# Data can be combined into 1 large data.frame using: # where the "behavior" column will hold the "Increasing"/"Decreasing" information dplyr::bind_rows(example_ame, .id = "behavior")
# Data can be combined into 1 large data.frame using: # where the "behavior" column will hold the "Increasing"/"Decreasing" information dplyr::bind_rows(example_ame, .id = "behavior")
AME was run for "ectopic", "entopic", and "orphan" sites using shuffled background.
example_ame_large
example_ame_large
a list of runAme() results data.frames
see 'vignette("integrative_analysis", package = "memes")' for details.
# Data can be combined into 1 large data.frame using: dplyr::bind_rows(example_ame_large, .id = "binding_type")
# Data can be combined into 1 large data.frame using: dplyr::bind_rows(example_ame_large, .id = "binding_type")
ChIP-seq summit positions on Drosophila melanogaster chromosome 3 for the transcription factor E93 using a union set of peaks from third-instar larval wings ("Early") and 24 hour Pupal ("Late") wings.
example_chip_summits
example_chip_summits
A GRanges object of ChIP summit position with 2 metadata columns
Binding profiles between Early and Late E93 were compared. Peaks are annotated as whether they are bound in Early wings only ("ectopic"), both Early and Late wings ("entopic"), or only bound in Late wings ("orphan").
change in chromatin accessibility in response to E93 binding: Increasing, Decreasing, or Static
E93 is a transcription factor normally present only in Late wings. An experimental perturbation precociously expressed E93 during Early stages. Binding profiles between Early and Late E93 were compared. Peaks are annotated as whether they are bound in Early wings only ("ectopic"), both Early and Late wings ("entopic"), or only bound in Late wings ("orphan").
DNA elements can be made "open" or "closed" in response to binding of transcription factors like E93. Accessibility of E93 binding sites before and after E93 expression was measured using FAIRE-seq. ChIP peaks are annotated by how their accessibility changes in response to E93 binding . Peaks can become more open ("Increasing"), more closed ("Decreasing"), or unchanged in accessibility ("Static"). These experiments demonstrate a causal relationship between E93 binding and both opening and closing of DNA elements.
https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE141738&format=file&file=GSE141738
Result when running dreme using 100bp window around example_chip_summits using "Decreasing" sites as foreground, and "Static" sites as background.
example_dreme
example_dreme
a runDreme results data.frame
See vignette("integrative_analysis", package = "memes") for details
example_dreme_by_binding
example_dreme_by_binding
a runDreme results data.frame
Result when running 'runTomTom(example_dreme)' using FlyFactorSurvey as database.
example_dreme_tomtom
example_dreme_tomtom
a runDreme results data.frame with runTomTom results columns
Run using 100bp windows around 'example_chip_summits', using E93 motif as database.
example_fimo
example_fimo
A GRanges object of E93 motif positions within 100bp windows of 'example_chip_summits'
10 ChIP-seq peaks from GSE141738
example_peaks
example_peaks
An object of class GRanges
of length 10.
A small number of transcription factor ChIP-seq peaks as a GRanges object, taken from [GSE141738](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE141738)
https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE141738&format=file&file=GSE141738
These data are a subset of RNAseq counts from the full FPKM table in GSE97956. Includes counts for all Drosophila transcription factors and ~200 other random genes.
example_rnaseq
example_rnaseq
A data.frame of RNAseq FPKM data
The FlyBase gene symbol
Developmental stage of RNAseq experiment
RNAseq count in Fragments per Kilobase Million (FPKM)
"https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE97956&format=file&file=GSE97956
Result when running 'runTomTom(example_dreme$motif)' using FlyFactorSurvey as database
example_tomtom
example_tomtom
a data.frame
Although TomTom assigns a best match, this is not always the most biologically relevant match. In these cases, it is useful to "force" the best match to another lower ranked, but still significant TomTom match. This function allows users to select a new best match motif from the set of lower-ranked matches in the 'tomtom' list column. This function also reorders the 'tomtom' data.frame such that the forced match is the first row of the 'tomtom' entry.
force_best_match(res, matches)
force_best_match(res, matches)
res |
results from runTomTom |
matches |
named vector where name is the input motif name, and value is the match_name to use as the new best match |
'res' with new best_* columns and re-ranked tomtom data in the 'tomtom' list column for the updated entries.
[update_best_match()]
if (meme_is_installed()){ motif <- universalmotif::create_motif("CCRAAAW", name = "example_motif") db <- system.file("extdata", "flyFactorSurvey_cleaned.meme", package = "memes") res <- runTomTom(motif, database = db) res$best_match_name res2 <- force_best_match(res, c("example_motif" = "Eip93F_SANGER_10")) res2$best_match_name }
if (meme_is_installed()){ motif <- universalmotif::create_motif("CCRAAAW", name = "example_motif") db <- system.file("extdata", "flyFactorSurvey_cleaned.meme", package = "memes") res <- runTomTom(motif, database = db) res$best_match_name res2 <- force_best_match(res, c("example_motif" = "Eip93F_SANGER_10")) res2$best_match_name }
A light wrapper around Biostrings::getSeq to return named DNAStringSets, from input genomic coordinates.
get_sequence(regions, genome, score_column, ...)
get_sequence(regions, genome, score_column, ...)
regions |
GRanges, or GRangesList object. Will also accept a data.frame as long as it can be coerced to a GRanges object, or a string in the format: "chr:start-end" (NOTE: use 1-based closed intervals, not BED format 0-based half-open intervals). |
genome |
object of any valid type in 'showMethods(Biostrings::getSeq)'. Commonly a BSgenome object, or fasta file. Used to look up sequences in regions. |
score_column |
optional name of column (in mcols() of 'regions') containing a fasta score that is added to the fasta header of each entry. Used when using [runAme()] in partitioning mode. (default: 'NULL') |
... |
additional arguments passed to Biostrings::getSeq. |
'Biostrings::DNAStringSet' object with names corresponding to genomic coordinates. If input is a list object, output will be a 'Biostrings::BStringSetList' with list names corresponding to input list names.
# using character string as coordinates # using BSgenome object for genome drosophila.genome <- BSgenome.Dmelanogaster.UCSC.dm6::BSgenome.Dmelanogaster.UCSC.dm6 get_sequence("chr2L:100-200", drosophila.genome) # using GRanges object for coordinates data(example_peaks, package = "memes") get_sequence(example_peaks, drosophila.genome)
# using character string as coordinates # using BSgenome object for genome drosophila.genome <- BSgenome.Dmelanogaster.UCSC.dm6::BSgenome.Dmelanogaster.UCSC.dm6 get_sequence("chr2L:100-200", drosophila.genome) # using GRanges object for coordinates data(example_peaks, package = "memes") get_sequence(example_peaks, drosophila.genome)
This function identifies whether any motif matrices in the input universalmotif list or universalmotif_df are identical to each other. Note: this operation is slow on large motif lists
has_duplicate_motifs(x)
has_duplicate_motifs(x)
x |
a universalmotif list or universalmotif_df |
logical value indicating presence or absence of duplicated motif matrices
motif <- universalmotif::create_motif() duplicated <- c(motif, motif) has_duplicate_motifs(duplicated)
motif <- universalmotif::create_motif() duplicated <- c(motif, motif) has_duplicate_motifs(duplicated)
This imports AME results using the "ame.tsv" output, and optionally the "sequences.tsv" output if run with "method = fisher". AME results differ based on the method used, thus this must be set on import or the column names will be incorrect.
importAme( path, method = c("fisher", "ranksum", "dmhg3", "dmhg4", "pearson", "spearman"), sequences = NULL )
importAme( path, method = c("fisher", "ranksum", "dmhg3", "dmhg4", "pearson", "spearman"), sequences = NULL )
path |
path to ame results file ("ame.tsv") |
method |
ame run method used (one of: c("fisher", "ranksum", "dmhg3", "dmhg4", "pearson", "spearman")). Default: "fisher". |
sequences |
NULL/FALSE to skip sequence import, or path to sequences file to import (only valid for method = "fisher") |
data.frame with method-specific results. See [AME results](http://meme-suite.org/doc/ame-output-format.html) webpage for more information. If sequences is set to a path to the sequences.tsv and method = "fisher", the list-column 'sequences' will be appended to resulting data.frame.
[runAme()]
ame_tsv <- system.file("extdata", "ame.tsv", package = "memes", mustWork = TRUE) importAme(ame_tsv)
ame_tsv <- system.file("extdata", "ame.tsv", package = "memes", mustWork = TRUE) importAme(ame_tsv)
Import Dreme output from previous run
importDremeXML(dreme_xml_path)
importDremeXML(dreme_xml_path)
dreme_xml_path |
path to dreme.xml file |
data.frame with statistics for each discovered motif. The 'motifs' column contains a universalmotif object representation in PCM format of each DREME motif. If no motifs are discovered, returns NULL.
[runDreme()]
dreme_xml <- system.file("extdata", "dreme.xml", package = "memes") importDremeXML(dreme_xml)
dreme_xml <- system.file("extdata", "dreme.xml", package = "memes") importDremeXML(dreme_xml)
Import FIMO results
importFimo(fimo_tsv)
importFimo(fimo_tsv)
fimo_tsv |
path to fimo.tsv output file |
GenomicRanges object for each match position. Note unless coordinates are genomic positions, each 'seqnames' entry will be the fasta header, and start/end will be the position within that sequence of the match.
fimo_tsv <- system.file("extdata", "fimo.tsv", package = "memes") importFimo(fimo_tsv)
fimo_tsv <- system.file("extdata", "fimo.tsv", package = "memes") importFimo(fimo_tsv)
This is a light wrapper around [universalmotif::read_meme()] that imports MEME results as universalmotif data.frame. If MEME is run with genomic coordinates in the fasta header, in "chr:start-end" format (base 1 indexed), the genomic coordinates of the motif match from input sequences can be parsed from the header.
importMeme(meme_txt, parse_genomic_coord = FALSE, combined_sites = FALSE)
importMeme(meme_txt, parse_genomic_coord = FALSE, combined_sites = FALSE)
meme_txt |
path to "meme.txt" output |
parse_genomic_coord |
whether to parse sequence headers into genomic coordinates for motif position information, only works if fasta files were written such that the sequence headers are in the form: "chr:start-end", or some variation of this form (delimiters can be any of: "[^[:alnum:]]+" (ie non-alphanumeric characters)), (default = FALSE). |
combined_sites |
whether to add 'combined_sites' output which contains coordinates of each sequence, the motif sequence (if 'parse_genomic_coord = TRUE'), and the 'diagram' column raw output from MEME indicating the relative locations of motifs in the sequence. |
MEME results in universalmotif data.frame format (see: [as_universalmotif_dataframe()]). 'sites_hits' is a nested data.frame column containing the position within each input sequence of matches to the identified motif.
[runMeme()] [universalmotif::read_meme()]
example_meme_txt <- system.file("extdata", "meme_full.txt", package = "universalmotif") importMeme(example_meme_txt)
example_meme_txt <- system.file("extdata", "meme_full.txt", package = "universalmotif") importMeme(example_meme_txt)
Import Streme output from previous run
importStremeXML(streme_xml_path)
importStremeXML(streme_xml_path)
streme_xml_path |
path to streme.xml file |
data.frame with statistics for each discovered motif. The 'motifs' column contains a universalmotif object representation in PCM format of each DREME motif. If no motifs are discovered, returns NULL.
[runStreme()]
streme_xml <- system.file("extdata", "streme.xml", package = "memes") importStremeXML(streme_xml)
streme_xml <- system.file("extdata", "streme.xml", package = "memes") importStremeXML(streme_xml)
Import tomtom data from previous run
importTomTomXML(tomtom_xml_path)
importTomTomXML(tomtom_xml_path)
tomtom_xml_path |
path to tomtom.xml |
tomtom list column format the 'tomtom' list column contains data.frames with the following format: - name: name of query PWM - altname: alternate name of query PWM - match_name: name of matched PWM - match_altname: alt name of matched PWM - match_pval: p-value of match - match_eval: E-value of match - match_qval: q-value of match - match_offset: number of letters the query was offset from the target match - match_strand: whether the motif was found on input strand (+) or as reverse-complement (-) - db_name: database source of matched motif - match_motif: universalmotif object containing the PWM that was matched
will return data.frame with input motifs & results for best match. 'tomtom' list column contains full tomtom data for each input motif. NOTE: if tomtom detects no matches for any input motif, currently will print a message & return NA values for 'tomtom', 'best_match_name', and 'best_match_motif'.
[runTomTom()]
tomtom_xml <- system.file("extdata", "tomtom.xml", package = "memes") importTomTomXML(tomtom_xml)
tomtom_xml <- system.file("extdata", "tomtom.xml", package = "memes") importTomTomXML(tomtom_xml)
Checks for a valid meme install using same heirarchy as check_meme_install()
.
Returns TRUE
if all supported utilities are found in the meme install
location, FALSE
if not.
meme_is_installed(path = NULL)
meme_is_installed(path = NULL)
path |
optional path to "meme/bin/". If unset, will follow the search heirarchy listed above. |
The search heirarchy is as follows:
the meme_path
function argument if set
the meme_bin
option
the MEME_BIN
environment variable
the default location at ~/meme/bin
logical(1)
indicating whether meme is installed with all supported utilities
meme_is_installed()
meme_is_installed()
This is a convienience function for re-nesting the 'tomtom' list column if the user unnests it. Additionally, it will update the best_match information based on the ranking of the resulting 'tomtom' data.frame. This avoids having out-of-date best_match information after manipulating the 'tomtom' entries.
nest_tomtom(data)
nest_tomtom(data)
data |
tomtom results data.frame after unnesting the 'tomtom' column |
**NOTE:** that the resulting columns may not be in the same order, so operations like 'identical()' before & after a nest/renest operation may fail even though the column values are unchanged.
the input data.frame with the match_* columns nested into a column named 'tomtom'
if (meme_is_installed()){ motif <- universalmotif::create_motif("CCRAAAW") db <- system.file("extdata/flyFactorSurvey_cleaned.meme", package = "memes") res <- runTomTom(motif, database = db) data <- tidyr::unnest(res, "tomtom") identical(nest_tomtom(data), res) }
if (meme_is_installed()){ motif <- universalmotif::create_motif("CCRAAAW") db <- system.file("extdata/flyFactorSurvey_cleaned.meme", package = "memes") res <- runTomTom(motif, database = db) data <- tidyr::unnest(res, "tomtom") identical(nest_tomtom(data), res) }
Plot AME heatmap clustered by similarity in detected motifs
plot_ame_heatmap( ame, id = motif_id, group = NULL, value = -log10(adj.pvalue), group_name = NULL, scale_max = NA )
plot_ame_heatmap( ame, id = motif_id, group = NULL, value = -log10(adj.pvalue), group_name = NULL, scale_max = NA )
ame |
ame results data.frame |
id |
column of motif ids to use (default: motif_id). |
group |
grouping column if comparing across multiple ame runs (optional, default: NULL). |
value |
value to display as heatmap intensity. Default: -log10(adj.pvalue). Takes function or column name as input. If set to "normalize", will use normalized rank within 'group' as the heatmap values. **If in doubt**, prefer the -log10(adj.pvalue) plot potentially in conjunction with adjusting 'scale_max'. (See "Normalized rank visualization" section below for more details on how to interpret these data) |
group_name |
when group = NULL, name to use for input regions. Ignored if group is set. |
scale_max |
max heatmap value to limit upper-value of scale. Useful if distribution of 'value's vary greatly between groups. Usually a better to tweak this option than to use value = "normalize". The cumulative distribution plot generated by 'ame_compare_heatmap_methods()' can be useful for selecting this value, try to pick a value which captures the largest fraction of hits across all groups while excluding outliers. |
Normalized rank visualization **NOTE:** The normalized rank visualization eliminates all real values related to statistical significance! Instead, this visualization represents the relative ranks of hits within an AME run, which already pass a significance threshold set during 'runAME()'. This means that even if several motifs have similar or even identical pvalues, their heatmap representation will be a different color value based on their ranked order in the results list. This also means that using the normalized rank visualization will be misleading if there are only a few AME hits; it is only worth using if the number of hits is very large (>100). Both visualizations can be useful and reveal different properties of the data to the user during data exploration. Use 'ame_compare_heatmap_methods()' to help assess differences in the two visualizations. **If in doubt**, prefer the '-log10(adj.pvalue)' representation.
Common mistake: if 'value' is set to a string that is not "normalize", it will return: "Error: Discrete value supplied to continuous scale". To use a column by name, do not quote the column name.
'ggplot' object
data("example_ame", package = "memes") # Plot a single category heatmap plot_ame_heatmap(example_ame$Decreasing) # Plot a multi category heatmap grouped_ame <- dplyr::bind_rows(example_ame, .id = "category") plot_ame_heatmap(grouped_ame, group = category)
data("example_ame", package = "memes") # Plot a single category heatmap plot_ame_heatmap(example_ame$Decreasing) # Plot a multi category heatmap grouped_ame <- dplyr::bind_rows(example_ame, .id = "category") plot_ame_heatmap(grouped_ame, group = category)
Sometimes it is useful to visualize individual motif matches in aggregate to understand how sequence variability contributes to motif matches. This function creates a heatmap where each row represents a single sequence and each column represents a position. Cells are colored by the sequence at that position. Sequences are optionally aggregated into a sequence logo aligned in register with the heatmap to visualize how sequence variability contributes to motif makeup.
plot_sequence_heatmap( sequence, title = NULL, logo = TRUE, alph = c("DNA", "RNA", "AA"), title_hjust = 0, heights = c(1, 5), legend = "none" )
plot_sequence_heatmap( sequence, title = NULL, logo = TRUE, alph = c("DNA", "RNA", "AA"), title_hjust = 0, heights = c(1, 5), legend = "none" )
sequence |
character vector of sequences, plot will be ranked in order of the sequences. Each sequence must be equal length. Alternately, sequence can be a named list in which case each plot will be titled by the names of the list entries. |
title |
title of the plot. Default: NULL. If sequence is a named list of sequences, title defaults to the list entry names. Set to NULL to override this behavior. To use a different title than the list entry name, pass a vector of names to 'title'. |
logo |
whether to include a sequence logo above the heatmap |
alph |
alphabet colorscheme to use. One of: DNA, RNA, AA. |
title_hjust |
value from 0 to 1 determining the horizontal justification of the title. Default: 0. |
heights |
ratio of logo:heatmap heights. Given as: c(logo_height, heatmap_height). Values are not absolute. Ignored when logo = FALSE. |
legend |
passed to ggplot2::theme(legend.position). Default: "none". Values can be: "none", "left", "right", "top", "bottom", or coordinates in c(x,y) format. |
a ggplot object of the sequence heatmap ranked by the order of sequences
runFimo
data(example_fimo, package = "memes") genome <- BSgenome.Dmelanogaster.UCSC.dm3::BSgenome.Dmelanogaster.UCSC.dm3 motifs <- add_sequence(example_fimo, genome) plot_sequence_heatmap(motifs$sequence) # Use on named list sequences <- list("set 1" = motifs$sequence[1:100], "set 2" = motifs$sequence[101:200]) plot_sequence_heatmap(sequences) # Use different titles for list input plot_sequence_heatmap(sequences, title = c("A", "B"))
data(example_fimo, package = "memes") genome <- BSgenome.Dmelanogaster.UCSC.dm3::BSgenome.Dmelanogaster.UCSC.dm3 motifs <- add_sequence(example_fimo, genome) plot_sequence_heatmap(motifs$sequence) # Use on named list sequences <- list("set 1" = motifs$sequence[1:100], "set 2" = motifs$sequence[101:200]) plot_sequence_heatmap(sequences) # Use different titles for list input plot_sequence_heatmap(sequences, title = c("A", "B"))
This function identifies motif matrices which are duplicated in a universalmotif list or universalmotif_df and removes them. This operation ignores motif metadata and operates by removing all entries whose motif matrices are identical. The first instance of a duplicated motif in the input list is the one returned.
remove_duplicate_motifs(x)
remove_duplicate_motifs(x)
x |
a universalmotif list or universalmotif_df |
A deduplicated list or universalmotif_df
motif <- universalmotif::create_motif() duplicated <- c(motif, motif) remove_duplicate_motifs(duplicated)
motif <- universalmotif::create_motif() duplicated <- c(motif, motif) remove_duplicate_motifs(duplicated)
AME identifies known motifs (provided by the user) that are enriched in your input sequences.
## S3 method for class 'list' runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... ) ## S3 method for class 'BStringSetList' runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... ) ## Default S3 method: runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... ) runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... )
## S3 method for class 'list' runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... ) ## S3 method for class 'BStringSetList' runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... ) ## Default S3 method: runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... ) runAme( input, control = "shuffle", outdir = "auto", method = "fisher", database = NULL, meme_path = NULL, sequences = FALSE, silent = TRUE, ... )
input |
path to fasta, or DNAstringset (optional: DNAStringSet object names contain fasta score, required for partitioning mode) |
control |
default: "shuffle", or set to
DNAstringset or path to fasta file to use those sequences in discriminative
mode. If |
outdir |
Path to output directory location to save data files. If set to "auto", will use location of input files if passing file paths, otherwise will write to a temporary directory. default: "auto" |
method |
default: fisher (allowed values: fisher, ranksum, pearson, spearman, 3dmhg, 4dmhg) |
database |
path to .meme format file, universalmotif list object, dreme results data.frame, or list() of multiple of these. If objects are assigned names in the list, that name will be used as the database id in the output. It is highly recommended you set a name if not using a file path so the database name will be informative, otherwise the position in the list will be used as the database id. For example, if the input is: list("motifs.meme", list_of_motifs), the database id's will be: "motifs.meme" and "2". If the input is list("motifs.meme", "customMotifs" = list_of_motifs), the database id's will be "motifs.meme" and "customMotifs". |
meme_path |
path to "meme/bin/" (default: |
sequences |
|
silent |
whether to suppress stdout (default: TRUE), useful for debugging. |
... |
additional arguments passed to AME (see AME Flag table below) |
AME can be run using several modes:
Discriminative mode: to discover motifs enriched relative to shuffled input, or a set of provided control sequences
Partitioning mode: in which AME uses some biological signal to identify the association between the signal and motif presence.
To read more about how AME works, see the AME Tutorial
Additional AME arguments
memes allows passing any valid flag to it's target programs via ...
. For
additional details for all valid AME arguments, see the AME Manual webpage. For convenience, a table
of valid parameters, and brief descriptions of their function are provided
below:
AME Flag | allowed values | default | description |
kmer | integer(1) |
2 | kmer frequency to preserve when shuffling control sequences |
seed | integer(1) |
1 | seed for random number generator when shuffling control sequences |
scoring | "avg", "max", "sum", "totalhits" | "avg" | Method for scoring a sequence for matches to a PWM (avg, max, sum, totalhits) |
hit_lo_fraction | numeric(1) |
0.25 | fraction of hit log odds score to exceed to be considered a "hit" |
evalue_report_threshold | numeric(1) |
10 | E-value threshold for reporting a motif as significantly enriched |
fasta_threshold | numeric(1) |
0.001 | AME will classify sequences with FASTA scores below this value as positives. Only valid when method = "fisher", poslist = "pwm", control = NA, fix_partition = NULL . |
fix_partition | numeric(1) |
NULL |
AME evaluates only the partition of the first N sequences. Only works when control = NA and poslist = "fasta" |
poslist | "pwm", "fasta" | "fasta" | When using paritioning mode (control = NA ), test thresholds on either PWM or Fasta score |
log_fscores | logical(1) |
FALSE | Convert FASTA scores into log-space (only used when method = "pearson" ) |
log_pwmscores | logical(1) |
FALSE | Convert PWM scores into log-space (only used for method = "pearson" or method = "spearman ) |
lingreg_switchxy | logical(1) |
FALSE | Make the x-points FASTA scores and y-points PWM scores (only used for method = "pearson" or method = "spearman ) |
xalph | file path | NULL(1) |
alphabet file to use if input motifs are in different alphabet than input sequences |
bfile | "motif", "motif-file", "uniform", path to file | NULL |
source of 0-order background model. If "motif" or "motif-file" 0-order letter frequencies in the first motif file are used. If "uniform" uses uniform letter frequencies. |
motif_pseudo | numeric(1) |
0.1 | Addd this pseudocount when converting from frequency matrix to log-odds matrix |
inc | character(1) |
NULL |
use only motifs with names matching this regex |
exc | character(1) |
NULL |
exclude motifs with names matching this regex |
a data.frame of AME enrichment results. If run using a BStringsSetList input, returns a list of data.frames.
If you use runAme()
in your analysis, please cite:
Robert McLeay and Timothy L. Bailey, "Motif Enrichment Analysis: A unified framework and method evaluation", BMC Bioinformatics, 11:165, 2010, doi:10.1186/1471-2105-11-165. full text
The MEME Suite is free for non-profit use, but for-profit users should purchase a license. See the MEME Suite Copyright Page for details.
if (meme_is_installed()) { # Create random named sequences as input for example seqs <- universalmotif::create_sequences(rng.seed = 123) names(seqs) <- seq_along(seqs) # An example path to a motif database file in .meme format motif_file <- system.file("extdata", "flyFactorSurvey_cleaned.meme", package = "memes") runAme(seqs, database = motif_file) # Dreme results dataset for example dreme_xml <- system.file("extdata", "dreme.xml", package = "memes") dreme_results <- importDremeXML(dreme_xml) # database can be set to multiple values like so: runAme(seqs, database = list(motif_file, "my_dreme_motifs" = dreme_results)) }
if (meme_is_installed()) { # Create random named sequences as input for example seqs <- universalmotif::create_sequences(rng.seed = 123) names(seqs) <- seq_along(seqs) # An example path to a motif database file in .meme format motif_file <- system.file("extdata", "flyFactorSurvey_cleaned.meme", package = "memes") runAme(seqs, database = motif_file) # Dreme results dataset for example dreme_xml <- system.file("extdata", "dreme.xml", package = "memes") dreme_results <- importDremeXML(dreme_xml) # database can be set to multiple values like so: runAme(seqs, database = list(motif_file, "my_dreme_motifs" = dreme_results)) }
DREME discovers short, ungapped, de-novo motifs that are relatively enriched relative to a control set of sequences. DREME can be run to discover motifs relative to a shuffled set of input sequences, or against a separately provided set of "control" sequences.
runDreme(input, control, outdir = "auto", meme_path = NULL, silent = TRUE, ...)
runDreme(input, control, outdir = "auto", meme_path = NULL, silent = TRUE, ...)
input |
regions to scan for motifs. Can be any of:
|
control |
regions to use as background for motif search. Can be any of:
|
outdir |
path to output directory of dreme files, or "auto" to autogenerate path. Default: location of input fasta in dir named "\<input\>vs\<control\>". If input is DNAstringset, will be temporary path. This means that if you want to save the raw output files, you must use fasta files as input or use an informative (and unique) outdir name. memes will not check if it overwrites files in a directory. Directories will be recursively created if needed. |
meme_path |
optional, path to "meme/bin/" on your local machine. runDreme will search 3 places in order for meme if this flag is unset:
|
silent |
whether to suppress printing dreme stdout as a message when finishing with no errors. Can be useful for troubleshooting in situations where no motifs are discovered, but command completes successfully. (default: TRUE) |
... |
dreme flags can be passed as R function arguments to use
non-default behavior. For a full list of valid arguments, run your local
install of dreme -h, or visit the dreme documentation
website. See list below for aliases
of common flags. To set flags with no values (ex. |
Properly setting the control
parameter is key to discovering biologically
relevant motifs. Often, using control = "shuffle"
will produce a suboptimal
set of motifs; however, some discriminative analysis designs don't have
proper "control" regions other than to shuffle.
As of MEME version 5.2.0, DREME is deprecated. Consider runStreme()
instead.
In addition to allowing any valid flag of dreme to be passed to ...
, we
provide a few user-friendly aliases for common flags which are more readable (see list below).
For example, e = 1 will use a max evalue cutoff of 1. This is equivalent to
setting evalue = 1. For additional details about each DREME flag, see the
DREME Manual Webpage.
List of values which can be passed to ...
:
NOTE: values must be referred to using their name in the "memes alias"
column, not the "DREME Flag" column.
memes alias | DREME Flag | description | default |
nmotifs | m | max number of motifs to discover | NULL |
sec | t | max number of seconds to run | NULL |
evalue | e | max E-value cutoff | 0.05 |
seed | s | random seed if using "shuffle" as control | 1 |
ngen | g | nuber of REs to generalize | 100 |
mink | mink | minimum motif width to search | 3 |
maxk | maxk | maximum motif width to search | 7 |
k | k | set mink and maxk to this value | NULL |
norc | norc | search only the input strand for sequences | FALSE |
dna | dna | use DNA alphabet | TRUE |
rna | rna | use RNA alphabet | FALSE |
protein | protein | use protein alphabet (NOT RECCOMENDED) | FALSE |
universalmotif_df
with statistics for each discovered motif. The motif
column contains a universalmotif object representation in PCM format of
each DREME motif. If no motifs are discovered, returns NULL. The column
values are as follows:
rank = ranked order of discovered motif
name = primary name of motif
altname = alternative name of motif
seq = consensus sequence of the motif
length = length of discovered motif
nsites = number of times the motif is found in input sequences
positive_hits = number of sequences in input containing at least 1 of the motif
negative_hits = number of sequences in control containing at least 1 of the motif
pval = p-value
eval = E-value
unerased_eval = Unerased E-Value
positive_total = number of sequences in input
negative_total = number of sequences in control
pos_frac = fraction of positive sequences with a hit
neg_frac = fraction of negative sequences with a hit
motif = a universalmotif object of the discovered motif
If you use runDreme()
in your analysis, please cite:
Timothy L. Bailey, "DREME: Motif discovery in transcription factor ChIP-seq data", Bioinformatics, 27(12):1653-1659, 2011. full text
The MEME Suite is free for non-profit use, but for-profit users should purchase a license. See the MEME Suite Copyright Page for details.
if (meme_is_installed()) { # Create random named sequences as input for example seqs <- universalmotif::create_sequences(rng.seed = 123) names(seqs) <- seq_along(seqs) # Runs dreme with default settings, shuffles input as background runDreme(seqs, "shuffle") # Runs searching for max 2 motifs, e-value cutoff = 0.1, explicitly using the DNA alphabet runDreme(seqs, "shuffle", nmotifs = 2, e = 0.1, dna = TRUE) }
if (meme_is_installed()) { # Create random named sequences as input for example seqs <- universalmotif::create_sequences(rng.seed = 123) names(seqs) <- seq_along(seqs) # Runs dreme with default settings, shuffles input as background runDreme(seqs, "shuffle") # Runs searching for max 2 motifs, e-value cutoff = 0.1, explicitly using the DNA alphabet runDreme(seqs, "shuffle", nmotifs = 2, e = 0.1, dna = TRUE) }
FIMO scans input sequences to identify the positions of matches to each input motif. FIMO has no sequence length or motif number restrictions.
runFimo( sequences, motifs, bfile = "motif", outdir = "auto", parse_genomic_coord = TRUE, skip_matched_sequence = FALSE, max_strand = TRUE, text = TRUE, meme_path = NULL, silent = TRUE, ... )
runFimo( sequences, motifs, bfile = "motif", outdir = "auto", parse_genomic_coord = TRUE, skip_matched_sequence = FALSE, max_strand = TRUE, text = TRUE, meme_path = NULL, silent = TRUE, ... )
sequences |
path to fasta file, or stringset input. |
motifs |
path to .meme format file, or universalmotif/universalmotif list input. |
bfile |
path to background file, or special values: "motif" to use 0-order frequencies contained in the motif, or "uniform" to use a uniform letter distribution. (default: "motif") |
outdir |
output directory location. Only used if text = FALSE. Default: "auto" to autogenerate directory name. Note: if not using a fasta path as input, this will be a temporary location unless explicity set. |
parse_genomic_coord |
|
skip_matched_sequence |
|
max_strand |
if match is found on both strands, only report strand with best match (default: TRUE). |
text |
|
meme_path |
path to |
silent |
|
... |
additional commandline arguments to fimo. See the FIMO Flag table below. |
Additional arguments passed to ...
. See: Fimo web manual
for a complete description of FIMO flags.
FIMO Flag | allowed values | default | description |
alpha | numeric(1) |
1 | alpha for calculating position-specific priors. Represents fraction of sites that are binding sites of TF of interest. Used in conjunction with psp |
bfile | "motif", "motif-file", "uniform", file path, | "motif" | If "motif" or "motif-file", use 0-order letter frequencies from motif. "uniform" sets uniform letter frequencies. |
max_stored_scores | integer(1) |
NULL | maximum number of scores to be stored for computing q-values. used when text = FALSE , see FIMO webpage for details |
motif_pseudo | numeric(1) |
0.1 | pseudocount added to motif matrix |
no_qvalue | logical(1) |
FALSE | only needed when text = FALSE , do not compute q-value for each p-value |
norc | logical(1) |
FALSE | Do not score reverse complement strand |
prior_dist | file path | NULL | file containing binned distribution of priors |
psp | file path | NULL | file containing position specific priors. Requires prior_dist |
qv_thresh | logical(1) |
FALSE | use q-values for the output threshold |
thresh | numeric(1) |
1e-4 |
output threshold for returning a match, only matches with values less than thresh are returned. |
The MEME Suite is free for non-profit use, but for-profit users should purchase a license. See the MEME Suite Copyright Page for details.
GRanges object containing positions of each match. Note: if
parse_genomic_coords = FALSE
, each seqnames
entry will be the full fasta
header, and start/end will be the relative position within that sequence of the
match. The GRanges object has the following additional mcols
:
* motif_id = primary name of matched motif
* motif_alt_id = alternate name of matched motif
* score = score of match (higher score is a better match)
* pvalue = pvalue of the match
* qvalue = qvalue of the match
* matched_sequence = sequence that matches the motif
If you use runFimo()
in your analysis, please cite:
Charles E. Grant, Timothy L. Bailey, and William Stafford Noble, "FIMO: Scanning for occurrences of a given motif", Bioinformatics, 27(7):1017-1018, 2011. full text
if (meme_is_installed()){ # Generate some example input sequences seq <- universalmotif::create_sequences() # sequences must have values in their fasta headers names(seq) <- seq_along(seq) # Create random example motif to search for motif <- universalmotif::create_motif() # Search for motif in sequences # parse_genomic_coord set to FALSE since fasta headers aren't in "chr:start-end" format. runFimo(seq, motif, parse_genomic_coord = FALSE) }
if (meme_is_installed()){ # Generate some example input sequences seq <- universalmotif::create_sequences() # sequences must have values in their fasta headers names(seq) <- seq_along(seq) # Create random example motif to search for motif <- universalmotif::create_motif() # Search for motif in sequences # parse_genomic_coord set to FALSE since fasta headers aren't in "chr:start-end" format. runFimo(seq, motif, parse_genomic_coord = FALSE) }
MEME performs de-novo discovery of ungapped motifs present in the input sequences. It can be used in both discriminative and non-discriminative modes.
runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... ) ## S3 method for class 'list' runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... ) ## S3 method for class 'BStringSetList' runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... ) ## Default S3 method: runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... )
runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... ) ## S3 method for class 'list' runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... ) ## S3 method for class 'BStringSetList' runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... ) ## Default S3 method: runMeme( input, control = NA, outdir = "auto", alph = "dna", parse_genomic_coord = TRUE, combined_sites = FALSE, silent = TRUE, meme_path = NULL, ... )
input |
path to fasta, Biostrings::BStringSet list, or list of
Biostrings::BStringSet (can generate using |
control |
any data type as in |
outdir |
(default: "auto") Directory where output data will be stored. |
alph |
one of c("dna", "rna", "protein") or path to alphabet file (default: "dna"). |
parse_genomic_coord |
|
combined_sites |
|
silent |
Whether to suppress printing stdout to terminal (default: TRUE) |
meme_path |
path to "meme/bin/". If unset, will use default search behavior:
|
... |
additional arguments passed to MEME (see below) |
Note that MEME can take a long time to run. The more input sequences used, the wider the motifs searched for, and the more motifs MEME is asked to discover will drastically affect runtime. For this reason, MEME usually performs best on a few (<50) short (100-200 bp) sequences, although this is not a requirement. Additional details on how data size affects runtime can be found here.
MEME works best when specifically tuned to the analysis question. The default
settings are unlikely to be ideal. It has several complex arguments
documented here, which runMeme()
accepts as R function arguments (see details below).
If discovering motifs within ChIP-seq, ATAC-seq, or similar peaks, MEME may perform
best if using sequences flaking the summit (the site of maximum signal) of
each peak rather than the center. ChIP-seq or similar data can also benefit
from setting revcomp = TRUE, minw = 5, maxw = 20
. For more tips on using
MEME to analyze ChIP-seq data, see the following
tips page.
runMeme()
accepts all valid arguments to meme as arguments passed to ...
.
For flags without values, pass them as flag = TRUE
. The dna
, rna
, and
protein
flags should instead be passed to the alph
argument of
runMeme()
. The arguments passed to MEME often have many interactions
with each other, for a detailed description of each argument see
MEME Commandline Documentation.
MEME results in universalmotif_df format (see:
universalmotif::to_df()
). sites_hits
is a nested data.frame
column containing the position within each input sequence of matches to the
identified motif.
If you use runMeme()
in your analysis, please cite:
Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. pdf
The MEME Suite is free for non-profit use, but for-profit users should purchase a license. See the MEME Suite Copyright Page for details.
if (meme_is_installed()) { seqs <- universalmotif::create_sequences("CCRAAAW", seqnum = 4) names(seqs) <- 1:length(seqs) runMeme(seqs, parse_genomic_coord = FALSE) }
if (meme_is_installed()) { seqs <- universalmotif::create_sequences("CCRAAAW", seqnum = 4) names(seqs) <- 1:length(seqs) runMeme(seqs, parse_genomic_coord = FALSE) }
STREME discovers short, ungapped, *de-novo* motifs that are enriched or relatively enriched relative to a control set of sequences. STREME can be run to discover motifs relative to a shuffled set of input sequences, against a separately provided set of "control" sequences, or to determine whether motifs are centrally enriched within input sequences.
runStreme( input, control, outdir = "auto", objfun = "de", alph = "dna", meme_path = NULL, silent = TRUE, ... )
runStreme( input, control, outdir = "auto", objfun = "de", alph = "dna", meme_path = NULL, silent = TRUE, ... )
input |
regions to scan for motifs. If using 'objfun = "cd"' to test for centrally enriched motifs, be sure to include sufficient flanking sequence (e.g. +/- 500bp) for an accurate estimate. Can be any of: - path to fasta file - DNAStringSet object (can be generated from GRanges using 'get_sequence()') - List of DNAStringSet objects (generated from 'get_sequence()') - *NOTE:* if using StringSet inputs, each entry must be named (set with 'names()'). - *NOTE:* If you want to retain the raw streme output files, you must use a path to fasta file as input, or specify an "outdir" |
control |
regions to use as background for motif search. These should have a similar length distribution as the input sequences. Can be any of: - path to fasta file - DNAStringSet object (can be generated from GRanges using get_sequence) - A Biostrings::BStringSetList (generated using 'get_sequence'), in which case all sequences in the list will be combined as the control set. - if 'input' is a list of DNAStringSet objects, a character vector of names in 'input' will use those sequences as background. runstreme will not scan those regions as input. - "shuffle" to use streme's built-in dinucleotide shuffle feature (NOTE: if 'input' is a list object with an entry named "shuffle", the list entry will be used instead). Optionally can also pass 'seed = <any number>' to '...' to use as the random seed during shuffling. If no seed is passed, streme will use 0 as the random seed, so results will be reproducible if rerunning. |
outdir |
path to output directory of streme files, or "auto" to autogenerate path. Default: location of input fasta in dir named "\<input\>_vs_\<control\>". If input is DNAstringset, will be temporary path. This means that if you want to save the raw output files, you must use fasta files as input or use an informative (and unique) outdir name. memes will **not check** if it overwrites files in a directory. Directories will be recursively created if needed. (default: "auto") |
objfun |
one of c("de", "cd"). Default: "de" for differential enrichment. "cd" for central distance (control must be set to NA for "cd"). |
alph |
one of c("dna", "rna", "protein") or a path to a MEME format alph file. (default: "dna") |
meme_path |
path to "meme/bin" |
silent |
Whether to suppress printing stdout & stderr to console (default: TRUE). Warnings are always printed regardless of this setting. |
... |
pass any commandline options as R function arguments. For a complete list of STREME options, see [the STREME manual](https://meme-suite.org/meme/doc/streme.html). |
Properly setting the 'control' parameter is key to discovering biologically relevant motifs. Often, using 'control = "shuffle"' will produce a suboptimal set of motifs; however, some discriminative analysis designs don't have proper "control" regions other than to shuffle.
If you have fewer than 50 sequences, consider using [runMeme()] instead.
# Citation
If you use 'runStreme()' in your analysis, please cite:
Timothy L. Bailey, "STREME: Accurate and versatile sequence motif discovery", Bioinformatics, 2021. https://doi.org/10.1093/bioinformatics/btab203
# Licensing The MEME Suite is free for non-profit use, but for-profit users should purchase a license. See the [MEME Suite Copyright Page](http://meme-suite.org/doc/copyright.html) for details.
a 'universalmotif_df' of STREME Motifs
'?universalmotif::tidy-motifs'
TomTom compares input motifs to a database of known, user-provided motifs to identify matches.
runTomTom( input, database = NULL, outdir = "auto", thresh = 10, min_overlap = 5, dist = "ed", evalue = TRUE, silent = TRUE, meme_path = NULL, ... )
runTomTom( input, database = NULL, outdir = "auto", thresh = 10, min_overlap = 5, dist = "ed", evalue = TRUE, silent = TRUE, meme_path = NULL, ... )
input |
path to .meme format file of motifs, a list of universalmotifs,
or a universalmotif data.frame object (such as the output of |
database |
path to .meme format file to use as reference database (or list of universalmotifs). NOTE: p-value estimates are inaccurate when the database has fewer than 50 entries. |
outdir |
directory to store tomtom results (will be overwritten if exists). Default: location of input fasta file, or temporary location if using universalmotif input. |
thresh |
report matches less than or equal to this value. If evalue = TRUE (default), set an e-value threshold (default = 10). If evalue = FALSE, set a value between 0-1 (default = 0.5). |
min_overlap |
only report matches that overlap by this value or more, unless input motif is shorter, in which case the shorter length is used as the minimum value |
dist |
distance metric. Valid arguments: |
evalue |
whether to use E-value as significance threshold (default:
|
silent |
suppress printing stderr to console (default: TRUE). |
meme_path |
path to "meme/bin/" (optional). If unset, will check R
environment variable "MEME_DB (set in |
... |
additional flags passed to tomtom using cmdfun formating (see table below for details) |
runTomTom will rank matches by significance and return a
best match motif for each input (whose properties are stored in the best_match_*
columns) as well as a ranked list of all possible matches stored in the
tomtom
list column.
Additional arguments
runTomTom() can accept all valid tomtom arguments passed to ...
as described in the
tomtom commandline reference. For
convenience, below is a table of valid arguments, their default values, and
their description.
TomTom Flag | allowed values | default | description |
bfile | file path | NULL |
path to background model for converting frequency matrix to log-odds score (not used when dist is set to "ed", "kullback", "pearson", or "sandelin" |
motif_pseudo | numeric |
0.1 | pseudocount to add to motifs |
xalph | logical |
FALSE | convert alphabet of target database to alphabet of query database |
norc | logical |
FALSE | Do not score reverse complements of motifs |
incomplete_scores | logical |
FALSE | Compute scores using only aligned columns |
thresh | numeric |
0.5 | only report matches with significance values <= this value. Unless evalue = TRUE , this value must be < 1. |
internal | logical |
FALSE | forces the shorter motif to be completely contained in the longer motif |
min_overlap | integer |
1 | only report matches that overlap by this number of positions or more. If query motif is smaller than this value, its width is used as the min overlap for that query |
time | integer |
NULL |
Maximum runtime in CPU seconds (default: no limit) |
data.frame of match results. Contains best_match_motif
column of
universalmotif
objects with the matched PWM from the database, a series
of best_match_*
columns describing the TomTom results of the match, and a
tomtom
list column storing the ranked list of possible matches to each
motif. If a universalmotif data.frame is used as input, these columns are
appended to the data.frame. If no matches are returned, tomtom
and
best_match_motif
columns will be set to NA
and a message indicating
this will print.
If you use runTomTom()
in your analysis, please cite:
Shobhit Gupta, JA Stamatoyannopolous, Timothy Bailey and William Stafford Noble, "Quantifying similarity between motifs", Genome Biology, 8(2):R24, 2007. full text
The MEME Suite is free for non-profit use, but for-profit users should purchase a license. See the MEME Suite Copyright Page for details.
if (meme_is_installed()) { motif <- universalmotif::create_motif("CCRAAAW") database <- system.file("extdata", "flyFactorSurvey_cleaned.meme", package = "memes") runTomTom(motif, database) }
if (meme_is_installed()) { motif <- universalmotif::create_motif("CCRAAAW") database <- system.file("extdata", "flyFactorSurvey_cleaned.meme", package = "memes") runTomTom(motif, database) }
This function updates the best_match columns based on the rankings on the tomtom list data. By re-ordering the entries of a 'tomtom' object, the best_match columns can be updated to reflect the new rankings using [update_best_match()], where the first row of the 'tomtom' data.frame is selected as the best match.
update_best_match(res)
update_best_match(res)
res |
results from runTomTom |
'res' with updated best_* columns
[force_best_match()]
data("example_dreme_tomtom") # best match is "CG2052_SANGER_2.5" example_dreme_tomtom$best_match_name[1] # reorder the `tomtom` data.frame example_dreme_tomtom$tomtom[[1]] <- dplyr::arrange(example_dreme_tomtom$tomtom[[1]], dplyr::desc(match_eval)) # update_best_match will use the new order of rows, taking the top row as the new best match new_res <- update_best_match(example_dreme_tomtom) # best match is now altered: new_res$best_match_name[1]
data("example_dreme_tomtom") # best match is "CG2052_SANGER_2.5" example_dreme_tomtom$best_match_name[1] # reorder the `tomtom` data.frame example_dreme_tomtom$tomtom[[1]] <- dplyr::arrange(example_dreme_tomtom$tomtom[[1]], dplyr::desc(match_eval)) # update_best_match will use the new order of rows, taking the top row as the new best match new_res <- update_best_match(example_dreme_tomtom) # best match is now altered: new_res$best_match_name[1]
Although TomTom does a good job of matching unknown motifs to known motifs, sometimes the top hit is not the correct assignment. It can be useful to manually inspect the hits. This function provides a quick utility to compare matches.
view_tomtom_hits(results, top_n = "all")
view_tomtom_hits(results, top_n = "all")
results |
results data.frame from [runTomTom()] |
top_n |
number of matched motifs to return in plot (default: "all") |
This is intended to be a function used interactively and may not always be the best tool for creating publication-quality figures. Results with matches return ggseqlogo outputs which can be further manipulated using [ggplot2::theme()] calls, but results containing no matches are static plots.
plot of input motif vs the top n number of tomtom matched motifs. If no match found, will plot "No Match". Note: the "No Match" plots are not amenable to ggplot theme() manipulations, while all others are.
results <- importTomTomXML(system.file("extdata", "tomtom.xml", package = "memes")) # show top 3 hits view_tomtom_hits(results, top_n = 3)
results <- importTomTomXML(system.file("extdata", "tomtom.xml", package = "memes")) # show top 3 hits view_tomtom_hits(results, top_n = 3)
Write fasta file from stringset
write_fasta(seq, path = tempfile(fileext = ".fa"))
write_fasta(seq, path = tempfile(fileext = ".fa"))
seq |
a 'Biostrings::XStringSet' |
path |
path of fasta file to write (default: temporary file) |
path to created fasta file
seq <- universalmotif::create_sequences() write_fasta(seq)
seq <- universalmotif::create_sequences() write_fasta(seq)