| Title: | Benchmarking Epigenomic Profiling Methods Using Motif Enrichment |
|---|---|
| Description: | MotifPeeker is used to compare and analyse datasets from epigenomic profiling methods with motif enrichment as the key benchmark. The package outputs an HTML report consisting of three sections: (1. General Metrics) Overview of peaks-related general metrics for the datasets (FRiP scores, peak widths and motif-summit distances). (2. Known Motif Enrichment Analysis) Statistics for the frequency of user-provided motifs enriched in the datasets. (3. Motif Discovery Enrichment Analysis) Statistics for the frequency of ab-initio discovered motifs enriched in the datasets and compared with known motifs. |
| Authors: | Hiranyamaya Dash [cre, aut] (ORCID: <https://orcid.org/0009-0005-5514-505X>), Thomas Roberts [aut] (ORCID: <https://orcid.org/0009-0006-6244-8670>), Maria Weinert [aut] (ORCID: <https://orcid.org/0000-0001-6187-1000>), Nathan Skene [aut] (ORCID: <https://orcid.org/0000-0002-6807-3180>) |
| Maintainer: | Hiranyamaya Dash <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.5.0 |
| Built: | 2026-05-30 07:44:08 UTC |
| Source: | https://github.com/bioc/MotifPeeker |
This function performs bootstrapping to estimate the distribution of mean absolute distances between peak summits and motif positions for a given set of peaks and a specified motif.
bootstrap_distances( peaks, motif, genome_build, samples_n = NULL, samples_len = NULL, out_dir = tempdir(), meme_path = NULL, verbose = FALSE )bootstrap_distances( peaks, motif, genome_build, samples_n = NULL, samples_len = NULL, out_dir = tempdir(), meme_path = NULL, verbose = FALSE )
peaks |
A |
motif |
A |
genome_build |
A |
samples_n |
An integer specifying the number of bootstrap samples to
generate. If |
samples_len |
An integer specifying the number of peaks to sample in
each bootstrap iteration. If |
out_dir |
Location to save the 0-order background file. By default, the background file will be written to a temporary directory. |
meme_path |
path to "meme/bin/" (default: |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A numeric vector of bootstrapped mean absolute distances
between peak summits and motif positions with length equal to
samples_n.
if (memes::meme_is_installed()) { peak <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") |> read_peak_file() |> sample(20) motif <- system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") |> read_motif_file() if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances <- bootstrap_distances( peak = peak, motif = motif, genome_build = genome_build, samples_n = 2, samples_len = NULL, verbose = FALSE ) print(distances) } }if (memes::meme_is_installed()) { peak <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") |> read_peak_file() |> sample(20) motif <- system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") |> read_motif_file() if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances <- bootstrap_distances( peak = peak, motif = motif, genome_build = genome_build, samples_n = 2, samples_len = NULL, verbose = FALSE ) print(distances) } }
Calculate the Fraction of Reads in Peak score from the read and peak file of an experiment.
calc_frip(read_file, peak_file, single_end = TRUE, total_reads = NULL)calc_frip(read_file, peak_file, single_end = TRUE, total_reads = NULL)
read_file |
A BamFile object. |
peak_file |
A GRanges object. |
single_end |
A logical value. If TRUE, the reads classified as single-ended. (default = TRUE) |
total_reads |
(optional) The total number of reads in the experiment. Skips counting the total number of reads if provided, saving computation. |
The FRiP score is calculated as follows:
A numeric value indicating the FRiP score.
read_file <- system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker") read_file <- Rsamtools::BamFile(read_file) data("CTCF_ChIP_peaks", package = "MotifPeeker") calc_frip(read_file, CTCF_ChIP_peaks)read_file <- system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker") read_file <- Rsamtools::BamFile(read_file) data("CTCF_ChIP_peaks", package = "MotifPeeker") calc_frip(read_file, CTCF_ChIP_peaks)
Check and get files from ENCODE project. Requires the input to be in ENCODE ID format. Uses BiocFileCache to cache downloads. Only works for files.
check_ENCODE(encode_id, expect_format, verbose = FALSE)check_ENCODE(encode_id, expect_format, verbose = FALSE)
encode_id |
A character string specifying the ENCODE ID. |
expect_format |
A character string (or a vector) specifying the expected format(s) of the file. If the file is not in the expected format, an error is thrown. |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A character string specifying the path to the downloaded file.
if (requireNamespace("curl", quietly = TRUE) && requireNamespace("jsonlite", quietly = TRUE)) { check_ENCODE("ENCFF920TXI", expect_format = c("bed", "gz")) }if (requireNamespace("curl", quietly = TRUE) && requireNamespace("jsonlite", quietly = TRUE)) { check_ENCODE("ENCFF920TXI", expect_format = c("bed", "gz")) }
Check if the genome build is valid and return an appropriate BSgenome-class object.
check_genome_build(genome_build)check_genome_build(genome_build)
genome_build |
A character string with the abbreviated genome build name, or a BSGenome object. Check check_genome_build details for genome builds which can be imported as abbreviated names. |
Currently, the following genome builds can be specified to 'genome_build' as abbreviated names:
hg38: Human genome build GRCh38
(BSgenome.Hsapiens.UCSC.hg38)
hg19: Human genome build GRCh37
(BSgenome.Hsapiens.UCSC.hg19)
mm10: Mouse genome build GRCm38
(BSgenome.Mmusculus.UCSC.mm10)
mm39: Mouse genome build GRCm39
(BSgenome.Mmusculus.UCSC.mm39)
If the genome build you wish to use is not listed here, please pass a BSgenome-class data object directly.
A BSGenome object.
BSgenome-class for more information on BSGenome objects.
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { check_genome_build("hg38") }if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { check_genome_build("hg38") }
Check and get files from JASPAR. Requires the input to be in JASPAR ID format. Uses BiocFileCache to cache downloads.
check_JASPAR(motif_id, verbose = FALSE)check_JASPAR(motif_id, verbose = FALSE)
motif_id |
A character string specifying the JASPAR motif ID. |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A character string specifying the path to the downloaded file.
check_JASPAR("MA1930.2")check_JASPAR("MA1930.2")
Human CTCF peak file generated with ChIP-seq using HCT116 cell-line. No control files were used to generate the peak file.
data("CTCF_ChIP_peaks")data("CTCF_ChIP_peaks")
An object of class GRanges of length 209.
To reduce the size of the package, the included peak file focuses on specific genomic regions. The subset region included is chr10:65,654,529-74,841,155.
Human CTCF peak file generated with TIP-seq using HCT116 cell-line. The peak file was generated using the nf-core/cutandrun pipeline. Raw read files were sourced from NIH Sequence Read Archives (ID: SRR16963166).
data("CTCF_TIP_peaks")data("CTCF_TIP_peaks")
An object of class GRanges of length 182.
To reduce the size of the package, the included peak file focuses on specific genomic regions. The subset region included is chr10:65,654,529-74,841,155.
Use STREME from MEME suite to find motifs in the provided sequences. To speed up the process, the sequences can be optionally trimmed to reduce the search space. The result is then optionally filtered to remove motifs with a high number of nucleotide repeats
denovo_motifs( seqs, trim_seq_width, genome_build, discover_motifs_count = 3, minw = 8, maxw = 25, filter_n = 6, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::SerialParam(), verbose = FALSE, debug = FALSE, ... )denovo_motifs( seqs, trim_seq_width, genome_build, discover_motifs_count = 3, minw = 8, maxw = 25, filter_n = 6, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::SerialParam(), verbose = FALSE, debug = FALSE, ... )
seqs |
A list of |
trim_seq_width |
An integer specifying the width of the sequence to extract around the summit (default = NULL). This sequence is used to search for discovered motifs. If not provided, the entire peak region will be used. This parameter is intended to reduce the search space and speed up motif discovery; therefore, a value less than the average peak width is recommended. Peaks are trimmed symmetrically around the summit while respecting the peak bounds. |
genome_build |
The genome build that the peak sequences should be derived from. |
discover_motifs_count |
An integer specifying the number of motifs to discover. (default = 3) Note that higher values take longer to compute. |
minw |
An integer specifying the minimum width of the motif. (default = 8) |
maxw |
An integer specifying the maximum width of the motif. (default = 25) |
filter_n |
An integer specifying the number of consecutive nucleotide repeats a discovered motif must contain to be filtered out. (default = 6) |
out_dir |
A |
meme_path |
path to "meme/bin/" (default: |
BPPARAM |
A |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
debug |
A logical indicating whether to print debug messages while running the function. (default = FALSE) |
... |
Additional arguments to pass to |
A list of universalmotif objects and
associated metadata.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, minw = 8, maxw = 8, out_dir = tempdir()) print(res[[1]]$consensus) } }if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, minw = 8, maxw = 8, out_dir = tempdir()) print(res[[1]]$consensus) } }
Search through provided motif database to find similar motifs to the input.
Light wrapper around TOMTOM from MEME Suite.
find_motifs( streme_out, motif_db, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::bpparam(), verbose = FALSE, debug = FALSE, ... )find_motifs( streme_out, motif_db, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::bpparam(), verbose = FALSE, debug = FALSE, ... )
streme_out |
Output from |
motif_db |
Path to |
out_dir |
A |
meme_path |
path to "meme/bin/" (default: |
BPPARAM |
A |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
debug |
A logical indicating whether to print debug messages while running the function. (default = FALSE) |
... |
Additional arguments to pass to |
data.frame of match results. Contains best_match_motif column of
universalmotif objects with the matched PWM from the database, a series
of best_match_* columns describing the TomTom results of the match, and a
tomtom list column storing the ranked list of possible matches to each
motif. If a universalmotif data.frame is used as input, these columns are
appended to the data.frame. If no matches are returned, tomtom and
best_match_motif columns will be set to NA and a message indicating
this will print.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 10, out_dir = tempdir()) res2 <- find_motifs(res, motif_db = get_JASPARCORE(), out_dir = tempdir()) print(res2) } }if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 10, out_dir = tempdir()) res2 <- find_motifs(res, motif_db = get_JASPARCORE(), out_dir = tempdir()) print(res2) } }
Wrapper for 'MotifPeeker::summit_to_motif' to get motif-summit distances
for all peaks and motifs, generating a data.frame suitable
for plots.
get_df_distances( result, user_motifs, genome_build, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )get_df_distances( result, user_motifs, genome_build, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )
result |
A
|
user_motifs |
A
|
genome_build |
A character string with the abbreviated genome build name, or a BSGenome object. Check check_genome_build details for genome builds which can be imported as abbreviated names. |
out_dir |
A |
BPPARAM |
A
IMPORTANT: For each worker, please ensure a minimum of 8GB of
memory (RAM) is available as |
meme_path |
path to |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A data.frame with the following columns:
Experiment labels.
Experiment types.
Motif indices.
Distances between peak summit and motif.
Other generate data.frames:
get_df_distances_bootstrapped(),
get_df_enrichment()
if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = CTCF_ChIP_peaks, exp_type = "ChIP", exp_labels = "CTCF", read_count = 150, peak_count = 100 ) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances_df <- get_df_distances(input, motifs, genome_build) print(distances_df) } }if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = CTCF_ChIP_peaks, exp_type = "ChIP", exp_labels = "CTCF", read_count = 150, peak_count = 100 ) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances_df <- get_df_distances(input, motifs, genome_build) print(distances_df) } }
Wrapper for 'MotifPeeker::bootstrap_distances' to get bootstrapped
motif-summit distances for given peaks and motifs, generating a
data.frame suitable for plots.
get_df_distances_bootstrapped( result, user_motifs, genome_build, samples_n = NULL, samples_len = NULL, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )get_df_distances_bootstrapped( result, user_motifs, genome_build, samples_n = NULL, samples_len = NULL, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )
result |
A
|
user_motifs |
A
|
genome_build |
A character string with the abbreviated genome build name, or a BSGenome object. Check check_genome_build details for genome builds which can be imported as abbreviated names. |
samples_n |
An integer specifying the number of bootstrap samples to
generate. If |
samples_len |
An integer specifying the number of peaks to sample in
each bootstrap iteration. If |
out_dir |
A |
BPPARAM |
A
IMPORTANT: For each worker, please ensure a minimum of 8GB of
memory (RAM) is available as |
meme_path |
path to |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A data.frame with the following columns:
Experiment labels.
Experiment types.
Motif indices.
Bootstrap iteration number.
Mean of absolute distances between peak summit and motif.
Other generate data.frames:
get_df_distances(),
get_df_enrichment()
if (memes::meme_is_installed()) { peak <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") |> read_peak_file() |> sample(20) motif_MA1102.3 <- system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") |> read_motif_file() motif_MA1930.2 <- system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker") |> read_motif_file() input <- list( peaks = peak, exp_type = "ChIP", exp_labels = "CTCF", read_count = 150, peak_count = 100 ) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances_df_bootstrapped <- get_df_distances_bootstrapped( input, user_motifs = motifs, genome_build = genome_build, samples_n = NULL, samples_len = NULL, verbose = FALSE ) print(distances_df_bootstrapped) } }if (memes::meme_is_installed()) { peak <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") |> read_peak_file() |> sample(20) motif_MA1102.3 <- system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") |> read_motif_file() motif_MA1930.2 <- system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker") |> read_motif_file() input <- list( peaks = peak, exp_type = "ChIP", exp_labels = "CTCF", read_count = 150, peak_count = 100 ) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances_df_bootstrapped <- get_df_distances_bootstrapped( input, user_motifs = motifs, genome_build = genome_build, samples_n = NULL, samples_len = NULL, verbose = FALSE ) print(distances_df_bootstrapped) } }
Wrapper for 'MotifPeeker::motif_enrichment' to get motif enrichment counts
and percentages for all peaks and motifs, generating a data.frame
suitable for plots. The data.frame contains values for all and
segregated peaks.
get_df_enrichment( result, segregated_peaks, user_motifs, genome_build, reference_index = 1, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )get_df_enrichment( result, segregated_peaks, user_motifs, genome_build, reference_index = 1, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )
result |
A
|
segregated_peaks |
A |
user_motifs |
A
|
genome_build |
A character string with the abbreviated genome build name, or a BSGenome object. Check check_genome_build details for genome builds which can be imported as abbreviated names. |
reference_index |
An integer specifying the index of the peak file to use as the reference dataset for comparison. Indexing starts from 1. (default = 1) |
out_dir |
A |
BPPARAM |
A
IMPORTANT: For each worker, please ensure a minimum of 8GB of
memory (RAM) is available as |
meme_path |
path to |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A data.frame with the following columns:
Experiment labels.
Experiment types.
Motif indices.
Segregated group- "all", "Common" or "Unique".
"reference" or "comparison" group.
Number of peaks with motif.
Number of peaks without motif.
Percentage of peaks with motif.
Percentage of peaks without motif.
Other generate data.frames:
get_df_distances(),
get_df_distances_bootstrapped()
if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = list(CTCF_ChIP_peaks, CTCF_TIP_peaks), exp_type = c("ChIP", "TIP"), exp_labels = c("CTCF_ChIP", "CTCF_TIP"), read_count = c(150, 200), peak_count = c(100, 120) ) segregated_input <- segregate_seqs(input$peaks[[1]], input$peaks[[2]]) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) reference_index <- 1 if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 enrichment_df <- get_df_enrichment( input, segregated_input, motifs, genome_build, reference_index = 1 ) } }if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = list(CTCF_ChIP_peaks, CTCF_TIP_peaks), exp_type = c("ChIP", "TIP"), exp_labels = c("CTCF_ChIP", "CTCF_TIP"), read_count = c(150, 200), peak_count = c(100, 120) ) segregated_input <- segregate_seqs(input$peaks[[1]], input$peaks[[2]]) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) reference_index <- 1 if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 enrichment_df <- get_df_enrichment( input, segregated_input, motifs, genome_build, reference_index = 1 ) } }
Downloads JASPAR CORE database in meme format for all available
taxonomic groups. Uses BiocFileCache to cache downloads.
get_JASPARCORE(verbose = FALSE)get_JASPARCORE(verbose = FALSE)
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A character string specifying the path to the downloaded file
(meme format).
get_JASPARCORE()get_JASPARCORE()
motif_enrichment() calculates motif enrichment relative to a set of
background sequences using Analysis of Motif Enrichment (AME) from
memes-package.
motif_enrichment( peak_input, motif, genome_build, out_dir = tempdir(), verbose = FALSE, meme_path = NULL, ... )motif_enrichment( peak_input, motif, genome_build, out_dir = tempdir(), verbose = FALSE, meme_path = NULL, ... )
peak_input |
Either a path to the narrowPeak file or a GRanges peak
object generated by |
motif |
An object of class |
genome_build |
The genome build that the peak sequences should be derived from. |
out_dir |
Location to save the 0-order background file along with the AME output files. |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
meme_path |
path to "meme/bin/" (default: |
... |
Arguments passed on to
|
A list containing a AME results data frame and a numeric referring to the proportion of peaks with a motif.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- motif_enrichment( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, ) print(res) }if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- motif_enrichment( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, ) print(res) }
The motif file contains the JASPAR motif for CTCFL (MA1102.3) for
Homo Sapiens. This is one of the two motif files used to demonstrate
MotifPeeker's known-motif analysis functionality.
data("motif_MA1102.3")data("motif_MA1102.3")
An object of class universalmotif of length 1.
The motif file contains the JASPAR motif for CTCF (MA1930.2) for
Homo Sapiens. This is one of the two motif files used to demonstrate
MotifPeeker's known-motif analysis functionality.
data("motif_MA1930.2")data("motif_MA1930.2")
An object of class universalmotif of length 1.
Compute motif similarity scores between motifs discovered from segregated
sequences. Wrapper around compare_motifs to
compare motifs from different groups of sequences. To see the possible
similarity measures available, refer to details.
motif_similarity( streme_out, method = "PCC", normalise.scores = TRUE, BPPARAM = BiocParallel::bpparam(), ... )motif_similarity( streme_out, method = "PCC", normalise.scores = TRUE, BPPARAM = BiocParallel::bpparam(), ... )
streme_out |
Output from |
method |
|
normalise.scores |
|
BPPARAM |
A |
... |
Arguments passed on to
|
The following metrics are available:
Euclidean distance (EUCL) (Choi et al. 2004)
Weighted Euclidean distance (WEUCL)
Kullback-Leibler divergence (KL) (Kullback and Leibler 1951; Roepcke et al. 2005)
Hellinger distance (HELL) (Hellinger 1909)
Squared Euclidean distance (SEUCL)
Manhattan distance (MAN)
Pearson correlation coefficient (PCC)
Weighted Pearson correlation coefficient (WPCC)
Sandelin-Wasserman similarity (SW), or sum of squared distances (Sandelin and Wasserman 2004)
Average log-likelihood ratio (ALLR) (Wang and Stormo 2003)
Lower limit ALLR (ALLR_LL) (Mahony et al. 2007)
Bhattacharyya coefficient (BHAT) (Bhattacharyya 1943)
Comparisons are calculated between two motifs at a time. All possible alignments
are scored, and the best score is reported. In an alignment scores are calculated
individually between columns. How those scores are combined to generate the final
alignment scores depends on score.strat.
See the "Motif comparisons and P-values" vignette for a description of the
various metrics. Note that PCC, WPCC, SW, ALLR, ALLR_LL and BHAT
are similarities;
higher values mean more similar motifs. For the remaining metrics, values closer
to zero represent more similar motifs.
Small pseudocounts are automatically added when one of the following methods
is used: KL, ALLR, ALLR_LL, IS. This is avoid
zeros in the calculations.
To note regarding p-values: P-values are pre-computed using the
make_DBscores() function. If not given, then uses a set of internal
precomputed P-values from the JASPAR2018 CORE motifs. These precalculated
scores are dependent on the length of the motifs being compared. This takes
into account that comparing small motifs with larger motifs leads to higher
scores, since the probability of finding a higher scoring alignment is
higher.
The default P-values have been precalculated for regular DNA motifs. They
are of little use for motifs with a different number of alphabet letters
(or even the multifreq slot).
A list of matrices containing the similarity scores between motifs from different groups of sequences. The order of comparison is as follows, with first element representing the rows and second element representing the columns of the matrix:
1. Common motifs comparison: Common seqs from reference (1) <-> comparison (2)
2. Unique motifs comparison: Unique seqs from reference (1) <-> comparison (2)
3. Cross motifs comparison 1: Unique seqs from reference (1) <-> comparison (1)
4. Cross motifs comparison 2: Unique seqs from comparison (2) <-> reference (1)
The list is repeated for each set of comparison groups in input.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("CTCF_ChIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 segregated_peaks <- segregate_seqs(CTCF_TIP_peaks, CTCF_ChIP_peaks) denovo_motifs <- denovo_motifs(unlist(segregated_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, maxw = 8, minw = 8, out_dir = tempdir()) similarity_matrices <- motif_similarity(denovo_motifs) print(similarity_matrices) } }if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("CTCF_ChIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 segregated_peaks <- segregate_seqs(CTCF_TIP_peaks, CTCF_ChIP_peaks) denovo_motifs <- denovo_motifs(unlist(segregated_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, maxw = 8, minw = 8, out_dir = tempdir()) similarity_matrices <- motif_similarity(denovo_motifs) print(similarity_matrices) } }
This function compares different epigenomic datasets using motif enrichment as the key metric. The output is an easy-to-interpret HTML document with the results. The report contains three main sections: (1) General Metrics on peak and alignment files (if provided), (2) Known Motif Enrichment Analysis and (3) Discovered Motif Enrichment Analysis.
MotifPeeker( peak_files, reference_index = 1, alignment_files = NULL, exp_labels = NULL, exp_type = NULL, genome_build, motif_files = NULL, motif_labels = NULL, cell_counts = NULL, distance_bootstrap = TRUE, bootstrap_n = 500, bootstrap_len = NULL, motif_discovery = TRUE, motif_discovery_count = 3, filter_n = 6, trim_seq_width = NULL, motif_db = NULL, download_buttons = TRUE, meme_path = NULL, out_dir = tempdir(), save_runfiles = FALSE, display = if (interactive()) "browser", BPPARAM = BiocParallel::SerialParam(), quiet = TRUE, debug = FALSE, verbose = FALSE )MotifPeeker( peak_files, reference_index = 1, alignment_files = NULL, exp_labels = NULL, exp_type = NULL, genome_build, motif_files = NULL, motif_labels = NULL, cell_counts = NULL, distance_bootstrap = TRUE, bootstrap_n = 500, bootstrap_len = NULL, motif_discovery = TRUE, motif_discovery_count = 3, filter_n = 6, trim_seq_width = NULL, motif_db = NULL, download_buttons = TRUE, meme_path = NULL, out_dir = tempdir(), save_runfiles = FALSE, display = if (interactive()) "browser", BPPARAM = BiocParallel::SerialParam(), quiet = TRUE, debug = FALSE, verbose = FALSE )
peak_files |
A character vector of path to peak files, or a vector of
GRanges objects generated using
ENCODE file IDs can also be provided to automatically fetch peak file(s) from the ENCODE database. |
reference_index |
An integer specifying the index of the peak file to use as the reference dataset for comparison. Indexing starts from 1. (default = 1) |
alignment_files |
A character vector of path to alignment files, or a
vector of |
exp_labels |
A character vector of labels for each peak file. (optional) If not provided, capital letters will be used as labels in the report. |
exp_type |
A character vector of experimental types for each peak file. (optional) Useful for comparison of different methods. If not provided, all datasets will be classified as "unknown" experiment types in the report. Supported experimental types are:
|
genome_build |
A character string with the abbreviated genome build name, or a BSGenome object. Check check_genome_build details for genome builds which can be imported as abbreviated names. |
motif_files |
A character vector of path to motif files, or a vector of
|
motif_labels |
A character vector of labels for each motif file.
(optional) Only used if path to file names are passed in
|
cell_counts |
An integer vector of experiment cell counts for each peak file. (optional) Creates additional comparisons based on cell counts. |
distance_bootstrap |
A logical indicating whether to perform bootstrap
analysis for motif-peak summit distances. (default = TRUE) If |
bootstrap_n |
An integer specifying the number of bootstrap samples to
generate. (default = 500) If |
bootstrap_len |
An integer specifying the length of sequences to sample
in each bootstrap iteration. (default = NULL) If |
motif_discovery |
A logical indicating whether to perform motif discovery for the third section of the report. (default = TRUE) |
motif_discovery_count |
An integer specifying the number of motifs to discover. (default = 3) Note that higher values take longer to compute. |
filter_n |
An integer specifying the number of consecutive nucleotide repeats a discovered motif must contain to be filtered out. (default = 6) |
trim_seq_width |
An integer specifying the width of the sequence to extract around the summit (default = NULL). This sequence is used to search for discovered motifs. If not provided, the entire peak region will be used. This parameter is intended to reduce the search space and speed up motif discovery; therefore, a value less than the average peak width is recommended. Peaks are trimmed symmetrically around the summit while respecting the peak bounds. |
motif_db |
Path to |
download_buttons |
A logical indicating whether to include download buttons for various files within the HTML report. (default = TRUE) |
meme_path |
path to |
out_dir |
A character string specifying the directory to save the
output files. (default = |
save_runfiles |
A logical indicating whether to save intermediate files generated during the run, such as those from FIMO and AME. (default = FALSE) |
display |
A character vector specifying the display mode for the HTML report once it is generated. (default = NULL) Options are:
|
BPPARAM |
A
IMPORTANT: For each worker, please ensure a minimum of 8GB of
memory (RAM) is available as |
quiet |
A logical indicating whether to print markdown knit messages. (default = FALSE) |
debug |
A logical indicating whether to print debug/error messages in the HTML report. (default = FALSE) |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
Runtime guidance: For 4 datasets, the runtime is approximately 3 minutes with motif_discovery disabled. However, motif discovery can take hours to complete. To make computation faster, we highly recommend tuning the following arguments:
BPPARAM=MulticoreParam(x)Running motif discovery in parallel can significantly reduce runtime, but it is very memory-intensive, consuming 10+GB of RAM per thread. Memory starvation can greatly slow the process, so set the number of cores with caution.
motif_discovery_countThe number of motifs to discover per sequence group exponentially increases runtime. We recommend no more than 5 motifs to make a meaningful inference.
trim_seq_widthTrimming sequences before running
motif discovery can significantly reduce the search space. Sequence
length can exponentially increase runtime. We recommend running the
script with motif_discovery = FALSE and studying the
motif-summit distance distribution under general metrics to find the
sequence length that captures most motifs. A good starting point is 150
but it can be reduced further if appropriate.
Path to the output directory.
Running motif discovery is computationally expensive and can
require from minutes to hours. denovo_motifs can widely affect the
runtime (higher values take longer). Setting trim_seq_width to a lower
value can also reduce the runtime significantly.
peaks <- list( system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_peaks.narrowPeak", package = "MotifPeeker") ) alignments <- list( system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_alignment.bam", package = "MotifPeeker") ) motifs <- list( system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker"), system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") ) if (memes::meme_is_installed()) { MotifPeeker( peak_files = peaks, reference_index = 2, alignment_files = alignments, exp_labels = c("ChIP", "TIP"), exp_type = c("chipseq", "tipseq"), genome_build = "hg38", motif_files = motifs, motif_labels = NULL, cell_counts = NULL, distance_bootstrap = TRUE, bootstrap_n = 2, bootstrap_len = 40, motif_discovery = TRUE, motif_discovery_count = 2, motif_db = NULL, download_buttons = TRUE, out_dir = tempdir(), debug = FALSE, quiet = TRUE, verbose = FALSE ) }peaks <- list( system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_peaks.narrowPeak", package = "MotifPeeker") ) alignments <- list( system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_alignment.bam", package = "MotifPeeker") ) motifs <- list( system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker"), system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") ) if (memes::meme_is_installed()) { MotifPeeker( peak_files = peaks, reference_index = 2, alignment_files = alignments, exp_labels = c("ChIP", "TIP"), exp_type = c("chipseq", "tipseq"), genome_build = "hg38", motif_files = motifs, motif_labels = NULL, cell_counts = NULL, distance_bootstrap = TRUE, bootstrap_n = 2, bootstrap_len = 40, motif_discovery = TRUE, motif_discovery_count = 2, motif_db = NULL, download_buttons = TRUE, out_dir = tempdir(), debug = FALSE, quiet = TRUE, verbose = FALSE ) }
read_motif_file() reads a motif file and converts to a PWM. The
function supports multiple motif formats, including "homer", "jaspar",
"meme", "transfac" and "uniprobe".
read_motif_file(motif_file, file_format = "auto", verbose = FALSE)read_motif_file(motif_file, file_format = "auto", verbose = FALSE)
motif_file |
Path to a motif file or a
|
file_format |
Character string specifying the format of the motif file. The options are "homer", "jaspar", "meme", "transfac" and "uniprobe" |
verbose |
A logical indicating whether to print messages. |
A universalmotif motif object.
motif_file <- system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker") res <- read_motif_file(motif_file = motif_file, file_format = "jaspar") print(res)motif_file <- system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker") res <- read_motif_file(motif_file = motif_file, file_format = "jaspar") print(res)
This function reads a MACS2/3 narrowPeak or SEACR BED peak file and returns a GRanges object with the peak coordinates and summit.
read_peak_file(peak_file, file_format = "auto", verbose = FALSE)read_peak_file(peak_file, file_format = "auto", verbose = FALSE)
peak_file |
A character string with the path to the peak file, or a
GRanges object created using |
file_format |
A character string specifying the format of the peak file.
|
verbose |
A logical indicating whether to print messages. |
The summit column is the absolute genomic position of the peak, which is relative to the start position of the sequence range. For SEACR BED files, the summit column is calculated as the midpoint of the max signal region.
A GRanges-class object with the peak coordinates and summit.
GRanges-class for more information on GRanges objects.
macs3_peak_file <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") macs3_peak_read <- read_peak_file(macs3_peak_file) macs3_peak_readmacs3_peak_file <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") macs3_peak_read <- read_peak_file(macs3_peak_file) macs3_peak_read
This function saves a peak object to a file in BED4 format. The included
columns are: chr, start, end, and name. Since
no strand data is being included, it is recommended to use this function
only for peak objects that do not have strand information.
save_peak_file( peak_obj, save = TRUE, filename = random_string(10), out_dir = tempdir() )save_peak_file( peak_obj, save = TRUE, filename = random_string(10), out_dir = tempdir() )
peak_obj |
A GRanges object with the peak coordinates. Must include
columns: |
save |
A logical indicating whether to save the peak object to a file. |
filename |
A character string of the file name. If the file extension
is not |
out_dir |
A character string of the output directory. (default = tempdir()) |
If save = FALSE, a data frame with the peak coordinates. If
save = TRUE, the path to the saved file.
data("CTCF_ChIP_peaks", package = "MotifPeeker") out <- save_peak_file(CTCF_ChIP_peaks, save = TRUE, "test_peak_file.bed") print(out)data("CTCF_ChIP_peaks", package = "MotifPeeker") out <- save_peak_file(CTCF_ChIP_peaks, save = TRUE, "test_peak_file.bed") print(out)
This function takes two sets of sequences and segregates them into common and unique sequences. The common sequences are sequences that are present in both sets of sequences. The unique sequences are sequences that are present in only one of the sets of sequences.
segregate_seqs(seqs1, seqs2)segregate_seqs(seqs1, seqs2)
seqs1 |
A set of sequences ( |
seqs2 |
A set of sequences ( |
Sequences are considered common if their base pairs align in any position, even if they vary in length. Consequently, while the number of common sequences remains consistent between both sets, but the length and composition of these sequences may differ. As a result, the function returns distinct sets of common sequences for each input set of sequences.
A list containing the common sequences and unique sequences for each
set of sequences. The list contains the following GRanges objects:
common_seqs1: Common sequences in seqs1
common_seqs2: Common sequences in seqs2
unique_seqs1: Unique sequences in seqs1
unique_seqs2: Unique sequences in seqs2
data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") seqs1 <- CTCF_ChIP_peaks seqs2 <- CTCF_TIP_peaks res <- segregate_seqs(seqs1, seqs2) print(res)data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") seqs1 <- CTCF_ChIP_peaks seqs2 <- CTCF_TIP_peaks res <- segregate_seqs(seqs1, seqs2) print(res)
summit_to_motif() calculates the distance between each motif and its
nearest peak summit. runFimo from the memes package is used to
recover the locations of each motif.
summit_to_motif( peak_input, motif, fp_rate = 0.05, genome_build, out_dir = tempdir(), meme_path = NULL, verbose = FALSE, ... )summit_to_motif( peak_input, motif, fp_rate = 0.05, genome_build, out_dir = tempdir(), meme_path = NULL, verbose = FALSE, ... )
peak_input |
Either a path to the narrowPeak file or a GRanges peak
object generated by |
motif |
An object of class |
fp_rate |
The desired false-positive rate. A p-value threshold will be selected based on this value. The default false-positive rate is 0.05. |
genome_build |
The genome build that the peak sequences should be derived from. |
out_dir |
Location to save the 0-order background file. By default, the background file will be written to a temporary directory. |
meme_path |
path to "meme/bin/" (default: |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
... |
Arguments passed on to
|
To calculate the p-value threshold for a desired false-positive rate, we use the approximate formula:
(Dervied from FIMO documentation)
A list containing an expanded GRanges peak object with metadata columns relating to motif positions along with a vector of summit-to-motif distances for each valid peak.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- summit_to_motif( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, fp_rate = 5e-02, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 ) print(res) }if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- summit_to_motif( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, fp_rate = 5e-02, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 ) print(res) }