Title: | Benchmarking Epigenomic Profiling Methods Using Motif Enrichment |
---|---|
Description: | MotifPeeker is used to compare and analyse datasets from epigenomic profiling methods with motif enrichment as the key benchmark. The package outputs an HTML report consisting of three sections: (1. General Metrics) Overview of peaks-related general metrics for the datasets (FRiP scores, peak widths and motif-summit distances). (2. Known Motif Enrichment Analysis) Statistics for the frequency of user-provided motifs enriched in the datasets. (3. De-Novo Motif Enrichment Analysis) Statistics for the frequency of de-novo discovered motifs enriched in the datasets and compared with known motifs. |
Authors: | Hiranyamaya Dash [cre, aut] , Thomas Roberts [aut] , Maria Weinert [aut] , Nathan Skene [aut] |
Maintainer: | Hiranyamaya Dash <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.99.13 |
Built: | 2024-12-20 03:35:55 UTC |
Source: | https://github.com/bioc/MotifPeeker |
Calculate the Fraction of Reads in Peak score from the read and peak file of an experiment.
calc_frip(read_file, peak_file, single_end = TRUE, total_reads = NULL)
calc_frip(read_file, peak_file, single_end = TRUE, total_reads = NULL)
read_file |
A BamFile object. |
peak_file |
A GRanges object. |
single_end |
A logical value. If TRUE, the reads classified as single-ended. (default = TRUE) |
total_reads |
(optional) The total number of reads in the experiment. Skips counting the total number of reads if provided, saving computation. |
The FRiP score is calculated as follows:
A numeric value indicating the FRiP score.
read_file <- system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker") read_file <- Rsamtools::BamFile(read_file) data("CTCF_ChIP_peaks", package = "MotifPeeker") calc_frip(read_file, CTCF_ChIP_peaks)
read_file <- system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker") read_file <- Rsamtools::BamFile(read_file) data("CTCF_ChIP_peaks", package = "MotifPeeker") calc_frip(read_file, CTCF_ChIP_peaks)
Check and get files from ENCODE project. Requires the input to be in ENCODE ID format. Uses BiocFileCache to cache downloads. Only works for files.
check_ENCODE(encode_id, expect_format, verbose = FALSE)
check_ENCODE(encode_id, expect_format, verbose = FALSE)
encode_id |
A character string specifying the ENCODE ID. |
expect_format |
A character string (or a vector) specifying the expected format(s) of the file. If the file is not in the expected format, an error is thrown. |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A character string specifying the path to the downloaded file.
if (requireNamespace("curl", quietly = TRUE) && requireNamespace("jsonlite", quietly = TRUE)) { check_ENCODE("ENCFF920TXI", expect_format = c("bed", "gz")) }
if (requireNamespace("curl", quietly = TRUE) && requireNamespace("jsonlite", quietly = TRUE)) { check_ENCODE("ENCFF920TXI", expect_format = c("bed", "gz")) }
Check if the genome build is valid and return the appropriate BSGenome object.
check_genome_build(genome_build)
check_genome_build(genome_build)
genome_build |
A character string with the abbreviated genome build
name, or a BSGenome object. At the moment, only |
A BSGenome object.
BSgenome-class for more information on BSGenome objects.
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { check_genome_build("hg38") }
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { check_genome_build("hg38") }
Check and get files from JASPAR. Requires the input to be in JASPAR ID format. Uses BiocFileCache to cache downloads.
check_JASPAR(motif_id, verbose = FALSE)
check_JASPAR(motif_id, verbose = FALSE)
motif_id |
A character string specifying the JASPAR motif ID. |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A character string specifying the path to the downloaded file.
check_JASPAR("MA1930.2")
check_JASPAR("MA1930.2")
Human CTCF peak file generated with ChIP-seq using HCT116 cell-line. No control files were used to generate the peak file.
data("CTCF_ChIP_peaks")
data("CTCF_ChIP_peaks")
An object of class GRanges
of length 209.
To reduce the size of the package, the included peak file focuses on specific genomic regions. The subset region included is chr10:65,654,529-74,841,155.
Human CTCF peak file generated with TIP-seq using HCT116 cell-line. The peak file was generated using the nf-core/cutandrun pipeline. Raw read files were sourced from NIH Sequence Read Archives (ID: SRR16963166).
data("CTCF_TIP_peaks")
data("CTCF_TIP_peaks")
An object of class GRanges
of length 182.
To reduce the size of the package, the included peak file focuses on specific genomic regions. The subset region included is chr10:65,654,529-74,841,155.
Use STREME from MEME suite to find motifs in the provided sequences. To speed up the process, the sequences can be optionally trimmed to reduce the search space. The result is then optionally filtered to remove motifs with a high number of nucleotide repeats
denovo_motifs( seqs, trim_seq_width, genome_build, discover_motifs_count = 3, minw = 8, maxw = 25, filter_n = 6, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::SerialParam(), verbose = FALSE, debug = FALSE, ... )
denovo_motifs( seqs, trim_seq_width, genome_build, discover_motifs_count = 3, minw = 8, maxw = 25, filter_n = 6, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::SerialParam(), verbose = FALSE, debug = FALSE, ... )
seqs |
A list of |
trim_seq_width |
An integer specifying the width of the sequence to extract around the summit (default = NULL). This sequence is used to search for discovered motifs. If not provided, the entire peak region will be used. This parameter is intended to reduce the search space and speed up motif discovery; therefore, a value less than the average peak width is recommended. Peaks are trimmed symmetrically around the summit while respecting the peak bounds. |
genome_build |
The genome build that the peak sequences should be derived from. |
discover_motifs_count |
An integer specifying the number of motifs to discover. (default = 3) Note that higher values take longer to compute. |
minw |
An integer specifying the minimum width of the motif. (default = 8) |
maxw |
An integer specifying the maximum width of the motif. (default = 25) |
filter_n |
An integer specifying the number of consecutive nucleotide repeats a discovered motif must contain to be filtered out. (default = 6) |
out_dir |
A |
meme_path |
path to "meme/bin/" (default: |
BPPARAM |
A |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
debug |
A logical indicating whether to print debug messages while running the function. (default = FALSE) |
... |
Additional arguments to pass to |
A list of universalmotif
objects and
associated metadata.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, minw = 8, maxw = 8, out_dir = tempdir()) print(res[[1]]$consensus) } }
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, minw = 8, maxw = 8, out_dir = tempdir()) print(res[[1]]$consensus) } }
Search through provided motif database to find similar motifs to the input.
Light wrapper around TOMTOM
from MEME Suite.
find_motifs( streme_out, motif_db, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::bpparam(), verbose = FALSE, debug = FALSE, ... )
find_motifs( streme_out, motif_db, out_dir = tempdir(), meme_path = NULL, BPPARAM = BiocParallel::bpparam(), verbose = FALSE, debug = FALSE, ... )
streme_out |
Output from |
motif_db |
Path to |
out_dir |
A |
meme_path |
path to "meme/bin/" (default: |
BPPARAM |
A |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
debug |
A logical indicating whether to print debug messages while running the function. (default = FALSE) |
... |
Additional arguments to pass to |
data.frame of match results. Contains best_match_motif
column of
universalmotif
objects with the matched PWM from the database, a series
of best_match_*
columns describing the TomTom results of the match, and a
tomtom
list column storing the ranked list of possible matches to each
motif. If a universalmotif data.frame is used as input, these columns are
appended to the data.frame. If no matches are returned, tomtom
and
best_match_motif
columns will be set to NA
and a message indicating
this will print.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 10, out_dir = tempdir()) res2 <- find_motifs(res, motif_db = get_JASPARCORE(), out_dir = tempdir()) print(res2) } }
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 res <- denovo_motifs(list(CTCF_TIP_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 10, out_dir = tempdir()) res2 <- find_motifs(res, motif_db = get_JASPARCORE(), out_dir = tempdir()) print(res2) } }
Wrapper for 'MotifPeeker::summit_to_motif' to get motif-summit distances
for all peaks and motifs, generating a data.frame
suitable
for plots.
get_df_distances( result, user_motifs, genome_build, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )
get_df_distances( result, user_motifs, genome_build, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )
result |
A
|
user_motifs |
A
|
genome_build |
A character string with the abbreviated genome build
name, or a BSGenome object. At the moment, only |
out_dir |
A |
BPPARAM |
A
IMPORTANT: For each worker, please ensure a minimum of 8GB of
memory (RAM) is available as |
meme_path |
path to |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A data.frame
with the following columns:
Experiment labels.
Experiment types.
Motif indices.
Distances between peak summit and motif.
Other generate data.frames:
get_df_enrichment()
if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = CTCF_ChIP_peaks, exp_type = "ChIP", exp_labels = "CTCF", read_count = 150, peak_count = 100 ) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances_df <- get_df_distances(input, motifs, genome_build) print(distances_df) } }
if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = CTCF_ChIP_peaks, exp_type = "ChIP", exp_labels = "CTCF", read_count = 150, peak_count = 100 ) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 distances_df <- get_df_distances(input, motifs, genome_build) print(distances_df) } }
Wrapper for 'MotifPeeker::motif_enrichment' to get motif enrichment counts
and percentages for all peaks and motifs, generating a data.frame
suitable for plots. The data.frame
contains values for all and
segregated peaks.
get_df_enrichment( result, segregated_peaks, user_motifs, genome_build, reference_index = 1, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )
get_df_enrichment( result, segregated_peaks, user_motifs, genome_build, reference_index = 1, out_dir = tempdir(), BPPARAM = BiocParallel::bpparam(), meme_path = NULL, verbose = FALSE )
result |
A
|
segregated_peaks |
A |
user_motifs |
A
|
genome_build |
A character string with the abbreviated genome build
name, or a BSGenome object. At the moment, only |
reference_index |
An integer specifying the index of the peak file to use as the reference dataset for comparison. Indexing starts from 1. (default = 1) |
out_dir |
A |
BPPARAM |
A
IMPORTANT: For each worker, please ensure a minimum of 8GB of
memory (RAM) is available as |
meme_path |
path to |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A data.frame
with the following columns:
Experiment labels.
Experiment types.
Motif indices.
Segregated group- "all", "Common" or "Unique".
"reference" or "comparison" group.
Number of peaks with motif.
Number of peaks without motif.
Percentage of peaks with motif.
Percentage of peaks without motif.
Other generate data.frames:
get_df_distances()
if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = list(CTCF_ChIP_peaks, CTCF_TIP_peaks), exp_type = c("ChIP", "TIP"), exp_labels = c("CTCF_ChIP", "CTCF_TIP"), read_count = c(150, 200), peak_count = c(100, 120) ) segregated_input <- segregate_seqs(input$peaks[[1]], input$peaks[[2]]) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) reference_index <- 1 if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 enrichment_df <- get_df_enrichment( input, segregated_input, motifs, genome_build, reference_index = 1 ) } }
if (memes::meme_is_installed()) { data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") data("motif_MA1930.2", package = "MotifPeeker") input <- list( peaks = list(CTCF_ChIP_peaks, CTCF_TIP_peaks), exp_type = c("ChIP", "TIP"), exp_labels = c("CTCF_ChIP", "CTCF_TIP"), read_count = c(150, 200), peak_count = c(100, 120) ) segregated_input <- segregate_seqs(input$peaks[[1]], input$peaks[[2]]) motifs <- list( motifs = list(motif_MA1930.2, motif_MA1102.3), motif_labels = list("MA1930.2", "MA1102.3") ) reference_index <- 1 if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 enrichment_df <- get_df_enrichment( input, segregated_input, motifs, genome_build, reference_index = 1 ) } }
Downloads JASPAR CORE database in meme
format for all available
taxonomic groups. Uses BiocFileCache to cache downloads.
get_JASPARCORE(verbose = FALSE)
get_JASPARCORE(verbose = FALSE)
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
A character string specifying the path to the downloaded file
(meme
format).
get_JASPARCORE()
get_JASPARCORE()
motif_enrichment()
calculates motif enrichment relative to a set of
background sequences using Analysis of Motif Enrichment (AME) from
memes.
motif_enrichment( peak_input, motif, genome_build, out_dir = tempdir(), verbose = FALSE, meme_path = NULL, ... )
motif_enrichment( peak_input, motif, genome_build, out_dir = tempdir(), verbose = FALSE, meme_path = NULL, ... )
peak_input |
Either a path to the narrowPeak file or a GRanges peak
object generated by |
motif |
An object of class |
genome_build |
The genome build that the peak sequences should be derived from. |
out_dir |
Location to save the 0-order background file along with the AME output files. |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
meme_path |
path to "meme/bin/" (default: |
... |
Arguments passed on to
|
A list containing a AME results data frame and a numeric referring to the proportion of peaks with a motif.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- motif_enrichment( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, ) print(res) }
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- motif_enrichment( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, ) print(res) }
The motif file contains the JASPAR motif for CTCFL (MA1102.3) for
Homo Sapiens. This is one of the two motif files used to demonstrate
MotifPeeker
's known-motif analysis functionality.
data("motif_MA1102.3")
data("motif_MA1102.3")
An object of class universalmotif
of length 1.
The motif file contains the JASPAR motif for CTCF (MA1930.2) for
Homo Sapiens. This is one of the two motif files used to demonstrate
MotifPeeker
's known-motif analysis functionality.
data("motif_MA1930.2")
data("motif_MA1930.2")
An object of class universalmotif
of length 1.
Compute motif similarity scores between motifs discovered from segregated
sequences. Wrapper around compare_motifs
to
compare motifs from different groups of sequences. To see the possible
similarity measures available, refer to details.
motif_similarity( streme_out, method = "PCC", normalise.scores = TRUE, BPPARAM = BiocParallel::bpparam(), ... )
motif_similarity( streme_out, method = "PCC", normalise.scores = TRUE, BPPARAM = BiocParallel::bpparam(), ... )
streme_out |
Output from |
method |
|
normalise.scores |
|
BPPARAM |
A |
... |
Arguments passed on to
|
The following metrics are available:
Euclidean distance (EUCL
) (Choi et al. 2004)
Weighted Euclidean distance (WEUCL
)
Kullback-Leibler divergence (KL
) (Kullback and Leibler 1951; Roepcke et al. 2005)
Hellinger distance (HELL
) (Hellinger 1909)
Squared Euclidean distance (SEUCL
)
Manhattan distance (MAN
)
Pearson correlation coefficient (PCC
)
Weighted Pearson correlation coefficient (WPCC
)
Sandelin-Wasserman similarity (SW
), or sum of squared distances (Sandelin and Wasserman 2004)
Average log-likelihood ratio (ALLR
) (Wang and Stormo 2003)
Lower limit ALLR (ALLR_LL
) (Mahony et al. 2007)
Bhattacharyya coefficient (BHAT
) (Bhattacharyya 1943)
Comparisons are calculated between two motifs at a time. All possible alignments
are scored, and the best score is reported. In an alignment scores are calculated
individually between columns. How those scores are combined to generate the final
alignment scores depends on score.strat
.
See the "Motif comparisons and P-values" vignette for a description of the
various metrics. Note that PCC
, WPCC
, SW
, ALLR
, ALLR_LL
and BHAT
are similarities;
higher values mean more similar motifs. For the remaining metrics, values closer
to zero represent more similar motifs.
Small pseudocounts are automatically added when one of the following methods
is used: KL
, ALLR
, ALLR_LL
, IS
. This is avoid
zeros in the calculations.
To note regarding p-values: P-values are pre-computed using the
make_DBscores()
function. If not given, then uses a set of internal
precomputed P-values from the JASPAR2018 CORE motifs. These precalculated
scores are dependent on the length of the motifs being compared. This takes
into account that comparing small motifs with larger motifs leads to higher
scores, since the probability of finding a higher scoring alignment is
higher.
The default P-values have been precalculated for regular DNA motifs. They
are of little use for motifs with a different number of alphabet letters
(or even the multifreq
slot).
A list of matrices containing the similarity scores between motifs from different groups of sequences. The order of comparison is as follows, with first element representing the rows and second element representing the columns of the matrix:
1. Common motifs comparison: Common seqs from reference (1) <-> comparison (2)
2. Unique motifs comparison: Unique seqs from reference (1) <-> comparison (2)
3. Cross motifs comparison 1: Unique seqs from reference (1) <-> comparison (1)
4. Cross motifs comparison 2: Unique seqs from comparison (2) <-> reference (1)
The list is repeated for each set of comparison groups in input.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("CTCF_ChIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 segregated_peaks <- segregate_seqs(CTCF_TIP_peaks, CTCF_ChIP_peaks) denovo_motifs <- denovo_motifs(unlist(segregated_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, maxw = 8, minw = 8, out_dir = tempdir()) similarity_matrices <- motif_similarity(denovo_motifs) print(similarity_matrices) } }
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("CTCF_ChIP_peaks", package = "MotifPeeker") if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38")) { genome_build <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 segregated_peaks <- segregate_seqs(CTCF_TIP_peaks, CTCF_ChIP_peaks) denovo_motifs <- denovo_motifs(unlist(segregated_peaks), trim_seq_width = 50, genome_build = genome_build, discover_motifs_count = 1, filter_n = 6, maxw = 8, minw = 8, out_dir = tempdir()) similarity_matrices <- motif_similarity(denovo_motifs) print(similarity_matrices) } }
This function compares different epigenomic datasets using motif enrichment as the key metric. The output is an easy-to-interpret HTML document with the results. The report contains three main sections: (1) General Metrics on peak and alignment files (if provided), (2) Known Motif Enrichment Analysis and (3) Discovered Motif Enrichment Analysis.
MotifPeeker( peak_files, reference_index = 1, alignment_files = NULL, exp_labels = NULL, exp_type = NULL, genome_build, motif_files = NULL, motif_labels = NULL, cell_counts = NULL, motif_discovery = TRUE, motif_discovery_count = 3, filter_n = 6, trim_seq_width = NULL, motif_db = NULL, download_buttons = TRUE, meme_path = NULL, out_dir = tempdir(), save_runfiles = FALSE, display = if (interactive()) "browser", BPPARAM = BiocParallel::SerialParam(), quiet = TRUE, debug = FALSE, verbose = FALSE )
MotifPeeker( peak_files, reference_index = 1, alignment_files = NULL, exp_labels = NULL, exp_type = NULL, genome_build, motif_files = NULL, motif_labels = NULL, cell_counts = NULL, motif_discovery = TRUE, motif_discovery_count = 3, filter_n = 6, trim_seq_width = NULL, motif_db = NULL, download_buttons = TRUE, meme_path = NULL, out_dir = tempdir(), save_runfiles = FALSE, display = if (interactive()) "browser", BPPARAM = BiocParallel::SerialParam(), quiet = TRUE, debug = FALSE, verbose = FALSE )
peak_files |
A character vector of path to peak files, or a vector of
GRanges objects generated using
ENCODE file IDs can also be provided to automatically fetch peak file(s) from the ENCODE database. |
reference_index |
An integer specifying the index of the peak file to use as the reference dataset for comparison. Indexing starts from 1. (default = 1) |
alignment_files |
A character vector of path to alignment files, or a
vector of |
exp_labels |
A character vector of labels for each peak file. (optional) If not provided, capital letters will be used as labels in the report. |
exp_type |
A character vector of experimental types for each peak file. (optional) Useful for comparison of different methods. If not provided, all datasets will be classified as "unknown" experiment types in the report. Supported experimental types are:
|
genome_build |
A character string with the abbreviated genome build
name, or a BSGenome object. At the moment, only |
motif_files |
A character vector of path to motif files, or a vector of
|
motif_labels |
A character vector of labels for each motif file.
(optional) Only used if path to file names are passed in
|
cell_counts |
An integer vector of experiment cell counts for each peak file. (optional) Creates additional comparisons based on cell counts. |
motif_discovery |
A logical indicating whether to perform motif discovery for the third section of the report. (default = TRUE) |
motif_discovery_count |
An integer specifying the number of motifs to discover. (default = 3) Note that higher values take longer to compute. |
filter_n |
An integer specifying the number of consecutive nucleotide repeats a discovered motif must contain to be filtered out. (default = 6) |
trim_seq_width |
An integer specifying the width of the sequence to extract around the summit (default = NULL). This sequence is used to search for discovered motifs. If not provided, the entire peak region will be used. This parameter is intended to reduce the search space and speed up motif discovery; therefore, a value less than the average peak width is recommended. Peaks are trimmed symmetrically around the summit while respecting the peak bounds. |
motif_db |
Path to |
download_buttons |
A logical indicating whether to include download buttons for various files within the HTML report. (default = TRUE) |
meme_path |
path to |
out_dir |
A character string specifying the directory to save the
output files. (default = |
save_runfiles |
A logical indicating whether to save intermediate files generated during the run, such as those from FIMO and AME. (default = FALSE) |
display |
A character vector specifying the display mode for the HTML report once it is generated. (default = NULL) Options are:
|
BPPARAM |
A
IMPORTANT: For each worker, please ensure a minimum of 8GB of
memory (RAM) is available as |
quiet |
A logical indicating whether to print markdown knit messages. (default = FALSE) |
debug |
A logical indicating whether to print debug/error messages in the HTML report. (default = FALSE) |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
Runtime guidance: For 4 datasets, the runtime is approximately 3 minutes with motif_discovery disabled. However, motif discovery can take hours to complete. To make computation faster, we highly recommend tuning the following arguments:
BPPARAM=MulticoreParam(x)
Running motif discovery in parallel can significantly reduce runtime, but it is very memory-intensive, consuming 10+GB of RAM per thread. Memory starvation can greatly slow the process, so set the number of cores with caution.
motif_discovery_count
The number of motifs to discover per sequence group exponentially increases runtime. We recommend no more than 5 motifs to make a meaningful inference.
trim_seq_width
Trimming sequences before running
motif discovery can significantly reduce the search space. Sequence
length can exponentially increase runtime. We recommend running the
script with motif_discovery = FALSE
and studying the
motif-summit distance distribution under general metrics to find the
sequence length that captures most motifs. A good starting point is 150
but it can be reduced further if appropriate.
Path to the output directory.
Running motif discovery is computationally expensive and can
require from minutes to hours. denovo_motifs
can widely affect the
runtime (higher values take longer). Setting trim_seq_width
to a lower
value can also reduce the runtime significantly.
peaks <- list( system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_peaks.narrowPeak", package = "MotifPeeker") ) alignments <- list( system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_alignment.bam", package = "MotifPeeker") ) motifs <- list( system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker"), system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") ) if (memes::meme_is_installed()) { MotifPeeker( peak_files = peaks, reference_index = 2, alignment_files = alignments, exp_labels = c("ChIP", "TIP"), exp_type = c("chipseq", "tipseq"), genome_build = "hg38", motif_files = motifs, motif_labels = NULL, cell_counts = NULL, motif_discovery = TRUE, motif_discovery_count = 2, motif_db = NULL, download_buttons = TRUE, out_dir = tempdir(), debug = FALSE, quiet = TRUE, verbose = FALSE ) }
peaks <- list( system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_peaks.narrowPeak", package = "MotifPeeker") ) alignments <- list( system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker"), system.file("extdata", "CTCF_TIP_alignment.bam", package = "MotifPeeker") ) motifs <- list( system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker"), system.file("extdata", "motif_MA1102.3.jaspar", package = "MotifPeeker") ) if (memes::meme_is_installed()) { MotifPeeker( peak_files = peaks, reference_index = 2, alignment_files = alignments, exp_labels = c("ChIP", "TIP"), exp_type = c("chipseq", "tipseq"), genome_build = "hg38", motif_files = motifs, motif_labels = NULL, cell_counts = NULL, motif_discovery = TRUE, motif_discovery_count = 2, motif_db = NULL, download_buttons = TRUE, out_dir = tempdir(), debug = FALSE, quiet = TRUE, verbose = FALSE ) }
read_motif_file()
reads a motif file and converts to a PWM. The
function supports multiple motif formats, including "homer", "jaspar",
"meme", "transfac" and "uniprobe".
read_motif_file( motif_file, motif_id = "Unknown", file_format = "auto", verbose = FALSE )
read_motif_file( motif_file, motif_id = "Unknown", file_format = "auto", verbose = FALSE )
motif_file |
Path to a motif file or a
|
motif_id |
ID of the motif (e.g. "MA1930.1"). |
file_format |
Character string specifying the format of the motif file. The options are "homer", "jaspar", "meme", "transfac" and "uniprobe" |
verbose |
A logical indicating whether to print messages. |
A universalmotif
motif object.
motif_file <- system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker") res <- read_motif_file(motif_file = motif_file, motif_id = "MA1930.2", file_format = "jaspar") print(res)
motif_file <- system.file("extdata", "motif_MA1930.2.jaspar", package = "MotifPeeker") res <- read_motif_file(motif_file = motif_file, motif_id = "MA1930.2", file_format = "jaspar") print(res)
This function reads a MACS2/3 narrowPeak or SEACR BED peak file and returns a GRanges object with the peak coordinates and summit.
read_peak_file(peak_file, file_format = "auto", verbose = FALSE)
read_peak_file(peak_file, file_format = "auto", verbose = FALSE)
peak_file |
A character string with the path to the peak file, or a
GRanges object created using |
file_format |
A character string specifying the format of the peak file.
|
verbose |
A logical indicating whether to print messages. |
The summit column is the absolute genomic position of the peak, which is relative to the start position of the sequence range. For SEACR BED files, the summit column is calculated as the midpoint of the max signal region.
A GRanges-class object with the peak coordinates and summit.
GRanges-class for more information on GRanges objects.
macs3_peak_file <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") macs3_peak_read <- read_peak_file(macs3_peak_file) macs3_peak_read
macs3_peak_file <- system.file("extdata", "CTCF_ChIP_peaks.narrowPeak", package = "MotifPeeker") macs3_peak_read <- read_peak_file(macs3_peak_file) macs3_peak_read
This function saves a peak object to a file in BED4 format. The included
columns are: chr
, start
, end
, and name
. Since
no strand data is being included, it is recommended to use this function
only for peak objects that do not have strand information.
save_peak_file( peak_obj, save = TRUE, filename = random_string(10), out_dir = tempdir() )
save_peak_file( peak_obj, save = TRUE, filename = random_string(10), out_dir = tempdir() )
peak_obj |
A GRanges object with the peak coordinates. Must include
columns: |
save |
A logical indicating whether to save the peak object to a file. |
filename |
A character string of the file name. If the file extension
is not |
out_dir |
A character string of the output directory. (default = tempdir()) |
If save = FALSE
, a data frame with the peak coordinates. If
save = TRUE
, the path to the saved file.
data("CTCF_ChIP_peaks", package = "MotifPeeker") out <- save_peak_file(CTCF_ChIP_peaks, save = TRUE, "test_peak_file.bed") print(out)
data("CTCF_ChIP_peaks", package = "MotifPeeker") out <- save_peak_file(CTCF_ChIP_peaks, save = TRUE, "test_peak_file.bed") print(out)
This function takes two sets of sequences and segregates them into common and unique sequences. The common sequences are sequences that are present in both sets of sequences. The unique sequences are sequences that are present in only one of the sets of sequences.
segregate_seqs(seqs1, seqs2)
segregate_seqs(seqs1, seqs2)
seqs1 |
A set of sequences ( |
seqs2 |
A set of sequences ( |
Sequences are considered common if their base pairs align in any position, even if they vary in length. Consequently, while the number of common sequences remains consistent between both sets, but the length and composition of these sequences may differ. As a result, the function returns distinct sets of common sequences for each input set of sequences.
A list containing the common sequences and unique sequences for each
set of sequences. The list contains the following GRanges
objects:
common_seqs1
: Common sequences in seqs1
common_seqs2
: Common sequences in seqs2
unique_seqs1
: Unique sequences in seqs1
unique_seqs2
: Unique sequences in seqs2
data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") seqs1 <- CTCF_ChIP_peaks seqs2 <- CTCF_TIP_peaks res <- segregate_seqs(seqs1, seqs2) print(res)
data("CTCF_ChIP_peaks", package = "MotifPeeker") data("CTCF_TIP_peaks", package = "MotifPeeker") seqs1 <- CTCF_ChIP_peaks seqs2 <- CTCF_TIP_peaks res <- segregate_seqs(seqs1, seqs2) print(res)
summit_to_motif()
calculates the distance between each motif and its
nearest peak summit. runFimo
from the memes
package is used to
recover the locations of each motif.
summit_to_motif( peak_input, motif, fp_rate = 0.05, genome_build, out_dir = tempdir(), meme_path = NULL, verbose = FALSE, ... )
summit_to_motif( peak_input, motif, fp_rate = 0.05, genome_build, out_dir = tempdir(), meme_path = NULL, verbose = FALSE, ... )
peak_input |
Either a path to the narrowPeak file or a GRanges peak
object generated by |
motif |
An object of class |
fp_rate |
The desired false-positive rate. A p-value threshold will be selected based on this value. The default false-positive rate is 0.05. |
genome_build |
The genome build that the peak sequences should be derived from. |
out_dir |
Location to save the 0-order background file. By default, the background file will be written to a temporary directory. |
meme_path |
path to "meme/bin/" (default: |
verbose |
A logical indicating whether to print verbose messages while running the function. (default = FALSE) |
... |
Arguments passed on to
|
To calculate the p-value threshold for a desired false-positive rate, we use the approximate formula:
(Dervied from FIMO documentation)
A list containing an expanded GRanges peak object with metadata columns relating to motif positions along with a vector of summit-to-motif distances for each valid peak.
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- summit_to_motif( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, fp_rate = 5e-02, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 ) print(res) }
if (memes::meme_is_installed()) { data("CTCF_TIP_peaks", package = "MotifPeeker") data("motif_MA1102.3", package = "MotifPeeker") res <- summit_to_motif( peak_input = CTCF_TIP_peaks, motif = motif_MA1102.3, fp_rate = 5e-02, genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 ) print(res) }