Title: | Perform key tests for binding motifs in sequence data |
---|---|
Description: | Taking a set of sequence motifs as PWMs, test a set of sequences for over-representation of these motifs, as well as any positional features within the set of motifs. Enrichment analysis can be undertaken using multiple statistical approaches. The package also contains core functions to prepare data for analysis, and to visualise results. |
Authors: | Stevie Pederson [aut, cre] |
Maintainer: | Stevie Pederson <[email protected]> |
License: | GPL-3 |
Version: | 1.3.3 |
Built: | 2025-01-20 06:16:14 UTC |
Source: | https://github.com/bioc/motifTestR |
A set of ChIP-Seq peaks where AR and ER were both detected
data("ar_er_peaks")
data("ar_er_peaks")
An object of class GRanges
of length 849.
The subset of peaks found on chr1 which contained signal from at least two of AR, ER and H3K27ac, taken from GSE123767. Peaks were resized to a uniform width of 400bp after downloading
Generation of these ranges is documented in
system.file("scripts/ar_er_peaks.R", package = "motifTestR")
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123767
data("ar_er_peaks") ar_er_peaks
data("ar_er_peaks") ar_er_peaks
The genomic sequences obtained from the ar_er_peaks
data("ar_er_seq")
data("ar_er_seq")
An object of class DNAStringSet
of length 849.
These sequences represent the sequences obtained from
BSgenome.Hsapiens.UCSC.hg19 for thw peaks supplied as ar_er_peaks
Generation of these sequences is documented in
system.file("scripts/ar_er_peaks.R", package = "motifTestR")
data("ar_er_seq") ar_er_seq
data("ar_er_seq") ar_er_seq
Cluster related motifs for testing as a group
clusterMotifs( motifs, type = c("PPM", "ICM"), method = c("PCC", "EUCL", "SW", "KL", "ALLR", "BHAT", "HELL", "SEUCL", "MAN", "ALLR_LL", "WEUCL", "WPCC"), power = 1, agglom = "complete", thresh = 0.2, return_d = FALSE, plot = FALSE, labels = FALSE, cex = 1, linecol = "red", ... )
clusterMotifs( motifs, type = c("PPM", "ICM"), method = c("PCC", "EUCL", "SW", "KL", "ALLR", "BHAT", "HELL", "SEUCL", "MAN", "ALLR_LL", "WEUCL", "WPCC"), power = 1, agglom = "complete", thresh = 0.2, return_d = FALSE, plot = FALSE, labels = FALSE, cex = 1, linecol = "red", ... )
motifs |
A list of universalmotifs or a list of PWMs |
type |
Can be ICM or PPM |
method |
The method to be used for determining similarity/distances |
power |
Raise correlation matrices to this power before converting to a distance matrix. Only applied if method is either "PCC" or "WPCC" |
agglom |
Method to be used for agglomeration by hclust |
thresh |
Tree heights below which motifs are formed into a cluster |
return_d |
logical(1) Return the distance matrices for each cluster |
plot |
Show tree produced by hclust. If requested the value set by thresh will be shown as a horizontal line |
labels , cex
|
Passed to plot.hclust |
linecol |
Passed to abline as the argument |
... |
passed to compare_motifs |
This builds on compare_motifs, enabling the assignment of each PWM to a cluster, and subsequent testing of motifs as a cluster, rather than returning individual results.
Internally all matrices are converted to distance matrices and hclust is used to form clusters. By default, options such as "EUCL", "MAN" produce distances, whilst similarity matrices are produced when choosing "PCC" and other correlation based metrics. In these cases, the distance matrix is obtained by taking 1 - similarity.
By default PWM labels are hidden (labels = FALSE), however these can be shown using labels = NULL as explained in plot.hclust.
Raising the threshold will lead to fewer, larger clusters whilst leaving this value low will return a more conservative approach, with more smaller clusters. The final decision as the best clustering strategy is highly subjective and left to the user. Manual inspection of motifs within a cluster can be performed using view_motifs, as shown in the vignette.
Named vector with numeric values representing which cluster each motif has been assigned to.
If setting return_d = TRUE, a named list will be returned with the clusters
as the element cl
and a list with distance matrices for each cluster as the
element d
# Load the example motifs data("ex_pfm") # Return a vector with each motif assigned a cluster # The default uses Pearson's Correlation Coefficient clusterMotifs(ex_pfm) # Preview the settings noting that showing labels can clutter the plot # with large numbers of motifs. The defaults for Euclidean distance # show the threshold may need raising clusterMotifs(ex_pfm, plot = TRUE, labels = NULL, method = "EUCL")
# Load the example motifs data("ex_pfm") # Return a vector with each motif assigned a cluster # The default uses Pearson's Correlation Coefficient clusterMotifs(ex_pfm) # Preview the settings noting that showing labels can clutter the plot # with large numbers of motifs. The defaults for Euclidean distance # show the threshold may need raising clusterMotifs(ex_pfm, plot = TRUE, labels = NULL, method = "EUCL")
Count the matches to a PWM within an XStringSet
countPwmMatches( pwm, stringset, rc = TRUE, min_score = "80%", mc.cores = 1, ... )
countPwmMatches( pwm, stringset, rc = TRUE, min_score = "80%", mc.cores = 1, ... )
pwm |
A Position Weight Matrix |
stringset |
An XStringSet |
rc |
logical(1) Also find matches using the reverse complement of pwm |
min_score |
The minimum score to return a match |
mc.cores |
Passed to mclapply when analysing a list of PWMs |
... |
Passed to countPWM |
Will simply count the matches within an XStringSet and return an integer. All matches are included.
An integer vector
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example Peaks data("ar_er_seq") countPwmMatches(esr1, ar_er_seq) ## Count all PWMs countPwmMatches(ex_pfm, ar_er_seq)
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example Peaks data("ar_er_seq") countPwmMatches(esr1, ar_er_seq) ## Count all PWMs countPwmMatches(ex_pfm, ar_er_seq)
Example Position Frequency Matrices
data("ex_pfm")
data("ex_pfm")
An object of class list
of length 5.
This object contains 5 PFMs taken from HOCOMOCOv11-coreA for examples and testing
Generation of this motif list is documented in
system.file("scripts/ex_pfm.R", package = "motifTestR")
data("ex_pfm") ex_pfm$ESR1
data("ex_pfm") ex_pfm$ESR1
Find matches from a PWM cluster within a set of sequences
getClusterMatches( cl, stringset, rc = TRUE, min_score = "80%", best_only = FALSE, break_ties = c("all", "random", "first", "last", "central"), mc.cores = 1, ... ) countClusterMatches( cl, stringset, rc = TRUE, min_score = "80%", mc.cores = 1, ... )
getClusterMatches( cl, stringset, rc = TRUE, min_score = "80%", best_only = FALSE, break_ties = c("all", "random", "first", "last", "central"), mc.cores = 1, ... ) countClusterMatches( cl, stringset, rc = TRUE, min_score = "80%", mc.cores = 1, ... )
cl |
A list of Position Weight Matrices, universalmotifs, with each element representing clusters of related matrices |
stringset |
An XStringSet |
rc |
logical(1) Also find matches using the reverse complement of PWMs in the cluster |
min_score |
The minimum score to return a match |
best_only |
logical(1) Only return the best match |
break_ties |
Method for breaking ties when only returning the best match Ignored when all matches are returned (the default) |
mc.cores |
Passed to mclapply |
... |
Passed to matchPWM |
This function extends getPwmMatches by returning a single set of results for set of clustered motifs. This can help remove some of the redundancy in results returned for highly similar PWMs, such as those in the GATA3 family.
Taking a set of sequences as an XStringSet, find all matches above the
supplied score (i.e. threshold) for a list of Position Weight Matrices
(PWMs), which have been clustered together as highly-related motifs.
By default, matches are performed using the PWMs as provided and the reverse
complement, however this can easily be disabled by setting rc = FALSE
.
The function relies heavily on matchPWM and Views for speed.
Where overlapping matches are found for the PWMs within a cluster, only a single match is returned. The motif with the highest relative score (score / maxScore(PWM)) is selected.
When choosing to return the best match (best_only = TRUE
), only the match
with the highest relative score is returned for each sequence.
Should there be tied scores, the best match can be chosen as either the first,
last, most central, all tied matches, or choosing one at random (the default).
Output from getClusterMatches will be a list of DataFrames with columns:
seq
, score
, direction
, start
, end
, from_centre
, seq_width
,
motif
and match
The first three columns describe the sequence with matches, the score of
the match and whether the match was found using the forward or reverse PWM.
The columns start
, end
and width
describe the where the match was found
in the sequence, whilst from_centre
defines the distance between the centre
of the match and the centre of the sequence being queried.
The motif column denotes which individual motif was found to match in this
position, again noting that when matches overlap, only the one with the
highest relative score is returned.
The final column contains the matching fragment of the sequence as an
XStringSet
.
Output from countClusterMatches will be a simple integer vector the same length as the number of clusters
# Load example PFMs data("ex_pfm") # Cluster using default settings cl_ids <- clusterMotifs(ex_pfm) ex_cl <- split(ex_pfm, cl_ids) # Add optional names names(ex_cl) <- vapply(ex_cl, \(x) paste(names(x), collapse = "/"), character(1)) # Load example sequences data("ar_er_seq") # Get all matches for each cluster getClusterMatches(ex_cl, ar_er_seq) # Or Just count them countClusterMatches(ex_cl, ar_er_seq) # Compare this to individual counts countPwmMatches(ex_pfm, ar_er_seq)
# Load example PFMs data("ex_pfm") # Cluster using default settings cl_ids <- clusterMotifs(ex_pfm) ex_cl <- split(ex_pfm, cl_ids) # Add optional names names(ex_cl) <- vapply(ex_cl, \(x) paste(names(x), collapse = "/"), character(1)) # Load example sequences data("ar_er_seq") # Get all matches for each cluster getClusterMatches(ex_cl, ar_er_seq) # Or Just count them countClusterMatches(ex_cl, ar_er_seq) # Compare this to individual counts countPwmMatches(ex_pfm, ar_er_seq)
Find all PWM matches within a set of sequences
getPwmMatches( pwm, stringset, rc = TRUE, min_score = "80%", best_only = FALSE, break_ties = c("all", "random", "first", "last", "central"), mc.cores = 1, ... )
getPwmMatches( pwm, stringset, rc = TRUE, min_score = "80%", best_only = FALSE, break_ties = c("all", "random", "first", "last", "central"), mc.cores = 1, ... )
pwm |
A Position Weight Matrix, list of PWMs or universalmotif list |
stringset |
An XStringSet |
rc |
logical(1) Also find matches using the reverse complement of pwm |
min_score |
The minimum score to return a match |
best_only |
logical(1) Only return the best match |
break_ties |
Method for breaking ties when only returning the best match Ignored when all matches are returned (the default) |
mc.cores |
Passed to mclapply if passing multiple PWMs |
... |
Passed to matchPWM |
Taking a set of sequences as an XStringSet, find all matches above the
supplied score (i.e. threshold) for a single Position Weight Matrix (PWM),
generally representing a transcription factor binding motif.
By default, matches are performed using the PWM as provided and the reverse
complement, however this can easily be disabled by setting rc = FALSE
.
The function relies heavily on matchPWM and Views for speed.
When choosing to return the best match (best_only = TRUE
), only the match
with the highest score is returned for each sequence.
Should there be tied scores, the best match can be chosen as either the first,
last, most central, all tied matches, or choosing one at random (the default).
A DataFrame with columns: seq
, score
, direction
, start
,
end
, from_centre
, seq_width
, and match
The first three columns describe the sequence with matches, the score of
the match and whether the match was found using the forward or reverse PWM.
The columns start
, end
and width
describe the where the match was found
in the sequence, whilst from_centre
defines the distance between the centre
of the match and the centre of the sequence being queried.
The final column contains the matching fragment of the sequence as an
XStringSet
.
When passing a list of PWMs, a list of the above DataFrames will be returned.
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example Peaks data("ar_er_seq") ## Return all matches getPwmMatches(esr1, ar_er_seq) ## Just the best match getPwmMatches(esr1, ar_er_seq, best_only = TRUE) ## Apply multiple PWMs as a list getPwmMatches(ex_pfm, ar_er_seq, best_only = TRUE)
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example Peaks data("ar_er_seq") ## Return all matches getPwmMatches(esr1, ar_er_seq) ## Just the best match getPwmMatches(esr1, ar_er_seq, best_only = TRUE) ## Apply multiple PWMs as a list getPwmMatches(ex_pfm, ar_er_seq, best_only = TRUE)
A GRanges object with regions annotated as telomeres or centromeres
data("hg19_mask")
data("hg19_mask")
An object of class GRanges
of length 345.
The regions defined as centromeres or telomeres in hg19, taken from AnnotationHub objects "AH107360" and "AH107361". These were combined with regions containing Ns from the UCSC 2bit file, and regions with Ns in the BSgenome.Hsapiens.UCSC.hg19 were retained.
Generation of these ranges is documented in
system.file("scripts/hg19_mask.R", package = "motifTestR")
The package AnnotationHub and https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.masked.gz
data("hg19_mask") hg19_mask
data("hg19_mask") hg19_mask
Form a set of ranges from y which (near) exactly match those in x for use as a background set requiring matching
makeRMRanges(x, y, ...) ## S4 method for signature 'GRanges,GRanges' makeRMRanges( x, y, exclude = GRanges(), n_iter = 1, n_total = NULL, replace = TRUE, ..., force_ol = TRUE ) ## S4 method for signature 'GRangesList,GRangesList' makeRMRanges( x, y, exclude = GRanges(), n_iter = 1, n_total = NULL, replace = TRUE, mc.cores = 1, ..., force_ol = TRUE, unlist = TRUE )
makeRMRanges(x, y, ...) ## S4 method for signature 'GRanges,GRanges' makeRMRanges( x, y, exclude = GRanges(), n_iter = 1, n_total = NULL, replace = TRUE, ..., force_ol = TRUE ) ## S4 method for signature 'GRangesList,GRangesList' makeRMRanges( x, y, exclude = GRanges(), n_iter = 1, n_total = NULL, replace = TRUE, mc.cores = 1, ..., force_ol = TRUE, unlist = TRUE )
x |
GRanges/GRangesList with ranges to be matched |
y |
GRanges/GRangesList with ranges to select random matching ranges from |
... |
Not used |
exclude |
GRanges of ranges to omit from testing |
n_iter |
The number of times to repeat the random selection process |
n_total |
Setting this value will over-ride anything set using n_iter. Can be vector of any length, corresponding to the length of x, when x is a GRangesList |
replace |
logical(1) Sample with our without replacement when creating the set of random ranges. |
force_ol |
logical(1) Enforce an overlap between every site in x and y |
mc.cores |
Passsed to mclapply |
unlist |
logical(1) Return as a sorted GRanges object, or leave as a GRangesList |
This function uses the width distribution of the 'test' ranges (i.e. x
) to
randomly sample a set of ranges with matching width from the ranges provided
in y
. The width distribution will clearly be exact when a set of
fixed-width ranges is passed to x
, whilst random sampling may yield some
variability when matching ranges of variable width.
When both x and y are GRanges objects, they are implicitly assumed to both
represent similar ranges, such as those overlapping a promoter or enhancer.
When passing two GRangesList objects, both objects are expected to contain
ranges annotated as belonging to key features, such that the list elements in
y must encompass all elements in x.
For example if x
contains two elements named 'promoter' and 'intron', y
should also contain elements named 'promoter' and 'intron' and these will
be sampled as matching ranges for the same element in x
. If elements of
x
and y
are not named, they are assumed to be in matching order.
The default behaviour is to assume that randomly-generated ranges are for
iteration, and as such, ranges are randomly formed in multiples of the number
of 'test' ranges provided in x
. The column iteration
will be added to the
returned ranges.
Placing any number into the n_total
argument will instead select a total
number of ranges as specified here. In this case, no iteration
column will
be included in the returned ranges.
Sampling is assumed to be with replacement as this is most suitable for
bootstrapping and related procedures, although this can be disabled by
setting replace = FALSE
A GRanges or GRangesList object
## Load the example peaks data("ar_er_peaks") sq <- seqinfo(ar_er_peaks) ## Now sample size-matched ranges for two iterations from chr1 makeRMRanges(ar_er_peaks, GRanges(sq)[1], n_iter = 2) ## Or simply sample 100 ranges if not planning any iterative analyses makeRMRanges(ar_er_peaks, GRanges(sq)[1], n_total = 100)
## Load the example peaks data("ar_er_peaks") sq <- seqinfo(ar_er_peaks) ## Now sample size-matched ranges for two iterations from chr1 makeRMRanges(ar_er_peaks, GRanges(sq)[1], n_iter = 2) ## Or simply sample 100 ranges if not planning any iterative analyses makeRMRanges(ar_er_peaks, GRanges(sq)[1], n_total = 100)
Plot the distribution of motif matches across sequences
plotMatchPos( matches, binwidth = 10, abs = FALSE, use_totals = FALSE, type = c("density", "cdf", "heatmap"), geom = c("smooth", "line", "point", "col"), cluster = FALSE, w = 0.1, heat_fill = NULL, ... )
plotMatchPos( matches, binwidth = 10, abs = FALSE, use_totals = FALSE, type = c("density", "cdf", "heatmap"), geom = c("smooth", "line", "point", "col"), cluster = FALSE, w = 0.1, heat_fill = NULL, ... )
matches |
Output from getPwmMatches |
binwidth |
Width of bins to use when plotting |
abs |
logical(1) Plot absolute distances from centre |
use_totals |
logical(1). If TRUE, plots will use total counts. The default (FALSE) plots probabilities. |
type |
Plot match density, the CDF or a binned heatmap |
geom |
Type of geom to be used for line plots. Ignored for heatmaps |
cluster |
logical(1) Cluster motifs when drawing a heatmap. If TRUE a dendrogram will be added to the LHS of the plot |
w |
Relative width of the dendrogram on (0, 1) |
heat_fill |
scale_fill_continuous object for heatmaps. If not provided, scale_fill_viridis_c() will be added to the heatmap. |
... |
Passed to individual geom* functions |
Multiple options are provided for showing the distribution of PWM matches within a set of sequences, using either the smoothed probability density, the probability CDF or as a heatmap. Distances can be shown as symmetrical around the centre or using absolute distances from the central position within the sequences.
Heatmaps are only enabled for comparisons across multiple PWMs, with optional clustering enabled. If adding a dendrogram for clustering, the returned plot object will be a patchwork object.
A ggplot2 object
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example sequences from the peaks data("ar_er_seq") ## Just the best match bm <- getPwmMatches(esr1, ar_er_seq, best_only = TRUE) plotMatchPos(bm, se = FALSE) ## Matches can also be shown by distance from centre plotMatchPos(bm, abs = TRUE) ## Cumulative Probability plots are also implemented plotMatchPos(bm, type = "cdf", geom = "line", colour = "red") + geom_abline(intercept = 0.5, slope = 1/ 400)
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example sequences from the peaks data("ar_er_seq") ## Just the best match bm <- getPwmMatches(esr1, ar_er_seq, best_only = TRUE) plotMatchPos(bm, se = FALSE) ## Matches can also be shown by distance from centre plotMatchPos(bm, abs = TRUE) ## Cumulative Probability plots are also implemented plotMatchPos(bm, type = "cdf", geom = "line", colour = "red") + geom_abline(intercept = 0.5, slope = 1/ 400)
Simulate a set of fixed-width sequences using optional TFBMs
simSeq( n, width, pfm = NULL, nt = c("A", "C", "G", "T"), prob = rep(0.25, 4), shape1 = 1, shape2 = 1, as = "DNAStringSet", ... )
simSeq( n, width, pfm = NULL, nt = c("A", "C", "G", "T"), prob = rep(0.25, 4), shape1 = 1, shape2 = 1, as = "DNAStringSet", ... )
n |
The number of sequences to simulate |
width |
Width of sequences to simulate |
pfm |
Probability Weight/Frequency Matrix |
nt |
Nucleotides to include |
prob |
Sampling probablities for each nucleotide |
shape1 , shape2
|
Passed to rbetabinom.ab |
as |
ObjectClass to return objects as. Defaults to DNAStringSet, but other viable options may include 'character', 'CharacterList' or any other class from which a character vector may be coerced. |
... |
Not used |
Using the nucleotide and probabilities provided as set of sequences can be simulated. By default, this will effectively be a set of 'background' sequences, with letters effectively chosen at random.
If a PWM/PFM is supplied, the shape parameters are first passed to rbetabinom.ab to determine the random positions the motif will be placed, with the default parameters representing a discrete uniform distribution. Once positions for the TFBM have been selected, nucleotides will be randomly sampled using the probabilities provided in the PWM and these motifs will be placed at the randomly sample positions
By default a DNAStringSet will be returned. If possible, the position of any randomly sampled motifs will be included in the mcols element of the returned object.
## Randomly generate 10x50nt sequences without any TFBMs present simSeq(10, 50) ## Now place a motif at random positions data('ex_pfm') sim_seq <- simSeq(10, width = 20, pfm = ex_pfm$ESR1) sim_seq ## The position of the motif within each sequence is included in the mcols mcols(sim_seq) ## Use this to extract the random motifs from the random sequences i <- mcols(sim_seq)$pos + cumsum(width(sim_seq)) - width(sim_seq) Views(unlist(sim_seq), start = i, width = 10)
## Randomly generate 10x50nt sequences without any TFBMs present simSeq(10, 50) ## Now place a motif at random positions data('ex_pfm') sim_seq <- simSeq(10, width = 20, pfm = ex_pfm$ESR1) sim_seq ## The position of the motif within each sequence is included in the mcols mcols(sim_seq) ## Use this to extract the random motifs from the random sequences i <- mcols(sim_seq)$pos + cumsum(width(sim_seq)) - width(sim_seq) Views(unlist(sim_seq), start = i, width = 10)
Test for enrichment of any motif within a cluster across a set of sequences using a background set to derive a NULL hypothesis
testClusterEnrich( cl, stringset, bg, var = "iteration", model = c("quasipoisson", "hypergeometric", "poisson", "iteration"), sort_by = c("p", "none"), mc.cores = 1, ... )
testClusterEnrich( cl, stringset, bg, var = "iteration", model = c("quasipoisson", "hypergeometric", "poisson", "iteration"), sort_by = c("p", "none"), mc.cores = 1, ... )
cl |
A list of Position Weight Matrices, universalmotifs, with each element representing clusters of related matrices |
stringset |
An XStringSet with equal sequence widths |
bg |
An XStringSet with the same sequence widths as the test XStringset |
var |
A column in the mcols element of bg, usually denoting an iteration number |
model |
The model used for analysis |
sort_by |
Column to sort results by |
mc.cores |
Passed to mclapply |
... |
Passed to getPwmMatches or countPwmMatches |
This extends the analytic methods offered by testMotifEnrich using PWMs grouped into a set of clusters. As with all cluster-level approaches, hits from multiple PWMs which overlap are counted as a single hit ensuring that duplicated matches are not double-counted, and that only individual positions within the sequences are.
See testMotifEnrich
makeRMRanges()
, getClusterMatches()
, countClusterMatches()
, testMotifEnrich()
## Load the example peaks & the sequences data("ar_er_peaks") data("ar_er_seq") sq <- seqinfo(ar_er_peaks) ## Now sample size-matched ranges 10 times larger. In real-world analyses, ## this set should be sampled as at least 1000x larger, ensuring features ## are matched to your requirements. This example masks regions with known N ## content, including centromeres & telomeres data("hg19_mask") set.seed(305) bg_ranges <- makeRMRanges( ar_er_peaks, GRanges(sq)[1], exclude = hg19_mask, n_iter = 10 ) ## Convert ranges to DNAStringSets library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 bg_seq <- getSeq(genome, bg_ranges) ## Test for enrichment of clustered motifs data("ex_pfm") cl <- list(A = ex_pfm[1], B = ex_pfm[2:3]) testClusterEnrich(cl, ar_er_seq, bg_seq, model = "poisson")
## Load the example peaks & the sequences data("ar_er_peaks") data("ar_er_seq") sq <- seqinfo(ar_er_peaks) ## Now sample size-matched ranges 10 times larger. In real-world analyses, ## this set should be sampled as at least 1000x larger, ensuring features ## are matched to your requirements. This example masks regions with known N ## content, including centromeres & telomeres data("hg19_mask") set.seed(305) bg_ranges <- makeRMRanges( ar_er_peaks, GRanges(sq)[1], exclude = hg19_mask, n_iter = 10 ) ## Convert ranges to DNAStringSets library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 bg_seq <- getSeq(genome, bg_ranges) ## Test for enrichment of clustered motifs data("ex_pfm") cl <- list(A = ex_pfm[1], B = ex_pfm[2:3]) testClusterEnrich(cl, ar_er_seq, bg_seq, model = "poisson")
Test positional bias for all motifs within a given cluster
testClusterPos( x, stringset, binwidth = 10, abs = FALSE, rc = TRUE, min_score = "80%", break_ties = "all", alt = c("greater", "less", "two.sided"), sort_by = c("p", "none"), mc.cores = 1, ... )
testClusterPos( x, stringset, binwidth = 10, abs = FALSE, rc = TRUE, min_score = "80%", break_ties = "all", alt = c("greater", "less", "two.sided"), sort_by = c("p", "none"), mc.cores = 1, ... )
x |
A Position Weight Matrix, universalmotif object or list thereof.
Alternatively can be a single DataFrame or list of DataFrames as returned
by getClusterMatches with |
stringset |
An XStringSet. Not required if matches are supplied as x |
binwidth |
Width of bins across the range to group data into |
abs |
Use absolute positions around zero to find symmetrical enrichment |
rc |
logical(1) Also find matches using the reverse complement of each PWM |
min_score |
The minimum score to return a match |
break_ties |
Choose how to resolve matches with tied scores |
alt |
Alternative hypothesis for the binomial test |
sort_by |
Column to sort results by |
mc.cores |
Passed to mclapply |
... |
Passed to matchPWM |
This is a reimplementation of testMotifPos for sets of motifs which have been clustered for similarity. The positions test the bias of any motifs within the cluster given that overlapping matches are only counted once, and with the match retained being the one with the highest relative score.
It should also be noted that some motif clusters will contain PWMs of varying length. When finding positional bias, the widest motif is taken as the width for all, and any matches from narrower motifs outside of the range allowed by wider motifs are discarded. This reduction in signal will make a small difference in the outer bins, but is not considered to be problematic for the larger analysis.
A data.frame with columns start
, end
, centre
, width
, total_matches
,
matches_in_region
, expected
, enrichment
, prop_total
, p
and consensus_motif
The total matches represent the total number of matches within the set of
sequences, whilst the number observed in the final region are also given,
along with the proportion of the total this represents.
Enrichment is simply the ratio of observed to expected based on the
expectation of the null hypothesis
The consensus motif across all matches is returned as a Position Frequency Matrix (PFM) using consensusMatrix.
## Load the example PWM data("ex_pfm") ## Load the example sequences data("ar_er_seq") ## Cluster the motifs cl <- list(A = ex_pfm[1], B = ex_pfm[2:3]) ## Get the best match and use this data matches <- getClusterMatches(cl, ar_er_seq, best_only = TRUE) ## Test for enrichment in any position testClusterPos(matches) ## Or just pass the clustered matrices ## Here we've set abs = TRUE to test absolute distance from the centre testClusterPos(cl, ar_er_seq, abs = TRUE, binwidth = 10)
## Load the example PWM data("ex_pfm") ## Load the example sequences data("ar_er_seq") ## Cluster the motifs cl <- list(A = ex_pfm[1], B = ex_pfm[2:3]) ## Get the best match and use this data matches <- getClusterMatches(cl, ar_er_seq, best_only = TRUE) ## Test for enrichment in any position testClusterPos(matches) ## Or just pass the clustered matrices ## Here we've set abs = TRUE to test absolute distance from the centre testClusterPos(cl, ar_er_seq, abs = TRUE, binwidth = 10)
Test for motif enrichment within a set of sequences using a background set to derive a NULL hypothesis
testMotifEnrich( pwm, stringset, bg, var = "iteration", model = c("quasipoisson", "hypergeometric", "poisson", "iteration"), sort_by = c("p", "none"), mc.cores = 1, ... )
testMotifEnrich( pwm, stringset, bg, var = "iteration", model = c("quasipoisson", "hypergeometric", "poisson", "iteration"), sort_by = c("p", "none"), mc.cores = 1, ... )
pwm |
A Position Weight Matrix or list of PWMs |
stringset |
An XStringSet with equal sequence widths |
bg |
An XStringSet with the same sequence widths as the test XStringset |
var |
A column in the mcols element of bg, usually denoting an iteration number |
model |
The model used for analysis |
sort_by |
Column to sort results by |
mc.cores |
Passed to mclapply |
... |
Passed to getPwmMatches or countPwmMatches |
This function offers four alternative models for assessing the enrichment of a motif within a set of sequences, in comparison to a background set of sequences. Selection of the BG sequences plays an important role and, in conjunction with the question being addressed, determines the most appropriate model to use for testing, as described below.
It should also be noted that the larger the BG set of sequences, the larger the computational burden, and results can take far longer to return. For many millions of background sequences, this may run beyond an hour
Hypergeometric tests are best suited to the use case where the test set of sequences represents a subset of a larger set, with a specific feature or behaviour, whilst the BG set may be the remainder of the set without that feature. For example, the test set may represent ChIP-Seq binding sites where signal changes in response to treatment, whilst the BG set represents the sites where no changed signal was observed. Testing is one-sided, for enrichment of motifs within the test set.
Due to these relatively smaller sized datasets, setting model = "hypergeometric", will generally return results quickly
This approach requires a set of background sequences which should be much larger than the test set of sequences. The parameters for a Poisson model are estimated in a per-sequence manner on the set of BG sequences, and the observed rate of motif-matches within the test set is then tested using poisson.test.
This approach assumes that all matches follow a Poisson distribution, which is often true, but data can also be over-dispersed. Given that this model can also return results relatively quickly, is it primarily suitable for data exploration, such as quickly checking for expected behaviours, but not for final results.
The quasipoisson model allows for over-dispersion and will return more conservative results than using the standard Poisson model. Under the method currently implemented here, BG sequences should be divided into blocks (i.e. iterations), identical in size to the test set of sequences. Model parameters are estimated per iteration across the BG set of sequences, with the rate of matches in the test set being compared against these blocks. This ensures more conservative results that if analysing test and bg sequences as collections of individual sequences.
It is expected that the BG set will matched for the features of interest and chosen using makeRMRanges with a large number of iterations, e.g. n_iter = 1000. Due to this parameterisation, quasipoisson approaches can be computationally time-consuming, as this is effectively an iterative approach.
Setting the model as "iteration" performs a non-parametric analysis, with the exception of returning Z-scores under the Central Limit Theorem. Mean and SD of matches is found for each iteration, and used to return Z scores, with p-values returned from both a Z-test and from comparing observed values directly to sampled values obtained from the BG sequences. Sampled values are calculated directly and as such, are limited in precision.
As for the QuasiPoisson model, a very large number of iterations is expected
to be used, to ensure the CLT holds, again making this a computationally
demanding test.
Each iteration/block is expected to be identically-sized to the test set,
and matched for any features as appropriate using makeRMRanges()
.
A data.frame with columns: sequences
, matches
, expected
, enrichment
,
and p
, with additional columns Z
, est_bg_rate
(Poisson), odds_ratio
(Hypergeometric) or Z
, sd_bg
, n_iter
and iter_p
(Iterations).
The numbers of sequences and matches refer to the test set of sequences,
whilst expected is the expected number of matches under the Poisson or
iterative null distribution. The ratio of matches to expected is given as
enrichment
, along with the Z score and p-value. Whilst the Z score is only
representative under the Poisson model, it is used to directly estimate
p-values under the iterative approach.
Under this latter approach, the sd of the matches found in the background
sequences is also given, along with the number of iterations and the p-values
from permutations testing the one-sided hypothesis hypothesis for enrichment.
It may also be worth noting that if producing background sequences using
makeRMRanges with replace = TRUE
and force_ol = TRUE
, the
iterative model corresponds to a bootstrap, given that the test sequences
will overlap the background sequences and background ranges are able to be
sampled with replacement.
makeRMRanges()
, getPwmMatches()
, countPwmMatches()
## Load the example peaks & the sequences data("ar_er_peaks") data("ar_er_seq") sq <- seqinfo(ar_er_peaks) ## Now sample size-matched ranges 10 times larger. In real-world analyses, ## this set should be sampled as at least 1000x larger, ensuring features ## are matched to your requirements. This example masks regions with known N ## content, including centromeres & telomeres data("hg19_mask") set.seed(305) bg_ranges <- makeRMRanges( ar_er_peaks, GRanges(sq)[1], exclude = hg19_mask, n_iter = 10 ) ## Convert ranges to DNAStringSets library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 bg_seq <- getSeq(genome, bg_ranges) ## Test for enrichment of the ESR1 motif data("ex_pfm") esr1 <- ex_pfm$ESR1 testMotifEnrich(esr1, ar_er_seq, bg_seq, model = "poisson") ## Test all motifs testMotifEnrich(ex_pfm, ar_er_seq, bg_seq, model = "poisson")
## Load the example peaks & the sequences data("ar_er_peaks") data("ar_er_seq") sq <- seqinfo(ar_er_peaks) ## Now sample size-matched ranges 10 times larger. In real-world analyses, ## this set should be sampled as at least 1000x larger, ensuring features ## are matched to your requirements. This example masks regions with known N ## content, including centromeres & telomeres data("hg19_mask") set.seed(305) bg_ranges <- makeRMRanges( ar_er_peaks, GRanges(sq)[1], exclude = hg19_mask, n_iter = 10 ) ## Convert ranges to DNAStringSets library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 bg_seq <- getSeq(genome, bg_ranges) ## Test for enrichment of the ESR1 motif data("ex_pfm") esr1 <- ex_pfm$ESR1 testMotifEnrich(esr1, ar_er_seq, bg_seq, model = "poisson") ## Test all motifs testMotifEnrich(ex_pfm, ar_er_seq, bg_seq, model = "poisson")
Test for a Uniform Distribution across a set of best matches
testMotifPos( x, stringset, binwidth = 10, abs = FALSE, rc = TRUE, min_score = "80%", break_ties = "all", alt = c("greater", "less", "two.sided"), sort_by = c("p", "none"), mc.cores = 1, ... )
testMotifPos( x, stringset, binwidth = 10, abs = FALSE, rc = TRUE, min_score = "80%", break_ties = "all", alt = c("greater", "less", "two.sided"), sort_by = c("p", "none"), mc.cores = 1, ... )
x |
A Position Weight Matrix, universalmotif object or list thereof.
Alternatively can be a single DataFrame or list of DataFrames as returned
by getPwmMatches with |
stringset |
An XStringSet. Not required if matches are supplied as x |
binwidth |
Width of bins across the range to group data into |
abs |
Use absolute positions around zero to find symmetrical enrichment |
rc |
logical(1) Also find matches using the reverse complement of each PWM |
min_score |
The minimum score to return a match |
break_ties |
Choose how to resolve matches with tied scores |
alt |
Alternative hypothesis for the binomial test |
sort_by |
Column to sort results by |
mc.cores |
Passed to mclapply |
... |
Passed to matchPWM |
This function tests for an even positional spread of motif matches across a set of sequences, using the assumption (i.e. H~0~) that if there is no positional bias, matches will be evenly distributed across all positions within a set of sequences. Conversely, if there is positional bias, typically but not necessarily near the centre of a range, this function intends to detect this signal, as a rejection of the null hypothesis.
Input can be provided as the output from getPwmMatches setting
best_only = TRUE
if these matches have already been identified.
If choosing to provide this object to the argument matches
, nothing is required
for the arguments pwm
, stringset
, rc
, min_score
or break_ties
Otherwise, a Position Weight Matrix (PWM) and an XStringSet
are required,
along with the relevant arguments, with best matches identified within the
function.
The set of best matches are then grouped into bins along the range, with the
central bin containing zero, and tallied.
Setting abs
to TRUE
will set all positions from the centre as
absolute values, returning counts purely as bins with distances from zero,
marking this as an inclusive lower bound.
Motif alignments are assigned into bins based on the central position of the
match, as provided in the column from_centre
when calling
getPwmMatches.
The binom.test is performed on each bin using the alternative hypothesis, with the returned p-values across all bins combined using the Harmonic Mean p-value (HMP) (See p.hmp). All bins with raw p-values below the HMP are identified and the returned values for start, end, centre, width, matches in region, expected and enrichment are across this set of bins. The expectation is that where a positional bias is evident, this will be a narrow range containing a non-trivial proportion of the total matches.
It should also be noted that binom.test()
can return p-values of zero, as
beyond machine precision. In these instances, zero p-values are excluded from
calculation of the HMP. This will give a very slight conservative bias, and
assumes that for these extreme cases, neighbouring bins are highly likely to
also return extremely low p-values and no significance will be lost.
A data.frame with columns start
, end
, centre
, width
, total_matches
,
matches_in_region
, expected
, enrichment
, prop_total
, p
and consensus_motif
The total matches represent the total number of matches within the set of
sequences, whilst the number observed in the final region are also given,
along with the proportion of the total this represents.
Enrichment is simply the ratio of observed to expected based on the
expectation of the null hypothesis
The consensus motif across all matches is returned as a Position Frequency Matrix (PFM) using consensusMatrix.
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example sequences data("ar_er_seq") ## Get the best match and use this data matches <- getPwmMatches(esr1, ar_er_seq, best_only = TRUE) ## Test for enrichment in any position testMotifPos(matches) ## Provide a list of PWMs, testing for distance from zero testMotifPos(ex_pfm, ar_er_seq, abs = TRUE, binwidth = 10)
## Load the example PWM data("ex_pfm") esr1 <- ex_pfm$ESR1 ## Load the example sequences data("ar_er_seq") ## Get the best match and use this data matches <- getPwmMatches(esr1, ar_er_seq, best_only = TRUE) ## Test for enrichment in any position testMotifPos(matches) ## Provide a list of PWMs, testing for distance from zero testMotifPos(ex_pfm, ar_er_seq, abs = TRUE, binwidth = 10)
The chr1 subset of candidate enhancers for ZR-75-1 cells
data("zr75_enh")
data("zr75_enh")
An object of class GRanges
of length 5237.
These enhancers are the chr1 subset of enhancer regions for ZR-75-1 cells as identified by EnhancerAtlas 2.0
#' Generation of these ranges is documented in
system.file("scripts/zr75_enh.R", package = "motifTestR")
http://www.enhanceratlas.org/index.php
data("zr75_enh") zr75_enh
data("zr75_enh") zr75_enh