Title: | PWM enrichment analysis |
---|---|
Description: | A toolkit of high-level functions for DNA motif scanning and enrichment analysis built upon Biostrings. The main functionality is PWM enrichment analysis of already known PWMs (e.g. from databases such as MotifDb), but the package also implements high-level functions for PWM scanning and visualisation. The package does not perform "de novo" motif discovery, but is instead focused on using motifs that are either experimentally derived or computationally constructed by other tools. |
Authors: | Robert Stojnic, Diego Diez |
Maintainer: | Diego Diez <[email protected]> |
License: | LGPL (>= 2) |
Version: | 4.43.0 |
Built: | 2024-12-30 03:22:15 UTC |
Source: | https://github.com/bioc/PWMEnrich |
Normalizes the motifs input argument for multiple functions
.inputParamMotifs(motifs)
.inputParamMotifs(motifs)
motifs |
a list of motifs either as frequency matrices (PFM) or as PWM objects. If PFMs are specified they are converted to PWMs using uniform background. |
Normalize the sequences input argument
.inputParamSequences(sequences)
.inputParamSequences(sequences)
sequences |
a set of sequences to be scanned, a list of DNAString or other scannable objects |
Check the frequency matrix input parameter for motifSimilarity
.inputPFMfromMatrixOrPWM(m)
.inputPFMfromMatrixOrPWM(m)
m |
either a PWM object or a matrix |
corresponding PFM
check consistency of bg.seq input parameter
.normalize.bg.seq(bg.seq)
.normalize.bg.seq(bg.seq)
bg.seq |
a set of background sequences, either a list of DNAString object or DNAStringSet object |
This function is from Biostrings package. A Position Frequency Matrix (PFM) is also represented as an ordinary matrix. Unlike a PWM, it must be of type integer (it will typically be the result of consensusMatrix()).
.normargPfm(x)
.normargPfm(x)
x |
a frequency matrix |
This function is from Biostrings package
.normargPriorParams(prior.params)
.normargPriorParams(prior.params)
prior.params |
Typical 'prior.params' vector: c(A=0.25, C=0.25, G=0.25, T=0.25) |
Get the background for a subset of PWMs
## S4 method for signature 'PWMCutoffBackground' x[i, j, ..., drop = TRUE]
## S4 method for signature 'PWMCutoffBackground' x[i, j, ..., drop = TRUE]
x |
the PWMCutoffBackground object |
i |
the indicies of PWMs |
j |
unused |
... |
unused |
drop |
unused |
Get the background for a subset of PWMs
## S4 method for signature 'PWMEmpiricalBackground' x[i, j, ..., drop = TRUE]
## S4 method for signature 'PWMEmpiricalBackground' x[i, j, ..., drop = TRUE]
x |
the PWMEmpiricalBackground object |
i |
the indicies of PWMs |
j |
unused |
... |
unused |
drop |
unused |
Get the background for a subset of PWMs
## S4 method for signature 'PWMGEVBackground' x[i, j, ..., drop = TRUE]
## S4 method for signature 'PWMGEVBackground' x[i, j, ..., drop = TRUE]
x |
the PWMGEVBackground object |
i |
the indicies of PWMs |
j |
unused |
... |
unused |
drop |
unused |
Get the background for a subset of PWMs
## S4 method for signature 'PWMLognBackground' x[i, j, ..., drop = TRUE]
## S4 method for signature 'PWMLognBackground' x[i, j, ..., drop = TRUE]
x |
the PWMLognBackground object |
i |
the indicies of PWMs |
j |
unused |
... |
unused |
drop |
unused |
Calculate total affinity over a set of sequences
affinitySequenceSet(scores, seq.len, pwm.len)
affinitySequenceSet(scores, seq.len, pwm.len)
scores |
affinity scores for individual sequences |
seq.len |
lengths of sequences |
pwm.len |
lengths of PWMs |
Convert a MotifEnrichmentReport into a data.frame object
## S4 method for signature 'MotifEnrichmentReport' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S4 method for signature 'MotifEnrichmentReport' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
the MotifEnrichmentReport object |
row.names |
unused |
optional |
unused |
... |
unused |
This function only take one background sequence as input, it also just calculates the P-value so it is more efficient.
cloverPvalue1seq( scores, seq.len, pwm.len, bg.fwd, bg.rev, B = 1000, verbose = TRUE, clover = NULL )
cloverPvalue1seq( scores, seq.len, pwm.len, bg.fwd, bg.rev, B = 1000, verbose = TRUE, clover = NULL )
scores |
the affinity scores for individual sequences |
seq.len |
lengths of sequences |
pwm.len |
lengths of PWMs |
bg.fwd |
the raw score of forward strand |
bg.rev |
the raw scores of reverse strand |
B |
the number of random replicates |
verbose |
if to give verbose progress reports |
clover |
the clover scores if already calculated |
P-value
Calculate the Clover score using the recursive formula from Frith et al
cloverScore(scores, lr3 = FALSE, verbose = FALSE)
cloverScore(scores, lr3 = FALSE, verbose = FALSE)
scores |
a matrix of average odds scores, where columns are motifs, and rows sequences |
lr3 |
if to return a matrix of LR3 scores, where columns correpond to motifs, and rows to subset sizes |
verbose |
if to produce verbose output of progress |
the LR4 score, which is the mean of LR3 scores over subset sizes
Calculate medians of columns
colMedians(x)
colMedians(x)
x |
a matrix |
Calculate standard deviations of columns
colSds(x)
colSds(x)
x |
a matrix |
Concatenata DNA sequences into a single character object
concatenateSequences(sequences)
concatenateSequences(sequences)
sequences |
either a list of DNAString objects, or a DNAStringSet |
a single character string
The Z-score is calculated separately for each sequence
cutoffZscore(scores, seq.len, pwm.len, bg.P)
cutoffZscore(scores, seq.len, pwm.len, bg.P)
scores |
the hit counts for the sequences |
seq.len |
the length distribution of sequences |
pwm.len |
the length distribution of the PWMs |
bg.P |
background probabilities of observing a motif hit at nucleotide resolution (scaled to sequence length, not 2 * length) |
Z-score
The Z-score is calculated as if the sequence came for one very long sequence
cutoffZscoreSequenceSet(scores, seq.len, pwm.len, bg.P)
cutoffZscoreSequenceSet(scores, seq.len, pwm.len, bg.P)
scores |
the hit counts for the sequences |
seq.len |
the length distribution of sequences |
pwm.len |
the length distribution of the PWMs |
bg.P |
background probabilities of observing a motif hit at nucleotide resolution |
Z-score
Divide each row of a matrix with a vector
divideRows(m, v)
divideRows(m, v)
m |
matrix to be divided |
v |
the vector to use for division |
as.list doesn't seem to always work for DNAStringSets, so implementing this ourselves.
DNAStringSetToList(x)
DNAStringSetToList(x)
x |
an object of class DNAStringSet |
This is the new backend function for empirical P-values for either affinity or cutoff. The function only works on single sequences.
empiricalPvalue( scores, seq.len, pwm.len, bg.fwd, bg.rev, cutoff = NULL, B = 10000, verbose = FALSE, exact.length = FALSE )
empiricalPvalue( scores, seq.len, pwm.len, bg.fwd, bg.rev, cutoff = NULL, B = 10000, verbose = FALSE, exact.length = FALSE )
scores |
the scores obtained for the sequence |
seq.len |
the length of the sequence, if a single value will take a single sequence of given length. If a vector of values, will take sequences of given lengths and joint them together |
pwm.len |
the lengths of PWMs |
bg.fwd |
raw odds scores for the forward strand of background |
bg.rev |
raw odds scores for the reverse strand of background |
cutoff |
if not NULL, will use hit count above this cutoff. The cutoff should be specified in log2. |
B |
the number of random replicates |
verbose |
if to give verbose progress reports |
exact.length |
if to take into consideration that the actual sequence lengths differ for different PWMs. For very long sequences (i.e. seq.len >> pwm.len) this make very little difference, however the run time with exact.length is much longer. |
Calculate empirical P-value for a set of sequences, using either affinity or cutoff. When cutoff is used, the score is a number of motif hits above a certain log-odds cutoff.
empiricalPvalueSequenceSet( scores, seq.len, pwm.len, bg.fwd, bg.rev, cutoff = NULL, B = 10000, verbose = FALSE )
empiricalPvalueSequenceSet( scores, seq.len, pwm.len, bg.fwd, bg.rev, cutoff = NULL, B = 10000, verbose = FALSE )
scores |
a matrix of scores, rows for sequences, columns for PWMs |
seq.len |
the lengths of sequences |
pwm.len |
the lengths of PWMs |
bg.fwd |
raw odds scores for the forward strand of background |
bg.rev |
raw odds scores for the reverse strand of background |
cutoff |
if not NULL, will use hit count above this cutoff. The cutoff should be specified in log2. |
B |
the number of random replicates |
verbose |
if to give verbose progress reports |
Estimate the background frequencies of A,C,G,T on a set of promoters from an organism
getBackgroundFrequencies(organism = "dm3", pseudo.count = 1, quick = FALSE)
getBackgroundFrequencies(organism = "dm3", pseudo.count = 1, quick = FALSE)
organism |
either a name of the organisms for which the background should be compiled
(supported names are "dm3", "mm9" and "hg19"), a |
pseudo.count |
the number to which the frequencies sum up to, by default 1 |
quick |
if to preform fitting on a reduced set of 100 promoters. This will not give as good results but is much quicker than fitting to all the promoters (~10k). Usage of this parameter is recommended only for testing and rough estimates. |
Robert Stojnic, Diego Diez
## Not run: getBackgroundFrequencies("dm3") ## End(Not run)
## Not run: getBackgroundFrequencies("dm3") ## End(Not run)
Get the promoter sequences either for a named organism such as "dm3" or a BSgenome object
getPromoters(organismOrGenome)
getPromoters(organismOrGenome)
organismOrGenome |
either organism name, e.g. "dm3", or BSgenome object |
a list of: promoters - DNAStringSet of (unique) promoters; organism - name of species; version - genome version
Apply GEV background normalization per every sequence
gevPerSequence(scores, seq.len, pwm.len, bg.loc, bg.scale, bg.shape)
gevPerSequence(scores, seq.len, pwm.len, bg.loc, bg.scale, bg.shape)
scores |
affinity scores for the PWMs, can contain scores for more than one sequence (as rows), P-values are extracted separately |
seq.len |
the length distribution of the sequences |
pwm.len |
the lengths of PWMs |
bg.loc |
list of linear regression for location parameter |
bg.scale |
list of linear regression for scale parameter |
bg.shape |
list of linear regression for shape parameter |
Generate a motif enrichment report for the whole group of sequences together
## S4 method for signature 'MotifEnrichmentResults' groupReport(obj, top = 0.05, bg = TRUE, by.top.motifs = FALSE, ...)
## S4 method for signature 'MotifEnrichmentResults' groupReport(obj, top = 0.05, bg = TRUE, by.top.motifs = FALSE, ...)
obj |
a MotifEnrichmentResults object |
top |
what proportion of top motifs should be examined in each individual sequence (by default 0.05, i.e. 5%) |
bg |
if to use background corrected P-values to do the ranking (if available) |
by.top.motifs |
if to rank by the proportion of sequences where the motif is within 'top' percentage of motifs |
... |
unused |
a MotifEnrichmentReport object containing a table with the following columns:
'rank' - The rank of the PWM's enrichment in the whole group of sequences together
'target' - The name of the PWM's target gene, transcript or protein complex.
'id' - The unique identifier of the PWM (if set during PWM creation).
'raw.score' - The raw score before P-value calculation
'p.value' - The P-value of motif enrichment (if available)
'top.motif.prop' - The proportion (between 0 and 1) of sequences where the motif is within top
proportion of enrichment motifs.
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # produce a report for all sequences taken together r.default = groupReport(res) # produce a report where the last column takes top 1% motifs r = groupReport(res, top=0.01) # view the results r # plot the top 10 most enriched motifs plot(r[1:10]) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # produce a report for all sequences taken together r.default = groupReport(res) # produce a report where the last column takes top 1% motifs r = groupReport(res, top=0.01) # view the results r # plot the top 10 most enriched motifs plot(r[1:10]) }
Replace all infinite values by 0
keepFinite(x)
keepFinite(x)
x |
a vector of values |
Calculate the P-value from lognormal distribution with background of equal length
logNormPval(scores, seq.len, pwm.len, bg.mean, bg.sd, bg.len, log = FALSE)
logNormPval(scores, seq.len, pwm.len, bg.mean, bg.sd, bg.len, log = FALSE)
scores |
affinity scores for the PWMs, can contain scores for more than one sequence (as rows), P-values are extracted separately |
seq.len |
the length distribution of the sequences |
pwm.len |
the leggths of PWMs |
bg.mean |
the mean values from the background for PWMs |
bg.sd |
the sd values from the background |
bg.len |
the length distribution of the background (we currently support only constant length) |
log |
if to produce log p-values |
Lognormal P-value for a set of sequences
logNormPvalSequenceSet(scores, seq.len, pwm.len, bg.mean, bg.sd, bg.len)
logNormPvalSequenceSet(scores, seq.len, pwm.len, bg.mean, bg.sd, bg.len)
scores |
a matrix of per-sequence affinity scores |
seq.len |
lengths of sequences |
pwm.len |
lengths of pwms |
bg.mean |
mean background at length of bg.len |
bg.sd |
standard deviation of background at length of bg.len |
bg.len |
the length for which mean and sd are calculated |
P-value
This is a convenience front-end function to compile new backgrounds for a set of PFMs. Currently only supports D. melanogaster, but in the future should support other common organisms as well.
makeBackground( motifs, organism = "dm3", type = "logn", quick = FALSE, bg.seq = NULL, ... )
makeBackground( motifs, organism = "dm3", type = "logn", quick = FALSE, bg.seq = NULL, ... )
motifs |
a list of position frequency matrices (4xL matrices) |
organism |
either a name of the organisms for which the background should be compiled
(currently supported names are "dm3", "mm9" and "hg19"), or a |
type |
the type of background to be compiled. Possible types are:
|
quick |
if to preform fitting on a reduced set of 100 promoters. This will not give as good results but is much quicker than fitting to all the promoters (~10k). Usage of this parameter is recommended only for testing and rough estimates. |
bg.seq |
a set of background sequences to use. This parameter overrides the "organism" and "quick" parameters. |
... |
other named parameters that backend function makePWM***Background functions take. |
Robert Stojnic, Diego Diez
# load in the two example de-novo motifs motifs = readMotifs(system.file(package = "PWMEnrich", dir = "extdata", file = "example.transfac"), remove.acc = TRUE) ## Not run: # construct lognormal background bg.logn = makeBackground(motifs, organism="dm3", type="logn") # alternatively, any BSgenome object can also be used if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) bg.logn = makeBackground(motifs, organism=Dmelanogaster, type="logn") # construct a Z-score of hits with P-value background bg.pval = makeBackground(motifs, organism="dm3", type="pval", p.value=1e-3) # now we can use them to scan for enrichment in sequences (in this case there is a consensus # Tin binding site). motifEnrichment(DNAString("TGCATCAAGTGTGTAGTG"), bg.logn) motifEnrichment(DNAString("TGCATCAAGTGTGTAGTG"), bg.pval) ## End(Not run)
# load in the two example de-novo motifs motifs = readMotifs(system.file(package = "PWMEnrich", dir = "extdata", file = "example.transfac"), remove.acc = TRUE) ## Not run: # construct lognormal background bg.logn = makeBackground(motifs, organism="dm3", type="logn") # alternatively, any BSgenome object can also be used if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) bg.logn = makeBackground(motifs, organism=Dmelanogaster, type="logn") # construct a Z-score of hits with P-value background bg.pval = makeBackground(motifs, organism="dm3", type="pval", p.value=1e-3) # now we can use them to scan for enrichment in sequences (in this case there is a consensus # Tin binding site). motifEnrichment(DNAString("TGCATCAAGTGTGTAGTG"), bg.logn) motifEnrichment(DNAString("TGCATCAAGTGTGTAGTG"), bg.pval) ## End(Not run)
These priors serve both as background nucleotide frequencies and pseudo-counts for PWMs.
makePriors(bg.seq, bg.pseudo.count)
makePriors(bg.seq, bg.pseudo.count)
bg.seq |
a set of background sequences |
bg.pseudo.count |
the total pseudocount shared between nucleotides |
# some example sequences sequences = list(DNAString("AAAGAGAGTGACCGATGAC"), DNAString("ACGATGAGGATGAC")) # make priors with pseudo-count of 1 shared between them makePriors(sequences, 1)
# some example sequences sequences = list(DNAString("AAAGAGAGTGACCGATGAC"), DNAString("ACGATGAGGATGAC")) # make priors with pseudo-count of 1 shared between them makePriors(sequences, 1)
Make a background based on number of motifs hits above a certain threshold.
makePWMCutoffBackground( bg.seq, motifs, cutoff = log2(exp(4)), bg.pseudo.count = 1, bg.source = "", verbose = TRUE )
makePWMCutoffBackground( bg.seq, motifs, cutoff = log2(exp(4)), bg.pseudo.count = 1, bg.source = "", verbose = TRUE )
bg.seq |
a set of background sequences, either a list of DNAString object or DNAStringSet object |
motifs |
a set of motifs, either a list of frequency matrices, or a list of PWM objects. If frequency matrices are given, the background distribution is fitted from bg.seq. |
cutoff |
the cutoff at which the background should be made, i.e. at which a motif hit is called significant |
bg.pseudo.count |
the pseudo count which is shared between nucleotides when frequency matrices are given |
bg.source |
a free-form textual description of how the background was generated |
verbose |
if to produce verbose output |
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make background for MotifDb motifs using 2Kb promoters of all D. melanogaster transcripts # using a cutoff of 5 if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMCutoffBackground(Dmelanogaster$upstream2000, MotifDb.Dmel.PFM, cutoff=log2(exp(5))) } ## End(Not run)
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make background for MotifDb motifs using 2Kb promoters of all D. melanogaster transcripts # using a cutoff of 5 if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMCutoffBackground(Dmelanogaster$upstream2000, MotifDb.Dmel.PFM, cutoff=log2(exp(5))) } ## End(Not run)
Make a background appropriate for empirical P-value calculation. The provided set of background sequences is contcatenated into a single long sequence which is then scanned with the motifs and raw scores are saved. This object can be very large.
makePWMEmpiricalBackground( bg.seq, motifs, bg.pseudo.count = 1, bg.source = "", verbose = TRUE, ... )
makePWMEmpiricalBackground( bg.seq, motifs, bg.pseudo.count = 1, bg.source = "", verbose = TRUE, ... )
bg.seq |
a set of background sequences, either a list of DNAString object or DNAStringSet object |
motifs |
a set of motifs, either a list of frequency matrices, or a list of PWM objects. If frequency matrices are given, the background distribution is fitted from bg.seq. |
bg.pseudo.count |
the pseudo count which is shared between nucleotides when frequency matrices are given |
bg.source |
a free-form textual description of how the background was generated |
verbose |
if to produce verbose output |
... |
currently unused (this is for convenience for makeBackground function) |
For reliable P-value calculation the size of the background set needs to be at least seq.len / min.P.value. For instance, to get P-values at a resolution of 0.001 for a single sequence of 500bp, we would need a background of at least 500/0.001 = 50kb. This ensures that we can make 1000 independent 500bp samples from this background to properly estimate the P-value. For a group of sequences, we would take seq.len to be the total length of all sequences in a group.
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make empirical background by saving raw scores for each bp in the sequence. This can be # very large in memory! if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMEmpiricalBackground(Dmelanogaster$upstream2000[1:100], MotifDb.Dmel.PFM) } ## End(Not run)
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make empirical background by saving raw scores for each bp in the sequence. This can be # very large in memory! if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMEmpiricalBackground(Dmelanogaster$upstream2000[1:100], MotifDb.Dmel.PFM) } ## End(Not run)
Construct a lognormal background distribution for a set of sequences. Sequences concatenated are binned in 'bg.len' chunks and lognormal distribution fitted to them.
makePWMGEVBackground( bg.seq, motifs, bg.pseudo.count = 1, bg.len = seq(200, 2000, 200), bg.source = "", verbose = TRUE, fit.log = TRUE )
makePWMGEVBackground( bg.seq, motifs, bg.pseudo.count = 1, bg.len = seq(200, 2000, 200), bg.source = "", verbose = TRUE, fit.log = TRUE )
bg.seq |
a set of background sequences, either a list of DNAString object or DNAStringSet object |
motifs |
a set of motifs, either a list of frequency matrices, or a list of PWM objects. If frequency matrices are given, the background distribution is fitted from bg.seq. |
bg.pseudo.count |
the pseudo count which is shared between nucleotides when frequency matrices are given |
bg.len |
the length range of background chunks |
bg.source |
a free-form textual description of how the background was generated |
verbose |
if to produce verbose output |
fit.log |
if to fit log odds (instead of odds) |
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make background for MotifDb motifs using 2kb promoters of all D. melanogaster transcripts if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMGEVBackground(Dmelanogaster$upstream2000, MotifDb.Dmel.PFM) } ## End(Not run)
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make background for MotifDb motifs using 2kb promoters of all D. melanogaster transcripts if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMGEVBackground(Dmelanogaster$upstream2000, MotifDb.Dmel.PFM) } ## End(Not run)
Construct a lognormal background distribution for a set of sequences. Sequences concatenated are binned in 'bg.len' chunks and lognormal distribution fitted to them.
makePWMLognBackground( bg.seq, motifs, bg.pseudo.count = 1, bg.len = 250, bg.len.sizes = 2^(0:4), bg.source = "", verbose = TRUE, algorithm = "default" )
makePWMLognBackground( bg.seq, motifs, bg.pseudo.count = 1, bg.len = 250, bg.len.sizes = 2^(0:4), bg.source = "", verbose = TRUE, algorithm = "default" )
bg.seq |
a set of background sequences, either a list of DNAString object or DNAStringSet object |
motifs |
a set of motifs, either a list of frequency matrices, or a list of PWM objects. If frequency matrices are given, the background distribution is fitted from bg.seq. |
bg.pseudo.count |
the pseudo count which is shared between nucleotides when frequency matrices are given |
bg.len |
background sequences will be split into tiles of this length (default: 250bp) |
bg.len.sizes |
background tiles will be joined into bigger tiles containing this much smaller tiles.
The default is |
bg.source |
a free-form textual description of how the background was generated |
verbose |
if to produce verbose output |
algorithm |
type of algorithm to use, valid values are: "default" and "human". |
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make background for MotifDb motifs using 2kb promoters of all D. melanogaster transcripts if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMLognBackground(Dmelanogaster$upstream2000, MotifDb.Dmel.PFM) } ## End(Not run)
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make background for MotifDb motifs using 2kb promoters of all D. melanogaster transcripts if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) makePWMLognBackground(Dmelanogaster$upstream2000, MotifDb.Dmel.PFM) } ## End(Not run)
This function takes already calculated empirical background distribution and chooses cutoff for each motif based on P-value cutoff for individual sites.
makePWMPvalCutoffBackground(bg.p, p.value = 0.001, bg.source = "")
makePWMPvalCutoffBackground(bg.p, p.value = 0.001, bg.source = "")
bg.p |
an object of class PWMEmpiricalBackground |
p.value |
the P-value used to find cuttoffs for each of the motifs |
bg.source |
textual description of background source |
an object of type PWMCutoffBackground
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make empirical background - here we use only 100 sequences for illustrative purposes if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) bg.p = makePWMEmpiricalBackground(Dmelanogaster$upstream2000[1:100], MotifDb.Dmel.PFM) # use the empirical background to pick a threshold and make cutoff background makePWMPvalCutoffBackground(bg.p, 0.001) } ## End(Not run)
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # make empirical background - here we use only 100 sequences for illustrative purposes if(requireNamespace("BSgenome.Dmelanogaster.UCSC.dm3")) bg.p = makePWMEmpiricalBackground(Dmelanogaster$upstream2000[1:100], MotifDb.Dmel.PFM) # use the empirical background to pick a threshold and make cutoff background makePWMPvalCutoffBackground(bg.p, 0.001) } ## End(Not run)
This function creates a P-value cutoff background for motif enrichment.
makePWMPvalCutoffBackgroundFromSeq( bg.seq, motifs, p.value = 0.001, bg.pseudo.count = 1, bg.source = "", verbose = TRUE )
makePWMPvalCutoffBackgroundFromSeq( bg.seq, motifs, p.value = 0.001, bg.pseudo.count = 1, bg.source = "", verbose = TRUE )
bg.seq |
a set of background sequences, either a list of DNAString object or DNAStringSet object |
motifs |
a set of motifs, either a list of frequency matrices, or a list of PWM objects. If frequency matrices are given, the background distribution is fitted from bg.seq. |
p.value |
the P-value used to find cuttoffs for each of the motifs |
bg.pseudo.count |
the pseudo count which is shared between nucleotides when frequency matrices are given |
bg.source |
textual description of background source |
verbose |
if to print verbose output |
an object of type PWMCutoffBackground
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # use the empirical background to pick a threshold and make cutoff background makePWMPvalCutoffBackground(Dmelanogaster$upstream2000, 0.001) } ## End(Not run)
## Not run: if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # use the empirical background to pick a threshold and make cutoff background makePWMPvalCutoffBackground(Dmelanogaster$upstream2000, 0.001) } ## End(Not run)
Divide total.len into fragments of length len by providing start,end positions
makeStartEndPos(total.len, len)
makeStartEndPos(total.len, len)
total.len |
total available length to be subdivided |
len |
size of the individual chunk |
a data.frame containing paired up start,end positions
All PWMs are shuffled at the same time. This function would be too slow to produce empirical P-values, thus we return a z-score from a small number of shuffles.
matrixShuffleZscorePerSequence(scores, sequences, pwms, cutoff = NULL, B = 30)
matrixShuffleZscorePerSequence(scores, sequences, pwms, cutoff = NULL, B = 30)
scores |
a set of already calculated scores |
sequences |
either one sequence or a list/set of sequences (objects of type DNAString or DNAStringSet) |
pwms |
a list of PWMs |
cutoff |
if NULL, will use affinity, otherwise will use number of hits over this log2 odds cutoff |
B |
number of replicates, i.e. PWM column shuffles |
The z-scores are calculated for each sequence individually.
This function takes the offset of first motif relative to second and chops off the end of both motifs that are not aligned. It returns a list containing only the columns that align.
maxAligned(m1, m2, offset)
maxAligned(m1, m2, offset)
m1 |
frequency matrix of first motif |
m2 |
frequency matrix of second motif |
offset |
a number of nucleotides by which the first motif is offsetted compared to the second |
a list of column-trimmed motifs m1, m2
Test for differential enrichment between two groups of sequences
motifDiffEnrichment( sequences1, sequences2, pwms, score = "autodetect", bg = "autodetect", cutoff = log2(exp(4)), verbose = TRUE, res1 = NULL, res2 = NULL )
motifDiffEnrichment( sequences1, sequences2, pwms, score = "autodetect", bg = "autodetect", cutoff = log2(exp(4)), verbose = TRUE, res1 = NULL, res2 = NULL )
sequences1 |
First set of sequences. Can be either a single sequence (an object of class DNAString), or a list of DNAString objects, or a DNAStringSet object. |
sequences2 |
Second set of sequences. Can be either a single sequence (an object of class DNAString), or a list of DNAString objects, or a DNAStringSet object. |
pwms |
this parameter can take multiple values depending on the scoring scheme and background correction used.
When the
|
score |
this parameter determines which scoring scheme to use. Following scheme as available:
|
bg |
this parameter determines which background correction to use, if any.
|
cutoff |
the score cutoff for a significant motif hit if scoring scheme "cutoff" is selected. |
verbose |
if to produce verbose output |
res1 |
the output of |
res2 |
the output of |
This function calls motifEnrichment
on two groups of sequences and calculates
the difference statistics when possible.
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ # load the background file for drosophila and lognormal correction data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # get the differential enrichment diff = motifDiffEnrichment(DNAString("TGCATCAAGTGTGTAGTGTGAGATTAGT"), DNAString("TGAACGAGTAGGACGATGAGAGATTGATG"), PWMLogn.dm3.MotifDb.Dmel, verbose=FALSE) # motifs differentially enriched in the first sequence (with lognormal background correction) head(sort(diff$group.bg, decreasing=TRUE)) # motifs differentially enriched in the second sequence (with lognormal background correction) head(sort(diff$group.bg)) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ # load the background file for drosophila and lognormal correction data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # get the differential enrichment diff = motifDiffEnrichment(DNAString("TGCATCAAGTGTGTAGTGTGAGATTAGT"), DNAString("TGAACGAGTAGGACGATGAGAGATTGATG"), PWMLogn.dm3.MotifDb.Dmel, verbose=FALSE) # motifs differentially enriched in the first sequence (with lognormal background correction) head(sort(diff$group.bg, decreasing=TRUE)) # motifs differentially enriched in the second sequence (with lognormal background correction) head(sort(diff$group.bg)) }
Calculate the empirical distribution score distribution for a set of motifs
motifEcdf( motifs, organism = NULL, bg.seq = NULL, quick = FALSE, pseudo.count = 1 )
motifEcdf( motifs, organism = NULL, bg.seq = NULL, quick = FALSE, pseudo.count = 1 )
motifs |
a set of motifs, either a list of frequency matrices, or a list of PWM objects. If frequency matrices are given, the background distribution is fitted from bg.seq. |
organism |
either a name of the organisms for which the background should be compiled
(supported names are "dm3", "mm9" and "hg19"), or a |
bg.seq |
a set of background sequence (either this or organism needs to be specified!). Can be a DNAString or DNAStringSet object. |
quick |
if to do the fitting only on a small subset of the data (only in combination with |
pseudo.count |
the pseudo count which is shared between nucleotides when frequency matrices are given |
a list of ecdf
objects (see help page for ecdf
for usage).
Calculate motif enrichment using one of available scoring algorithms and background corrections.
motifEnrichment( sequences, pwms, score = "autodetect", bg = "autodetect", cutoff = NULL, verbose = TRUE, motif.shuffles = 30, B = 1000, group.only = FALSE )
motifEnrichment( sequences, pwms, score = "autodetect", bg = "autodetect", cutoff = NULL, verbose = TRUE, motif.shuffles = 30, B = 1000, group.only = FALSE )
sequences |
the sequences to be scanned for enrichment. Can be either a single sequence (an object of class DNAString), or a list of DNAString objects, or a DNAStringSet object. |
pwms |
this parameter can take multiple values depending on the scoring scheme and background correction used.
When the
|
score |
this parameter determines which scoring scheme to use. Following scheme as available:
|
bg |
this parameter determines how the raw score is compared to the background distribution.
|
cutoff |
the score cutoff for a significant motif hit if scoring scheme "cutoff" is selected. |
verbose |
if to print verbose output |
motif.shuffles |
number of times to shuffle motifs if using "ms" background correction |
B |
number of replicates when calculating empirical P-value |
group.only |
if to return statistics only for the group of sequences, not individual sequences. In the case of empirical background the P-values for individual sequences are not calculated (thus saving time), for other backgrounds they are calculated but not returned. |
This function provides and interface to all algorithms available in PWMEnrich to find motif enrichment in a single or a group of sequences with/without background correction.
Since for all algorithms the first step involves calculating raw scores without background correction, the output always contains the scores without background correction together with (optional) background-corrected scores.
Unless otherwise specified the scores are returned both separately for each sequence (without/with background) and for the whole group of sequences (without/with background).
To use a background correction you need to supply a set of PWMs with precompiled background distribution parameters
(see function makeBackground
). When such an object is supplied as the pwm
parameter, the scoring
scheme and background correction are automatically determined.
There are additional packages with already pre-computed background (e.g. see package PWMEnrich.Dmelanogaster.background
).
Please refer to (Stojnic & Adryan, 2012) for more details on the algorithms.
a MotifEnrichmentResults object containing a subset following elements:
"score" - scoring scheme used
"bg" - background correction used
"params" - any additional parameters
"sequences" - the set of sequences used
"pwms" - the set of pwms used
"sequence.nobg" - per-sequence scores without any background correction. For "affinity" and "clover" a matrix of mean affinity scores; for "cutoff" number of significant hits above a cutoff
"sequence.bg" - per-sequence scores after background correction. For "logn" and "pval" the P-value (smaller is better); for "z" and "ms" background corrections the z-scores (bigger is better).
"group.nobg" - aggregate scores for the whole group of sequences without background correction. For "affinity" and "clover" the mean affinity over all sequences in the set; for "cutoff" the total number of hits in all sequences.
"group.bg" - aggregate scores for the whole group of sequences with background correction. For "logn" and "pval", the P-value for the whole group (smaller is better); for "z" and "ms" the z-score for the whole set (bigger is better).
"sequence.norm" - (only for "logn") the length-normalized scores for each of the sequences. Currently only implemented for "logn", where it returns the values normalized from LogN(0,1) distribution
"group.norm" - (only for "logn") similar to sequence.norm, but for the whole group of sequences
R. Stojnic & B. Adryan: Identification of functional DNA motifs using a binding affinity lognormal background distribution, submitted.
MC Frith et al: Detection of functional DNA motifs via statistical over-representation, Nucleid Acid Research (2004).
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAGATTGAAGTAGACCAGTC"), DNAString("AGGTAGATAGAACAGTAGGCAATGGGGGAAATTGAGAGTC")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # most enriched in both sequences (lognormal background P-value) head(motifRankingForGroup(res)) # most enriched in both sequences (raw affinity, no background) head(motifRankingForGroup(res, bg=FALSE)) # most enriched in the first sequence (lognormal background P-value) head(motifRankingForSequence(res, 1)) # most enriched in the first sequence (raw affinity, no background) head(motifRankingForSequence(res, 1, bg=FALSE)) ### # Load the pre-compiled background for hit-based motif counts with cutoff of P-value = 0.001 data(PWMPvalueCutoff1e3.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") res.count = motifEnrichment(sequences, PWMPvalueCutoff1e3.dm3.MotifDb.Dmel) # Enrichment in the whole group, z-score for the number of motif hits head(motifRankingForGroup(res)) # First sequence, sorted by number of motif hits with P-value < 0.001 head(motifRankingForSequence(res, 1, bg=FALSE)) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAGATTGAAGTAGACCAGTC"), DNAString("AGGTAGATAGAACAGTAGGCAATGGGGGAAATTGAGAGTC")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # most enriched in both sequences (lognormal background P-value) head(motifRankingForGroup(res)) # most enriched in both sequences (raw affinity, no background) head(motifRankingForGroup(res, bg=FALSE)) # most enriched in the first sequence (lognormal background P-value) head(motifRankingForSequence(res, 1)) # most enriched in the first sequence (raw affinity, no background) head(motifRankingForSequence(res, 1, bg=FALSE)) ### # Load the pre-compiled background for hit-based motif counts with cutoff of P-value = 0.001 data(PWMPvalueCutoff1e3.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") res.count = motifEnrichment(sequences, PWMPvalueCutoff1e3.dm3.MotifDb.Dmel) # Enrichment in the whole group, z-score for the number of motif hits head(motifRankingForGroup(res)) # First sequence, sorted by number of motif hits with P-value < 0.001 head(motifRankingForSequence(res, 1, bg=FALSE)) }
The columns stored in this object will depend on the type of the report (either for group of sequences, or individual sequences).
d
:a DataFrame object that contains the main tabular report data
pwms
:a list of PWM
objects corresponding to rows of d
Note that this is only a wrapper around a list which is the return value in PWMEnrich 1.3 and as such it provides the same interface as a list (for backward compatibility), with some additional methods.
res
:a list of old results with elements such as: sequence.bg, sequence.nobg, group.bg, group.nobg
Information content for a PWM or PFM
motifIC( motif, prior.params = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), bycol = FALSE )
motifIC( motif, prior.params = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), bycol = FALSE )
motif |
a matrix of frequencies, or a PWM object |
prior.params |
the prior parameters to use when a matrix is given (ignored if motif is already a PWM) |
bycol |
if to return values separately for each column |
information content in bits (i.e. log2)
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # the nucleotide distribution is taken from the PWM (in this case genomic background) motifIC(MotifDb.Dmel[["ttk"]]) # information content with default uniform background because the input is a matrix, # not PWM object motifIC(MotifDb.Dmel.PFM[["ttk"]]) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # the nucleotide distribution is taken from the PWM (in this case genomic background) motifIC(MotifDb.Dmel[["ttk"]]) # information content with default uniform background because the input is a matrix, # not PWM object motifIC(MotifDb.Dmel.PFM[["ttk"]]) }
Note that this function asssumes that smaller values are better!
motifPrAUC(seq.res)
motifPrAUC(seq.res)
seq.res |
a matrix where each column represents a PWM and each row a result for a different sequence. |
Get a ranking of motifs by their enrichment in the whole set of sequences
## S4 method for signature 'MotifEnrichmentResults' motifRankingForGroup( obj, bg = TRUE, id = FALSE, order = FALSE, rank = FALSE, unique = FALSE, ... )
## S4 method for signature 'MotifEnrichmentResults' motifRankingForGroup( obj, bg = TRUE, id = FALSE, order = FALSE, rank = FALSE, unique = FALSE, ... )
obj |
a MotifEnrichmentResults object |
bg |
if to use background corrected P-values to do the ranking (if available) |
id |
if to show PWM IDs instead of target TF names |
order |
if to output the ordering of PWMs instead of actual P-values or raw values |
rank |
if the output should be rank of a PWM instead of actual P-values or raw values |
unique |
if TRUE, only the best rank is taken for each TF (only when id = FALSE, order = FALSE) |
... |
currently unused |
a vector of P-values or raw enrichments sorted such that the first motif is most enriched
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # most enriched in both sequences (sorted by lognormal background P-value) head(motifRankingForGroup(res)) # Return a non-redundant set of TFs head(motifRankingForGroup(res, unique=TRUE)) # sorted by raw affinity instead of P-value head(motifRankingForGroup(res, bg=FALSE)) # show IDs instead of target TF names head(motifRankingForGroup(res, id=TRUE)) # output the rank instead of P-value head(motifRankingForGroup(res, rank=TRUE)) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # most enriched in both sequences (sorted by lognormal background P-value) head(motifRankingForGroup(res)) # Return a non-redundant set of TFs head(motifRankingForGroup(res, unique=TRUE)) # sorted by raw affinity instead of P-value head(motifRankingForGroup(res, bg=FALSE)) # show IDs instead of target TF names head(motifRankingForGroup(res, id=TRUE)) # output the rank instead of P-value head(motifRankingForGroup(res, rank=TRUE)) }
Get a ranking of motifs by their enrichment in one specific sequence
## S4 method for signature 'MotifEnrichmentResults' motifRankingForSequence( obj, seq.id, bg = TRUE, id = FALSE, order = FALSE, rank = FALSE, unique = FALSE, ... )
## S4 method for signature 'MotifEnrichmentResults' motifRankingForSequence( obj, seq.id, bg = TRUE, id = FALSE, order = FALSE, rank = FALSE, unique = FALSE, ... )
obj |
a MotifEnrichmentResults object |
seq.id |
either the sequence number or sequence name |
bg |
if to use background corrected P-values to do the ranking (if available) |
id |
if to show PWM IDs instead of target TF names |
order |
if to output the ordering of PWMs instead of actual P-values or raw values |
rank |
if the output should be rank of a PWM instead of actual P-values or raw values |
unique |
if TRUE, only the best rank is taken for each TF (only when id = FALSE, order = FALSE) |
... |
currently unused |
a vector of P-values or raw enrichments sorted such that the first motif is most enriched
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # most enriched in the second sequences (sorted by lognormal background P-value) head(motifRankingForSequence(res, 2)) # return unique TFs enriched in sequence 2 head(motifRankingForSequence(res, 2, unique=TRUE)) # sorted by raw affinity instead of P-value head(motifRankingForSequence(res, 2, bg=FALSE)) # show IDs instead of target TF names head(motifRankingForSequence(res, 2, id=TRUE)) # output the rank instead of P-value head(motifRankingForSequence(res, 2, rank=TRUE)) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # most enriched in the second sequences (sorted by lognormal background P-value) head(motifRankingForSequence(res, 2)) # return unique TFs enriched in sequence 2 head(motifRankingForSequence(res, 2, unique=TRUE)) # sorted by raw affinity instead of P-value head(motifRankingForSequence(res, 2, bg=FALSE)) # show IDs instead of target TF names head(motifRankingForSequence(res, 2, id=TRUE)) # output the rank instead of P-value head(motifRankingForSequence(res, 2, rank=TRUE)) }
Note that this function asssumes that smaller values are better!
motifRecoveryAUC(seq.res)
motifRecoveryAUC(seq.res)
seq.res |
a matrix where each column represents a PWM and each row a result for a different sequence. |
Scan a number of sequences either to find overall affinity, or a number of hits over a score threshold.
motifScores( sequences, motifs, raw.scores = FALSE, verbose = TRUE, cutoff = NULL )
motifScores( sequences, motifs, raw.scores = FALSE, verbose = TRUE, cutoff = NULL )
sequences |
a set of sequences to be scanned, a list of DNAString or other scannable objects |
motifs |
a list of motifs either as frequency matrices (PFM) or as PWM objects. If PFMs are specified they are converted to PWMs using uniform background. |
raw.scores |
if to return raw scores (odds) for each position in the sequence. Note that scores for forward and reverse strand are concatenated into a single long vector of scores (twice the length of the sequence) |
verbose |
if to print verbose output |
cutoff |
if not NULL, will count number of matches with score above value specified (instead of returning the average affinity). Can either be one value, or a vector of values for each of the motifs. |
if raw.scores=FALSE, returns a matrix of mean scores (after cutoff if any), where columns are motifs. The returned values are either mean odd scores (not log-odd), or number of hits above a threshold; otherwise if raw.scores=TRUE, returns a list of raw score values (before cutoff)
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # affinity scores affinity = motifScores(DNAString("CGTAGGATAAAGTAACTAGTTGATGATGAAAG"), MotifDb.Dmel) # motif hit count with Patser score of 4 counts = motifScores(DNAString("CGTAGGATAAAGTAACTAGTTGATGATGAAAG"), MotifDb.Dmel, cutoff=log2(exp(4))) print(affinity) print(counts) # scanning multiple sequences sequences = list(DNAString("CGTAGGATAAAGTAACTAGTTGATGATGAAAG"), DNAString("TGAGACGAAGGGGATGAGATGCGGAAGAGTGAAA")) affinity2 = motifScores(sequences, MotifDb.Dmel) print(affinity2) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # affinity scores affinity = motifScores(DNAString("CGTAGGATAAAGTAACTAGTTGATGATGAAAG"), MotifDb.Dmel) # motif hit count with Patser score of 4 counts = motifScores(DNAString("CGTAGGATAAAGTAACTAGTTGATGATGAAAG"), MotifDb.Dmel, cutoff=log2(exp(4))) print(affinity) print(counts) # scanning multiple sequences sequences = list(DNAString("CGTAGGATAAAGTAACTAGTTGATGATGAAAG"), DNAString("TGAGACGAAGGGGATGAGATGCGGAAGAGTGAAA")) affinity2 = motifScores(sequences, MotifDb.Dmel) print(affinity2) }
The parameters and functionality are the same as motifScores
. Please refer to documentation of this function
for detailed explanation of functionality.
motifScoresBigMemory( sequences, motifs, raw.scores = FALSE, verbose = TRUE, cutoff = NULL, seq.all = NULL )
motifScoresBigMemory( sequences, motifs, raw.scores = FALSE, verbose = TRUE, cutoff = NULL, seq.all = NULL )
sequences |
set of input sequences |
motifs |
set of input PWMs or PFMs |
raw.scores |
if to return scores for each base-pair |
verbose |
if to produce verbose output |
cutoff |
the cutoff for calling binding sites (in base 2 log). |
seq.all |
already concatenated sequences if already available (used to internally speed up things) |
This function is not meant to be called directly, but is indirectly called by motifScores() once a global parameters useBigMemory is set.
This function calculates the normalized motif correlation as a measure of motif frequency matrix similarity.
motifSimilarity(m1, m2, trim = 0.4, self.sim = FALSE)
motifSimilarity(m1, m2, trim = 0.4, self.sim = FALSE)
m1 |
matrix with four rows representing the frequency matrix of first motif |
m2 |
matrix with four rows representing the frequency matrix of second motif |
trim |
bases with information content smaller than this value will be trimmed off both motif ends |
self.sim |
if to calculate self similarity (i.e. without including offset=0 in alignment) |
This score is essentially a normalized version of the sum of column correlations as proposed by Pietrokovski (1996). The sum is normalized by the average motif length of m1 and m2, i.e. (ncol(m1)+ncol(m2))/2. Thus, for two idential motifs this score is going to be 1. For unrelated motifs the score is going to be typically around 0.
Motifs need to aligned for this score to be calculated. The current implementation tries all possible ungapped alignment with a minimal of two basepair matching, and the maximal score over all alignments is returned.
Motif 1 is aligned both to Motif 2 and its reverse complement. Thus, the motif similarities are the same if the reverse complement of any of the two motifs is given.
Pietrokovski S. Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 1996;24:3836-3845.
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # calculate the similarity of tin and vnd motifs (which are almost identical) motifSimilarity(MotifDb.Dmel.PFM[["tin"]], MotifDb.Dmel.PFM[["vnd"]]) # similarity of two unrelated motifs motifSimilarity(MotifDb.Dmel.PFM[["tin"]], MotifDb.Dmel.PFM[["ttk"]]) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # calculate the similarity of tin and vnd motifs (which are almost identical) motifSimilarity(MotifDb.Dmel.PFM[["tin"]], MotifDb.Dmel.PFM[["vnd"]]) # similarity of two unrelated motifs motifSimilarity(MotifDb.Dmel.PFM[["tin"]], MotifDb.Dmel.PFM[["ttk"]]) }
Columns stored in the motif enrichment report
## S4 method for signature 'MotifEnrichmentReport' names(x) ## S4 method for signature 'MotifEnrichmentReport' x$name ## S4 method for signature 'MotifEnrichmentReport' x[i, j, ..., drop = TRUE]
## S4 method for signature 'MotifEnrichmentReport' names(x) ## S4 method for signature 'MotifEnrichmentReport' x$name ## S4 method for signature 'MotifEnrichmentReport' x[i, j, ..., drop = TRUE]
x |
the MotifEnrichmentReport object |
name |
the variable name |
i |
the row selector |
j |
unused |
... |
unused |
drop |
unused (always FALSE) |
the names of the variables
Name of different pieces of information associated with MotifEnrichmentResults
## S4 method for signature 'MotifEnrichmentResults' names(x) ## S4 method for signature 'MotifEnrichmentResults' x$name
## S4 method for signature 'MotifEnrichmentResults' names(x) ## S4 method for signature 'MotifEnrichmentResults' x$name
x |
the MotifEnrichmentResults object |
name |
the variable name |
the names of the variables
Name of different pieces of information associated with PWM
Returns the motif length, i.e. the number of columns in the PWM.
## S4 method for signature 'PWM' names(x) ## S4 method for signature 'PWM' x$name ## S4 method for signature 'PWM' length(x)
## S4 method for signature 'PWM' names(x) ## S4 method for signature 'PWM' x$name ## S4 method for signature 'PWM' length(x)
x |
the PWM object |
name |
the variable name |
the names of the variables
Name of different pieces of information associated with PWMCutoffBackground
## S4 method for signature 'PWMCutoffBackground' names(x) ## S4 method for signature 'PWMCutoffBackground' x$name
## S4 method for signature 'PWMCutoffBackground' names(x) ## S4 method for signature 'PWMCutoffBackground' x$name
x |
the PWMCutoffBackground object |
name |
the variable name |
the names of the variables
Name of different pieces of information associated with PWMEmpiricalBackground
## S4 method for signature 'PWMEmpiricalBackground' names(x) ## S4 method for signature 'PWMEmpiricalBackground' x$name
## S4 method for signature 'PWMEmpiricalBackground' names(x) ## S4 method for signature 'PWMEmpiricalBackground' x$name
x |
the PWMEmpiricalBackground object |
name |
the variable name |
the names of the variables
Name of different pieces of information associated with PWMGEVBackground
## S4 method for signature 'PWMGEVBackground' names(x) ## S4 method for signature 'PWMGEVBackground' x$name
## S4 method for signature 'PWMGEVBackground' names(x) ## S4 method for signature 'PWMGEVBackground' x$name
x |
the PWMGEVBackground object |
name |
the variable name |
the names of the variables
Name of different pieces of information associated with PWMLognBackground
## S4 method for signature 'PWMLognBackground' names(x) ## S4 method for signature 'PWMLognBackground' x$name
## S4 method for signature 'PWMLognBackground' names(x) ## S4 method for signature 'PWMLognBackground' x$name
x |
the PWMLognBackground object |
name |
the variable name |
the names of the variables
Note that this function is deprecated and replaced by toPWM()
.
PFMtoPWM( motifs, id = names(motifs), name = names(motifs), seq.count = NULL, ... )
PFMtoPWM( motifs, id = names(motifs), name = names(motifs), seq.count = NULL, ... )
motifs |
a list of motifs represented as matrices of frequencies (PFM) |
id |
the set of IDs for the motifs (defaults to names of the 'motifs' list) |
name |
the set of names for the motifs (defaults to names of the 'motifs' list) |
seq.count |
if frequencies in the motifs are normalized to 1, provides a vector of sequence counts (e.g. for MotifDb motifs) |
... |
other parameters to PWMUnscaled |
## Not run: if (requireNamespace("PWMEnrich.Dmelanogaster.background")) { data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # convert to PWM with uniform background PFMtoPWM(MotifDb.Dmel.PFM) # get background for drosophila (quick mode on a reduced dataset) prior = getBackgroundFrequencies("dm3", quick=TRUE) # convert with genomic background PFMtoPWM(MotifDb.Dmel.PFM, prior.params=prior) } ## End(Not run)
## Not run: if (requireNamespace("PWMEnrich.Dmelanogaster.background")) { data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") # convert to PWM with uniform background PFMtoPWM(MotifDb.Dmel.PFM) # get background for drosophila (quick mode on a reduced dataset) prior = getBackgroundFrequencies("dm3", quick=TRUE) # convert with genomic background PFMtoPWM(MotifDb.Dmel.PFM, prior.params=prior) } ## End(Not run)
Plots a graphical version of the motif enrichment report. Note that all values are plotted, if you want to plot only a subset of a report, first select this subset (see examples).
## S4 method for signature 'MotifEnrichmentReport,missing' plot( x, y, fontsize = 14, id.fontsize = fontsize, header.fontsize = fontsize, widths = NULL, ... )
## S4 method for signature 'MotifEnrichmentReport,missing' plot( x, y, fontsize = 14, id.fontsize = fontsize, header.fontsize = fontsize, widths = NULL, ... )
x |
a MotifEnrichmentReport object |
y |
unused |
fontsize |
font size to use in the plot |
id.fontsize |
font size to use for the motif IDs |
header.fontsize |
font size of the header |
widths |
the relative widths of columns |
... |
unused if(requireNamespace("PWMEnrich.Dmelanogaster.background")) ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # produce a report for all sequences taken together r = groupReport(res) # plot the top 10 most enriched motifs plot(r[1:10]) |
This function produces a sequence logo (via package seqLogo).
## S4 method for signature 'PWM,missing' plot(x, y, ...)
## S4 method for signature 'PWM,missing' plot(x, y, ...)
x |
the PWM object |
y |
unused |
... |
other parameters to pass to seqLogo's |
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # plot the tinman motif from MotifDb plot(MotifDb.Dmel[["tin"]]) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # plot the tinman motif from MotifDb plot(MotifDb.Dmel[["tin"]]) }
This function visualises the motif scores for one or more sequences. Sequences are drawn as lines, and scores are plotted
as triangles at both sides of the line (corresponding to the two strands). The width of the base of the triangle corresponds to motif width and
the height to the motif log(score)
that is positive and greater than the cutoff
parameter (if specified). All scores
have the same y-axis, so the heights of bars are comparable between sequences and motifs.
plotMotifScores( scores, sel.motifs = NULL, seq.names = NULL, cols = NULL, cutoff = NULL, log.fun = log2, main = "", legend.space = 0.3, max.score = NULL, trans = 0.5, text.cex = 0.9, legend.cex = 0.9, motif.names = NULL, seq.len.spacing = 8, shape = "rectangle" )
plotMotifScores( scores, sel.motifs = NULL, seq.names = NULL, cols = NULL, cutoff = NULL, log.fun = log2, main = "", legend.space = 0.3, max.score = NULL, trans = 0.5, text.cex = 0.9, legend.cex = 0.9, motif.names = NULL, seq.len.spacing = 8, shape = "rectangle" )
scores |
the list of motifs scores. Each element of the list is a matrix of scores for one sequences. The columns in the matrix
correspond to different motifs. Each column contains the odds (not log-odds!) scores over both strands. For example,
for a sequence of length 5, scores for a 3 bp motifs could be: |
sel.motifs |
a vector of motif names. Use this parameter to show the motif hits to only a subset of motifs for which the scores are available. |
seq.names |
a vector of sequence names to show in the graph. If none specified, the sequences will be named Sequence 1, Sequence 2, ... |
cols |
a vector of colours to use to colour code motif hits. If none are specified, the current palette will be used. |
cutoff |
either a single value, or a vector of values. The values are PWM cutoffs after |
log.fun |
the logarithm function to use to calculate log-odds. By default log2 is used for consistency with Biostrings. |
main |
the main title |
legend.space |
the proportion of horizontal space to reserve for the legend. The default is 30%. |
max.score |
the maximal log-odds score used to scale all other scores. By default this values is automatically determined, but it can also be set manually to make multiple plots comparable. |
trans |
the level of transparency. By default 50% transparency to be able to see overlapping binding sites |
text.cex |
the scaling factor for sequence names |
legend.cex |
the scaling factor for the legend |
motif.names |
optional vector of motif names to show instead of those present as column names in |
seq.len.spacing |
the spacing (in bp units) between the end of the sequence line and the text showing the length in bp |
shape |
the shape to use to draw motif occurances, valid values are "rectangle" (default), "line" and "triangle" |
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # Load Drosophila PWMs data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # two sequences of interest sequences = list(DNAString("GAAGTATCAAGTGACCAGGTGAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) # select the tinman and snail motifs pwms = MotifDb.Dmel[c("tin", "sna")] # get the raw score that will be plotted scores = motifScores(sequences, pwms, raw.scores=TRUE) # plot the scores in both sequences, green for tin and blue for sna plotMotifScores(scores, cols=c("green", "blue")) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # Load Drosophila PWMs data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # two sequences of interest sequences = list(DNAString("GAAGTATCAAGTGACCAGGTGAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) # select the tinman and snail motifs pwms = MotifDb.Dmel[c("tin", "sna")] # get the raw score that will be plotted scores = motifScores(sequences, pwms, raw.scores=TRUE) # plot the scores in both sequences, green for tin and blue for sna plotMotifScores(scores, cols=c("green", "blue")) }
Individual motif logos are plotted on a rows x cols grid. This function is a convenience
interface for the seqLogoGrid
function that deals with viewpoint placement in a
matrix-like grid layout.
plotMultipleMotifs( pwms, titles = names(pwms), rows = ceiling(sqrt(length(pwms))), cols = ceiling(sqrt(length(pwms))), xmargin.scale = 0.4, ymargin.scale = 0.4, ... )
plotMultipleMotifs( pwms, titles = names(pwms), rows = ceiling(sqrt(length(pwms))), cols = ceiling(sqrt(length(pwms))), xmargin.scale = 0.4, ymargin.scale = 0.4, ... )
pwms |
a list of PWM objects or frequency matrices |
titles |
a characater vector of titles for each of the plots |
rows |
number of rows in the grid |
cols |
number or cols in the grid |
xmargin.scale |
the scaling parameter for the X-axis margin. Useful when plotting more than one logo on a page |
ymargin.scale |
the scaling parameter for the Y-axis margin. Useful when plotting more than one logo on a page |
... |
other parameters passed to seqLogoGrid() |
By default will try to make a square grid plot that would fit all the motifs and use list names as captions.
Plot a PFM (not PWM) using seqLogo
plotPFM(pfm, ...)
plotPFM(pfm, ...)
pfm |
a matrix where rows are the four nucleotides |
... |
additional parameters for plot() |
Plot the top N enrichment motifs in a group of sequences
## S4 method for signature 'MotifEnrichmentResults' plotTopMotifsGroup(obj, n, bg = TRUE, id = FALSE, ...)
## S4 method for signature 'MotifEnrichmentResults' plotTopMotifsGroup(obj, n, bg = TRUE, id = FALSE, ...)
obj |
a MotifEnrichmentResults object |
n |
the number of top ranked motifs to plot |
bg |
if to use background corrected P-values to do the ranking (if available) |
id |
if to show PWM IDs instead of target TF names |
... |
other parameters passed to |
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # plot the top 4 motifs in a 2x2 grid plotTopMotifsGroup(res, 4) # plot top 3 motifs in a single row plotTopMotifsGroup(res, 3, row=1, cols=3) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # plot the top 4 motifs in a 2x2 grid plotTopMotifsGroup(res, 4) # plot top 3 motifs in a single row plotTopMotifsGroup(res, 3, row=1, cols=3) }
Plot the top N enrichment motifs in a single sequence
## S4 method for signature 'MotifEnrichmentResults' plotTopMotifsSequence(obj, seq.id, n, bg = TRUE, id = FALSE, ...)
## S4 method for signature 'MotifEnrichmentResults' plotTopMotifsSequence(obj, seq.id, n, bg = TRUE, id = FALSE, ...)
obj |
a MotifEnrichmentResults object |
seq.id |
either the sequence number or sequence name |
n |
the number of top ranked motifs to plot |
bg |
if to use background corrected P-values to do the ranking (if available) |
id |
if to show PWM IDs instead of target TF names |
... |
other parameters passed to |
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # plot the top 4 motifs in a 2x2 grid plotTopMotifsSequence(res, 1, 4) # plot top 3 motifs in a single row plotTopMotifsSequence(res, 1, 3, row=1, cols=3) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # plot the top 4 motifs in a 2x2 grid plotTopMotifsSequence(res, 1, 4) # plot top 3 motifs in a single row plotTopMotifsSequence(res, 1, 3, row=1, cols=3) }
A class that represents a Position Weight Matrix (PWM)
id
:a systematic ID given to this PWM, could include the source, version, etc
name
:the name of the transcription factor (TF) to which the PWM corresponds to
pfm
:Position Frequency Matrix (PFM) from which the PWM is derived
prior.params
:Defines prior frequencies of the four bases (A,C,G,T), a named vector. These will be added to individual values for the PFM and at the same time used as background probabilities
pwm
:Final Position Weight Matrix (PWM) constructed using prior.params with logarithm base 2
Hit count background distribution for a set of PWMs
bg.source
:textual description of where the background distribution is derived from
bg.cutoff
:the cutoff score used to find significant motif hits (in log2 odds), either a single value or a vector of values
bg.P
:the density of significant motif hits per nucleotide in background
pwms
:the pwms for which the background has been compiled
This object contains raw scores for one very long sequence, thus it can be very large.
bg.source
:textual description of where the background distribution is derived from
bg.fwd
:affinity scores (odds) for the forward strand. PWMs as columns
bg.rev
:affinity scores (odds) for the reverse strand. PWMs as columns
pwms
:the pwms for which the background has been compiled
The three parameters of the GEV distribution are fitted by doing linear regression on log of sequence length.
bg.source
:textual description of where the background distribution is derived from
bg.loc
:linear regression model for estimating the location parameter based on log(L), list of lm objects of PWMs
bg.scale
:linear regression model for estimating the scale parameter based on log(L), list of lm objects of PWMs
bg.shape
:linear regression model for estimating the shape parameter based on log(L), list of lm objects of PWMs
pwms
:the pwms for which the background has been compiled
Lognormal background distribution for a set of PWMs
bg.source
:textual description of where the background distribution is derived from
bg.len
:the length to which the background is normalized to. This is a vector of values, can have a different value for each motif.
bg.mean
:the mean value of the lognormal distribution at bg.len
bg.sd
:the standard deviation of the lognormal distribution at bg.len
pwms
:the pwms for which the background has been compiled
The PWM function from Biostrings without unit scaling
PWMUnscaled( x, id = "", name = "", type = c("log2probratio", "prob"), prior.params = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), pseudo.count = prior.params, unit.scale = FALSE, seq.count = NULL )
PWMUnscaled( x, id = "", name = "", type = c("log2probratio", "prob"), prior.params = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), pseudo.count = prior.params, unit.scale = FALSE, seq.count = NULL )
x |
the integer count matrix representing the motif, rows as nucleotides |
id |
a systematic ID given to this PWM, could include the source, version, etc |
name |
the name of the transcription factor (TF) to which the PWM corresponds to |
type |
the type of PWM calculation, either as log2-odds, or posterior probability (frequency matrix) |
prior.params |
the pseudocounts for each of the nucleotides |
pseudo.count |
the pseudo-count values if different from priors |
unit.scale |
if to unit.scale the pwm (default is no unit scaling) |
seq.count |
if x is a normalised PFM (i.e. with probabilities instead of sequence counts), then this sequence count
will be used to convert |
By default the Biostrings package scales the log-odds score so it is within 0 and 1. In this function we take a more traditional approach with no unit scaling and offer unit scaling as an additional parameter.
See ?PWM from Biostrings for more information on input arguments.
a new PWM object representing the PWM
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") ttk = MotifDb.Dmel.PFM[["ttk"]] # make a PWM with uniform background PWMUnscaled(ttk, id="ttk-JASPAR", name="ttk") # custom background PWMUnscaled(ttk, id="ttk-JASPAR", name="ttk", prior.params=c("A"= 0.2, "C" = 0.3, "G" = 0.3, "T" = 0.2)) # get background for drosophila (quick mode on a reduced dataset) prior = getBackgroundFrequencies("dm3", quick=TRUE) # convert using genomic background PWMUnscaled(ttk, id="ttk-JASPAR", name="ttk", prior.params=prior) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") ttk = MotifDb.Dmel.PFM[["ttk"]] # make a PWM with uniform background PWMUnscaled(ttk, id="ttk-JASPAR", name="ttk") # custom background PWMUnscaled(ttk, id="ttk-JASPAR", name="ttk", prior.params=c("A"= 0.2, "C" = 0.3, "G" = 0.3, "T" = 0.2)) # get background for drosophila (quick mode on a reduced dataset) prior = getBackgroundFrequencies("dm3", quick=TRUE) # convert using genomic background PWMUnscaled(ttk, id="ttk-JASPAR", name="ttk", prior.params=prior) }
A helper function for motifRankingForGroup and motifRankingForSequence with the common code
rankingProcessAndReturn(res, r, id, order, rank, unique, decreasing)
rankingProcessAndReturn(res, r, id, order, rank, unique, decreasing)
res |
the list of results from MotifEnrichmentResults object |
r |
the vector of raw results that needs to be processed |
id |
if to return IDs instead of names |
order |
if to return the ordering of motifs |
rank |
if to return the rank of motifs |
unique |
if to remove duplicates |
decreasing |
specifies the sorting order |
Read motifs in JASPAR format
readJASPAR(file, remove.ids = FALSE)
readJASPAR(file, remove.ids = FALSE)
file |
the filename |
remove.ids |
if to strip JASPAR ID's from motif names, e.g. "MA0211.1 bap" would become just "bap" |
a list of matrices representing motifs (with four nucleotides as rows)
The format is autodetected based on file format. If the autodetection fail then the file cannot be read.
readMotifs(file, remove.acc = FALSE)
readMotifs(file, remove.acc = FALSE)
file |
the filename |
remove.acc |
if to remove accession numbers. If TRUE, the AC entry in TRANSFAC files is ignored, and the accession is stripped from JASPAR, e.g. motif with name "MA0211.1 bap" would become just "bap". If FALSE, botht he AC and ID are used to generate the TRANSFAC name and the original motif names are preserved in JASPAR files. |
a list of 4xL matrices representing motifs (four nucleotides as rows)
# read in example TRANSFAC motifs without accession codes (just IDs) readMotifs(system.file(package = "PWMEnrich", dir = "extdata", file = "example.transfac"), remove.acc = TRUE) # read in the JASPAR insects motifs provided as example readMotifs(system.file(package = "PWMEnrich", dir = "extdata", file = "jaspar-insecta.jaspar"), remove.acc = TRUE)
# read in example TRANSFAC motifs without accession codes (just IDs) readMotifs(system.file(package = "PWMEnrich", dir = "extdata", file = "example.transfac"), remove.acc = TRUE) # read in the JASPAR insects motifs provided as example readMotifs(system.file(package = "PWMEnrich", dir = "extdata", file = "jaspar-insecta.jaspar"), remove.acc = TRUE)
Read in motifs in TRANSFAC format
readTRANSFAC(file, remove.acc = TRUE)
readTRANSFAC(file, remove.acc = TRUE)
file |
the filename |
remove.acc |
if to ignore transfac accession numbers |
a list of matrices representing motifs (with four nucleotides as rows)
Certain functions (like motif scanning) can be parallelized in PWMEnrich. This function registers a number of parallel cores (via core package parallel) to be used in code that can be parallelized. After this function is called, all further PWMEnrich function calls will run in parallel if possible.
registerCoresPWMEnrich(numCores = NA)
registerCoresPWMEnrich(numCores = NA)
numCores |
number of cores to use (default to take all cores), or NULL if no parallel execution is to be used |
By default parallel execution is turned off. To turn it off after using it, call this function by passing NULL.
## Not run: registerCoresPWMEnrich(4) # use 4 CPU cores in PWMEnrich registerCoresPWMEnrich() # use maximal number of CPUs registerCoresPWMEnrich(NULL) # do not use parallel execution ## End(Not run)
## Not run: registerCoresPWMEnrich(4) # use 4 CPU cores in PWMEnrich registerCoresPWMEnrich() # use maximal number of CPUs registerCoresPWMEnrich(NULL) # do not use parallel execution ## End(Not run)
Finds the reverse complement of the PWM
## S4 method for signature 'PWM' reverseComplement(x, ...)
## S4 method for signature 'PWM' reverseComplement(x, ...)
x |
an object of type PWM |
... |
unused |
an object of type PWM that is reverse complement of x
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") reverseComplement(MotifDb.Dmel.PFM[["ttk"]]) # reverse complement of the ttk PWM }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") reverseComplement(MotifDb.Dmel.PFM[["ttk"]]) # reverse complement of the ttk PWM }
The whole sequence is scanned with a PWM and scores returned beginning at each position. Partial motif matches are not done, thus the last #[length of motif]-1 scores are NA.
scanWithPWM( pwm, dna, pwm.rev = NULL, odds.score = FALSE, both.strands = FALSE, strand.fun = "mean" )
scanWithPWM( pwm, dna, pwm.rev = NULL, odds.score = FALSE, both.strands = FALSE, strand.fun = "mean" )
pwm |
PWM object |
dna |
a DNAString or other sequence from Biostrings |
pwm.rev |
the reverse complement for a pwm (if it is already pre-computed) |
odds.score |
if to return raw scores in odds (not logodds) space |
both.strands |
if to return results on both strands |
strand.fun |
which function to use to summarise values over two strands (default is "mean") |
The function returns either an odds average (*not* log-odds average), maximal score on each strand, or scores on both strands.
The function by default returns the score in log2 following the package Biostrings
.
a vector representing scores starting at each position, or a matrix with score in the two strands
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") ttk = MotifDb.Dmel[["ttk"]] # odds average over the two strands expressed as log2-odds scanWithPWM(ttk, DNAString("CGTAGGATAAAGTAACT")) # log2-odds scores on both strands scanWithPWM(ttk, DNAString("CGTAGGATAAAGTAACT"), both.strands=TRUE) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ data(MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") ttk = MotifDb.Dmel[["ttk"]] # odds average over the two strands expressed as log2-odds scanWithPWM(ttk, DNAString("CGTAGGATAAAGTAACT")) # log2-odds scores on both strands scanWithPWM(ttk, DNAString("CGTAGGATAAAGTAACT"), both.strands=TRUE) }
This function comes from the seqLogo package. It has been modified to remove some unneccessary code as suggested by W Huber (https://stat.ethz.ch/pipermail/bioconductor/2010-September/035267.html).
seqLogoGrid( pwm, ic.scale = TRUE, xaxis = TRUE, yaxis = TRUE, xfontsize = 10, yfontsize = 10, xmargin.scale = 1, ymargin.scale = 1, title = "", titlefontsize = 15 )
seqLogoGrid( pwm, ic.scale = TRUE, xaxis = TRUE, yaxis = TRUE, xfontsize = 10, yfontsize = 10, xmargin.scale = 1, ymargin.scale = 1, title = "", titlefontsize = 15 )
pwm |
numeric The 4xW position weight matrix. |
ic.scale |
logical If TRUE, the height of each column is proportional to its information content. Otherwise, all columns have the same height. |
xaxis |
logical If TRUE, an X-axis will be plotted. |
yaxis |
logical If TRUE, a Y-axis will be plotted. |
xfontsize |
numeric Font size to be used for the X-axis. |
yfontsize |
numeric Font size to be used for the Y-axis. |
xmargin.scale |
the scaling parameter for the X-axis margin. Useful when plotting more than one logo on a page |
ymargin.scale |
the scaling parameter for the Y-axis margin. Useful when plotting more than one logo on a page |
title |
to be shown on the top |
titlefontsize |
the fontsize of the title |
Use this function for more advanced plotting where the viewports are directly set up and maintained (see package grid
).
Generate a motif enrichment report for a single sequence
## S4 method for signature 'MotifEnrichmentResults' sequenceReport(obj, seq.id, bg = TRUE, ...)
## S4 method for signature 'MotifEnrichmentResults' sequenceReport(obj, seq.id, bg = TRUE, ...)
obj |
a MotifEnrichmentResults object |
seq.id |
the sequence index or name |
bg |
if to use background corrected P-values to do the ranking (if available) |
... |
unused |
a MotifEnrichmentReport object containing a table with the following columns:
'rank' - The rank of the PWM's enrichment in the sequence
'target' - The name of the PWM's target gene, transcript or protein complex.
'id' - The unique identifier of the PWM (if set during PWM creation).
'raw.score' - The raw score before P-value calculation
'p.value' - The P-value of motif enrichment (if available)
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # reports for the two sequences r1 = sequenceReport(res, 1) r2 = sequenceReport(res, 2) # view the results r1 r2 # plot the top 10 most enriched motifs in the first, and then second sequence plot(r1[1:10]) plot(r2[1:10]) }
if(requireNamespace("PWMEnrich.Dmelanogaster.background")){ ### # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel, package = "PWMEnrich.Dmelanogaster.background") # scan two sequences for motif enrichment sequences = list(DNAString("GAAGTATCAAGTGACCAGTAAGTCCCAGATGA"), DNAString("AGGTAGATAGAACAGTAGGCAATGAAGCCGATG")) res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) # reports for the two sequences r1 = sequenceReport(res, 1) r2 = sequenceReport(res, 2) # view the results r1 r2 # plot the top 10 most enriched motifs in the first, and then second sequence plot(r1[1:10]) plot(r2[1:10]) }
show method for MotifEnrichmentReport
## S4 method for signature 'MotifEnrichmentReport' show(object)
## S4 method for signature 'MotifEnrichmentReport' show(object)
object |
the MotifEnrichmentReport object |
show method for MotifEnrichmentResults
## S4 method for signature 'MotifEnrichmentResults' show(object)
## S4 method for signature 'MotifEnrichmentResults' show(object)
object |
the MotifEnrichmentResults object |
show method for PWM
## S4 method for signature 'PWM' show(object)
## S4 method for signature 'PWM' show(object)
object |
the PWM object |
show method for PWMCutoffBackground
## S4 method for signature 'PWMCutoffBackground' show(object)
## S4 method for signature 'PWMCutoffBackground' show(object)
object |
the PWMCutoffBackground object |
show method for PWMEmpiricalBackground
## S4 method for signature 'PWMEmpiricalBackground' show(object)
## S4 method for signature 'PWMEmpiricalBackground' show(object)
object |
the PWMEmpiricalBackground object |
show method for PWMGEVBackground
## S4 method for signature 'PWMGEVBackground' show(object)
## S4 method for signature 'PWMGEVBackground' show(object)
object |
the PWMGEVBackground object |
show method for PWMLognBackground
## S4 method for signature 'PWMLognBackground' show(object)
## S4 method for signature 'PWMLognBackground' show(object)
object |
the PWMLognBackground object |
Convert motifs into PWMs
toPWM( motifs, ids = names(motifs), targets = names(motifs), seq.count = 50, prior = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), ... )
toPWM( motifs, ids = names(motifs), targets = names(motifs), seq.count = 50, prior = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), ... )
motifs |
a list of motifs either as position probability matrices (PPM) or frequency matirces (PFMs) |
ids |
the set of IDs for the motifs (defaults to names of the 'motifs' list) |
targets |
the set of target TF names for the motifs (defaults to names of the 'motifs' list) |
seq.count |
provides a vector of sequence counts for probability matrices (PPMs). Default it 50. |
prior |
frequencies of the four letters in the genome. Default is uniform background. |
... |
other parameters to PWMUnscaled |
## Not run: if (requireNamespace("PWMEnrich.Dmelanogaster.background")) { data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") toPWM(MotifDb.Dmel.PFM) # convert to PWM with uniform background # get background for drosophila (quick mode on a reduced dataset) prior = getBackgroundFrequencies("dm3", quick=TRUE) toPWM(MotifDb.Dmel.PFM, prior=prior) # convert with genomic background } ## End(Not run)
## Not run: if (requireNamespace("PWMEnrich.Dmelanogaster.background")) { data(MotifDb.Dmel.PFM, package = "PWMEnrich.Dmelanogaster.background") toPWM(MotifDb.Dmel.PFM) # convert to PWM with uniform background # get background for drosophila (quick mode on a reduced dataset) prior = getBackgroundFrequencies("dm3", quick=TRUE) toPWM(MotifDb.Dmel.PFM, prior=prior) # convert with genomic background } ## End(Not run)
This function tries all offsets of motif1 compared to motif2 and returns the maximal (unnormalized) correlation score.
tryAllMotifAlignments(m1, m2, min.align = 2, exclude.zero = FALSE)
tryAllMotifAlignments(m1, m2, min.align = 2, exclude.zero = FALSE)
m1 |
frequency matrix of motif 1 |
m2 |
frequency matrix of motif 2 |
min.align |
minimal number of basepairs that need to align |
exclude.zero |
if to exclude offset=0, useful for calculating self-similarity |
The correlation score is essentially the sum of correlations of individual aligned columns as described in Pietrokovski (1996).
single maximal score
Pietrokovski S. Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 1996;24:3836-3845.
If to use a faster implementation of motif scanning that requires abount 5 to 10 times more memory
useBigMemoryPWMEnrich(useBigMemory = FALSE)
useBigMemoryPWMEnrich(useBigMemory = FALSE)
useBigMemory |
a boolean value denoting if to use big memory implementation |
## Not run: useBigMemoryPWMEnrich(TRUE) # switch to big memory implementation globally useBigMemoryPWMEnrich(FALSE) # switch back to default implementation ## End(Not run)
## Not run: useBigMemoryPWMEnrich(TRUE) # switch to big memory implementation globally useBigMemoryPWMEnrich(FALSE) # switch back to default implementation ## End(Not run)