Title: | Binned Motif Enrichment Analysis and Visualization |
---|---|
Description: | Useful functions to work with sequence motifs in the analysis of genomics data. These include methods to annotate genomic regions or sequences with predicted motif hits and to identify motifs that drive observed changes in accessibility or expression. Functions to produce informative visualizations of the obtained results are also provided. |
Authors: | Dania Machlab [aut] , Lukas Burger [aut] , Charlotte Soneson [aut] , Dany Mukesha [ctb] , Michael Stadler [aut, cre] |
Maintainer: | Michael Stadler <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.13.0 |
Built: | 2024-11-29 08:33:03 UTC |
Source: | https://github.com/bioc/monaLisa |
create an annotation for a Heatmap
containing sequence logos.
annoSeqlogo( grobL, which = c("column", "row"), space = unit(0.5, "mm"), width = NULL, height = NULL, gp = gpar(fill = NA, col = NA) )
annoSeqlogo( grobL, which = c("column", "row"), space = unit(0.5, "mm"), width = NULL, height = NULL, gp = gpar(fill = NA, col = NA) )
grobL |
A |
which |
Whether it is a column annotation or a row annotation? |
space |
The space around the image to the annotation grid borders. The value should be a unit object. |
width |
Width of the annotation. The value should be an absolute unit. Width is not allowed to be set for column annotation. |
height |
Height of the annotation. The value should be an absolute unit. Height is not allowed to be set for row annotation. |
gp |
Graphic parameters for annotation grids. Can be used to control the background color in the annotation grids. |
An annotation function which can be used in
HeatmapAnnotation
.
if (require(JASPAR2020) && require(TFBSTools) && require(gridExtra)) { pfm1 <- getMatrixByID(JASPAR2020, "MA0139") g1 <- seqLogoGrob(pfm1) anno <- annoSeqlogo(list(g1)) }
if (require(JASPAR2020) && require(TFBSTools) && require(gridExtra)) { pfm1 <- getMatrixByID(JASPAR2020, "MA0139") g1 <- seqLogoGrob(pfm1) anno <- annoSeqlogo(list(g1)) }
x
.bin
groups elements of x
into bins with either a
constant number of elements per bin, a constant bin width or according to
user-provided bin boundaries.
bin( x, binmode = c("equalN", "equalWidth", "breaks"), nElements = round(length(x)/5), nBins = NULL, minAbsX = NULL, breaks = NULL, ... )
bin( x, binmode = c("equalN", "equalWidth", "breaks"), nElements = round(length(x)/5), nBins = NULL, minAbsX = NULL, breaks = NULL, ... )
x |
A numerical vector with the values used for binning. |
binmode |
The algorithm to be used for binning. Possible values are: "equalN" (default), "equalWidth" or "breaks" (see Details). |
nElements |
The number of elements per bin (only for
|
nBins |
The number of bins (only for |
minAbsX |
The minimal absolute value in |
breaks |
Numerical vector with bin boundaries (only for
|
... |
further arguments to be passed to |
Elements are binned according to the values in x
depending on
binmode
:
Items are grouped into a variable
number of bins with nElements
elements each. If minAbsX
is
not NULL
, elements with x
-values in [-minAbsX,minAbsX]
will first be collected in a single bin before binning the remaining
elements. The boundaries of this single bin may be slightly adjusted in
order to respect the nElements
elements in the other bins.
Items are group into nBins
bins with a variable
number of elements each.
Items are grouped into bins using
cut(x, breaks, include.lowest = TRUE)
The return value from cut(x, ...)
, typically a factor of the
same length as x
. Binning mode, bin boundaries and the "neutral" bin
are available from attr(..., "binmode")
, attr(..., "breaks")
and attr(..., "bin0")
. For binmode = "breaks"
, the latter
will be NA
.
cut
which is used internally.
set.seed(1) x <- rnorm(100) summary(bin(x, "equalN", nElements=10)) summary(bin(x, "equalN", nElements=10, minAbsX=0.5)) summary(bin(x, "equalWidth", nBins=5)) summary(bin(x, "breaks", breaks=c(-10,-1,0,1,10)))
set.seed(1) x <- rnorm(100) summary(bin(x, "equalN", nElements=10)) summary(bin(x, "equalN", nElements=10, minAbsX=0.5)) summary(bin(x, "equalWidth", nBins=5)) summary(bin(x, "breaks", breaks=c(-10,-1,0,1,10)))
Given a set of sequences and corresponding bins, identify enriched k-mers (n-grams) in each bin. The sequences can be given either directly or as genomic coordinates.
calcBinnedKmerEnr( seqs, bins = NULL, kmerLen = 5, background = c("otherBins", "allBins", "zeroBin", "genome", "model"), MMorder = 1, test = c("fisher", "binomial"), includeRevComp = TRUE, maxFracN = 0.7, maxKmerSize = 3L, GCbreaks = c(0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8), pseudocount.kmers = 1, pseudocount.log2enr = 8, p.adjust.method = "BH", genome = NULL, genome.regions = NULL, genome.oversample = 2, BPPARAM = SerialParam(), verbose = FALSE )
calcBinnedKmerEnr( seqs, bins = NULL, kmerLen = 5, background = c("otherBins", "allBins", "zeroBin", "genome", "model"), MMorder = 1, test = c("fisher", "binomial"), includeRevComp = TRUE, maxFracN = 0.7, maxKmerSize = 3L, GCbreaks = c(0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8), pseudocount.kmers = 1, pseudocount.log2enr = 8, p.adjust.method = "BH", genome = NULL, genome.regions = NULL, genome.oversample = 2, BPPARAM = SerialParam(), verbose = FALSE )
seqs |
|
bins |
factor of the same length and order as |
kmerLen |
A |
background |
A |
MMorder |
A |
test |
A |
includeRevComp |
A |
maxFracN |
A numeric scalar with the maximal fraction of N bases allowed in a sequence (defaults to 0.7). Sequences with higher fractions are excluded from the analysis. |
maxKmerSize |
the maximum k-mer size to consider, when adjusting background sequence weights for k-mer composition compared to the foreground sequences. The default value (3) will correct for mono-, di- and tri-mer composition. |
GCbreaks |
The breaks between GC bins. The default value is based on the hard-coded bins used in Homer. |
pseudocount.kmers |
A |
pseudocount.log2enr |
A numerical scalar with the pseudocount to add to foreground and background counts when calculating log2 motif enrichments |
p.adjust.method |
A character scalar selecting the p value adjustment
method (used in |
genome |
A |
genome.regions |
An optional |
genome.oversample |
A |
BPPARAM |
An optional |
verbose |
A |
This function implements a binned k-mer enrichment analysis. In each
enrichment analysis, the sequences in a specific bin are used as foreground
sequences to test for k-mer enrichments comparing to background sequences
(defined by background
, see below), similarly as in done for motifs
in calcBinnedMotifEnrR
. Sequences are weighted to correct for
GC and shorter k-mer composition differences between fore- and background
sets.
The background sequences are defined according to the value of the
background
argument:
: sequences from all other bins (excluding the current bin)
: sequences from all bins (including the current bin)
: sequences from the "zero bin", defined by the
maxAbsX
argument of bin
. If bins
does not define a "zero bin", for example because it was created by
bin(..., maxAbsX = NULL)
, selecting this background definition
will abort with an error.
: sequences randomly sampled from the genome (or the
intervals defined in genome.regions
if given). For each
foreground sequence, genome.oversample
background sequences
of the same size are sampled (on average). From these, one per
foreground sequence is selected trying to match the G+C composition.
In order to make the sampling deterministic, a seed number needs to be
provided to the RNGseed
parameter in
SerialParam
or MulticoreParam
when creating the
BiocParallelParam
instance in BPPARAM
.
: a Markov model of the order MMorder
is estimated
from the foreground sequences and used to estimate expected k-mer
frequencies. K-mer enrichments are then calculated comparing observed
to these expected frequencies. In order to make the process
deterministic, a seed number needs to be provided to the
RNGseed
parameter in SerialParam
or
MulticoreParam
when creating the
BiocParallelParam
instance in BPPARAM
.
For each k-mer, the weights of sequences is multiplied with the number
of k-mer occurrences in each sequence and summed, separately for foreground
(sumForegroundWgtWithHits
) and background
(sumBackgroundWgtWithHits
) sequences. The function works in ZOOPS
(Zero-Or-One-Per-Sequence) mode, so at most one occurrence per
sequence is counted, which helps reduce the impact of sequence repeats.
The total foreground (totalWgtForeground
) and background
(totalWgtBackground
) sum of sequence weights is also calculated. If
a k-mer has zero sumForegroundWgtWithHits
and
sumBackgroundWgtWithHits
, then any values (p-values and enrichment)
that are calculated using these two numbers are set to NA.
Two statistical tests for the calculation of enrichment log p-value are
available: test = "fisher"
(default) to perform Fisher's exact
tests, or test = "binomial"
to perform binomial tests, using:
: fisher.test(x = tab, alternative =
"greater")
, where tab
is the contingency table with the summed
weights of sequences in foreground or background sets (rows), and with
or without a occurrences of a particular k-mer (columns).
: pbinom(q = sumForegroundWgtWithHits - 1, size =
totalWgtForeground,
prob = sumBackgroundWgtWithHits / totalWgtBackground,
lower.tail = FALSE, log.p = TRUE)
A SummarizedExperiment
object
with motifs in rows and bins in columns, containing seven assays:
: -log10 P values
: -log10 adjusted P values
: k-mer enrichments as Pearson residuals
: expected number of foreground sequences with motif hits
: k-mer enrichments as log2 ratios
: Sum of foreground sequence weights in a bin that have k-mer occurrences
: Sum of background sequence weights in a bin that have k-mer occurrences
#' The rowData
of the object contains annotations (name, PFMs, PWMs
and GC fraction) for the k-mers, while the colData
slot contains
summary information about the bins.
getKmerFreq
used to calculate k-mer enrichments;
getSeq,BSgenome-method
which is used to extract
sequences from genomepkg
if x
is a GRanges
object;
bplapply
that is used for parallelization;
bin
for binning of regions
seqs <- Biostrings::DNAStringSet(c("GCATGCATGC", "CATGCGCATG")) bins <- factor(1:2) calcBinnedKmerEnr(seqs = seqs, bins = bins, kmerLen = 3)
seqs <- Biostrings::DNAStringSet(c("GCATGCATGC", "CATGCGCATG")) bins <- factor(1:2) calcBinnedKmerEnr(seqs = seqs, bins = bins, kmerLen = 3)
Run complete HOMER motif enrichment analysis, consisting of
calls to prepareHomer
, system2
and
parseHomerOutput
. This function requires HOMER
to be installed (see http://homer.ucsd.edu/homer/index.html)
and the path to the tool to be provided (homerfile
argument).
calcBinnedMotifEnrHomer( gr, b, genomedir, outdir, motifFile, homerfile = findHomer(), regionsize = "given", pseudocount.log2enr = 8, p.adjust.method = "BH", Ncpu = 2L, verbose = FALSE, verbose.Homer = FALSE )
calcBinnedMotifEnrHomer( gr, b, genomedir, outdir, motifFile, homerfile = findHomer(), regionsize = "given", pseudocount.log2enr = 8, p.adjust.method = "BH", Ncpu = 2L, verbose = FALSE, verbose.Homer = FALSE )
gr |
A |
b |
A vector of the same length as |
genomedir |
Directory containing sequence files in Fasta format (one per chromosome). |
outdir |
A path specifying the folder into which the output files will be written. |
motifFile |
A file with HOMER formatted PWMs to be used in the enrichment analysis. |
homerfile |
Path and file name of the |
regionsize |
The peak size to use in HOMER ( |
pseudocount.log2enr |
A numerical scalar with the pseudocount to add to foreground and background counts when calculating log2 motif enrichments |
p.adjust.method |
A character scalar selecting the p value adjustment
method (used in |
Ncpu |
Number of parallel threads that HOMER can use. |
verbose |
A logical scalar. If |
verbose.Homer |
A logical scalar. If |
A SummarizedExperiment
object with motifs in rows and bins
in columns, containing seven assays:
: -log10 P values
: -log10 adjusted P values
: motif enrichments as Pearson residuals
: expected number of foreground sequences with motif hits
: motif enrichments as log2 ratios
: Sum of foreground sequence weights in a bin that have motif hits
: Sum of background sequence weights in a bin that have motif hits
The rowData
of the object contains annotations (name, PFMs, PWMs
and GC fraction) for the motifs, while the colData
slot contains
summary information about the bins.
The functions that are wrapped: prepareHomer
,
system2
and parseHomerOutput
,
bin
for binning of regions
if (!is.na(findHomer())){ # genome genome <- system.file("extdata", "exampleGenome.fa", package = "monaLisa") # create motif file for Homer motiffile <- tempfile() motifIDs <- c("MA0139.1", "MA1102.1", "MA0740.1") dumpJaspar(filename = motiffile, pkg = "JASPAR2020", opts = list(ID = motifIDs)) # GRanges of regions used in binned motif enrichment analysis gr <- GenomicRanges::tileGenome( seqlengths = c(chr1 = 10000L, chr2 = 10000L, chr3 = 10000L), tilewidth = 200, cut.last.tile.in.chrom = TRUE) # create bins (motif enrichment analysis will be per bin) bins <- factor(GenomicRanges::seqnames(gr)) table(bins) # run calcBinnedMotifEnrHomer outdir <- tempfile() se <- calcBinnedMotifEnrHomer(gr = gr, b = bins, genomedir = genome, outdir = outdir, motifFile = motiffile) list.files(outdir) }
if (!is.na(findHomer())){ # genome genome <- system.file("extdata", "exampleGenome.fa", package = "monaLisa") # create motif file for Homer motiffile <- tempfile() motifIDs <- c("MA0139.1", "MA1102.1", "MA0740.1") dumpJaspar(filename = motiffile, pkg = "JASPAR2020", opts = list(ID = motifIDs)) # GRanges of regions used in binned motif enrichment analysis gr <- GenomicRanges::tileGenome( seqlengths = c(chr1 = 10000L, chr2 = 10000L, chr3 = 10000L), tilewidth = 200, cut.last.tile.in.chrom = TRUE) # create bins (motif enrichment analysis will be per bin) bins <- factor(GenomicRanges::seqnames(gr)) table(bins) # run calcBinnedMotifEnrHomer outdir <- tempfile() se <- calcBinnedMotifEnrHomer(gr = gr, b = bins, genomedir = genome, outdir = outdir, motifFile = motiffile) list.files(outdir) }
monaLisa
This function performs a motif enrichment analysis on bins of sequences. For each bin, the sequences in all other bins are used as background.
calcBinnedMotifEnrR( seqs, bins = NULL, pwmL = NULL, background = c("otherBins", "allBins", "zeroBin", "genome"), test = c("fisher", "binomial"), maxFracN = 0.7, maxKmerSize = 3L, min.score = 10, matchMethod = "matchPWM", GCbreaks = c(0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8), pseudocount.log2enr = 8, p.adjust.method = "BH", genome = NULL, genome.regions = NULL, genome.oversample = 2, BPPARAM = SerialParam(), verbose = FALSE, ... )
calcBinnedMotifEnrR( seqs, bins = NULL, pwmL = NULL, background = c("otherBins", "allBins", "zeroBin", "genome"), test = c("fisher", "binomial"), maxFracN = 0.7, maxKmerSize = 3L, min.score = 10, matchMethod = "matchPWM", GCbreaks = c(0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8), pseudocount.log2enr = 8, p.adjust.method = "BH", genome = NULL, genome.regions = NULL, genome.oversample = 2, BPPARAM = SerialParam(), verbose = FALSE, ... )
seqs |
|
bins |
factor of the same length and order as |
pwmL |
PWMatrixList with motifs for which to calculate enrichments. |
background |
A |
test |
A |
maxFracN |
A numeric scalar with the maximal fraction of N bases allowed in a sequence (defaults to 0.7). Sequences with higher fractions are excluded from the analysis. |
maxKmerSize |
the maximum k-mer size to consider, when adjusting background sequence weights for k-mer composition compared to the foreground sequences. The default value (3) will correct for mono-, di- and tri-mer composition. |
min.score |
the minimal score for motif hits, used in
|
matchMethod |
the method used to scan for motif hits, passed to the
|
GCbreaks |
The breaks between GC bins. The default value is based on the hard-coded bins used in Homer. |
pseudocount.log2enr |
A numerical scalar with the pseudocount to add to foreground and background counts when calculating log2 motif enrichments |
p.adjust.method |
A character scalar selecting the p value adjustment
method (used in |
genome |
A |
genome.regions |
An optional |
genome.oversample |
A |
BPPARAM |
An optional |
verbose |
A logical scalar. If |
... |
Additional arguments for |
This function implements a binned motif enrichment analysis. In each
enrichment analysis, the sequences in a specific bin are used as foreground
sequences to test for motif enrichments comparing to background sequences
(defined by background
, see below). The logic follows the
findMotifsGenome.pl
tool from Homer
version 4.11, with
-size given -nomotif -mknown
and additionally -h
if using
test = "fisher"
, and gives very similar results. As in the
Homer
tool, sequences are weighted to correct for GC and k-mer
composition differences between fore- and background sets.
The background sequences are defined according to the value of the
background
argument:
: sequences from all other bins (excluding the current bin)
: sequences from all bins (including the current bin)
: sequences from the "zero bin", defined by the
maxAbsX
argument of bin
. If bins
does not define a "zero bin", for example because it was created by
bin(..., maxAbsX = NULL)
, selecting this background definition
will abort with an error.
: sequences randomly sampled from the genome (or the
intervals defined in genome.regions
if given). For each
foreground sequence, genome.oversample
background sequences
of the same size are sampled (on average). From these, one per
foreground sequence is selected trying to match the G+C composition.
In order to make the sampling deterministic, a seed number needs to be
provided to the RNGseed
parameter in
SerialParam
or MulticoreParam
when creating the
BiocParallelParam
instance in BPPARAM
.
Motif hits are predicted using findMotifHits
and
multiple hits per sequence are counted as just one hit (ZOOPS mode). For
each motif, the weights of sequences that have a hit are summed separately
for foreground (sumForegroundWgtWithHits
) and background
(sumBackgroundWgtWithHits
). The total foreground
(totalWgtForeground
) and background (totalWgtBackground
)
sum of sequence weights is also calculated. If a motif has zero
sumForegroundWgtWithHits
and sumBackgroundWgtWithHits
,
then any values (p-values and enrichment) that are calculated using
these two numbers are set to NA.
Two statistical tests for the calculation of enrichment log p-value are
available: test = "fisher"
(default) to perform Fisher's exact
tests, or test = "binomial"
to perform binomial tests
(default in Homer
), using:
: fisher.test(x = tab, alternative =
"greater")
, where tab
is the contingency table with the summed
weights of sequences in foreground or background sets (rows), and with
or without a hit for a particular motif (columns).
: pbinom(q = sumForegroundWgtWithHits - 1, size =
totalWgtForeground,
prob = sumBackgroundWgtWithHits / totalWgtBackground,
lower.tail = FALSE, log.p = TRUE)
A SummarizedExperiment
object
with motifs in rows and bins in columns, containing seven assays:
: -log10 P values
: -log10 adjusted P values
: motif enrichments as Pearson residuals
: expected number of foreground sequences with motif hits
: motif enrichments as log2 ratios
: Sum of foreground sequence weights in a bin that have motif hits
: Sum of background sequence weights in a bin that have motif hits
The rowData
of the object contains annotations (name, PFMs, PWMs
and GC fraction) for the motifs, while the colData
slot contains
summary information about the bins.
seqs <- Biostrings::DNAStringSet(c("GTCAGTCGATC", "CAGTCTAGCTG", "CGATCGTCAGT", "AGCTGCAGTCT")) bins <- factor(rep(1:2, each = 2)) m <- rbind(A = c(2, 0, 0), C = c(1, 1, 0), G = c(0, 2, 0), T = c(0, 0, 3)) pwms <- TFBSTools::PWMatrixList( TFBSTools::PWMatrix(ID = "m1", profileMatrix = m), TFBSTools::PWMatrix(ID = "m2", profileMatrix = m[, 3:1]) ) calcBinnedMotifEnrR(seqs = seqs, bins = bins, pwmL = pwms, min.score = 3)
seqs <- Biostrings::DNAStringSet(c("GTCAGTCGATC", "CAGTCTAGCTG", "CGATCGTCAGT", "AGCTGCAGTCT")) bins <- factor(rep(1:2, each = 2)) m <- rbind(A = c(2, 0, 0), C = c(1, 1, 0), G = c(0, 2, 0), T = c(0, 0, 3)) pwms <- TFBSTools::PWMatrixList( TFBSTools::PWMatrix(ID = "m1", profileMatrix = m), TFBSTools::PWMatrix(ID = "m2", profileMatrix = m[, 3:1]) ) calcBinnedMotifEnrR(seqs = seqs, bins = bins, pwmL = pwms, min.score = 3)
Get motifs from a Jaspar database package (e.g.
JASPAR2020
) and write them into a HOMER-compatible motif file
as positional probability matrices.
dumpJaspar( filename, pkg = "JASPAR2020", opts = list(tax_group = "vertebrates"), pseudocount = 1, relScoreCutoff = 0.8, verbose = FALSE )
dumpJaspar( filename, pkg = "JASPAR2020", opts = list(tax_group = "vertebrates"), pseudocount = 1, relScoreCutoff = 0.8, verbose = FALSE )
filename |
Name of the output file to be created. |
pkg |
Name of the Jaspar package to use (default: |
opts |
A list with search options used in
|
pseudocount |
A numerical scalar with the pseudocount to be added to each element of the position frequency matrix extracted from Jaspar, before its conversion to a position probability matrix (default: 1.0). |
relScoreCutoff |
Currently ignored. numeric(1) in [0,1] that sets the default motif log-odds score cutof to relScoreCutoff * maximal score for each PWM (default: 0.8). |
verbose |
A logical scalar. If |
TRUE
if successful.
getMatrixSet
for details on the argument
opts
. homerToPFMatrixList
to read a file with
HOMER-formatted motifs into a PFMatrixList
.
dumpJaspar(filename = tempfile(), pkg = "JASPAR2020", opts = list(ID = c("MA0006.1")))
dumpJaspar(filename = tempfile(), pkg = "JASPAR2020", opts = list(ID = c("MA0006.1")))
Find absolute path to HOMER script file.
findHomer(homerfile = "findMotifsGenome.pl", dirs = NULL)
findHomer(homerfile = "findMotifsGenome.pl", dirs = NULL)
homerfile |
Name of the script file to search. |
dirs |
Directory names to look for |
In addition to dirs
, findHomer
will also look in the
directory provided in the environment variable MONALISA_HOMER
.
Absolute path to homerfile
, or NA
if none or several
were found.
homer_path <- findHomer()
homer_path <- findHomer()
findMotifHits
scans sequences (either provided
as a file, an R object or genomic coordinates) for matches to
positional weight matrices (provided as a file or as R objects)
findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'character,character' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'character,DNAString' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'character,DNAStringSet' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,character' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,DNAString' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,DNAStringSet' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,character' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,DNAString' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,DNAStringSet' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,GRanges' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,GRanges' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL )
findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'character,character' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'character,DNAString' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'character,DNAStringSet' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,character' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,DNAString' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,DNAStringSet' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,character' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,DNAString' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,DNAStringSet' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrix,GRanges' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL ) ## S4 method for signature 'PWMatrixList,GRanges' findMotifHits( query, subject, min.score, method = c("matchPWM", "homer2"), homerfile = findHomer("homer2"), BPPARAM = SerialParam(), genome = NULL )
query |
The motifs to search for, either a
|
subject |
The sequences to be searched, either a
|
min.score |
The minimum score for counting a match. Can be given as a character string containing a percentage (e.g. "85 highest possible score or as a single number. |
method |
The internal method to use for motif searching. One of
Please note that the two methods might give slightly different results (see details). |
homerfile |
Path and file name of the |
BPPARAM |
An optional |
genome |
|
The implemented methods (matchPWM
and homer2
) are
there for convenience (method="matchPWM"
calls
Biostrings::matchPWM
internally in an optimized fashion, and
method = "homer2"
calls the command line tool from Homer and
therefore requires an installation of Homer).
In general, running findMotifHits
with the same parameters using
any of the methods generates identical results. Some minor differences
could occur that result from rounding errors during the necessary
conversion of PWMs (log2-odd scores) to the probability matrices needed
by Homer, and the conversion of scores from and to the natural log scale
used by Homer. These conversions are implemented transparently for the
user, so that the arguments of findMotifHits
do not have to be
adjusted (e.g. the PWMs should always contain log2-odd scores, and
min.score
is always on the log2 scale).
If there are bases with frequencies of less than 0.001 in a motif, Homer
will set them to 0.001 and adjust the other frequencies at that motif
position accordingly so that they sum to 1.0. This may differ from the
adjustment used when scanning a PWM with matchPWM
(e.g. the
pseudocounts
argument in the toPWM
function), and thus can give rise to differences in reported motif hits
and hit scores (typically only low-scoring hits).
A GRanges
object with the matches to query
in
subject
.
seqs <- Biostrings::DNAStringSet(c(s1 = "GTCAGTCGATC", s2 = "CAGTCTAGCTG", s3 = "CGATCGTCAGT", s4 = "AGCTGCAGTCT")) m <- rbind(A = c(2, 0, 0), C = c(1, 1, 0), G = c(0, 2, 0), T = c(0, 0, 3)) pwms <- TFBSTools::PWMatrixList( TFBSTools::PWMatrix(ID = "m1", profileMatrix = m), TFBSTools::PWMatrix(ID = "m2", profileMatrix = m[, 3:1]) ) findMotifHits(pwms, seqs, min.score = 7)
seqs <- Biostrings::DNAStringSet(c(s1 = "GTCAGTCGATC", s2 = "CAGTCTAGCTG", s3 = "CGATCGTCAGT", s4 = "AGCTGCAGTCT")) m <- rbind(A = c(2, 0, 0), C = c(1, 1, 0), G = c(0, 2, 0), T = c(0, 0, 3)) pwms <- TFBSTools::PWMatrixList( TFBSTools::PWMatrix(ID = "m1", profileMatrix = m), TFBSTools::PWMatrix(ID = "m2", profileMatrix = m[, 3:1]) ) findMotifHits(pwms, seqs, min.score = 7)
Get colors for elements according to their bin.
Colors are assigned to bins forming a gradient from col1
to col2
in the order of levels{b}
. col0
is assigned
to the neutral bin (attribute ""
) if available.
getColsByBin( b, col1 = c("#003C30", "#01665E", "#35978F", "#80CDC1", "#C7EAE5"), col2 = c("#F6E8C3", "#DFC27D", "#BF812D", "#8C510A", "#543005"), col0 = "#F5F5F5" )
getColsByBin( b, col1 = c("#003C30", "#01665E", "#35978F", "#80CDC1", "#C7EAE5"), col2 = c("#F6E8C3", "#DFC27D", "#BF812D", "#8C510A", "#543005"), col0 = "#F5F5F5" )
b |
A factor that groups elements into bins (typically the output of
|
col1 |
First color. |
col2 |
Second color. |
col0 |
Neutral color. |
A character vector with colors for the elements in b
.
bin
.
set.seed(1) x <- rnorm(100) b <- bin(x, "equalN", nElements = 10) cols <- getColsByBin(b)
set.seed(1) x <- rnorm(100) b <- bin(x, "equalN", nElements = 10) cols <- getColsByBin(b)
Given a set of sequences, calculate observed and expected k-mer
frequencies. Expected frequencies are based on a Markov model of order
MMorder
.
getKmerFreq( seqs, kmerLen = 5, MMorder = 1, pseudocount = 1, zoops = TRUE, strata = rep(1L, length(seqs)), p.adjust.method = "BH", includeRevComp = TRUE )
getKmerFreq( seqs, kmerLen = 5, MMorder = 1, pseudocount = 1, zoops = TRUE, strata = rep(1L, length(seqs)), p.adjust.method = "BH", includeRevComp = TRUE )
seqs |
Set of sequences, either a |
kmerLen |
A |
MMorder |
A |
pseudocount |
A |
zoops |
A |
strata |
A |
p.adjust.method |
A character scalar selecting the p value adjustment
method (used in |
includeRevComp |
A |
A list
with observed and expected k-mer frequencies
(freq.obs
and freq.exp
, respectively), and enrichment
statistics for each k-mer.
res <- getKmerFreq(seqs = c("AAAAATT", "AAATTTT"), kmerLen = 3) names(res) head(res$freq.obs) head(res$freq.exp)
res <- getKmerFreq(seqs = c("AAAAATT", "AAATTTT"), kmerLen = 3) names(res) head(res$freq.obs) head(res$freq.exp)
Get and set the zero bin manually
getZeroBin(bins) setZeroBin(bins, zeroBin)
getZeroBin(bins) setZeroBin(bins, zeroBin)
bins |
Factor, typically the return value of
|
zeroBin |
Numeric or character scalar indicating the level to use as the zero bin, or NA. |
For getZeroBin
, the index of the level representing the zero bin.
For setZeroBin
, a modified factor with the zero bin set to the
provided value.
set.seed(1) x <- rnorm(100) bins <- bin(x, "equalN", nElements = 10, minAbsX = 0.5) getZeroBin(bins) bins <- setZeroBin(bins, 2)
set.seed(1) x <- rnorm(100) bins <- bin(x, "equalN", nElements = 10, minAbsX = 0.5) getZeroBin(bins) bins <- setZeroBin(bins, 2)
Read motifs from a file in HOMER format and create a PFMatrixList from them.
homerToPFMatrixList(filename, n = 100L)
homerToPFMatrixList(filename, n = 100L)
filename |
Name of the input file with HOMER-formatted motifs. |
n |
The number of observations (multiplied with base frequencies to create the number of observed bases at each position). |
A PFMatrixList
with motifs from the file.
dumpJaspar
for writing motifs from a Jaspar database
package into a file in HOMER format.
library(JASPAR2020) optsL <- list(ID = c("MA0006.1")) pfm1 <- TFBSTools::getMatrixSet(JASPAR2020, opts = optsL) TFBSTools::Matrix(pfm1) tmpfn <- tempfile() dumpJaspar(filename = tmpfn, pkg = "JASPAR2020", opts = optsL) pfm2 <- homerToPFMatrixList(tmpfn) TFBSTools::Matrix(pfm2) unlink(tmpfn)
library(JASPAR2020) optsL <- list(ID = c("MA0006.1")) pfm1 <- TFBSTools::getMatrixSet(JASPAR2020, opts = optsL) TFBSTools::Matrix(pfm1) tmpfn <- tempfile() dumpJaspar(filename = tmpfn, pkg = "JASPAR2020", opts = optsL) pfm2 <- homerToPFMatrixList(tmpfn) TFBSTools::Matrix(pfm2) unlink(tmpfn)
For each motif, calculate it's similarity to all k-mers of
length kmerLen
, defined as the maximal probability of observing the
k-mer given the base frequencies of the motif (the maximum is taken over
for all possible ungapped alignments between motif and k-mer). If necessary
matrices are padded on the sides with background base frequencies (assuming
all bases to have a frequency of 0.25).
motifKmerSimilarity( x, kmerLen = 5, kmers = NULL, includeRevComp = FALSE, BPPARAM = SerialParam(), verbose = FALSE )
motifKmerSimilarity( x, kmerLen = 5, kmers = NULL, includeRevComp = FALSE, BPPARAM = SerialParam(), verbose = FALSE )
x |
Either a |
kmerLen |
A |
kmers |
Either a character vector of k-mers for which to calculate
the similarity to each motif, or |
includeRevComp |
A |
BPPARAM |
An optional |
verbose |
A logical scalar. If |
A matrix of probabilties for each motif - k-mer pair.
bplapply
used for parallelization.
m <- rbind(A = c(12, 0, 0), C = c( 3, 2, 0), G = c( 0, 14, 0), T = c( 0, 0, 15)) pfms <- TFBSTools::PFMatrixList( TFBSTools::PFMatrix(name = "m1", profileMatrix = m), TFBSTools::PFMatrix(name = "m2", profileMatrix = m[, 3:1]) ) motifKmerSimilarity(pfms, kmerLen = 3)[, c("AGT", "TGA")]
m <- rbind(A = c(12, 0, 0), C = c( 3, 2, 0), G = c( 0, 14, 0), T = c( 0, 0, 15)) pfms <- TFBSTools::PFMatrixList( TFBSTools::PFMatrix(name = "m1", profileMatrix = m), TFBSTools::PFMatrix(name = "m2", profileMatrix = m[, 3:1]) ) motifKmerSimilarity(pfms, kmerLen = 3)[, c("AGT", "TGA")]
For each pair of motifs, calculate the similarity defined as the maximal Pearson's correlation coefficient between base frequencies over all possible shifts (relative positions of the two matrices with at least one overlapping position). If necessary matrices are padded on the sides with background base frequencies (assuming all bases to have a frequency of 0.25) to enable comparison of all positions in both matrices.
motifSimilarity( x, y = NULL, method = c("R", "HOMER"), homerfile = findHomer("compareMotifs.pl"), homerOutfile = NULL, BPPARAM = SerialParam(), verbose = FALSE )
motifSimilarity( x, y = NULL, method = c("R", "HOMER"), homerfile = findHomer("compareMotifs.pl"), homerOutfile = NULL, BPPARAM = SerialParam(), verbose = FALSE )
x |
Either a |
y |
Either a |
method |
A character scalar specifying the method for similarity
calculations. Either |
homerfile |
Path to the HOMER script |
homerOutfile |
A character scalar giving the file to save the similarity
scores (only for |
BPPARAM |
An optional |
verbose |
A logical scalar. If |
A matrix of Pearson's correlation coefficients for each pair of motifs.
bplapply
used for parallelization for
method = "R"
,
documentation of HOMER's compareMotifs.pl
for details on
method = "HOMER"
.
m <- rbind(A = c(12, 0, 0), C = c( 3, 2, 0), G = c( 0, 14, 0), T = c( 0, 0, 15)) pfms <- TFBSTools::PFMatrixList( TFBSTools::PFMatrix(name = "m1", profileMatrix = m), TFBSTools::PFMatrix(name = "m2", profileMatrix = m + 10), TFBSTools::PFMatrix(name = "m3", profileMatrix = m[, 3:1]) ) motifSimilarity(pfms)
m <- rbind(A = c(12, 0, 0), C = c( 3, 2, 0), G = c( 0, 14, 0), T = c( 0, 0, 15)) pfms <- TFBSTools::PFMatrixList( TFBSTools::PFMatrix(name = "m1", profileMatrix = m), TFBSTools::PFMatrix(name = "m2", profileMatrix = m + 10), TFBSTools::PFMatrix(name = "m3", profileMatrix = m[, 3:1]) ) motifSimilarity(pfms)
Parse HOMER output files into R data structures.
parseHomerOutput(infiles, pseudocount.log2enr = 8, p.adjust.method = "BH")
parseHomerOutput(infiles, pseudocount.log2enr = 8, p.adjust.method = "BH")
infiles |
HOMER output files to be parsed. |
pseudocount.log2enr |
A numerical scalar with the pseudocount to add to foreground and background counts when calculating log2 motif enrichments |
p.adjust.method |
A character scalar selecting the p value adjustment
method (used in |
A list of nine components (negLog10P
, negLog10Padj
,
pearsonResid
, expForegroundWgtWithHits
, log2enr
,
sumForegroundWgtWithHits
and sumBackgroundWgtWithHits
),
seven containing each a motif (rows) by bin (columns) matrix with raw
-log10 P values, -log10 adjusted P values, the expected number of
foreground sequences with hits, the observed number of foreground and
background sequences with hits,
and motif enrichments as Pearson residuals (pearsonResid
) and as
log2 ratios (log2enr
), and two containing the total foreground
and background weight (totalWgtForeground
,
totalWgtBackground
).
outfile <- system.file("extdata", "homer_output.txt.gz", package = "monaLisa") res <- parseHomerOutput(infiles = c(bin1 = outfile)) head(res$negLog10P)
outfile <- system.file("extdata", "homer_output.txt.gz", package = "monaLisa") res <- parseHomerOutput(infiles = c(bin1 = outfile)) head(res$negLog10P)
Plot the density of binned elements with binning information.
plotBinDensity( x, b, xlab = deparse(substitute(x, env = as.environment(-1))), ylab = "Density", main = "", legend = "topright", legend.cex = 1, ... )
plotBinDensity( x, b, xlab = deparse(substitute(x, env = as.environment(-1))), ylab = "Density", main = "", legend = "topright", legend.cex = 1, ... )
x |
A numerical vector with the values used for binning. |
b |
A factor that groups elements of |
xlab |
Label for x-axis. |
ylab |
Label for y-axis. |
main |
Main title. |
legend |
If not |
legend.cex |
A scalar that controls the text size in the legend relative
to the current |
... |
Further arguments passed to |
Invisibly the return value of density(x)
that generated the
plot.
set.seed(1) x <- rnorm(100) b <- bin(x, "equalN", nElements = 10) plotBinDensity(x, b)
set.seed(1) x <- rnorm(100) b <- bin(x, "equalN", nElements = 10) plotBinDensity(x, b)
Plot various diagnostics of binned sequences. Three plot types are available:
length
plots the distribution of sequence lengths within each bin.
GCfrac
plots the distribution of GC fractions within each bin.
dinucfreq
plots a heatmap of the relative frequency of each dinucleotide, averaged across the sequences within each bin. The values are centered for each dinucleotide to better highlight differences between the bins. The average relative frequency of each dinucleotide (across the bins) is indicated as well.
plotBinDiagnostics( seqs, bins, aspect = c("length", "GCfrac", "dinucfreq"), ... )
plotBinDiagnostics( seqs, bins, aspect = c("length", "GCfrac", "dinucfreq"), ... )
seqs |
DNAStringSet object with sequences. |
bins |
factor of the same length and order as seqs, indicating the bin
for each sequence. Typically the return value of |
aspect |
The diagnostic to plot. Should be one of |
... |
Additional argument passed to |
For aspect="length"
or "GCfrac"
, returns (invisibly)
the output of vioplot()
, which generates the plot. For
aspect="dinucfreq"
, returns (invisibly) the ComplexHeatmap
object.
seqs <- Biostrings::DNAStringSet( vapply(1:100, function(i) paste(sample(c("A", "C", "G", "T"), 10, replace = TRUE), collapse = ""), "") ) bins <- factor(rep(1:2, each = 50)) plotBinDiagnostics(seqs, bins, aspect = "GCfrac") plotBinDiagnostics(seqs, bins, aspect = "dinucfreq")
seqs <- Biostrings::DNAStringSet( vapply(1:100, function(i) paste(sample(c("A", "C", "G", "T"), 10, replace = TRUE), collapse = ""), "") ) bins <- factor(rep(1:2, each = 50)) plotBinDiagnostics(seqs, bins, aspect = "GCfrac") plotBinDiagnostics(seqs, bins, aspect = "dinucfreq")
Plot a histogram of binned elements with binning information.
plotBinHist( x, b, breaks = 10 * nlevels(b), xlab = deparse(substitute(x, env = as.environment(-1))), ylab = "Frequency", main = "", legend = "topright", legend.cex = 1, ... )
plotBinHist( x, b, breaks = 10 * nlevels(b), xlab = deparse(substitute(x, env = as.environment(-1))), ylab = "Frequency", main = "", legend = "topright", legend.cex = 1, ... )
x |
A numerical vector with the values used for binning. |
b |
A factor that groups elements of |
breaks |
Controls the histogram breaks (passed to |
xlab |
Label for x-axis. |
ylab |
Label for y-axis. |
main |
Main title. |
legend |
If not |
legend.cex |
A scalar that controls the text size in the legend relative
to the current |
... |
Further arguments passed to |
Invisibly the return value of hist(...)
that generated the
plot.
set.seed(1) x <- rnorm(100) b <- bin(x, "equalN", nElements = 10) plotBinHist(x, b)
set.seed(1) x <- rnorm(100) b <- bin(x, "equalN", nElements = 10) plotBinHist(x, b)
Plot a scatter (xy-plot) of binned elements with binning information.
plotBinScatter( x, y, b, cols = getColsByBin(b), xlab = deparse(substitute(x, env = as.environment(-1))), ylab = deparse(substitute(y, env = as.environment(-1))), main = "", legend = "topright", legend.cex = 1, ... )
plotBinScatter( x, y, b, cols = getColsByBin(b), xlab = deparse(substitute(x, env = as.environment(-1))), ylab = deparse(substitute(y, env = as.environment(-1))), main = "", legend = "topright", legend.cex = 1, ... )
x |
A numerical vector with x values. |
y |
A numerical vector with y values (the values used for binning). |
b |
A factor that groups elements of |
cols |
A color vector (will be computed based on |
xlab |
Label for x-axis. |
ylab |
Label for y-axis. |
main |
Main title. |
legend |
If not |
legend.cex |
A scalar that controls the text size in the legend relative
to the current |
... |
Further arguments passed to |
TRUE
(invisibly).
set.seed(1) x <- rnorm(100) y <- rnorm(100) b <- bin(y, "equalN", nElements = 10) plotBinScatter(x, y, b)
set.seed(1) x <- rnorm(100) y <- rnorm(100) b <- bin(y, "equalN", nElements = 10) plotBinScatter(x, y, b)
Plot motif enrichments (e.g. significance or magnitude) as a heatmap.
plotMotifHeatmaps( x, which.plots = c("negLog10P", "pearsonResid", "negLog10Padj", "log2enr"), width = 4, col.enr = c("#053061", "#2166AC", "#4393C3", "#92C5DE", "#D1E5F0", "#F7F7F7", "#FDDBC7", "#F4A582", "#D6604D", "#B2182B", "#67001F"), col.sig = c("#F0F0F0", "#D9D9D9", "#BDBDBD", "#969696", "#737373", "#525252", "#252525", "#000000"), col.gc = c("#F7FCF5", "#E5F5E0", "#C7E9C0", "#A1D99B", "#74C476", "#41AB5D", "#238B45", "#006D2C", "#00441B"), maxEnr = NULL, maxSig = NULL, highlight = NULL, cluster = FALSE, show_dendrogram = FALSE, show_motif_GC = FALSE, show_seqlogo = FALSE, show_bin_legend = FALSE, width.seqlogo = 1.5, use_raster = FALSE, na_col = "white", doPlot = TRUE, ... )
plotMotifHeatmaps( x, which.plots = c("negLog10P", "pearsonResid", "negLog10Padj", "log2enr"), width = 4, col.enr = c("#053061", "#2166AC", "#4393C3", "#92C5DE", "#D1E5F0", "#F7F7F7", "#FDDBC7", "#F4A582", "#D6604D", "#B2182B", "#67001F"), col.sig = c("#F0F0F0", "#D9D9D9", "#BDBDBD", "#969696", "#737373", "#525252", "#252525", "#000000"), col.gc = c("#F7FCF5", "#E5F5E0", "#C7E9C0", "#A1D99B", "#74C476", "#41AB5D", "#238B45", "#006D2C", "#00441B"), maxEnr = NULL, maxSig = NULL, highlight = NULL, cluster = FALSE, show_dendrogram = FALSE, show_motif_GC = FALSE, show_seqlogo = FALSE, show_bin_legend = FALSE, width.seqlogo = 1.5, use_raster = FALSE, na_col = "white", doPlot = TRUE, ... )
x |
A |
which.plots |
Selects which heatmaps to plot (one or several from
|
width |
The width (in inches) of each individual heatmap, without legend. |
col.enr |
Colors used for enrichment heatmap ("pearsonResid" and "log2enr"). |
col.sig |
Colors used for significance hetmaps ("negLog10P" and "negLog10Padj"). |
col.gc |
Colors used for motif GC content (for
|
maxEnr |
Cap color mapping at enrichment = |
maxSig |
Cap color mapping at -log10 P value or -log10 FDR =
|
highlight |
A logical vector indicating motifs to be highlighted. |
cluster |
If |
show_dendrogram |
If |
show_motif_GC |
If |
show_seqlogo |
If |
show_bin_legend |
If |
width.seqlogo |
The width (in inches) for the longest sequence logo (shorter logos are drawn to scale). |
use_raster |
|
na_col |
"white" (default). Passed to |
doPlot |
If |
... |
Further arguments passed to |
The heatmaps are created using the ComplexHeatmap package and plotted side-by-side.
Each heatmap will be width
inches wide, so the total plot needs a
graphics device with a width of at least
length(which.plots) * width
plus the space used for motif names
and legend. The height will be auto-adjusted to the graphics device.
A list of ComplexHeatmap::Heatmap
objects.
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 2016.
se <- readRDS(system.file("extdata", "results.binned_motif_enrichment_LMRs.rds", package = "monaLisa")) i <- which(SummarizedExperiment::assay(se, "negLog10Padj")[, 8] > 4) plotMotifHeatmaps(se[i, ], which.plots = "pearsonResid", width = 2, show_seqlogo = TRUE)
se <- readRDS(system.file("extdata", "results.binned_motif_enrichment_LMRs.rds", package = "monaLisa")) i <- which(SummarizedExperiment::assay(se, "negLog10Padj")[, 8] > 4) plotMotifHeatmaps(se[i, ], which.plots = "pearsonResid", width = 2, show_seqlogo = TRUE)
This function plots the selection probabilities of predictors (for example the selected motifs), optionally multiplied with either +1 or -1 to give a sense of both the strength and the directionality of the associated effects. The directionality is estimated from the sign of the correlation coefficient between each predictor and the response vector.
plotSelectionProb( se, directional = TRUE, selProbMin = metadata(se)$stabsel.params.cutoff, selProbMinPlot = 0.4, showSelProbMin = TRUE, col = c("cadetblue", "grey", "red"), method = c("pearson", "kendall", "spearman"), ylimext = 0.25, legend = "topright", legend.cex = 1, ... )
plotSelectionProb( se, directional = TRUE, selProbMin = metadata(se)$stabsel.params.cutoff, selProbMinPlot = 0.4, showSelProbMin = TRUE, col = c("cadetblue", "grey", "red"), method = c("pearson", "kendall", "spearman"), ylimext = 0.25, legend = "topright", legend.cex = 1, ... )
se |
The |
directional |
A logical scalar. If |
selProbMin |
A numerical scalar in [0,1]. Predictors with a selection
probability greater than |
selProbMinPlot |
A numerical scalar in [0,1] less than
|
showSelProbMin |
A logical scalar. If |
col |
A color vector giving the three colors used for predictors with
selection probability greater than |
method |
A character scalar with the correlation method to use in the
calculation of predictor-response marginal correlations. One of "pearson",
"kendall" or "spearman" (see |
ylimext |
A numeric scalar defining how much the y axis limits should be expanded beyond the plotted probabilities to allow for space for the bar labels. |
legend |
the position of the legend in the bar plot (will
be passed to |
legend.cex |
A scalar that controls the text size in the legend relative
to the current |
... |
additional parameters passed to |
This function creates a bar plot using the
barplot
function.
Each bar corresponds to a predictor (motif) and the colors correspond to
whether or not it was selected. The y-axis shows the selection
probabilities (directional=FALSE
) or selection probabilities with
the sign of the marginal correlation to the response
(directional=TRUE
).
a matrix
with one column, containing the coordinates of the
bar midpoints, or NULL
if no bar plot is drawn.
## create data set Y <- rnorm(n = 500, mean = 2, sd = 1) X <- matrix(data = NA, nrow = length(Y), ncol = 50) for (i in seq_len(ncol(X))) { X[ ,i] <- runif(n = 500, min = 0, max = 3) } s_cols <- sample(x = seq_len(ncol(X)), size = 10, replace = FALSE) for (i in seq_along(s_cols)) { X[ ,s_cols[i]] <- X[ ,s_cols[i]] + Y } ## reproducible randLassoStabSel() with 1 core set.seed(123) ss <- randLassoStabSel(x = X, y = Y) plotSelectionProb(ss)
## create data set Y <- rnorm(n = 500, mean = 2, sd = 1) X <- matrix(data = NA, nrow = length(Y), ncol = 50) for (i in seq_len(ncol(X))) { X[ ,i] <- runif(n = 500, min = 0, max = 3) } s_cols <- sample(x = seq_len(ncol(X)), size = 10, replace = FALSE) for (i in seq_along(s_cols)) { X[ ,s_cols[i]] <- X[ ,s_cols[i]] + Y } ## reproducible randLassoStabSel() with 1 core set.seed(123) ss <- randLassoStabSel(x = X, y = Y) plotSelectionProb(ss)
Plot the stability paths of each variable (predictor), showing the selection probability as a function of the regularization step.
plotStabilityPaths( se, selProbMin = metadata(se)$stabsel.params.cutoff, col = "cadetblue", lwd = 1, lty = 1, ylim = c(0, 1.1), ... )
plotStabilityPaths( se, selProbMin = metadata(se)$stabsel.params.cutoff, col = "cadetblue", lwd = 1, lty = 1, ylim = c(0, 1.1), ... )
se |
the |
selProbMin |
A numerical scalar in [0,1]. Predictors with a selection
probability greater than |
col |
color of the selected predictors. |
lwd |
line width (default = 1). |
lty |
line type (default = 1). |
ylim |
limits for y-axis (default = c(0,1.1)). |
... |
additional parameters to pass on to |
TRUE
(invisibly).
## create data set Y <- rnorm(n = 500, mean = 2, sd = 1) X <- matrix(data = NA, nrow = length(Y), ncol = 50) for (i in seq_len(ncol(X))) { X[ ,i] <- runif(n = 500, min = 0, max = 3) } s_cols <- sample(x = seq_len(ncol(X)), size = 10, replace = FALSE) for (i in seq_along(s_cols)) { X[ ,s_cols[i]] <- X[ ,s_cols[i]] + Y } ## reproducible randLassoStabSel() with 1 core set.seed(123) ss <- randLassoStabSel(x = X, y = Y) plotStabilityPaths(ss)
## create data set Y <- rnorm(n = 500, mean = 2, sd = 1) X <- matrix(data = NA, nrow = length(Y), ncol = 50) for (i in seq_len(ncol(X))) { X[ ,i] <- runif(n = 500, min = 0, max = 3) } s_cols <- sample(x = seq_len(ncol(X)), size = 10, replace = FALSE) for (i in seq_along(s_cols)) { X[ ,s_cols[i]] <- X[ ,s_cols[i]] + Y } ## reproducible randLassoStabSel() with 1 core set.seed(123) ss <- randLassoStabSel(x = X, y = Y) plotStabilityPaths(ss)
For each bin, write genomic coordinates for foreground and background regions into files for HOMER motif enrichment analysis.
prepareHomer( gr, b, genomedir, outdir, motifFile, homerfile = findHomer(), regionsize = "given", Ncpu = 2L, verbose = FALSE )
prepareHomer( gr, b, genomedir, outdir, motifFile, homerfile = findHomer(), regionsize = "given", Ncpu = 2L, verbose = FALSE )
gr |
A |
b |
A vector of the same length as |
genomedir |
Directory containing sequence files in Fasta format (one per chromosome). |
outdir |
A path specifying the folder into which the output files (two
files per unique value of |
motifFile |
A file with HOMER formatted PWMs to be used in the enrichment analysis. |
homerfile |
Path and file name of the |
regionsize |
The peak size to use in HOMER ( |
Ncpu |
Number of parallel threads that HOMER can use. |
verbose |
A logical scalar. If |
For each bin (unique value of b
) this functions creates two
files in outdir
(outdir/bin_N_foreground.tab
and
outdir/bin_N_background.tab
, where N
is the number of the
bin and foreground/background correspond to the ranges that are/are not
within the current bin). The files are in the HOMER peak file format
(see http://homer.ucsd.edu/homer/ngs/peakMotifs.html for details).
In addition, a shell script file is created containing the shell commands to run the HOMER motif enrichment analysis.
The path and name of the script file to run the HOMER motif enrichment analysis.
# prepare genome directory (here: one dummy chromosome) genomedir <- tempfile() dir.create(genomedir) writeLines(c(">chr1", "ATGCATGCATCGATCGATCGATCGTACGTA"), file.path(genomedir, "chr1.fa")) # prepare motif file, regions and bins motiffile <- tempfile() dumpJaspar(filename = motiffile, pkg = "JASPAR2020", opts = list(ID = c("MA0006.1"))) gr <- GenomicRanges::GRanges("chr1", IRanges::IRanges(1:4, width = 4)) b <- bin(1:4, nElements = 2) # create dummy file (should point to local Homer installation) homerfile <- file.path(tempdir(), "findMotifsGenome.pl") writeLines("dummy", homerfile) # run prepareHomer outdir <- tempfile() prepareHomer(gr = gr, b = b, genomedir = genomedir, outdir = outdir, motifFile = motiffile, homerfile = homerfile, verbose = TRUE) list.files(outdir) # clean up example unlink(c(genomedir, motiffile, homerfile, outdir))
# prepare genome directory (here: one dummy chromosome) genomedir <- tempfile() dir.create(genomedir) writeLines(c(">chr1", "ATGCATGCATCGATCGATCGATCGTACGTA"), file.path(genomedir, "chr1.fa")) # prepare motif file, regions and bins motiffile <- tempfile() dumpJaspar(filename = motiffile, pkg = "JASPAR2020", opts = list(ID = c("MA0006.1"))) gr <- GenomicRanges::GRanges("chr1", IRanges::IRanges(1:4, width = 4)) b <- bin(1:4, nElements = 2) # create dummy file (should point to local Homer installation) homerfile <- file.path(tempdir(), "findMotifsGenome.pl") writeLines("dummy", homerfile) # run prepareHomer outdir <- tempfile() prepareHomer(gr = gr, b = b, genomedir = genomedir, outdir = outdir, motifFile = motiffile, homerfile = homerfile, verbose = TRUE) list.files(outdir) # clean up example unlink(c(genomedir, motiffile, homerfile, outdir))
This function runs randomized lasso stability selection as
presented by Meinshausen and Bühlmann (2010) and with the improved
error bounds introduced by Shah and Samworth (2013). The function
uses the stabsel
function from the stabs
package, but implements the randomized lasso version.
randLassoStabSel( x, y, weakness = 0.8, cutoff = 0.8, PFER = 2, mc.cores = 1L, ... )
randLassoStabSel( x, y, weakness = 0.8, cutoff = 0.8, PFER = 2, mc.cores = 1L, ... )
x |
the predictor matrix. |
y |
the response vector. |
weakness |
value between 0 and 1 (default = 0.8). It affects how strict the method will be in selecting predictors. The closer it is to 0, the more stringent the selection. A weakness value of 1 is identical to performing lasso stability selection (not the randomized version). |
cutoff |
value between 0 and 1 (default = 0.8) which is the cutoff for the selection probability. Any variable with a selection probability that is higher than the set cutoff will be selected. |
PFER |
integer (default = 2) representing the absolute number of false positives that we allow for in the final list of selected variables. For details see Meinshausen and Bühlmann (2010). |
mc.cores |
integer (default = 1) specifying the number of cores to
use in |
... |
additional parameters that can be passed on to
|
Randomized lasso stability selection runs a randomized lasso
regression several times on subsamples of the response variable and
predictor matrix. N/2 elements from the response variable are randomly
chosen in each regression, where N is the length of the vector. The
corresponding section of the predictor matrix is also chosen, and the
internal .glmnetRandomizedLasso
function is applied.
Stability selection results in selection probabilities for each
predictor. The probability of a specific predictor is the number of
times it was selected divided by the total number of subsamples that
were done (total number of times the regression was performed).
We made use of the stabs
package that implements lasso stability
selection, and adapted it to run randomized lasso stability selection.
A SummarizedExperiment
object where the rows are the
observations and the columns the predictors (same dimnames as the
predictor matrix x
).
It contains:
:
: the predictor matrix.
: a DataFrame with columns:
: the response vector.
: a DataFrame with columns:
: the final selection probabilities for the predictors (from the last regularization step).
: logical indicating the predictors that made the selection with the specified cutoff.
: the normalized area under the seletion curve (mean of selection probabilities over regulatization steps).
i
': columns containing the selection probabilities for regularization step i.
: a list of output returned from
stabsel
and randLassoStabSel
:
: probability cutoff set for selection
of predictors (see stabsel
).
: elements with maximal selection
probability greater cutoff
(see stabsel
).
: maximum of selection probabilities
(see stabsel
).
: average number of selected variables
used (see stabsel
).
: (realized) upper bound for the
per-family error rate (see stabsel
).
: specified upper bound for
the per-family error rate (see stabsel
).
: the number of effects subject to
selection (see stabsel
).
: the number of subsamples (see
stabsel
).
: the sampling type used for
stability selection (see stabsel
).
: the assumptions made on the
selection probabilities (see stabsel
).
: stabsel
the call.
: the weakness parameter in the randomized lasso stability selection.
N. Meinshausen and P. Bühlmann (2010), Stability Selection,
Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 72, 417–73.
R.D. Shah and R.J. Samworth (2013), Variable Selection with Error
Control: Another Look at Stability Selection,
Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 75, 55–80.
B. Hofner, L. Boccuto, and M. Göker (2015), Controlling False
Discoveries in High-Dimensional Situations: Boosting with Stability
Selection, BMC Bioinformatics, 16 144.
## create data set Y <- rnorm(n = 500, mean = 2, sd = 1) X <- matrix(data = NA, nrow = length(Y), ncol = 50) for (i in seq_len(ncol(X))) { X[ ,i] <- runif(n = 500, min = 0, max = 3) } s_cols <- sample(x = seq_len(ncol(X)), size = 10, replace = FALSE) for (i in seq_along(s_cols)) { X[ ,s_cols[i]] <- X[ ,s_cols[i]] + Y } ## reproducible randLassoStabSel() with 1 core set.seed(123) ss <- randLassoStabSel(x = X, y = Y) ## reproducible randLassoStabSel() in parallel mode ## (only works on non-windows machines) if (.Platform$OS.type == "unix") { RNGkind("L'Ecuyer-CMRG") set.seed(123) ss <- randLassoStabSel(x = X, y = Y, mc.preschedule = TRUE, mc.set.seed = TRUE, mc.cores = 2L) }
## create data set Y <- rnorm(n = 500, mean = 2, sd = 1) X <- matrix(data = NA, nrow = length(Y), ncol = 50) for (i in seq_len(ncol(X))) { X[ ,i] <- runif(n = 500, min = 0, max = 3) } s_cols <- sample(x = seq_len(ncol(X)), size = 10, replace = FALSE) for (i in seq_along(s_cols)) { X[ ,s_cols[i]] <- X[ ,s_cols[i]] + Y } ## reproducible randLassoStabSel() with 1 core set.seed(123) ss <- randLassoStabSel(x = X, y = Y) ## reproducible randLassoStabSel() in parallel mode ## (only works on non-windows machines) if (.Platform$OS.type == "unix") { RNGkind("L'Ecuyer-CMRG") set.seed(123) ss <- randLassoStabSel(x = X, y = Y, mc.preschedule = TRUE, mc.set.seed = TRUE, mc.cores = 2L) }
Sample random regions from the mappable parts of the genome with a given fraction from CpG islands.
sampleRandomRegions(allowedRegions = NULL, N = 100L, regWidth = 200L)
sampleRandomRegions(allowedRegions = NULL, N = 100L, regWidth = 200L)
allowedRegions |
An unstranded GRanges object of the "allowed" of the genome, usually the mappable regions. |
N |
Number of regions to sample. |
regWidth |
Region width. |
In order to make the results deterministic, set the random
number seed before calling sampleRandomRegions
using
set.seed
.
A GRanges object with randomly sampled mappable regions of width
regWidth
with fractionCGI
coming from CpG islands.
regs <- GenomicRanges::GRanges( seqnames = rep(c("chr1", "chr2"), each = 2), ranges = IRanges::IRanges(start = 1:4, end = 5:8)) set.seed(123) sampleRandomRegions(regs, N = 2, regWidth = 3L)
regs <- GenomicRanges::GRanges( seqnames = rep(c("chr1", "chr2"), each = 2), ranges = IRanges::IRanges(start = 1:4, end = 5:8)) set.seed(123) sampleRandomRegions(regs, N = 2, regWidth = 3L)
Create a simple sequence logo grob (grid-graphics object) for a
transcription factor from a position frequency matrix. The logo drawing
code is a simplified version from seqLogo
and for
example can be used to embedd sequence logos within other plots.
seqLogoGrob(x, xmax = NULL, ymax = 2, xjust = c("left", "center", "right"))
seqLogoGrob(x, xmax = NULL, ymax = 2, xjust = c("left", "center", "right"))
x |
A |
xmax |
A numeric scalar with the maximal width for the logo
(in base-pairs). A value of |
ymax |
A numeric scalar with the maximal height for the logo (in bits)
A value of |
xjust |
A character scalar specifying the horizontal adjustment of the
sequence log withint the viewport; one of |
A polygon grob.
seqLogo
for the original, more flexible
version of this function.
if (require(JASPAR2020) && require(TFBSTools) && require(gridExtra)) { pfm1 <- getMatrixByID(JASPAR2020, "MA0139") pfm2 <- getMatrixByID(JASPAR2020, "MA0531") g1 <- seqLogoGrob(pfm1) g2 <- seqLogoGrob(pfm2) gridExtra::grid.arrange(g1, g2) }
if (require(JASPAR2020) && require(TFBSTools) && require(gridExtra)) { pfm1 <- getMatrixByID(JASPAR2020, "MA0139") pfm2 <- getMatrixByID(JASPAR2020, "MA0531") g1 <- seqLogoGrob(pfm1) g2 <- seqLogoGrob(pfm2) gridExtra::grid.arrange(g1, g2) }