Title: | A Package For Predicting The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites |
---|---|
Description: | We introduce motifbreakR, which allows the biologist to judge in the first place whether the sequence surrounding the polymorphism is a good match, and in the second place how much information is gained or lost in one allele of the polymorphism relative to another. MotifbreakR is both flexible and extensible over previous offerings; giving a choice of algorithms for interrogation of genomes with motifs from public sources that users can choose from; these are 1) a weighted-sum probability matrix, 2) log-probabilities, and 3) weighted by relative entropy. MotifbreakR can predict effects for novel or previously described variants in public databases, making it suitable for tasks beyond the scope of its original design. Lastly, it can be used to interrogate any genome curated within Bioconductor (currently there are 32 species, a total of 109 versions). |
Authors: | Simon Gert Coetzee [aut, cre] , Dennis J. Hazelett [aut] |
Maintainer: | Simon Gert Coetzee <[email protected]> |
License: | GPL-2 |
Version: | 2.21.0 |
Built: | 2024-11-17 03:08:22 UTC |
Source: | https://github.com/bioc/motifbreakR |
Calculate the significance of the matches for the reference and alternate alleles for the for their PWM
calculatePvalue( results, background = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), granularity = NULL, BPPARAM = BiocParallel::SerialParam() )
calculatePvalue( results, background = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), granularity = NULL, BPPARAM = BiocParallel::SerialParam() )
results |
The output of |
background |
Numeric Vector; the background probabilities of the nucleotides |
granularity |
Numeric Vector; the granularity to which to round the PWM,
larger values compromise full accuracy for speed of calculation. A value of
|
BPPARAM |
a BiocParallel object see |
This function is intended to be used on a selection of results produced by motifbreakR
, and
this can be (although not always) a very memory and time intensive process if the algorithm doesn't converge rapidly.
a GRanges object. The same Granges object that was input as results
, but with
Refpvalue
and Altpvalue
columns in the output modified from NA
to the p-value
calculated by TFMsc2pv
. Additionally a pvalueEffect
column that indicates "strong"
when the lower p-value (between ref and alt) is an order of magnitude or more different from
the higher p-value, otherwise weak.
H\'el\'ene Touzet and Jean-St\'ephane Varr\'e (2007) Efficient and accurate P-value computation for Position Weight Matrices. Algorithms for Molecular Biology, 2: 15.
See TFMsc2pv
from the TFMPvalue package for
information about how the p-values are calculated.
data(example.results) rs1006140 <- example.results[example.results$SNP_id %in% "rs1006140"] # low granularity for speed; 1e-6 or 1e-7 recommended for accuracy rs1006140 <- calculatePvalue(rs1006140, BPPARAM=BiocParallel::SerialParam(), granularity = 1e-4)
data(example.results) rs1006140 <- example.results[example.results$SNP_id %in% "rs1006140"] # low granularity for speed; 1e-6 or 1e-7 recommended for accuracy rs1006140 <- calculatePvalue(rs1006140, BPPARAM=BiocParallel::SerialParam(), granularity = 1e-4)
From the abstract: "Recent advances in technology have led to a dramatic increase in the number of available transcription factor ChIP-seq and ChIP-chip data sets. Understanding the motif content of these data sets is an important step in understanding the underlying mechanisms of regulation. Here we provide a systematic motif analysis for 427 human ChIP-seq data sets using motifs curated from the literature and also discovered de novo using five established motif discovery tools. We use a systematic pipeline for calculating motif enrichment in each data set, providing a principled way for choosing between motif variants found in the literature and for flagging potentially problematic data sets. Our analysis confirms the known specificity of 41 of the 56 analyzed factor groups and reveals motifs of potential cofactors. We also use cell type-specific binding to find factors active in specific conditions. The resource we provide is accessible both for browsing a small number of factors and for performing large-scale systematic analyses. We provide motif matrices, instances and enrichments in each of the ENCODE data sets. The motifs discovered here have been used in parallel studies to validate the specificity of antibodies, understand cooperativity between data sets and measure the variation of motif binding across individuals and species."
encodemotif
encodemotif
MotifDb
object of length 2064; to access metadata
use mcols(encodemotif)
Name provided by ENCODE
Same as providerName
"ENCODE-motif"
Gene symbol for the transcription factor
Entrez gene id for the transcription factor
"ENTREZ"
UNIPROT id for the transcription factor
"UNIPROT"
"Hsapiens"
NA
not available
Consensus sequence for the motif
NA
incomplete
NA
incomplete
occurs in two forms:
For motifs that were discovered in this study, the format is cellType_source-LabMetadata:MotifFinder#Location
for example H1-hESC_encode-Myers_seq_hsa_v041610.2_r1:MEME#2#Intergenic
.
For motifs that were "known" the format tends to be TF_source_sourceId
for example AP1_jaspar_MA0099.2
.
"24335146"
see Source
for more details
Load with data(encodemotif)
MotifList-class
object
Pouya Kheradpour and Manolis Kellis (2013 December 13) Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Research, doi:10.1093/nar/gkt1249
http://compbio.mit.edu/encode-motifs/
data(encodemotif) encodemotif
data(encodemotif) encodemotif
This contains example results from motifbreaker for use in examples from the help docs
example.results
example.results
GRanges
output from motifbreakR
GRanges
object. See motifbreakR
for information on it's structure.
data(example.results) example.results
data(example.results) example.results
Export motifbreakR variants to bed file
exportMBbed(results, file, name = NULL, color = "effect_size")
exportMBbed(results, file, name = NULL, color = "effect_size")
results |
The output of |
file |
Character; the file name of the destination file |
name |
Character; name for the BED track, defaults to "motifbreakR results" |
color |
Character; one of ref_sig ( |
exportMBbed
produces an output BED file, with diverging color
scale for effect_size (blue representing stronger binding in REF
, red
representing stronger binding in ALT
), or a sequential color scale
otherwise (low values as purple, high values as yellow). The score column is
either the effect_size (alleleDiff
column), the -log10(p-value)
(capped at 10), corresponding to Refpvalue
, Altpvalue
, or the
best match of the two, or the score pctRef
, pctAlt
, or the
highest match of the two. The name column is formatted
SNP_id:REF/ALT:providerId
. Additionally a color key is returned
indicating the range of values for each color output.
See exportMBtable
for the function that exports the
full motifbreakR
results as a tab or comma separated table file.
data(example.results) example.results exportMBbed(example.results, file = "mb_test_output.bed", color = "effect_size")
data(example.results) example.results exportMBbed(example.results, file = "mb_test_output.bed", color = "effect_size")
Export motifbreakR results to csv or tsv
exportMBtable(results, file, format = "tsv")
exportMBtable(results, file, format = "tsv")
results |
The output of |
file |
Character; the file name of the destination file |
format |
Character; one of tsv (tab separated values) or csv (comma separated values) |
exportMBtable
produces an output file containing the output
of the motifbreakR
function.
See exportMBbed
for the function that exports the
motifbreakR
results as a BED file, colored by selected score.
data(example.results) example.results exportMBtable(example.results, file = "mb_test_output.tsv", format = "tsv")
data(example.results) example.results exportMBtable(example.results, file = "mb_test_output.tsv", format = "tsv")
From the abstract: "Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook (http://factorbook.org) and will continually update this repository as more ENCODE data are generated."
factorbook
factorbook
MotifDb
object of length 79; to access metadata
use mcols(factorbook)
Name listed in meme output of 'Supp TableS2.pdf' for the citation indicated below
Same as providerName
"FactorBook"
NA
these motifs don't have a direct 1 to 1 relationship with a transcription factor
NA
NA
NA
NA
"Hsapiens"
NA
Consensus sequence for the motif
NA
NA
NA
"22955990"
see Source
for more details
Load with data(factorbook)
MotifList-class
object
J Wang, J Zhuang, S Iyer, XY Lin, et al. (2012) Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research, 22 (9), 1798-1812, doi:10.1101/gr.139105.112
data(factorbook) factorbook
data(factorbook) factorbook
Find Corresponding TF Binding From The ReMap2022 Project
findSupportingRemapPeaks(results, genome, TFClass = FALSE)
findSupportingRemapPeaks(results, genome, TFClass = FALSE)
results |
The output of |
genome |
Character; one of:
|
TFClass |
Logical; The user may optionally query an expanded
motif/transcription factor relationship encompassing the entire potential
transcription factor family as implemented by |
TFClass
argument works for objects loaded in from the
MotifDb
package. hg19
and mm39
are data from liftOver.
The ReMap catalogues (2022, 2020, 2018, 2015) are under CC BY-NC 4.0 international license, as described in ReMap.
The CC BY-NC 4.0 license correspond to the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
the results GenomicRanges object output by motifbreakR
containing with the additional columns:
matchingBindingEvent |
The name of the transcription factor that binds
over the motif, or |
matchingCellType |
A list corresponding in length to the number of
transcription factors in |
associateTranscriptionFactors
for information about
TFClass. https://remap.univ-amu.fr/ for details about ReMap2022.
data(example.results) example.results <- findSupportingRemapPeaks(example.results, genome = "hg19", TFClass = TRUE)
data(example.results) example.results <- findSupportingRemapPeaks(example.results, genome = "hg19", TFClass = TRUE)
From the abstract: "We present the Homo sapiens comprehensive model collection (HOCOMOCO, http://autosome.ru/HOCOMOCO/, http://cbrc.kaust.edu.sa/hocomoco/) containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source."
hocomoco
hocomoco
MotifDb
object of length 426; to access metadata
use mcols(hocomoco)
Name provided by HOCOMOCO
ID provided by HOCOMOCO including experiment type
"HOCOMOCO"
Gene symbol for the transcription factor
Entrez gene id for the transcription factor
"ENTREZ"
UNIPROT id for the transcription factor
"UNIPROT"
"Hsapiens"
Number of sequences evaluated for producing the PWM
Consensus sequence for the motif
NA
incomplete
NA
incomplete
from http://autosome.ru/HOCOMOCO/Details.php#200 quoted here:
"TFBS model identification modes
To construct TFBS models ChIPMunk was run four times: two times (f1) and (f2) with uniform model positional prior and two times (si) and (do) with informative model positional prior.
The min-to-max (f1) model length estimation mode was used with the min length
of 7 bp and increasing it by 1 bp until the default max length of 25 bp was
reached following the optimal length selection procedure as in Kulakovskiy
and Makeev, Biophysics, 2009. For max-to-min (f2) model length estimation
mode we started from 25 bp and searched for the best alignment decreasing the
length by 1 bp until the minimal length of 7 bp. We also used the single (si)
and double box (do) model positional priors in order to simulate DNA helix
turn. For a single box, the positional weights are to be distributed as
cos2( n / T), where T=10.5 is the DNA helix pitch, n is the coordinate
within the alignment, and the center of the alignment of the length L is at
n=0. During the internal cycle of PWM optimization the PWM column scores are
multiplied by prior values so the columns closer to the center of the
alignment (n=0) receive no score penalty while the columns around (n =
5,6,-5,-6) contribute much less to the score of the PWM under optimization.
The single box model prior was used along with the min-to-max length
estimation mode (si). We also used the double box model prior with a shape
prior equal to sin2(
n / T), which was used to search for possibly longer
double box models in the max-to-min length estimation mode (do).
Model quality assignment
The resulting models were rated (from A to F) according to their quality. Model quality rates from A-to-D were assigned to proteins known to be TFs, including those listed in Schaefer et al., Nucleic Acids Research, 2011 with addition of a number of proteins having relevant models and sufficient evidence to be TFs. The ratings were assigned by human curation according to the following criteria:
Relevant distribution of position-specific information content over alignment columns, which means a model LOGO representation displaying well formed core positions with a high information content surrounded by flanking letters with lower information content; the information content at flanking positions decreasing with the distance from the model core.
"Stability", which means that in more than one of the ChIPMunk modes we obtained models with a similar length, consensus, and comparable number of aligned binding sites, along with a similar shape of model LOGO representation. "Similarity" of the model to the binding sequence consensus for this TF given in the UniProt or other databases, which means similarity of the shape of the model LOGO and TFBS lengths to those of other TFs from the same TF family. "A total number of binding sites" was also considered as a quality measure, as a large set of binding regions (mostly but not limited to ChIP-Seq and parallel SELEX) implies that there are many observations of each letter in any position of the alignment, particularly many observations of non-consensus letters in core positions. In positions with low information content, where there is no strong consensus, all variants have many observations, and thus the observed letter frequencies are less dependent on statistical fluctuations.
Quality A was assigned to high confidence models complying with all four criteria listed in the section above. Quality B was assigned to models built from large sequence sets that failed no more than one out of the three remaining criteria. Quality C was assigned to models built from small sequence sets but (with a number of specifically marked exceptions) complying with the three remaining criteria. Quality D models missed part of the known consensus sequence or had no clearly significant core positions in the TFBS model. Quality E (error) was assigned to models for proteins not convincingly shown to be TFs or to models exhibiting an irrelevant LOGO shape or a wrong consensus sequence. Quality F (failure) was assigned to TFs for which there was no reliable model identified."
"23175603"
see Source
for more details
Load with data(hocomoco)
MotifList-class
object
Kulakovskiy,I.V., Medvedeva,Y.A., Schaefer,U., Kasianov,A.S., Vorontsov,I.E., Bajic,V.B. and Makeev,V.J. (2013) HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Research, 41, D195–D202.
http://autosome.ru/HOCOMOCO/ http://cbrc.kaust.edu.sa/hocomoco/
data(hocomoco) hocomoco
data(hocomoco) hocomoco
From the website: "Homer includes several motif databases that are used to help annotate results and conduct searches for known motifs. HOMER contains a custom motif database based on independent analysis of mostly ChIP-Seq data sets which is heavily utilized in the software." See http://homer.salk.edu/homer/motif/motifDatabase.html for more information on how these files were generated, and Homer's sources.
homer
homer
MotifDb
object of length 247; to access metadata
use mcols(homer)
Name provided HOMER
Factor Name provided by HOMER
"HOMER"
Symbol provided by HOMER
Entrez gene id for the transcription factor
"ENTREZ"
UNIPROT id for the transcription factor
"UNIPROT"
"Hsapiens"
NA
Consensus sequence for the motif
DBD provided by HOMER
NA
The Celltype, IP, Assay, and GEO id if applicable for the motif
"20513432"
see Source
for more details
Load with data(homer)
MotifList-class
object
Heinz S, Benner C, Spann N, Bertolino E et al. (2010 May 28) Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell, 38(4):576-589. PMID: 20513432
http://homer.salk.edu/homer/index.html http://homer.salk.edu/homer/motif/motifDatabase.html http://homer.salk.edu/homer/motif/HomerMotifDB/homerResults.html
data(homer) homer
data(homer) homer
Predict The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites.
motifbreakR( snpList, pwmList, threshold = 0.85, filterp = FALSE, method = "default", show.neutral = FALSE, verbose = FALSE, bkg = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), BPPARAM = bpparam() )
motifbreakR( snpList, pwmList, threshold = 0.85, filterp = FALSE, method = "default", show.neutral = FALSE, verbose = FALSE, bkg = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), BPPARAM = bpparam() )
snpList |
The output of |
pwmList |
An object of class |
threshold |
Numeric; the maximum p-value for a match to be called or a minimum score threshold |
filterp |
Logical; filter by p-value instead of by pct score. |
method |
Character; one of |
show.neutral |
Logical; include neutral changes in the output |
verbose |
Logical; if running serially, show verbose messages |
bkg |
Numeric Vector; the background probabilites of the nucleotides
used with method= |
BPPARAM |
a BiocParallel object see |
motifbreakR works with position probability matrices (PPM). PPM
are derived as the fractional occurrence of nucleotides A,C,G, and T at
each position of a position frequency matrix (PFM). PFM are simply the
tally of each nucleotide at each position across a set of aligned
sequences. With a PPM, one can generate probabilities based on the
genome, or more practically, create any number of position specific
scoring matrices (PSSM) based on the principle that the PPM contains
information about the likelihood of observing a particular nucleotide at
a particular position of a true transcription factor binding site. What
follows is a discussion of the three different algorithms that may be
employed in calls to the motifbreakR function via the method
argument.
Suppose we have a frequency matrix of width
(i.e. a
PPM as described above). Furthermore, we have a sequence
also of
length
, such that
.
Each column of
contains the frequencies of each letter in each position.
Commonly in the literature sequences are scored as the sum of log probabilities:
Equation 1
where is the background frequency of letter
in
the genome of interest. This method can be specified by the user as
method='log'
.
As an alternative to this method, we introduced a scoring method to
directly weight the score by the importance of the position within the
match sequence. This method of weighting is accessed by specifying
method='ic'
(information content). A general representation
of this scoring method is given by:
Equation 2
where is the scoring vector derived from sequence
and matrix
, and
is a weight vector derived from
. First, we
compute the scoring vector of position scores
Equation 3
and second, for each a constant vector of weights
.
There are two methods for producing . The first, which we
call weighted sum, is the difference in the probabilities for the two
letters of the polymorphism (or variant), i.e.
, or the difference of the maximum and minimum
values for each column of
:
Equation 4.1
The second variation of this theme is to weight by relative entropy.
Thus the relative entropy weight for each column of the matrix is
given by:
Equation 4.2
where is again the background frequency of the letter
.
Thus, there are 3 possible algorithms to apply via the method
argument. The first is the standard summation of log probabilities
(method='log'
). The second and third are the weighted sum and
information content methods (method='default'
and method='ic'
) specified by
equations 4.1 and 4.2, respectively. motifbreakR assumes a
uniform background nucleotide distribution () in equations 1 and
4.2 unless otherwise specified by the user. Since we are primarily
interested in the difference between alleles, background frequency is
not a major factor, although it can change the results. Additionally,
inclusion of background frequency introduces potential bias when
collections of motifs are employed, since motifs are themselves
unbalanced with respect to nucleotide composition. With these cautions
in mind, users may override the uniform distribution if so desired. For
all three methods, motifbreakR scores and reports the reference
and alternate alleles of the sequence
(
and
), and provides the matrix scores
and
of the SNP (or
variant). The scores are scaled as a fraction of scoring range 0-1 of
the motif matrix,
. If either of
and
is greater than a user-specified
threshold (default value of 0.85) the SNP is reported. By default
motifbreakR does not display neutral effects,
(
) but this behaviour can be
overridden.
Additionally, now, with the use of TFMPvalue-package
, we may filter by p-value of the match.
This is unfortunately a two step process. First, by invoking filterp=TRUE
and setting a threshold at
a desired p-value e.g 1e-4, we perform a rough filter on the results by rounding all values in the PWM to two
decimal place, and calculating a scoring threshold based upon that. The second step is to use the function calculatePvalue()
on a selection of results which will change the Refpvalue
and Altpvalue
columns in the output from NA
to the p-value
calculated by TFMsc2pv
. This can be (although not always) a very memory and time intensive process if the algorithm doesn't converge rapidly.
a GRanges object containing:
REF |
the reference allele for the variant |
ALT |
the alternate allele for the variant |
snpPos |
the coordinates of the variant |
motifPos |
The position of the motif relative the the variant |
geneSymbol |
the geneSymbol corresponding to the TF of the TF binding motif |
dataSource |
the source of the TF binding motif |
providerName , providerId
|
the name and id provided by the source |
seqMatch |
the sequence on the 5' -> 3' direction of the "+" strand that corresponds to DNA at the position that the TF binding motif was found. |
pctRef |
The score as determined by the scoring method, when the sequence contains the reference variant allele, normalized to a scale from 0 - 1. If |
pctAlt |
The score as determined by the scoring method, when the sequence contains the alternate variant allele, normalized to a scale from 0 - 1. If |
scoreRef |
The score as determined by the scoring method, when the sequence contains the reference variant allele |
scoreAlt |
The score as determined by the scoring method, when the sequence contains the alternate variant allele |
Refpvalue |
p-value for the match for the pctRef score, initially set to |
Altpvalue |
p-value for the match for the pctAlt score, initially set to |
altPos |
the position, relative to the reference allele, of the alternate allele |
alleleDiff |
The difference between the score on the reference allele and the score on the alternate allele |
alleleEffectSize |
The ratio of the |
effect |
one of weak, strong, or neutral indicating the strength of the effect. |
each SNP in this object may be plotted with plotMB
See snps.from.rsid
and snps.from.file
for
information about how to generate the input to this function and
plotMB
for information on how to visualize it's output
library(BSgenome.Hsapiens.UCSC.hg19) # prepare variants load(system.file("extdata", "pca.enhancer.snps.rda", package = "motifbreakR")) # loads snps.mb pca.enhancer.snps <- sample(snps.mb, 20) # Get motifs to interrogate data(hocomoco) motifs <- sample(hocomoco, 50) # run motifbreakR results <- motifbreakR(pca.enhancer.snps, motifs, threshold = 0.85, method = "ic", BPPARAM=BiocParallel::SerialParam())
library(BSgenome.Hsapiens.UCSC.hg19) # prepare variants load(system.file("extdata", "pca.enhancer.snps.rda", package = "motifbreakR")) # loads snps.mb pca.enhancer.snps <- sample(snps.mb, 20) # Get motifs to interrogate data(hocomoco) motifs <- sample(hocomoco, 50) # run motifbreakR results <- motifbreakR(pca.enhancer.snps, motifs, threshold = 0.85, method = "ic", BPPARAM=BiocParallel::SerialParam())
This object contains all the MotifList-class
objects that were generated
for this package. See the individual help sections for hocomoco
, homer
,
factorbook
, and encodemotif
, for how the data is formatted.
motifbreakR_motif
motifbreakR_motif
MotifDb
object of length 2816; to access metadata
use mcols(motifbreakR_motif)
Load with data(motifbreakR_motif)
MotifList-class
object
Kulakovskiy,I.V., Medvedeva,Y.A., Schaefer,U., Kasianov,A.S., Vorontsov,I.E., Bajic,V.B. and Makeev,V.J. (2013) HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Research, 41, D195–D202.
Heinz S, Benner C, Spann N, Bertolino E et al. (2010 May 28) Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell, 38(4):576-589. PMID: 20513432
J Wang, J Zhuang, S Iyer, XY Lin, et al. (2012) Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research, 22 (9), 1798-1812, doi:10.1101/gr.139105.112
Pouya Kheradpour and Manolis Kellis (2013 December 13) Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Research, doi:10.1093/nar/gkt1249
hocomoco
, homer
,
factorbook
, and encodemotif
data(motifbreakR_motif) motifbreakR_motif
data(motifbreakR_motif) motifbreakR_motif
Plot a genomic region surrounding a genomic variant, and potentially disrupted motifs
plotMB( results, rsid, reverseMotif = TRUE, effect = c("strong", "weak"), altAllele = NULL )
plotMB( results, rsid, reverseMotif = TRUE, effect = c("strong", "weak"), altAllele = NULL )
results |
The output of |
rsid |
Character; the identifier of the variant to be visualized |
reverseMotif |
Logical; if the motif is on the "-" strand show the
the motifs as reversed |
effect |
Character; show motifs that are strongly effected |
altAllele |
Character; The default value of |
plotMB
produces output showing the location of the SNP on the
chromosome, the surrounding sequence of the + strand, the footprint of any
motif that is disrupted by the SNP or SNV, and the DNA sequence motif(s).
The altAllele
argument is included for variants like rs1006140 where
multiple alternate alleles exist, the reference allele is A, and the alternate
can be G,T, or C. plotMB
only plots one alternate allele at a time.
plots a figure representing the results of motifbreakR
at the
location of a single SNP, returns invisible NULL
.
See motifbreakR
for the function that produces output to be
visualized here, also snps.from.rsid
and snps.from.file
for information about how to generate the input to motifbreakR
function.
data(example.results) example.results library(BSgenome.Hsapiens.UCSC.hg19) plotMB(results = example.results, rsid = "rs1006140", effect = "strong", altAllele = "C")
data(example.results) example.results library(BSgenome.Hsapiens.UCSC.hg19) plotMB(results = example.results, rsid = "rs1006140", effect = "strong", altAllele = "C")
Run Shiny version of the motifbreakR package.
shiny_motifbreakR()
shiny_motifbreakR()
returns a shinyAppDir
that launches the shiny app when printed.
library(motifbreakR) app <- shiny_motifbreakR() if (interactive()) { shiny::runApp(app) }
library(motifbreakR) app <- shiny_motifbreakR() if (interactive()) { shiny::runApp(app) }
Import SNPs from a BED file or VCF file for use in motifbreakR
snps.from.file( file = NULL, dbSNP = NULL, search.genome = NULL, format = "bed", indels = FALSE, biomart.dataset = NULL, check.unnamed.for.rsid = FALSE ) variants.from.file( file = NULL, dbSNP = NULL, search.genome = NULL, biomart.dataset = NULL, format = "bed" )
snps.from.file( file = NULL, dbSNP = NULL, search.genome = NULL, format = "bed", indels = FALSE, biomart.dataset = NULL, check.unnamed.for.rsid = FALSE ) variants.from.file( file = NULL, dbSNP = NULL, search.genome = NULL, biomart.dataset = NULL, format = "bed" )
file |
Character; a character containing the path to a bed file or a vcf file see Details for a description of the required format |
dbSNP |
OPTIONAL; an object of class SNPlocs to lookup rsids; see |
search.genome |
an object of class BSgenome for the species you are interrogating;
see |
format |
Character; one of |
indels |
Logical; allow the import of indels. |
biomart.dataset |
a Mart object from |
check.unnamed.for.rsid |
Logical; check snps in the form chr:pos:ref:alt for corresponding rsid, lookup may be slow, requires either param dbSNP or biomart.dataset. |
snps.from.file
takes a character vector describing the file path
to a bed file that contains the necissary information to generate the input for
motifbreakR
see http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1
for a complete description of the BED format. Our convention deviates in that there
is a required format for the name field. name
is defined as chromosome:start:REF:ALT
or the rsid from dbSNP (if you've included the optional SNPlocs argument).
For example if you were to include rs123 in it's alternate
format it would be entered as chr7:24966446:C:A
a GRanges object containing:
SNP_id |
The rsid of the snp with the "rs" portion stripped |
alleles_as_ambig |
THE IUPAC ambiguity code between the reference and alternate allele for this SNP |
REF |
The reference allele for the SNP |
ALT |
The alternate allele for the SNP |
variants.from.file()
: Allows the use of indels by default
See motifbreakR
for analysis; See snps.from.rsid
for an alternate method for generating a list of variants.
library(BSgenome.Drerio.UCSC.danRer7) library(SNPlocs.Hsapiens.dbSNP155.GRCh37) snps.bed.file <- system.file("extdata", "danRer.bed", package = "motifbreakR") # see the contents read.table(snps.bed.file, header = FALSE) #import the BED file snps.mb <- snps.from.file(snps.bed.file, search.genome = BSgenome.Drerio.UCSC.danRer7, format = "bed")
library(BSgenome.Drerio.UCSC.danRer7) library(SNPlocs.Hsapiens.dbSNP155.GRCh37) snps.bed.file <- system.file("extdata", "danRer.bed", package = "motifbreakR") # see the contents read.table(snps.bed.file, header = FALSE) #import the BED file snps.mb <- snps.from.file(snps.bed.file, search.genome = BSgenome.Drerio.UCSC.danRer7, format = "bed")
Import SNPs from rsid for use in motifbreakR
snps.from.rsid( rsid = NULL, dbSNP = NULL, search.genome = NULL, biomart.dataset = NULL )
snps.from.rsid( rsid = NULL, dbSNP = NULL, search.genome = NULL, biomart.dataset = NULL )
rsid |
Character; a character vector of rsid values from dbSNP |
dbSNP |
an object of class SNPlocs to lookup rsids; see |
search.genome |
an object of class BSgenome for the species you are interrogating;
see |
biomart.dataset |
a Mart object from |
snps.from.rsid
take an rsid, or character vector of rsids and
generates the required object to input into motifbreakR
a GRanges object containing:
SNP_id |
The rsid of the snp with the "rs" portion stripped |
alleles_as_ambig |
THE IUPAC ambiguity code between the reference and alternate allele for this SNP |
REF |
The reference allele for the SNP |
ALT |
The alternate allele for the SNP |
See motifbreakR
for analysis; See snps.from.file
for an alternate method for generating a list of variants.
library(BSgenome.Hsapiens.UCSC.hg19) library(SNPlocs.Hsapiens.dbSNP155.GRCh37) snps.file <- system.file("extdata", "pca.enhancer.snps", package = "motifbreakR") snps <- as.character(read.table(snps.file)[,1]) snps.mb <- snps.from.rsid(snps[1], dbSNP = SNPlocs.Hsapiens.dbSNP155.GRCh37, search.genome = BSgenome.Hsapiens.UCSC.hg19) ## alternatively using biomaRt library(biomaRt) library(BSgenome.Hsapiens.UCSC.hg38) ensembl_snp <- useEnsembl(biomart = "snps", dataset = "hsapiens_snp", version = "112") snps.mb <- snps.from.rsid(snps, biomart.dataset = ensembl_snp, search.genome = BSgenome.Hsapiens.UCSC.hg38)
library(BSgenome.Hsapiens.UCSC.hg19) library(SNPlocs.Hsapiens.dbSNP155.GRCh37) snps.file <- system.file("extdata", "pca.enhancer.snps", package = "motifbreakR") snps <- as.character(read.table(snps.file)[,1]) snps.mb <- snps.from.rsid(snps[1], dbSNP = SNPlocs.Hsapiens.dbSNP155.GRCh37, search.genome = BSgenome.Hsapiens.UCSC.hg19) ## alternatively using biomaRt library(biomaRt) library(BSgenome.Hsapiens.UCSC.hg38) ensembl_snp <- useEnsembl(biomart = "snps", dataset = "hsapiens_snp", version = "112") snps.mb <- snps.from.rsid(snps, biomart.dataset = ensembl_snp, search.genome = BSgenome.Hsapiens.UCSC.hg38)