Title: | An Annotated Collection of Protein-DNA Binding Sequence Motifs |
---|---|
Description: | More than 9900 annotated position frequency matrices from 14 public sources, for multiple organisms. |
Authors: | Paul Shannon, Matt Richards |
Maintainer: | Paul Shannon <[email protected]> |
License: | Artistic-2.0 | file LICENSE |
Version: | 1.49.0 |
Built: | 2024-11-29 07:27:33 UTC |
Source: | https://github.com/bioc/MotifDb |
In the analysis of, or exploration of gene regulatory networks, one often creates a data.frame of possible genomic regulatory sites, genomic locations where a TF binding motif matches some DNA sequence. A common next step is to associate each of these motifs with its related transcription factor/s. We provide two sources for those relationships. When you specify the "MotifDb" source, we return the motif/TF relationships provided by each of the constituent public MotifDb sources. When you specify the "TFClass" source, transcription factor family memberships (described in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383905/) are - sometimes expansively - provided for each motif you supply.
This method uses, and therefore expects, different columns of the incoming data.frame to be used with each method. The MotifDb source uses the "motifName" column of the incoming data.frame. The TFClass source expects a "shortName" column in the incoming database.
A new column, "geneSymbol", is added to the incoming data.frame. This new column identifies the transcription factor associated with the motif for each row in the data.frame.
## S4 method for signature 'MotifList' associateTranscriptionFactors(object, tbl.withMotifs, source, expand.rows, motifColumnName="motifName")
## S4 method for signature 'MotifList' associateTranscriptionFactors(object, tbl.withMotifs, source, expand.rows, motifColumnName="motifName")
object |
a |
tbl.withMotifs |
a |
source |
a |
expand.rows |
a |
motifColumnName |
a |
A data.frame with one column ("geneSymbol") and possibly multiple rows added
Paul Shannon
MotifDb, geneToMotif, motifToGene, subset, query
tbl.tfClassExample <- data.frame(motifName=c("MA0006.1", "MA0042.2", "MA0043.2"), chrom=c("chr1", "chr1", "chr1"), start=c(1000005, 1000085, 1000105), start=c(1000013, 1000092, 1000123), score=c(0.85, 0.92, 0.98), stringsAsFactors=FALSE) # here we illustrate how to add a column with the required name: tbl.tfClassExample$shortMotif <- tbl.tfClassExample$motifName tbl.out <- associateTranscriptionFactors(MotifDb, tbl.tfClassExample, source="TFClass", expand.rows=TRUE) dim(tbl.out) # MANY tfs mapped, mostly FOX family genes tbl.motifDbExample <- data.frame(motifName=c("Mmusculus-jaspar2016-Ahr::Arnt-MA0006.1", "Hsapiens-jaspar2016-FOXI1-MA0042.2", "Hsapiens-jaspar2016-HLF-MA0043.2"), chrom=c("chr1", "chr1", "chr1"), start=c(1000005, 1000085, 1000105), start=c(1000013, 1000092, 1000123), score=c(0.85, 0.92, 0.98), stringsAsFactors=FALSE) tbl.out <- associateTranscriptionFactors(MotifDb, tbl.motifDbExample, source="MotifDb", expand.rows=TRUE) dim(tbl.out) # one new column ("geneSymbol"), no new rows
tbl.tfClassExample <- data.frame(motifName=c("MA0006.1", "MA0042.2", "MA0043.2"), chrom=c("chr1", "chr1", "chr1"), start=c(1000005, 1000085, 1000105), start=c(1000013, 1000092, 1000123), score=c(0.85, 0.92, 0.98), stringsAsFactors=FALSE) # here we illustrate how to add a column with the required name: tbl.tfClassExample$shortMotif <- tbl.tfClassExample$motifName tbl.out <- associateTranscriptionFactors(MotifDb, tbl.tfClassExample, source="TFClass", expand.rows=TRUE) dim(tbl.out) # MANY tfs mapped, mostly FOX family genes tbl.motifDbExample <- data.frame(motifName=c("Mmusculus-jaspar2016-Ahr::Arnt-MA0006.1", "Hsapiens-jaspar2016-FOXI1-MA0042.2", "Hsapiens-jaspar2016-HLF-MA0043.2"), chrom=c("chr1", "chr1", "chr1"), start=c(1000005, 1000085, 1000105), start=c(1000013, 1000092, 1000123), score=c(0.85, 0.92, 0.98), stringsAsFactors=FALSE) tbl.out <- associateTranscriptionFactors(MotifDb, tbl.motifDbExample, source="MotifDb", expand.rows=TRUE) dim(tbl.out) # one new column ("geneSymbol"), no new rows
Writes all matrices in the supplied list, in the specified format, to the specified connection.
## S4 method for signature 'MotifList,connection,character' export(object, con, format, ...) ## S4 method for signature 'MotifList,character,character' export(object, con, format, ...) ## S4 method for signature 'MotifList,missing,character' export(object, con, format, ...)
## S4 method for signature 'MotifList,connection,character' export(object, con, format, ...) ## S4 method for signature 'MotifList,character,character' export(object, con, format, ...) ## S4 method for signature 'MotifList,missing,character' export(object, con, format, ...)
object |
a |
con |
either a |
format |
a |
... |
ignore this |
The matrices list is written to the specified connection in the specified format.
Paul Shannon
MotifDb, query, subset, flyFactorSurvey, hPDI, jaspar, ScerTF, uniprobe
library (MotifDb) # identify all the SOX genes sox.indices = grep ('^sox', values (MotifDb)$geneSymbol, ignore.case=TRUE) matrices = MotifDb [sox.indices] export (matrices, con='SoxGenes-meme.txt', format='meme')
library (MotifDb) # identify all the SOX genes sox.indices = grep ('^sox', values (MotifDb)$geneSymbol, ignore.case=TRUE) matrices = MotifDb [sox.indices] export (matrices, con='SoxGenes-meme.txt', format='meme')
Using either of our two sources ("MotifDb" or "TFClass") retrieve the names of the transcription factor binding motifs associated with the gene symbol for each transcription factor. Slightly different information is returned in each case but the columns "geneSymbol", "motif", "pubmedID", "source" are returned by both sources. The TFClass source is described here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383905/. The MotifDb source is in fact the usually 1:1 gene/motif mapping provided by each of the data sources upon which MotifDb is built.
## S4 method for signature 'MotifList' geneToMotif(object, geneSymbols, source, ignore.case)
## S4 method for signature 'MotifList' geneToMotif(object, geneSymbols, source, ignore.case)
object |
a |
geneSymbols |
a |
source |
a |
ignore.case |
a |
A data.frame with these columns: geneSymbol, motif, pubmedID, source. The MotifDb source alos include dataSource and organism.
Paul Shannon
MotifDb, motifToGene, associateTranscriptionFactors, subset, query
genes <- c("ATF5", "FOS") geneToMotif(MotifDb, genes, source="TFClass") geneToMotif(MotifDb, genes, source="MotifDb")
genes <- c("ATF5", "FOS") geneToMotif(MotifDb, genes, source="TFClass") geneToMotif(MotifDb, genes, source="MotifDb")
Approximately 2000 position frequency matrices collected from public sources, with ample accompanying metadata, and search and export capabilities provided.
MotifDb is an R object of class MotifList, whose entries are numeric
matrices, accompanied by a 'parallel' metadata structure, a DataFrame, in
which each row provides information about the corresponding matrix.
This object is automatically created and fully populated by data from
five public sources (see below) when the package is loaded into your R environment via the
library
call.
The matrices are obtained from six public sources:
FlyFactorSurvey: | 614 |
hPDI: | 437 |
JASPAR_CORE: | 459 |
jolma2013: | 843 |
ScerTF: | 196 |
stamlab: | 683 |
UniPROBE: | 380 |
cisbp 1.02 | 874 |
Representing primarily five organsisms (and 49 total):
Hsapiens: | 2328 |
Dmelanogaster: | 1008 |
Scerevisiae: | 701 |
Mmusculus: | 660 |
Athaliana: | 160 |
Celegans: | 44 |
other: | 177 |
All the matrices are stored as position frequency matrices, in which each columm (each position) sums to 1.0. When the number of sequences which contributed to the motif are known, that number will be found in the matrix's metadata. With this information, one can transform the matrices into either PCM (position count matrices), or PWM (position weight matrices), also known as PSSM (position-specific-scoring matrices). The latter transformation requires that a model of the background distribution be known, or assumed.
The names of the matrices are the same as rownames of the metadata DataFrame, and have been chosen to balance the needs of concision and full description, including the organism in which the motif was discovered, the data source, and the name of the motif in the data source from which it was obtained. For example: "Hsapiens-JASPAR_CORE-SP1-MA0079.2" and "Scerevisiae-ScerTF-GSM1-badis".
Subsets of the Matrices may be obtainted in several ways:
By integer index, eg, MotifDb [[1]]
By query, eg, as.list (query (MotifDb, 'FBgn0000014'))
(Interactively only) by subset as.list (subset (MotifDb,
geneSymbol=='Abda' & !is.na (pubmedID)))
The matrices are stored in a SimpleList
which has semantics very
similar to the familiar list of R base. To examine a matrix, however,
you must sidestep the MotifDb show
method. These three commands
display quite different results:
> MotifDb [1] MotifDb object of length 1 | Created from downloaded public sources: 2012-Jul6 | 1 position frequency matrices from 1 source: | FlyFactorSurvey: 1 | 1 organism/s | Dmelanogaster: 1 Dmelanogaster-FlyFactorSurvey-ab_SANGER_10_FBgn0259750 > MotifDb [[1]] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 A 0.0 0.50 0.20 0.35 0 0 1 0 0 0.55 0.35 0.05 0.20 0.45 0.20 0.10 0.40 0.40 0.25 0.50 0.30 C 0.3 0.15 0.25 0.00 1 1 0 0 0 0.10 0.65 0.70 0.45 0.25 0.10 0.25 0.25 0.10 0.10 0.25 0.25 G 0.4 0.05 0.50 0.65 0 0 0 1 1 0.00 0.00 0.05 0.05 0.15 0.05 0.20 0.05 0.15 0.55 0.15 0.45 T 0.3 0.30 0.05 0.00 0 0 0 0 0 0.35 0.00 0.20 0.30 0.15 0.65 0.45 0.30 0.35 0.10 0.10 0.00 > as.list (MotifDb [1]) $`Dmelanogaster-FlyFactorSurvey-ab_SANGER_10_FBgn0259750` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 A 0.0 0.50 0.20 0.35 0 0 1 0 0 0.55 0.35 0.05 0.20 0.45 0.20 0.10 0.40 0.40 0.25 0.50 0.30 C 0.3 0.15 0.25 0.00 1 1 0 0 0 0.10 0.65 0.70 0.45 0.25 0.10 0.25 0.25 0.10 0.10 0.25 0.25 G 0.4 0.05 0.50 0.65 0 0 0 1 1 0.00 0.00 0.05 0.05 0.15 0.05 0.20 0.05 0.15 0.55 0.15 0.45 T 0.3 0.30 0.05 0.00 0 0 0 0 0 0.35 0.00 0.20 0.30 0.15 0.65 0.45 0.30 0.35 0.10 0.10 0.00
There are fifteen kinds of metadata – though not all matrices have a full complement: not all of the public sources are complete in this regard. The information falls into these categories, using the Dmelanogaster-FlyFactorSurvey-ab_SANGER_10_FBgn0259750 entry as an example (see below for the associated position frequency matrix):
providerName: "ab_SANGER_10_FBgn0259750"
providerId: "FBgn0259750"
dataSource: "FlyFactorSurvey"
geneSymbol: "Ab"
geneId: "FBgn0259750"
geneIdType: "FLYBASE"
proteinId: "E1JHF4"
proteinIdType: "UNIPROT"
organism: "Dmelanogaster"
sequenceCount: 20
bindingSequence: NA
bindingDomain: NA
tfFamily: NA
experimentType: "bacterial 1-hybrid, SANGER sequencing"
pubmedID: NA
Neph S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA. Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012 Sep 14;150(6):1274-86.
Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010 Jan;38(Database issue):D105-10. Epub 2009 Nov 11.
Robasky K, Bulyk ML. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2011 Jan;39(Database issue):D124-8. Epub 2010 Oct 30.
Spivak AT, Stormo GD. ScerTF: a comprehensive database of benchmarked position weight matrices for Saccharomyces species. Nucleic Acids Res. 2012 Jan;40(Database issue):D162-8. Epub 2011 Dec 2.
Xie Z, Hu S, Blackshaw S, Zhu H, Qian J. hPDI: a database of experimental human protein-DNA interactions. Bioinformatics. 2010 Jan 15;26(2):287-9. Epub 2009 Nov 9.
Zhu LJ, et al. 2011. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011 Jan;39(Database issue):D111-7. Epub 2010 Nov 19.
Jolma A, et al. 2013. DNA-binding specificities of human transcription factors. Cell 2013 Jan 17.
query, subset, export, flyFactorSurvey, hPDI, jaspar, ScerTF, uniprobe
# are there any matrices for Sox4? we find two mdb.sox4 <- MotifDb [grep ('sox4', values (MotifDb)$geneSymbol, ignore.case=TRUE)] # the same two matrices can be obtained this way also if (interactive ()) mdb.sox4 <- subset (MotifDb, tolower(geneSymbol)=='sox4') # and like this mdb.sox4 <- query (MotifDb, 'sox4') # matches against all fields in the metadata # implicitly invoke the 'show' method mdb.sox4 # get their full names names (mdb.sox4) # examine their metadata values (mdb.sox4) # examine the matrices with names include as.list (mdb.sox4) # export the matrices in meme format destination.file = tempfile () export (mdb.sox4, destination.file, 'meme')
# are there any matrices for Sox4? we find two mdb.sox4 <- MotifDb [grep ('sox4', values (MotifDb)$geneSymbol, ignore.case=TRUE)] # the same two matrices can be obtained this way also if (interactive ()) mdb.sox4 <- subset (MotifDb, tolower(geneSymbol)=='sox4') # and like this mdb.sox4 <- query (MotifDb, 'sox4') # matches against all fields in the metadata # implicitly invoke the 'show' method mdb.sox4 # get their full names names (mdb.sox4) # examine their metadata values (mdb.sox4) # examine the matrices with names include as.list (mdb.sox4) # export the matrices in meme format destination.file = tempfile () export (mdb.sox4, destination.file, 'meme')
A direct subclass of SimpleList, having no extra slots, in which listData is a list of position frequency matrices (PFMs), and the elementMetadata slot is a DataFrame with fifteen columns describing each matrix. Upon loading the MotifDb class, one MotifList object is instantiated and filled with matrices and their metadata. There should be no need for users to explicitly create objects of this class. When you load the MotifDb package, a fully-populated instance of this class is created, with > 2000 matrices with metadata
subset(x)
: extract matrices by metadata.
export(x)
: write matrices
show(x)
: describe matrices compactly
query(x)
: find matrices
Paul Shannon
# Examine the number of matrices contributed by each source. print (table (values (MotifDb)$dataSource))
# Examine the number of matrices contributed by each source. print (table (values (MotifDb)$dataSource))
Using either of our two sources ("MotifDb" or "TFClass") this method retrieves the the transcription factor (its gene symbol) for each of the supplied motifs. Slightly different information is returned in each case but the columns "geneSymbol", "motif", "pubmedID", "source" are returned by both. The TFClass source is described here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383905/. The MotifDb source is in fact the (typically) 1:1 gene/motif mapping provided by each of the data sources upon which MotifDb is built.
## S4 method for signature 'MotifList' motifToGene(object, motifs, source)
## S4 method for signature 'MotifList' motifToGene(object, motifs, source)
object |
a |
motifs |
a |
source |
a |
A data.frame with these columns: geneSymbol, motif, pubmedID, source. The MotifDb source also include dataSource and organism.
Paul Shannon
MotifDb, geneToMotif, associateTranscriptionFactors, subset, query
motifs <- c("MA0592.2", "ELF1.SwissRegulon", "UP00022") motifToGene(MotifDb, motifs, source="TFClass") motifToGene(MotifDb, motifs, source="MotifDb")
motifs <- c("MA0592.2", "ELF1.SwissRegulon", "UP00022") motifToGene(MotifDb, motifs, source="TFClass") motifToGene(MotifDb, motifs, source="MotifDb")
A very general search tool, returning all matrices whose metadata, in ANY column, is matched by the query string.
## S4 method for signature 'MotifList' query(object, andStrings, orStrings, notStrings, ignore.case=TRUE)
## S4 method for signature 'MotifList' query(object, andStrings, orStrings, notStrings, ignore.case=TRUE)
object |
a |
andStrings |
a |
orStrings |
a |
notStrings |
a |
ignore.case |
a |
A list of the matrices
Paul Shannon
MotifDb, subset, export, flyFactorSurvey, hPDI, jaspar, ScerTF, uniprobe
matrices.human <- query(MotifDb, 'hsapiens') matrices.sox4 <- query(MotifDb, 'sox4') uniprobe.sox.matrices <- query(MotifDb, c('uniprobe', 'sox')) # two approaches to selectinve extraction of TFEB matrices tfeb.human.1 <- query(MotifDb, andStrings=c("TFEB", "hsapiens"), notStrings=c("hpdi", "jolma", "cisbp")) tfeb.human.2 <- query(MotifDb, andStrings=c("TFEB", "hsapiens"), orStrings=c("hocomoco", "jaspar", "swissregulon"), notStrings="2016")
matrices.human <- query(MotifDb, 'hsapiens') matrices.sox4 <- query(MotifDb, 'sox4') uniprobe.sox.matrices <- query(MotifDb, c('uniprobe', 'sox')) # two approaches to selectinve extraction of TFEB matrices tfeb.human.1 <- query(MotifDb, andStrings=c("TFEB", "hsapiens"), notStrings=c("hpdi", "jolma", "cisbp")) tfeb.human.2 <- query(MotifDb, andStrings=c("TFEB", "hsapiens"), orStrings=c("hocomoco", "jaspar", "swissregulon"), notStrings="2016")
An analog of the base package subset method, this version will return all the matrices whose metadata match the (possibly intricate) logical expression in the "subset" argument.
Note: just as with the base subset method, this method is unreliable except when used interactively. Batch, script or other programmatic use of this function is to be avoided.
## S4 method for signature 'MotifList' subset(x, subset, select, drop=FALSE, ...)
## S4 method for signature 'MotifList' subset(x, subset, select, drop=FALSE, ...)
x |
a |
subset |
a |
select , drop , ...
|
these are ignored, appearing here only in fidelity to the generic definition of the method. |
A list of the matrices whose metadata satisfies the supplied subset
Paul Shannon
MotifDb, query, export, flyFactorSurvey, hPDI, jaspar, ScerTF, uniprobe
mdb <- MotifDb if (interactive ()) { matrices <- subset (mdb, dataSource=='UniPROBE') egr1.matrices <- subset (mdb, geneSymbol=='Egr1') jaspar.egr1.matrices <- subset (mdb, geneSymbol=='Egr1' & dataSource == 'JASPAR_CORE') # one of the mouse egr1 matrices has a geneSymbol 'Zif268', but # has the proper entrez geneId. all.egr1.matrices <- subset (mdb, geneId=='13653') }
mdb <- MotifDb if (interactive ()) { matrices <- subset (mdb, dataSource=='UniPROBE') egr1.matrices <- subset (mdb, geneSymbol=='Egr1') jaspar.egr1.matrices <- subset (mdb, geneSymbol=='Egr1' & dataSource == 'JASPAR_CORE') # one of the mouse egr1 matrices has a geneSymbol 'Zif268', but # has the proper entrez geneId. all.egr1.matrices <- subset (mdb, geneId=='13653') }