Title: | Software Package for Transcription Factor Binding Site (TFBS) Analysis |
---|---|
Description: | TFBSTools is a package for the analysis and manipulation of transcription factor binding sites. It includes matrices conversion between Position Frequency Matirx (PFM), Position Weight Matirx (PWM) and Information Content Matrix (ICM). It can also scan putative TFBS from sequence/alignment, query JASPAR database and provides a wrapper of de novo motif discovery software. |
Authors: | Ge Tan [aut, cre] |
Maintainer: | Ge Tan <[email protected]> |
License: | GPL-2 |
Version: | 1.45.0 |
Built: | 2024-10-31 05:48:12 UTC |
Source: | https://github.com/bioc/TFBSTools |
TFBS includes a set of tools for transcription factor binding site detection and analysis as well as database interface functions for JASPAR, etc.
Ge Tan
Calculate the conservation score for a pairwise alignment given a smooth window size.
calConservation(aln1, aln2, windowSize=51L, which="1")
calConservation(aln1, aln2, windowSize=51L, which="1")
aln1 |
A |
aln2 |
A |
windowSize |
The size of the sliding window (in nucleotides) for calculating local conservation in the alignment. This should be an odd value. |
which |
The conservation profile of Which sequence in the alignments is computed. It can be "1" or "2". |
A numeric
vector with the same length of alignment is returned.
Ge Tan
The functions to initialize, store matrix or delete matrix in JASPAR database.
## S4 method for signature 'character' deleteMatrixHavingID(x, IDs) ## S4 method for signature 'SQLiteConnection' deleteMatrixHavingID(x, IDs) ## S4 method for signature 'JASPAR2014' deleteMatrixHavingID(x, IDs) ## S4 method for signature 'character,PFMatrixList' storeMatrix(x, pfmList) ## S4 method for signature 'SQLiteConnection,PFMatrixList' storeMatrix(x, pfmList) ## S4 method for signature 'JASPAR2014,PFMatrixList' storeMatrix(x, pfmList) ## S4 method for signature 'character,PFMatrix' storeMatrix(x, pfmList) ## S4 method for signature 'SQLiteConnection,PFMatrix' storeMatrix(x, pfmList) ## S4 method for signature 'JASPAR2014,PFMatrix' storeMatrix(x, pfmList) ## S4 method for signature 'SQLiteConnection' initializeJASPARDB(x, version=c("2014", "2016", "2018", "2020", "2022")) ## S4 method for signature 'character' initializeJASPARDB(x, version=c("2014", "2016", "2018", "2020", "2022")) ## S4 method for signature 'JASPAR2014' initializeJASPARDB(x, version) ## S4 method for signature 'JASPAR2016' initializeJASPARDB(x, version) ## S4 method for signature 'JASPAR2018' initializeJASPARDB(x, version) ## S4 method for signature 'JASPAR2020' initializeJASPARDB(x, version)
## S4 method for signature 'character' deleteMatrixHavingID(x, IDs) ## S4 method for signature 'SQLiteConnection' deleteMatrixHavingID(x, IDs) ## S4 method for signature 'JASPAR2014' deleteMatrixHavingID(x, IDs) ## S4 method for signature 'character,PFMatrixList' storeMatrix(x, pfmList) ## S4 method for signature 'SQLiteConnection,PFMatrixList' storeMatrix(x, pfmList) ## S4 method for signature 'JASPAR2014,PFMatrixList' storeMatrix(x, pfmList) ## S4 method for signature 'character,PFMatrix' storeMatrix(x, pfmList) ## S4 method for signature 'SQLiteConnection,PFMatrix' storeMatrix(x, pfmList) ## S4 method for signature 'JASPAR2014,PFMatrix' storeMatrix(x, pfmList) ## S4 method for signature 'SQLiteConnection' initializeJASPARDB(x, version=c("2014", "2016", "2018", "2020", "2022")) ## S4 method for signature 'character' initializeJASPARDB(x, version=c("2014", "2016", "2018", "2020", "2022")) ## S4 method for signature 'JASPAR2014' initializeJASPARDB(x, version) ## S4 method for signature 'JASPAR2016' initializeJASPARDB(x, version) ## S4 method for signature 'JASPAR2018' initializeJASPARDB(x, version) ## S4 method for signature 'JASPAR2020' initializeJASPARDB(x, version)
x |
A character vector of length 1 for the path of JASPAR SQLite file,
or a |
IDs |
JASPAR stable IDs. |
pfmList |
The PFMatrixList object, or pfm object. |
version |
Which version of JASPAR to create. So far, it supports 2014, 2016 and 2018. |
If the operation works, a "success" will be returned.
initializeJASPARDB("jaspar.sqlite", version="2014") data("MA0043") storeMatrix("jaspar.sqlite", MA0043) deleteMatrixHavingID("jaspar.sqlite","MA0043.1") file.remove("jaspar.sqlite")
initializeJASPARDB("jaspar.sqlite", version="2014") data("MA0043") storeMatrix("jaspar.sqlite", MA0043) deleteMatrixHavingID("jaspar.sqlite","MA0043.1") file.remove("jaspar.sqlite")
This function trains the Dirichlet multinomial mixture models parameters for a set of profile matrices.
dmmEM(x, K=6, alg=c("C", "R"))
dmmEM(x, K=6, alg=c("C", "R"))
x |
x can be a |
K |
The maximal number of components to test in the mixture model when alg is "C". Then an optimal number of components between 1 and K will be chosen based on the fitness of the model. The fixed number of components to use when alg is "R". The default is 6. |
alg |
The algorithm to use.
"C" uses the implementation from |
When using the implementation from DirichletMultinomial
package,
the final number of components can be 1:K.
An internal selection will be made based on the maximum likelihood.
When using the implementation of R, the number of component is fixed to K.
A list of trainned alpha0, pmix and likelihood during the training.
signature(x = "ANY")
signature(x = "matrix")
signature(x = "PFMatrixList")
Ge Tan
data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) dmmParameters <- dmmEM(pfmList, K=6, alg="C")
data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) dmmParameters <- dmmEM(pfmList, K=6, alg="C")
This function accesses the emission distribution parameters of the TFFM.
getEmissionProb(tffm)
getEmissionProb(tffm)
tffm |
A |
This function accesses the emission distribution parameters for each position of the TFFM. It returns the probability of emitting certain nucleotide based on the nucleotide on the previous site.
A matrix of numeric
with dimensions of 16 * ncol(tffm)
.
Ge Tan
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") getEmissionProb(tffmFirst) xmlDetail <- file.path(system.file("extdata", package="TFBSTools"), "tffm_detailed.xml") tffmDetail <- readXMLTFFM(xmlDetail, type="Detail") getEmissionProb(tffmDetail)
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") getEmissionProb(tffmFirst) xmlDetail <- file.path(system.file("extdata", package="TFBSTools"), "tffm_detailed.xml") tffmDetail <- readXMLTFFM(xmlDetail, type="Detail") getEmissionProb(tffmDetail)
getMatrixByID
,
getMatrixByName
This method fetches matrix data under the given ID or name from the database and returns a XMatrix object.
## S4 method for signature 'character' getMatrixByID(x, ID) ## S4 method for signature 'SQLiteConnection' getMatrixByID(x, ID) ## S4 method for signature 'JASPAR2014' getMatrixByID(x, ID) ## S4 method for signature 'character' getMatrixByName(x, name) ## S4 method for signature 'SQLiteConnection' getMatrixByName(x, name) ## S4 method for signature 'JASPAR2014' getMatrixByName(x, name)
## S4 method for signature 'character' getMatrixByID(x, ID) ## S4 method for signature 'SQLiteConnection' getMatrixByID(x, ID) ## S4 method for signature 'JASPAR2014' getMatrixByID(x, ID) ## S4 method for signature 'character' getMatrixByName(x, name) ## S4 method for signature 'SQLiteConnection' getMatrixByName(x, name) ## S4 method for signature 'JASPAR2014' getMatrixByName(x, name)
x |
|
ID |
|
name |
|
For getMatrixByID, ID is a string which refers to the stable JASPAR ID (usually something like "MA0001") with or without version numbers. "MA0001" will give the latest version on MA0001, while "MA0001.2" will give the second version, if existing.
For getMatrixByName, according to the current JASPAR data model, name is not necessarily a unique identifier. Also, names change over time. In the case where there are several matrices with the same name in the database, the function fetches the first one and prints a warning. You've been warned. Some matrices have multiple versions. The function will return the latest version. For specific versions, use getMatrixByID(ID.version)
A PFMMatrix object is returned when input ID or name is length 1.
Otherwise, PFMMatrixList
is returned.
Ge Tan
library(JASPAR2014) db <- file.path(system.file("extdata", package="JASPAR2014"), "JASPAR2014.sqlite") ## character and ID pfm <- getMatrixByID(db, ID="MA0003") ## character and IDs pfmList <- getMatrixByID(db, ID=c("MA0003", "MA0004")) ## character and name pfm <- getMatrixByName(db, name="TFAP2A") ## ## character and name pfmList <- getMatrixByName(db, name=c("TFAP2A", "Arnt")) ## JASPAR2014 and ID pfm <- getMatrixByID(JASPAR2014, ID="MA0003")
library(JASPAR2014) db <- file.path(system.file("extdata", package="JASPAR2014"), "JASPAR2014.sqlite") ## character and ID pfm <- getMatrixByID(db, ID="MA0003") ## character and IDs pfmList <- getMatrixByID(db, ID=c("MA0003", "MA0004")) ## character and name pfm <- getMatrixByName(db, name="TFAP2A") ## ## character and name pfmList <- getMatrixByName(db, name=c("TFAP2A", "Arnt")) ## JASPAR2014 and ID pfm <- getMatrixByID(JASPAR2014, ID="MA0003")
get_MatrixSet
This function fetches matrix data for all matrices in the database matching criteria defined by the named arguments and returns a PFMatrixList object
## S4 method for signature 'character' getMatrixSet(x, opts) ## S4 method for signature 'SQLiteConnection' getMatrixSet(x, opts) ## S4 method for signature 'JASPAR2014' getMatrixSet(x, opts)
## S4 method for signature 'character' getMatrixSet(x, opts) ## S4 method for signature 'SQLiteConnection' getMatrixSet(x, opts) ## S4 method for signature 'JASPAR2014' getMatrixSet(x, opts)
x |
a character vector of length 1 for the path of JASPAR SQLite file,
a |
opts |
a search options list. See more details below. |
The search options include three categories:
(1) Database basic criterias:
all=c(TRUE, FALSE)
ID
: a unique identifier for each model.
CORE matrices always have a "MAnnnnIDs.Version".
name
: The name of the transcription factor.
As far as possible, the name is based on the standardized
Entrez gene symbols.
In the case the model describes a transcription factor hetero-dimer,
two names are concatenated, such as RXR-VDR.
In a few cases, different splice forms of the same gene
have different binding specificity: in this case
the splice form information is added to the name,
based on the relevant literature.
collection=c("CORE", "CNE", "PHYLOFACTS", "SPLICE",
"POLII", "FAM", "PBM", "PBM_HOMEO", "PBM_HLH", "UNVALIDATED"
all_versions=c(FALSE,TRUE)
: We constantly update the profiles in
JASPAR. Some profiles may have multiple versions.
By default, only the latest version will be returned.
species
: The species source for the sequences, in Latin (Homo sapiens)
or NCBI tax IDs (9606).
matrixtype=c("PFM", "PWM", "ICM")
(2) Tags based criterias:
class
: Structural class of the transcription factor,
based on the TFCaT system.
Examples: "Zipper-Type"", "Helix-Turn-Helix", etc.
type
: Methodology used for matrix construction:
"SELEX", "ChIP-seq", "PBM", etc.
tax_group
: Group of species,
currently consisting of "plants", "vertebrates", "insects",
"urochordat", "nematodes", "fungi".
family
: Structural sub-class of the transcription factor,
based on the TFCaT system.
Acc
: A representative protein accession number in Genbank
for the transcription factor.
Human takes precedence if several exists.
medline
: relevant publication reporting the sites
used in the mode building.
Pazar_tf_id
: PAZAR database id.
(3) Further criterias:
min_ic
(minimum total information content of the matrix)
length
(minimum sites length)
sites
(minimum average sites number per base)
When all
is TRUE
, it will get all the matrices
and has higher priority over other options.
Then ID has the second highest priority,
and will ignore all the followiing options.
The rest options are combined in search with AND,
while multiple elements under one options have the logical operator OR.
A PFMatrixList
object.
Ge Tan
getMatrixByID
,
getMatrixByName
library(JASPAR2014) db <- file.path(system.file("extdata", package="JASPAR2014"), "JASPAR2014.sqlite") opts <- list() opts[["species"]] <- 9606 opts[["type"]] <- "SELEX" opts[["all_versions"]] <- FALSE siteList <- getMatrixSet(db, opts) siteList2 <- getMatrixSet(JASPAR2014, opts)
library(JASPAR2014) db <- file.path(system.file("extdata", package="JASPAR2014"), "JASPAR2014.sqlite") opts <- list() opts[["species"]] <- 9606 opts[["type"]] <- "SELEX" opts[["all_versions"]] <- FALSE siteList <- getMatrixSet(db, opts) siteList2 <- getMatrixSet(JASPAR2014, opts)
Get the emission probabilities of ACGT at each position of TFFM.
getPosProb(tffm)
getPosProb(tffm)
tffm |
A |
This function calculates the probabilities of emitting nucleotides ACGT at each position of TFFM.
A matrix of numeric
with dimensions of 4 * ncol(tffm)
.
Ge Tan
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") getPosProb(tffmFirst) xmlDetail <- file.path(system.file("extdata", package="TFBSTools"), "tffm_detailed.xml") tffmDetail <- readXMLTFFM(xmlDetail, type="Detail") getPosProb(tffmDetail)
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") getPosProb(tffmFirst) xmlDetail <- file.path(system.file("extdata", package="TFBSTools"), "tffm_detailed.xml") tffmDetail <- readXMLTFFM(xmlDetail, type="Detail") getPosProb(tffmDetail)
Convert a IUPAC string into a Postion Weight Matirx
IUPAC2Matrix(x)
IUPAC2Matrix(x)
x |
The IUPAC string. |
The mapping between IUPAC Extended Genetic Alphabet and the DNA bases letters
are from IUPAC_CODE_MAP
in Biostrings package.
A matrix with position weight.
Ge Tan
x <- "RMGNV" IUPAC2Matrix(x)
x <- "RMGNV" IUPAC2Matrix(x)
Some example PFM matrices from JASPAR 2014.
data(MA0004.1) data(MA0003.2) data(MA0048) data(MA0043)
data(MA0004.1) data(MA0003.2) data(MA0048) data(MA0043)
The format is: PFMatrix
object.
Some examples PFM matrices from JASPAR 2014.
The PFMatrix
object.
http://jaspar.genereg.net/
data(MA0004.1) data(MA0003.2) data(MA0048) data(MA0043)
data(MA0004.1) data(MA0003.2) data(MA0048) data(MA0043)
On JASPAR web service, "FlatFileDir" includes all the *.pfm and a matrix_list.txt file
makeFlatFileDir(JASPAR)
makeFlatFileDir(JASPAR)
JASPAR |
A JASPAR object. Now it can be JASPAR2014 or JASPAR2016. |
The matrix_list.txt file contains each pfm per line. Each line has the ID, total information content, name, class and tags of one pfm.
The generated files are under "FlatFileDir" directory.
Ge Tan
library(JASPAR2014) makeFlatFileDir(JASPAR2014) unlink("FlatFileDir", recursive = TRUE)
library(JASPAR2014) makeFlatFileDir(JASPAR2014) unlink("FlatFileDir", recursive = TRUE)
"MotifSet"
This MotifSet object is a container for storing the generated motifs from Motif identification softwares, such as MEME.
## Constructor MotifSet(motifList=GRangesList(), motifEvalues=numeric(), subjectSeqs=DNAStringSet())
## Constructor MotifSet(motifList=GRangesList(), motifEvalues=numeric(), subjectSeqs=DNAStringSet())
motifList |
A GRangesList. Each GRanges store the starts, ends, strand, seqnames and scores information of one motif sites sequences. |
motifEvalues |
A numeric vector of the E values generated from MEME for each motif. |
subjectSeqs |
A DNAStringSet object. It stores the original sequences which are scanned by the software. |
A MotifSet
object is returned.
signature(x = "MotifSet")
: Getter
signature(x = "MotifSet")
(x, as.prob = FALSE,
shift = 0L, width = NULL, ...):
Calculate the consensus matrix. Other arguments,
please check the consensusMatrix
in Biostrings
package.
signature(x = "MotifSet")
:
Returns the number of motifs.
signature(x = "MotifSet")
(x, n=10L, type="none"):
Gets the sites sequences.
n
is the number of bases to include from flanking region.
type
controls "all", "left", "right" or "none"
flanking sequences are included.
Ge Tan
## Not run: motifSet <- runMEME(file.path(system.file("extdata", package="TFBSTools"), "crp0.s"), binary="/usr/local/Cellar/meme/4.10.1/bin/meme", arguments=list("-nmotifs"=3)) sitesSeq(motifSet, type="all") sitesSeq(motifSet, type="none") consensusMatrix(motifSet) ## End(Not run)
## Not run: motifSet <- runMEME(file.path(system.file("extdata", package="TFBSTools"), "crp0.s"), binary="/usr/local/Cellar/meme/4.10.1/bin/meme", arguments=list("-nmotifs"=3)) sitesSeq(motifSet, type="all") sitesSeq(motifSet, type="none") consensusMatrix(motifSet) ## End(Not run)
Parse the output file from “MEME”.
parseMEMEOutput(x)
parseMEMEOutput(x)
x |
|
A list of motifs and evalues is returned.
Ge Tan
memeOutput <- file.path(system.file("extdata", package="TFBSTools"), "meme.output") parseMEMEOutput(memeOutput)
memeOutput <- file.path(system.file("extdata", package="TFBSTools"), "meme.output") parseMEMEOutput(memeOutput)
This method simply shuffles the columns in matrices. This can either be done by just shuffling columns within each selected matrix, or by shuffling columns almong all selected matrices.
permuteMatrix(x, type="intra")
permuteMatrix(x, type="intra")
x |
A |
type |
The type of shuffling. It can be "intra" or "inter", which shuffle within each matrix, or between all the matrix. |
A object with shuffled matrix.
Ge Tan
data("MA0043") pfmSubject <- MA0043 data("MA0048") pfmQuery <- MA0048 #opts = list() #opts[["class"]] = "Ig-fold" #pfmList = getMatrixSet(JASPAR2014, opts) pfmList <- PFMatrixList(pfmSubject, pfmQuery) foo = permuteMatrix(pfmQuery) foo1 = permuteMatrix(pfmList, type="intra") foo2 = permuteMatrix(pfmList, type="inter")
data("MA0043") pfmSubject <- MA0043 data("MA0048") pfmQuery <- MA0048 #opts = list() #opts[["class"]] = "Ig-fold" #pfmList = getMatrixSet(JASPAR2014, opts) pfmList <- PFMatrixList(pfmSubject, pfmQuery) foo = permuteMatrix(pfmQuery) foo1 = permuteMatrix(pfmList, type="intra") foo2 = permuteMatrix(pfmList, type="inter")
Given a PFMatrix
or a normal matrix,
align it with another set of PFMatrix
to assess the similarity.
PFMSimilarity(pfmSubject, pfmQuery, openPenalty=3, extPenalty=0.01)
PFMSimilarity(pfmSubject, pfmQuery, openPenalty=3, extPenalty=0.01)
pfmSubject |
A |
pfmQuery |
A |
openPenalty |
The gap open penalty used in the modified Needleman-Wunsch algorithm. By default, it is 3. |
extPenalty |
The gap extension penalty used in the modified Needleman-Wunsch algorithm. By default, it is 0.01. |
For each pfmSubject
, an absolute score and
a relative percentage score is returned.
The maximum absolute score is 2*the width of the smaller matrix
in the comparison pair.
Ge Tan
Sandelin, A., H glund, A., Lenhard, B., & Wasserman, W. W. (2003). Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Functional & Integrative Genomics, 3(3), 125-134. doi:10.1007/s10142-003-0086-6
library(Biostrings) library(JASPAR2016) ## Example matrix from JASPAR database profileMatrix <- matrix(as.integer( c(13, 13, 3, 1, 54, 1, 1, 1, 0, 3, 2, 5, 13, 39, 5, 53, 0, 1, 50, 1, 0, 37, 0, 17, 17, 2, 37, 0, 0, 52, 3, 0, 53, 8, 37, 12, 11, 0, 9, 0, 0, 0, 0, 52, 1, 6, 15, 20)), nrow=4, byrow=TRUE, dimnames=list(DNA_BASES)) pfmQuery <- PFMatrix(profileMatrix=profileMatrix) pfmSubjects <- getMatrixSet(JASPAR2016, opts=list(ID=c("MA0500", "MA0499", "MA0521", "MA0697", "MA0048", "MA0751", "MA0832"))) PFMSimilarity(pfmSubjects, pfmQuery)
library(Biostrings) library(JASPAR2016) ## Example matrix from JASPAR database profileMatrix <- matrix(as.integer( c(13, 13, 3, 1, 54, 1, 1, 1, 0, 3, 2, 5, 13, 39, 5, 53, 0, 1, 50, 1, 0, 37, 0, 17, 17, 2, 37, 0, 0, 52, 3, 0, 53, 8, 37, 12, 11, 0, 9, 0, 0, 0, 0, 52, 1, 6, 15, 20)), nrow=4, byrow=TRUE, dimnames=list(DNA_BASES)) pfmQuery <- PFMatrix(profileMatrix=profileMatrix) pfmSubjects <- getMatrixSet(JASPAR2016, opts=list(ID=c("MA0500", "MA0499", "MA0521", "MA0697", "MA0048", "MA0751", "MA0832"))) PFMSimilarity(pfmSubjects, pfmQuery)
This function measures the similarity of two PWM matrix in three measurements: "normalised Euclidean distance", "Pearson correlation" and "Kullback Leibler divergence".
PWMSimilarity(pwmSubject, pwmQuery, method=c("Euclidean", "Pearson", "KL"))
PWMSimilarity(pwmSubject, pwmQuery, method=c("Euclidean", "Pearson", "KL"))
pwmSubject |
A |
pwmQuery |
A |
method |
The method can be "Euclidean", "Pearson", "KL". |
When pwmSubject and pwmQuery have different number of columns, the smaller PWM will be shifted from the start position of larger PWM and compare all the possible alignments. Only the smallest distance, divergence or largest correlation will be reported.
A numeric
value is returned.
signature(pwmSubject = "matrix", pwmQuery = "matrix")
signature(pwmSubject = "matrix", pwmQuery = "PWMatrix")
signature(pwmSubject = "PWMatrix", pwmQuery = "matrix")
signature(pwmSubject = "PWMatrix", pwmQuery = "PWMatrix")
signature(pwmSubject = "PWMatrixList", pwmQuery = "matrix")
signature(pwmSubject = "PWMatrixList", pwmQuery = "PWMatrix")
signature(pwmSubject = "PWMatrixList", pwmQuery = "PWMatrixList")
Linhart, C., Halperin, Y., & Shamir, R. (2008). Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets. Genome Research, 18(7), 1180-1189. doi:10.1101/gr.076117.108
data(MA0003.2) data(MA0004.1) pwm1 = toPWM(MA0003.2, type="prob") pwm2 = toPWM(MA0004.1, type="prob") PWMSimilarity(pwm1, pwm2, method="Euclidean")
data(MA0003.2) data(MA0004.1) pwm1 = toPWM(MA0003.2, type="prob") pwm2 = toPWM(MA0004.1, type="prob") PWMSimilarity(pwm1, pwm2, method="Euclidean")
Read a JASPAR format matrix file with ‘individual’ matrix or ‘all’ matrices in one file.
readJASPARMatrix(fn, matrixClass=c("PFM", "PWM", "PWMProb"))
readJASPARMatrix(fn, matrixClass=c("PFM", "PWM", "PWMProb"))
fn |
|
matrixClass |
|
An example of ‘individual’ format matrix file is available at http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/pfm/individual/MA0001.1.pfm
An exmaple of ‘all’ format matrix file is available at http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/pfm/nonredundant/pfm_all.txt
A PFMatrixList
or PWMatrixList
object is returned, depending on the
matrix class.
Ge Tan
fn <- file.path(system.file("extdata", package="TFBSTools"), "MA0001.1.pfm") readJASPARMatrix(fn, matrixClass="PFM") fn <- file.path(system.file("extdata", package="TFBSTools"), "pfm_all_example.txt") readJASPARMatrix(fn, matrixClass="PFM")
fn <- file.path(system.file("extdata", package="TFBSTools"), "MA0001.1.pfm") readJASPARMatrix(fn, matrixClass="PFM") fn <- file.path(system.file("extdata", package="TFBSTools"), "pfm_all_example.txt") readJASPARMatrix(fn, matrixClass="PFM")
Read the ouput xml files from Puython module "TFFM" into R.
readXMLTFFM(fn, type=c("First", "Detail"))
readXMLTFFM(fn, type=c("First", "Detail"))
fn |
The path of xml file. |
type |
The type of xml file. It can be one of the two types of xml files, "First" or "Detail". |
A TFFMFirst
object or a TFFMDetail
object is returned.
Ge Tan
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First")
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First")
This function samples matrices from trainned Dirichlet mixture model based on selected matrices.
rPWMDmm(x, alpha0, pmix, N=1, W=6)
rPWMDmm(x, alpha0, pmix, N=1, W=6)
x |
x can be a |
alpha0 |
The trained Dirichlet mixture parameters. |
pmix |
The trained mixing proportions of the components. |
N |
The number of matrices to sample. |
W |
The desired width of matrice from the sampling. |
This feature enables the users to generate random Position Frequency Matrices (PFMs) from selected profiles.
We assume that each column in the profile is independent and described by a mixture of Dirichlet multinomials in which the letters are drawn from a multinomial and the multinomial parameters are drawn from a mixture of Dirichlets. Within this model each column has its own set of multinomial parameters but the higher level parameters – those of the mixture prior is assumed to be common to all Jaspar matrices. We can therefore use a maximum likelihood approach to learn these from the observed column counts of all Jaspar matrices. The maximum likelihood approach automatically ensures that matrices receive a weight relative to the number of counts it contains.
Drawing samples from the prior distribution will generate PWMs with the same statistical properties as the Jaspar matrices as a whole. PWMs with statistical properties like those of the selected profiles can be obtained by drawing from a posterior distribution which is proportional to the prior times a multinomial likelihood term with counts taken from one of the columns of the selected profiles.
Each 4-dimensional column is sampled by the following three-step procedure: 1. draw the mixture component according to the distribution of mixing proportions, 2. draw an input column randomly from the concatenated selected profiles and 3. draw the probability vector over nucleotides from a 4-dimensional Dirichlet distribution. The parameter vector alpha of the Dirichlet is equal to the sum of the count (of the drawn input) and the parameters of the Dirichlet prior (of the drawn component).
Draws from a Dirichlet can be obtained in the following way from Gamma distributed samples: (X1,X2,X3,X4) = (Y1/V,Y2/V,Y3/V,Y4/V) ~ Dir(a1,a2,a3,a4) where V = sum(Yi) ~ Gamma(shape = sum(ai), scale = 1).
A list of matrices from the sampling.
signature(x = "PFMatrix")
signature(x = "matrix")
signature(x = "PFMatrixList")
This code is based on the Matlab code original written by Ole Winther, binf.ku.dk, June 2006.
Ge Tan
L. Devroye, "Non-Uniform Random Variate Generation", Springer-Verlag, 1986
Kimura, T., Tokuda, T., Nakada, Y., Nokajima, T., Matsumoto, T., & Doucet, A. (2011). Expectation-maximization algorithms for inference in Dirichlet processes mixture. Pattern Analysis and Applications, 16(1), 55-67. doi:10.1007/s10044-011-0256-4
data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) dmmParameters <- dmmEM(pfmList, 6) rPWMDmm(MA0003.2, dmmParameters$alpha0, dmmParameters$pmix, N=1, W=6)
data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) dmmParameters <- dmmEM(pfmList, 6) rPWMDmm(MA0003.2, dmmParameters$alpha0, dmmParameters$pmix, N=1, W=6)
This function builds position frequency matrices using an external program MEME written by Bailey and Elkan.
## S4 method for signature 'character' runMEME(x, binary="meme", seqtype="DNA", arguments=list(), tmpdir=tempdir()) ## S4 method for signature 'DNAStringSet' runMEME(x, binary="meme", seqtype="DNA", arguments=list(), tmpdir=tempdir())
## S4 method for signature 'character' runMEME(x, binary="meme", seqtype="DNA", arguments=list(), tmpdir=tempdir()) ## S4 method for signature 'DNAStringSet' runMEME(x, binary="meme", seqtype="DNA", arguments=list(), tmpdir=tempdir())
x |
A |
binary |
|
seqtype |
The sequence type. "AA" and "DNA" are allowed. |
arguments |
A |
tmpdir |
A |
A MotifSet
object is returned.
This wrapper is tested on “MEME” 4.10.1 and 4.12.0.
Ge Tan
Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, L., et al. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic acids research, 37(Web Server issue), W202-8. doi:10.1093/nar/gkp335
## Not run: motifSet <- runMEME(file.path(system.file("extdata", package="TFBSTools"), "crp0.s"), binary="/usr/local/Cellar/meme/4.10.1/bin/meme", arguments=list("-nmotifs"=3)) ## Get the site sequences sitesSeq(motifSet, type="all") sitesSeq(motifSet, type="none") ## Get the consensu matrix, then it can be used as a PFMatrix consensusMatrix(motifSet) ## End(Not run)
## Not run: motifSet <- runMEME(file.path(system.file("extdata", package="TFBSTools"), "crp0.s"), binary="/usr/local/Cellar/meme/4.10.1/bin/meme", arguments=list("-nmotifs"=3)) ## Get the site sequences sitesSeq(motifSet, type="all") sitesSeq(motifSet, type="none") ## Get the consensu matrix, then it can be used as a PFMatrix consensusMatrix(motifSet) ## End(Not run)
Sample ranges with same widths of input rannges from a set of subject ranges.
sampleRanges(inputGRanges, subjectGRanges, ignore.strand=TRUE)
sampleRanges(inputGRanges, subjectGRanges, ignore.strand=TRUE)
inputGRanges |
The input |
subjectGRanges |
The subject |
ignore.strand |
When set to |
A GRanges
object with the same length and widths of
inputGRanges
.
Ge Tan
library(GenomicRanges) inputGRanges <- GRanges(seqnames=c("chr1", "chr2"), range=IRanges(start=c(2L, 10L), end=c(6L, 15L)), strand=c("+", "-")) subjectGRanges <- GRanges( seqnames=c("chr1", "chr1", "chr1", "chr1", "chr2", "chr2"), ranges=IRanges(start=c(20L, 20L, 30L, 30L, 7L, 25L), end=c(50L, 50L, 32L, 32L,9L, 55L)), strand=c("+","-", "+", "-", "+","-")) set.seed(16) sampleRanges(inputGRanges, subjectGRanges, ignore.strand=TRUE) sampleRanges(inputGRanges, subjectGRanges, ignore.strand=FALSE)
library(GenomicRanges) inputGRanges <- GRanges(seqnames=c("chr1", "chr2"), range=IRanges(start=c(2L, 10L), end=c(6L, 15L)), strand=c("+", "-")) subjectGRanges <- GRanges( seqnames=c("chr1", "chr1", "chr1", "chr1", "chr2", "chr2"), ranges=IRanges(start=c(20L, 20L, 30L, 30L, 7L, 25L), end=c(50L, 50L, 32L, 32L,9L, 55L)), strand=c("+","-", "+", "-", "+","-")) set.seed(16) sampleRanges(inputGRanges, subjectGRanges, ignore.strand=TRUE) sampleRanges(inputGRanges, subjectGRanges, ignore.strand=FALSE)
Scans a pairwise alignment of nucleotide sequences with the pattern represented by the PWMatrix. It reports only those hits that are overlapped in the alignment of of the two sequences and exceed a specified threshold score in both, AND are found in regions of the alignment above the specified conservation cutoff value.
searchAln(pwm, aln1, aln2, seqname1="Unknown1", seqname2="Unknown2", min.score="80%", windowSize=51L, cutoff=0.7, strand="*", type="any", conservation=NULL, mc.cores=1L)
searchAln(pwm, aln1, aln2, seqname1="Unknown1", seqname2="Unknown2", min.score="80%", windowSize=51L, cutoff=0.7, strand="*", type="any", conservation=NULL, mc.cores=1L)
pwm |
A PWMatrix object or a PWMatrixList object. |
aln1 |
A |
aln2 |
A |
seqname1 , seqname2
|
A |
min.score |
The minimum score for the hit. Can be given an character string in the format of "80%" or as a single absolute value. When it is percentage value, it means the percentage of the maximal possible from the PWM. |
windowSize |
The size of the sliding window (in nucleotides) for calculating local conservation in the alignment. This should be an odd value. |
cutoff |
The conservation cutoff can be from 0 (0% identity) to 1 (100% identity).
The regions which have lower conservation than the cutoff
will be discarded from the results of the pattern searching.
The conservation is calculated by comparing the alignments
within the |
strand |
When searching the alignment, we can search the positive strand or negative strand. While strand is "*", it will search both strands and return the results based on the positvie strand coordinate. |
type |
This argument can be "any" or "all". When it is "any", one motif will be kept if the maximal conservation value of the motif is larger than the cutoff. When it is "all", one motif will be kept if the minimal conservation value of the motif is larger than the cutoff. |
conservation |
A vector of conservation profile. If not supplied, the conservation profile will be computed internally on the fly. |
mc.cores |
The number of cpu threads to use when searching |
In brief, given a pairwise alignment of two sequences,
first of all, we remove the gaps ("-", "-", ".").
Then we scan both ungapped sequences with the pwm and
return the hits that above min.score
.
Since we only want to keep the conserved hits,
we choose the pair of motifs that overlap most in the alignment.
Finally, the pair of motifs have to be conserved above
the threshold cutoff
.
In the returned SitePairSet
, the coordinates of start, end are based
on the ungapped sequences, instead of the original alignment.
This is due to we are more concerned about the actual location of motif
in the genome rather than in the alignment.
A SitePairSet
object is returned when pwm is a PWMatrix
,
while a SitePairSetList
is returned when pwm is a PWMatrixList
.
Ge Tan
data(MA0003.2) data(MA0004.1) pwm1 <- toPWM(MA0003.2) pwm2 <- toPWM(MA0004.1) pwmList <- PWMatrixList(pwm1=pwm1, pwm2=pwm2) # Two character objects aln1 <- "ACCACATTGCCTCAGGGCAGGTAAGTTGATC---AAAGG---AAACGCAAAGTTTTCAAG" aln2 <- "GTTTCACTACATTGCTTCAGGGCAGTAAATATATAAATATATAAAAATATAATTTTCATC" aln <- c(aln1=aln1, aln2=aln2) library(Biostrings) alnDNAStringSet <- DNAStringSet(c(aln1=aln1, aln2=aln2)) # PWMatrix, character, character ## Only scan the positive strand of the alignments sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="+", type="any") ## Only scan the negative strand of the alignments sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="-", type="any") ## Scan the both strands of the alignments sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="*", type="any") ## Convert the SitePairSet object into other R objects as(sitePairSet, "data.frame") as.data.frame(sitePairSet) as(sitePairSet, "DataFrame") as(sitePairSet, "GRanges") writeGFF3(sitePairSet) writeGFF2(sitePairSet) # PWMatrix, character, missing sitePairSet <- searchAln(pwm1, aln, min.score="70%", cutoff=0.5, strand="*", type="any") # PWMatrix, DNAString, DNAString sitePairSet <- searchAln(pwm1, DNAString(aln1), DNAString(aln2), seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="*", type="any") # PWMatrix, DNAStringSet, missing sitePairSet <- searchAln(pwm1, alnDNAStringSet, min.score="70%", cutoff=0.5, strand="*", type="any") # PWMatrixList, character, character sitePairSetList <- searchAln(pwmList, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="*", type="any") ## elementLenths of each pwm hits elementNROWS(sitePairSetList) ## output writeGFF2(sitePairSetList) writeGFF3(sitePairSetList) as(sitePairSetList, "DataFrame") as(sitePairSetList, "data.frame") as.data.frame(sitePairSetList) as(sitePairSetList, "GRanges") # PWMatrix, Axt, missing library(CNEr) axtFilesHg19DanRer7 <- file.path(system.file("extdata", package="TFBSTools"), "hg19.danRer7.net.axt") axtHg19DanRer7 <- readAxt(axtFilesHg19DanRer7) sitePairSetList <- searchAln(pwm1, axtHg19DanRer7, min.score="80%", windowSize=51L, cutoff=0.7, strand="*", type="any", conservation=NULL, mc.cores=1) ## We may want to coordinates of motif in the genome GRangesTFBS <- toGRangesList(sitePairSetList, axtHg19DanRer7)
data(MA0003.2) data(MA0004.1) pwm1 <- toPWM(MA0003.2) pwm2 <- toPWM(MA0004.1) pwmList <- PWMatrixList(pwm1=pwm1, pwm2=pwm2) # Two character objects aln1 <- "ACCACATTGCCTCAGGGCAGGTAAGTTGATC---AAAGG---AAACGCAAAGTTTTCAAG" aln2 <- "GTTTCACTACATTGCTTCAGGGCAGTAAATATATAAATATATAAAAATATAATTTTCATC" aln <- c(aln1=aln1, aln2=aln2) library(Biostrings) alnDNAStringSet <- DNAStringSet(c(aln1=aln1, aln2=aln2)) # PWMatrix, character, character ## Only scan the positive strand of the alignments sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="+", type="any") ## Only scan the negative strand of the alignments sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="-", type="any") ## Scan the both strands of the alignments sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="*", type="any") ## Convert the SitePairSet object into other R objects as(sitePairSet, "data.frame") as.data.frame(sitePairSet) as(sitePairSet, "DataFrame") as(sitePairSet, "GRanges") writeGFF3(sitePairSet) writeGFF2(sitePairSet) # PWMatrix, character, missing sitePairSet <- searchAln(pwm1, aln, min.score="70%", cutoff=0.5, strand="*", type="any") # PWMatrix, DNAString, DNAString sitePairSet <- searchAln(pwm1, DNAString(aln1), DNAString(aln2), seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="*", type="any") # PWMatrix, DNAStringSet, missing sitePairSet <- searchAln(pwm1, alnDNAStringSet, min.score="70%", cutoff=0.5, strand="*", type="any") # PWMatrixList, character, character sitePairSetList <- searchAln(pwmList, aln1, aln2, seqname1="aln1", seqname2="aln2", min.score="70%", cutoff=0.5, strand="*", type="any") ## elementLenths of each pwm hits elementNROWS(sitePairSetList) ## output writeGFF2(sitePairSetList) writeGFF3(sitePairSetList) as(sitePairSetList, "DataFrame") as(sitePairSetList, "data.frame") as.data.frame(sitePairSetList) as(sitePairSetList, "GRanges") # PWMatrix, Axt, missing library(CNEr) axtFilesHg19DanRer7 <- file.path(system.file("extdata", package="TFBSTools"), "hg19.danRer7.net.axt") axtHg19DanRer7 <- readAxt(axtFilesHg19DanRer7) sitePairSetList <- searchAln(pwm1, axtHg19DanRer7, min.score="80%", windowSize=51L, cutoff=0.7, strand="*", type="any", conservation=NULL, mc.cores=1) ## We may want to coordinates of motif in the genome GRangesTFBS <- toGRangesList(sitePairSetList, axtHg19DanRer7)
Given a chain file for liftover from one genome to another,
it searches two BSgenome
with a PWMatrix
,
and only reports the hits that are presents in two genomes
with equivalent positions.
searchPairBSgenome(pwm, BSgenome1, BSgenome2, chr1, chr2, min.score="80%", strand="*", chain)
searchPairBSgenome(pwm, BSgenome1, BSgenome2, chr1, chr2, min.score="80%", strand="*", chain)
pwm |
A PWMatrix object or a PWMatrixList object. |
BSgenome1 , BSgenome2
|
A |
chr1 , chr2
|
A |
min.score |
The minimum score for the hit. Can be given an character string in th format of "80%" or as a single absolute value. |
strand |
When searching the alignment, we can search the positive "+" strand or negative "-" strand. While strand is "*", it will search both strands and return the results based on the positvie strand coordinate. |
chain |
A |
A SitePairSet
object is returned when pwm is a PWMatrix
,
while a SitePairSetList
is returned when pwm is a PWMatrixList
.
Ge Tan
## Not run: library(rtracklayer) library(JASPAR2014) library(BSgenome.Hsapiens.UCSC.hg19) library(BSgenome.Mmusculus.UCSC.mm10) data("MA0004.1") pfm <- MA0004.1 pwm <- toPWM(pfm) chain <- import.chain("Downloads/hg19ToMm10.over.chain") sitepairset <- searchPairBSgenome(pwm, BSgenome.Hsapiens.UCSC.hg19, BSgenome.Mmusculus.UCSC.mm10, chr1="chr1", chr2="chr1", min.score="90%", strand="+", chain=chain) ## End(Not run)
## Not run: library(rtracklayer) library(JASPAR2014) library(BSgenome.Hsapiens.UCSC.hg19) library(BSgenome.Mmusculus.UCSC.mm10) data("MA0004.1") pfm <- MA0004.1 pwm <- toPWM(pfm) chain <- import.chain("Downloads/hg19ToMm10.over.chain") sitepairset <- searchPairBSgenome(pwm, BSgenome.Hsapiens.UCSC.hg19, BSgenome.Mmusculus.UCSC.mm10, chr1="chr1", chr2="chr1", min.score="90%", strand="+", chain=chain) ## End(Not run)
It scans a nucleotide sequence with the pattern represented by a PWMatrix and identifies putative transcription factor binding sites.
searchSeq(x, subject, seqname="Unknown", strand="*", min.score="80%", mc.cores=1L)
searchSeq(x, subject, seqname="Unknown", strand="*", min.score="80%", mc.cores=1L)
x |
|
subject |
A |
seqname |
This is sequence name of the target sequence.
If subject is a |
strand |
When searching the sequence, we can search the positive strand or negative strand. While strand is "*", it will search both strands and return the results based on the positvie strand coordinate. |
min.score |
The minimum score for the hit. Can be given an character string in the format of "80%" or as a single absolute value between 0 and 1. When it is percentage value, it represents the quantile between the minimal and the maximal possible value from the PWM. |
mc.cores |
|
A SiteSet
object is returned when x is a PWMatrix
object.
A SiteSetList
object is returned
when x is a PWMatrixList
or subject is a DNAStringSet
.
Ge Tan
Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. Nature Publishing Group, 5(4), 276-287. doi:10.1038/nrg1315
data(MA0003.2) data(MA0004.1) pwm1 <- toPWM(MA0003.2) pwm2 <- toPWM(MA0004.1) pwmList <- PWMatrixList(pwm1=pwm1, pwm2=pwm2) seq1 <- "GAATTCTCTCTTGTTGTAGCATTGCCTCAGGGCACACGTGCAAAATG" seq2 <- "GTTTCACCATTGCCTCAGGGCATAAATATATAAAAAAATATAATTTTCATC" # PWMatrix, character ## Only scan the positive strand of the input sequence siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="+", min.score="80%") siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="+", min.score=0.8) ## Only scan the negative strand of the input sequence siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="-", min.score="80%") ## Scan both strands of the input sequences siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="*", min.score="80%") ## Convert the SiteSet object into other R objects as(siteset, "data.frame") as(siteset, "DataFrame") as(siteset, "GRanges") writeGFF3(siteset) writeGFF2(siteset) # PWMatrixList, character sitesetList <- searchSeq(pwmList, seq1, seqname="seq1", strand="*", min.score="80%") sitesetList <- searchSeq(pwmList, seq1, seqname="seq1", strand="*", min.score="80%", mc.cores=1L) ## Convert the SiteSteList object into other R objects as(sitesetList, "data.frame") as(sitesetList, "DataFrame") as(sitesetList, "GRanges") writeGFF3(sitesetList) writeGFF2(sitesetList) # PWMatrix, DNAStringSet library(Biostrings) seqs <- DNAStringSet(c(seq1=seq1, seq2=seq2)) sitesetList <- searchSeq(pwm1, seqs, min.score="80%") # PWMatrixList, DNAStringSet sitesetList <- searchSeq(pwmList, seqs, min.score="80%")
data(MA0003.2) data(MA0004.1) pwm1 <- toPWM(MA0003.2) pwm2 <- toPWM(MA0004.1) pwmList <- PWMatrixList(pwm1=pwm1, pwm2=pwm2) seq1 <- "GAATTCTCTCTTGTTGTAGCATTGCCTCAGGGCACACGTGCAAAATG" seq2 <- "GTTTCACCATTGCCTCAGGGCATAAATATATAAAAAAATATAATTTTCATC" # PWMatrix, character ## Only scan the positive strand of the input sequence siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="+", min.score="80%") siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="+", min.score=0.8) ## Only scan the negative strand of the input sequence siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="-", min.score="80%") ## Scan both strands of the input sequences siteset <- searchSeq(pwm1, seq1, seqname="seq1", strand="*", min.score="80%") ## Convert the SiteSet object into other R objects as(siteset, "data.frame") as(siteset, "DataFrame") as(siteset, "GRanges") writeGFF3(siteset) writeGFF2(siteset) # PWMatrixList, character sitesetList <- searchSeq(pwmList, seq1, seqname="seq1", strand="*", min.score="80%") sitesetList <- searchSeq(pwmList, seq1, seqname="seq1", strand="*", min.score="80%", mc.cores=1L) ## Convert the SiteSteList object into other R objects as(sitesetList, "data.frame") as(sitesetList, "DataFrame") as(sitesetList, "GRanges") writeGFF3(sitesetList) writeGFF2(sitesetList) # PWMatrix, DNAStringSet library(Biostrings) seqs <- DNAStringSet(c(seq1=seq1, seq2=seq2)) sitesetList <- searchSeq(pwm1, seqs, min.score="80%") # PWMatrixList, DNAStringSet sitesetList <- searchSeq(pwmList, seqs, min.score="80%")
This function takes a ICMatrix
or TFFM
object
and plot the sequence logo.
seqLogo(x, ic.scale = TRUE, xaxis = TRUE, yaxis = TRUE, xfontsize = 15, yfontsize = 15)
seqLogo(x, ic.scale = TRUE, xaxis = TRUE, yaxis = TRUE, xfontsize = 15, yfontsize = 15)
x |
|
ic.scale |
A logical value. If TRUE, the total height of one column is proportional to the information content at that position. Otherwise, all the columns will have the same height. Ignored for |
xaxis |
A logical value. If TRUE, the x-axis will be plotted. Ignored for |
yaxis |
A logical value. If TRUE, the y-axis will be plotted. Ignored for |
xfontsize |
A numeric value. The font size for x-axis. |
yfontsize |
A numeric value. The font size for y-axis. |
A sequence logo is a graphical representation of the matrix model,
based on the information content of each position.
The information content ranges from 0 (no base preference) to 2 (only 1 base used).
If ic.scale
is TRUE,
the height of the logo at certain site is proportinal to
the information content value.
And each stacked base (A, C, G, T)'s height is also proportional to
the information content of each base at that position,
and sorted based on the character size.
For a TFFM
object, a novel graphical representation is used for
capturing the dinucleotide dependencies on the TFFM.
For the upper part of the sequence logo,
we represent the nucleotide probabilities
at position p for each possible nucleotide at position p-1.
Hence, each column represents a position within a TFBS and
each row the nucleotide probabilities found at that position.
Each row assumes a specific nucleotide has been emitted
by the previous hidden state.
The intersection between a column corresponding to position p and
row corresponding to nucleotide n gives the probabilities of
getting each nucleotide at position p if n has been seen at position p-1.
The opacity to represent the sequence logo is proportional to
the probablity of
possible row to be used by the TFFM.
No return value.
This function is based on the function seqLogo
from the Bioconductor
package seqLogo
, especially for the plotting code of TFFM.
Ge Tan
T D Schneider, R. M. S. (1990). Sequence logos: a new way to display consensus sequences. Nucleic acids research, 18(20), 6097.
Mathelier, A., and Wasserman, W.W. (2013). The next generation of transcription factor binding site prediction. PLoS Comput. Biol. 9, e1003214.
## ICMatrix data(MA0003.2) icm = toICM(MA0003.2) seqLogo(icm, ic.scale = TRUE) ## TFFM xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") seqLogo(tffmFirst)
## ICMatrix data(MA0003.2) icm = toICM(MA0003.2) seqLogo(icm, ic.scale = TRUE) ## TFFM xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") seqLogo(tffmFirst)
This function calculates the Shannon entropy for a discrete random variable with finite n values sample.
shannon.entropy(p)
shannon.entropy(p)
p |
A |
The entropy is calculated by H(x) = -sum_i^n(P(x_i)log_b(P(x_i))).
A numeric
value of entropy is returned.
Ge Tan
x <- c(1, 1, 1, 1) shannon.entropy(x) x <- c(1, 0, 0, 0) shannon.entropy(x)
x <- c(1, 1, 1, 1) shannon.entropy(x) x <- c(1, 0, 0, 0) shannon.entropy(x)
"SitePairSet"
The SitePairSet object is a container for storing two SiteSet objects. Usually it is used to hold the results returned by searchAln.
## Constructor SitePairSet(siteset1, siteset2)
## Constructor SitePairSet(siteset1, siteset2)
siteset1 , siteset2
|
Each SiteSet object is from one sequence in the pairwise alignment. |
A SitePairSet
object.
signature(x = "SitePairSet")
: Gets the first SiteSet object.
signature(x = "SitePairSet")
: Gets the second SiteSet object.
Ge Tan
"SitePairSetList"
The SitePairSetList class is a container for storing a collection of SitePairSet objects. Basically it is a SimpleList and is designed for manipulating the set of SitePairSet objects as a whole.
## Constructors: SitePairSetList(..., use.names=TRUE)
## Constructors: SitePairSetList(..., use.names=TRUE)
... |
The SitePairSet objects are supplied in .... A list of SitePairSet objects is also acceptable. |
use.names |
A logical value. When TRUE,
the names of the |
A SitePairSetList
object.
Ge Tan
"SiteSet"
The SiteSet object is a container for storing a set of putative transcription factor binding sites on a nucleotide sequence (start, end, strand, score, pattern as a PWMatrix, etc.)
## Constructors: SiteSet(views, score, strand="*", seqname="Unknown", sitesource="TFBS", primary="TF binding site", pattern)
## Constructors: SiteSet(views, score, strand="*", seqname="Unknown", sitesource="TFBS", primary="TF binding site", pattern)
views |
Object of class |
score |
Object of class |
strand |
Object of class |
seqname |
Object of class |
sitesource |
Object of class |
primary |
Object of class |
pattern |
Object of class |
The score retuned in SiteSet is the absolute score of each putative TFBS scanned by the corresponding PWM. The way of calculating the score is shown on the refernce, Page 281.
signature(x = "SiteSet")
: Getter function.
signature(x = "SiteSet")
:
The number of binding sites in this SiteSet.
signature(x = "SiteSet")
: Returns the PWMatrix used.
signature(x = "SiteSet")
:
Gets relative score (between 0.0 to 1.0) with respect of the score
range of the associated pattern (PWMatrix).
signature(x = "SiteSet")
:
Returns the score of each site.
signature(x = "SiteSet")
:
Returns the sequence name of the sequence which contains these sites.
signature(x = "SiteSet")
:
Returns the strand information.
signature(x = "SiteSet")
: Returns the views object.
signature(x = "SiteSet")
:
Returns the start coordinates.
signature(x = "SiteSet")
: Returns the end coordinates.
signature(x = "SiteSet")
(x, type=c("TFMPvalue", "sampling")):
Calculates the empirical p-values for the scores with two methods:
the exact method from TFMPaluve package or
implementation of sampling in this package.
The background probability for sampling is based on
the PWM matrix in the SiteSet
object.
Ge Tan
Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. Nature Publishing Group, 5(4), 276-287. doi:10.1038/nrg1315
searchSeq
,
searchAln
,
PWMatrix
,
SiteSetList
,
SitePairSet
data(MA0003.2) pwm <- toPWM(MA0003.2) siteset <- searchSeq(pwm, "GAATTCTCTCTTGTTGTAGTCTCTTGACAAAATG", min.score="60%") writeGFF3(siteset, scoreType="absolute") as(siteset, "data.frame") as(siteset, "DataFrame") as(siteset, "GRanges") relScore(siteset) pvalues(siteset, type="TFMPvalue") pvalues(siteset, type="sampling")
data(MA0003.2) pwm <- toPWM(MA0003.2) siteset <- searchSeq(pwm, "GAATTCTCTCTTGTTGTAGTCTCTTGACAAAATG", min.score="60%") writeGFF3(siteset, scoreType="absolute") as(siteset, "data.frame") as(siteset, "DataFrame") as(siteset, "GRanges") relScore(siteset) pvalues(siteset, type="TFMPvalue") pvalues(siteset, type="sampling")
"SiteSetList"
The SiteSetList class is a container for storing a collection of SiteSet objects. Basically it is a SimpleList and is designed for manipulating the set of SiteSet objects as a whole.
## Constructors: SiteSetList(..., use.names=TRUE)
## Constructors: SiteSetList(..., use.names=TRUE)
... |
The SiteSet objects are supplied in .... A list of SiteSet objects is also acceptable. |
use.names |
A logical value. When TRUE, the names of the |
A SiteSetList
object.
signature(x = "SiteSetList")
(x, type=c("TFMPvalue", "sampling")):
Calculates the empirical p-values for the scores.
Ge Tan
data(MA0003.2) data(MA0004.1) pwmList <- PWMatrixList(MA0003.2=toPWM(MA0003.2), MA0004.1=toPWM(MA0004.1)) sitesetList <- searchSeq(pwmList, "GAATTCTCTCTTGTTGTAGTCTCTTGACAAAATG", min.score="50%") ## elementNROWS of each pwm hits library(S4Vectors) elementNROWS(sitesetList) ## Output of SiteSetList writeGFF3(sitesetList, scoreType="absolute") as(sitesetList, "DataFrame") as(sitesetList, "data.frame") as.data.frame(sitesetList) as(sitesetList, "GRanges") ## Calculate the p-values pvalues(sitesetList, type="TFMPvalue") pvalues(sitesetList, type="sampling")
data(MA0003.2) data(MA0004.1) pwmList <- PWMatrixList(MA0003.2=toPWM(MA0003.2), MA0004.1=toPWM(MA0004.1)) sitesetList <- searchSeq(pwmList, "GAATTCTCTCTTGTTGTAGTCTCTTGACAAAATG", min.score="50%") ## elementNROWS of each pwm hits library(S4Vectors) elementNROWS(sitesetList) ## Output of SiteSetList writeGFF3(sitesetList, scoreType="absolute") as(sitesetList, "DataFrame") as(sitesetList, "data.frame") as.data.frame(sitesetList) as(sitesetList, "GRanges") ## Calculate the p-values pvalues(sitesetList, type="TFMPvalue") pvalues(sitesetList, type="sampling")
The TFFM is a virtual class. Two classes are derived from this class: TFFMFirst and TFFMDetail.
TFFMFirst class stands for the first-order TFFMs and TFFMDetail stands for the more detailed and descriptive TFFMs.
## constructors: TFFMFirst(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), type=character(), emission=list(), transition=matrix()) TFFMDetail(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), type=character(), emission=list(), transition=matrix())
## constructors: TFFMFirst(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), type=character(), emission=list(), transition=matrix()) TFFMDetail(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), type=character(), emission=list(), transition=matrix())
ID , name , matrixClass , strand , bg , tags , profileMatrix
|
See |
type |
The type of TFFM. |
emission |
The emission distribution parameters. |
transition |
The transition probability matrix. |
A TFFM
object.
signature(x = "TFFMFirst")
:
Get the length of First-order TFFM.
signature(x = "TFFMDetail")
:
Get the length of detail TFFM.
signature(x = "TFFM")
:
Get the information content at each position.
Ge Tan
Mathelier, A., and Wasserman, W.W. (2013). The next generation of transcription factor binding site prediction. PLoS Comput. Biol. 9, e1003214.
http://cisreg.cmmt.ubc.ca/TFFM/doc/#
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") tffm <- getPosProb(tffmFirst)
xmlFirst <- file.path(system.file("extdata", package="TFBSTools"), "tffm_first_order.xml") tffmFirst <- readXMLTFFM(xmlFirst, type="First") tffm <- getPosProb(tffmFirst)
Get the genomic coordinates from SitePairSetList.
A list of two GRanges
objects are returned,
one for the target sequences
and another for query sequences.
In the GRanges
, strand is taken from the Axt
object.
In the meta-data columns, PWM matrix ID, the strand of matrix and
match score are also returned.
signature(x = "SitePairSetList", axt = "Axt")
Convert the relative coordinates to absolute coordinates.
Ge Tan
data(MA0003.2) pwm <- toPWM(MA0003.2) library(CNEr) axtFilesHg19DanRer7 <- file.path(system.file("extdata", package="TFBSTools"), "hg19.danRer7.net.axt") axtHg19DanRer7 <- readAxt(axtFilesHg19DanRer7) sitePairSet <- searchAln(pwm, axtHg19DanRer7, min.score="80%", windowSize=51L, cutoff=0.7, strand="*", type="any", conservation=NULL, mc.cores=1) toGRangesList(sitePairSet, axtHg19DanRer7)
data(MA0003.2) pwm <- toPWM(MA0003.2) library(CNEr) axtFilesHg19DanRer7 <- file.path(system.file("extdata", package="TFBSTools"), "hg19.danRer7.net.axt") axtHg19DanRer7 <- readAxt(axtFilesHg19DanRer7) sitePairSet <- searchAln(pwm, axtHg19DanRer7, min.score="80%", windowSize=51L, cutoff=0.7, strand="*", type="any", conservation=NULL, mc.cores=1) toGRangesList(sitePairSet, axtHg19DanRer7)
Converts a raw frequency matrix (PFMatrix) to a information content matrix (ICMatrix). It takes the bases background frequencies, pseudocounts and schneider as parameters.
toICM(x, pseudocounts=0.8, schneider=FALSE, bg=c(A=0.25, C=0.25, G=0.25, T=0.25))
toICM(x, pseudocounts=0.8, schneider=FALSE, bg=c(A=0.25, C=0.25, G=0.25, T=0.25))
x |
For |
pseudocounts |
A default value 0.8 is used. |
schneider |
This logical parameter controls whether a Schneider correction will be done. See more details below. |
bg |
bg is a vector of background frequencies of four bases with names containing A, C, G, T. When toPWM is applied to a |
The information content matrix has a column sum between 0 (no base preference) and 2 (only 1 base used). Usually this information is used to plot sequence log.
The information content at each position is computed
where D is the total information contect for each position. For detailed procedure of computation, please refer to the vignette.
If a Schneider correction will be done if requested. Please see the reference below for more comprehensive explanation.
A ICMatrix
object which contains the background frequency,
pseudocounts and Schneider correction used.
Ge Tan
Schneider, T. D., Stormo, G. D., Gold, L., & Ehrenfeucht, A. (1986). Information content of binding sites on nucleotide sequences. Journal of molecular biology, 188(3), 415-431.
## Constructor a PFMatrix pfm <- PFMatrix(ID="MA0004.1", name="Arnt", matrixClass="Zipper-Type", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(family="Helix-Loop-Helix", species="10090", tax_group="vertebrates", medline="7592839", type="SELEX", ACC="P53762", pazar_tf_id="TF0000003", TFBSshape_ID="11", TFencyclopedia_ID="580"), profileMatrix=matrix(c(4L, 19L, 0L, 0L, 0L, 0L, 16L, 0L, 20L, 0L, 0L, 0L, 0L, 1L, 0L, 20L, 0L, 20L, 0L, 0L, 0L, 0L, 20L, 0L), byrow=TRUE, nrow=4, dimnames=list(c("A", "C", "G", "T"))) ) ## Convert it into a PWMatrix icm <- toICM(pfm, pseudocounts=0.8, schneider=TRUE) ## Conversion on PWMatrixList data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) icmList <- toICM(pfmList, pseudocounts=0.8, schneider=TRUE)
## Constructor a PFMatrix pfm <- PFMatrix(ID="MA0004.1", name="Arnt", matrixClass="Zipper-Type", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(family="Helix-Loop-Helix", species="10090", tax_group="vertebrates", medline="7592839", type="SELEX", ACC="P53762", pazar_tf_id="TF0000003", TFBSshape_ID="11", TFencyclopedia_ID="580"), profileMatrix=matrix(c(4L, 19L, 0L, 0L, 0L, 0L, 16L, 0L, 20L, 0L, 0L, 0L, 0L, 1L, 0L, 20L, 0L, 20L, 0L, 0L, 0L, 0L, 20L, 0L), byrow=TRUE, nrow=4, dimnames=list(c("A", "C", "G", "T"))) ) ## Convert it into a PWMatrix icm <- toICM(pfm, pseudocounts=0.8, schneider=TRUE) ## Conversion on PWMatrixList data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) icmList <- toICM(pfmList, pseudocounts=0.8, schneider=TRUE)
Converts a raw frequency matrix (PFMatrix) to a position weight matrix (PWMatrix). It takes the type, bases background frequencies, pseudocounts as parameters.
toPWM(x, type=c("log2probratio", "prob"), pseudocounts=0.8, bg=c(A=0.25, C=0.25, G=0.25, T=0.25))
toPWM(x, type=c("log2probratio", "prob"), pseudocounts=0.8, bg=c(A=0.25, C=0.25, G=0.25, T=0.25))
x |
For |
type |
The type of PWM generated, should be one of "log2probratio" or "prob". "log2probratio" will generate the PWM matrix in log-scale, while "prob" will give the PWM matrix in probability scale of 0 to 1. |
pseudocounts |
pseudocounts is a numeric non-negative vector, which means you can specify different pseudocounts for each site. The values will be recycled if shorter than the length of sites. 0.8 is recommended. See the reference below for more details. In the TFBS perl module, the squared root of the column sum of the matrix, i.e., the number of motifs used to construct the PFM, is used. |
bg |
bg is a vector of background frequencies of four bases
with names containing A, C, G, T.
When toPWM is applied to a |
The raw position frequency matrix (PFM) is usually converted into
a position weight matrix (PWM),
also known as position specific scoring matrix (PSSM).
The PWM provides the probability of each base at certain position and
used for scanning the genomic sequences.
The implementation here is slightly different from PWM
in
Biostrings
package by choosing the pseudocounts.
Pseudocounts is necessary for correcting the small number of counts
or eliminating the zero values before log transformation.
A PWMatrix
object that contains the background frequency and
pseudocounts used.
Ge Tan
Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. Nature Publishing Group, 5(4), 276-287. doi:10.1038/nrg1315
Nishida, K., Frith, M. C., & Nakai, K. (2009). Pseudocounts for transcription factor binding sites. Nucleic acids research, 37(3), 939-944. doi:10.1093/nar/gkn1019
## Constructe a PFMatrix pfm <- PFMatrix(ID="MA0004.1", name="Arnt", matrixClass="Zipper-Type", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(family="Helix-Loop-Helix", species="10090", tax_group="vertebrates", medline="7592839", type="SELEX", ACC="P53762", pazar_tf_id="TF0000003", TFBSshape_ID="11", TFencyclopedia_ID="580"), profileMatrix=matrix(c(4L, 19L, 0L, 0L, 0L, 0L, 16L, 0L, 20L, 0L, 0L, 0L, 0L, 1L, 0L, 20L, 0L, 20L, 0L, 0L, 0L, 0L, 20L, 0L), byrow=TRUE, nrow=4, dimnames=list(c("A", "C", "G", "T"))) ) ## Convert it into a PWMatrix pwm <- toPWM(pfm, type="log2probratio", pseudocounts=0.8) ## Conversion on PWMatrixList data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) pwmList <- toPWM(pfmList, pseudocounts=0.8)
## Constructe a PFMatrix pfm <- PFMatrix(ID="MA0004.1", name="Arnt", matrixClass="Zipper-Type", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(family="Helix-Loop-Helix", species="10090", tax_group="vertebrates", medline="7592839", type="SELEX", ACC="P53762", pazar_tf_id="TF0000003", TFBSshape_ID="11", TFencyclopedia_ID="580"), profileMatrix=matrix(c(4L, 19L, 0L, 0L, 0L, 0L, 16L, 0L, 20L, 0L, 0L, 0L, 0L, 1L, 0L, 20L, 0L, 20L, 0L, 0L, 0L, 0L, 20L, 0L), byrow=TRUE, nrow=4, dimnames=list(c("A", "C", "G", "T"))) ) ## Convert it into a PWMatrix pwm <- toPWM(pfm, type="log2probratio", pseudocounts=0.8) ## Conversion on PWMatrixList data(MA0003.2) data(MA0004.1) pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) pwmList <- toPWM(pfmList, pseudocounts=0.8)
writeGFF3
, writeGFF2
functionswrite the SiteSet
, SitePairSet
, SiteSetList
,
SitePairSetList
into the GFF3 or GFF2 format.
writeGFF3(x, scoreType=c("absolute", "relative")) writeGFF2(x, scoreType=c("absolute", "relative"))
writeGFF3(x, scoreType=c("absolute", "relative")) writeGFF2(x, scoreType=c("absolute", "relative"))
x |
A |
scoreType |
The score column can have absolute value or relative value. |
It returns nothing.
Ge Tan
"XMatrix"
objectsXMatrix is a virtual class. No objects can be created from it directly. Three classes are derived from this class: PFMatrix, PWMatrix and ICMatrix.
PFMatrix is a class whose instances are objects representing raw position frequency matrices (PFMs).
PWMatrix is a class whose instances are objects representing
position weight matrices (PWMs).
Compared with PFMatrix, it has extra slot pseudocounts
.
ICMatrix is a class whose instances are objects representing
information content matrices (ICMs).
Compared with PWMatrix, it has extra slot schneider
.
## Constructors: PFMatrix(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix()) PWMatrix(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), pseudocounts=numeric()) ICMatrix(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), pseudocounts=numeric(), schneider=logical()) ## Accessor-like methods: ## S4 method for signature 'XMatrix' ID(x) ## S4 method for signature 'XMatrix' bg(x) ## ... and more (see Methods)
## Constructors: PFMatrix(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix()) PWMatrix(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), pseudocounts=numeric()) ICMatrix(ID="Unknown", name="Unknown", matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=matrix(), pseudocounts=numeric(), schneider=logical()) ## Accessor-like methods: ## S4 method for signature 'XMatrix' ID(x) ## S4 method for signature 'XMatrix' bg(x) ## ... and more (see Methods)
ID |
Object of class |
name |
Object of class |
matrixClass |
Object of class |
strand |
Object of class |
bg |
Object of class |
tags |
Object of class (1) "family": Structural sub-class of the transcription factor, based on the TFCaT system. (2) "species": The species source for the sequences, in NCBI tax IDs. (3) "tax_group": Group of species, currently consisting of 4 larger groups: vertebrate, insect, plant, chordate. (4) "medline": a ID to the relevant publication reporting the sites used in the mode building. (5) "type": Methodology used for matrix construction. (6) "ACC": A representative protein accession number in Genbank for the transcription factor. Human takes precedence if several exists. (6) "pazar_tf_id": a ID to PAZAR database. (7) "TFBSshape_ID": a ID to TFBSshape database. (8) "TFencyclopedia_ID": a ID to the Transcription Factor Encyclopedia. (9) "comment": For some matrices, a curator comment is added. |
profileMatrix |
Object of class |
pseudocounts |
Object of class |
schneider |
Obejct of class |
x |
Object of class |
A XMatrix
object.
signature(x = "XMatrix")
:
Gets the background base frequencies.
signature(x = "XMatrix")
:
Sets the background base frequencies.
signature(x = "XMatrix")
: Gets the ID information.
signature(x = "XMatrix")
: Sets the ID information.
signature(x = "XMatrix")
: Gets the pattern length in nucleotides
(i.e. number of columns in the matrix).
signature(x = "PWMatrix")
:
Generates the reverse complement matrix object.
Note than the strand is XMatrix will also be changed to
the opossite strand.
signature(x = "XMatrix")
: Returns the matrix in the XMatrix class.
signature(x = "ICMatrix")
: Returns the information content vector.
signature(x = "XMatrix")
:
Gets the matrix stored in XMatrix
object.
signature(x = "XMatrix")
: Sets the matrix stored in XMatrix
object.
signature(x = "XMatrix")
:
Gets the matrix type of a XMatrix
object.
signature(x = "XMatrix")
:
Sets the matrix type of a XMatrix
object.
signature(x = "XMatrix")
: Gets the name information.
signature(x = "XMatrix")
: Sets the name information.
signature(x = "XMatrix")
:
Gets the strand information of a XMatrix
object.
signature(x = "XMatrix")
:
Gets a list
object of tags information.
Ge Tan
## Constructorpf PFMatrix ## Note that there is no XMatrix() constructor, ## but an XMatrix family of constructors: PFMatrix(), PWMatrix(), ICMatrix() pfm <- PFMatrix(ID="MA0004.1", name="Arnt", matrixClass="Zipper-Type", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(family="Helix-Loop-Helix", species="10090", tax_group="vertebrates", medline="7592839", type="SELEX", ACC="P53762", pazar_tf_id="TF0000003", TFBSshape_ID="11", TFencyclopedia_ID="580"), profileMatrix=matrix(c(4, 19, 0, 0, 0, 0, 16, 0, 20, 0, 0, 0, 0, 1, 0, 20, 0, 20, 0, 0, 0, 0, 20, 0), byrow=TRUE, nrow=4, dimnames=list(c("A", "C", "G", "T"))) ) ## Construction from a set of binding sites sequences sitesSeqs <- c("Human Gli1"= "GACCACCCA", "hIGFBP-6"= "GACCCCCCA", "HNF-3beta"="GAACACCCA", "hPlakoglobin"= "GACCACCAA", "rIGFBP-6"= "GTCCACCCA", "Sox-9"= "GGCCACCCA") countMatrix <- consensusMatrix(sitesSeqs) pfm <- PFMatrix(ID="Gli-1", name="Gli-1", profileMatrix=countMatrix) ## Coersion as.matrix(pfm) as(pfm, "matrix") ## Methods pwm <- toPWM(pfm) reverseComplement(pwm) length(pfm)
## Constructorpf PFMatrix ## Note that there is no XMatrix() constructor, ## but an XMatrix family of constructors: PFMatrix(), PWMatrix(), ICMatrix() pfm <- PFMatrix(ID="MA0004.1", name="Arnt", matrixClass="Zipper-Type", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(family="Helix-Loop-Helix", species="10090", tax_group="vertebrates", medline="7592839", type="SELEX", ACC="P53762", pazar_tf_id="TF0000003", TFBSshape_ID="11", TFencyclopedia_ID="580"), profileMatrix=matrix(c(4, 19, 0, 0, 0, 0, 16, 0, 20, 0, 0, 0, 0, 1, 0, 20, 0, 20, 0, 0, 0, 0, 20, 0), byrow=TRUE, nrow=4, dimnames=list(c("A", "C", "G", "T"))) ) ## Construction from a set of binding sites sequences sitesSeqs <- c("Human Gli1"= "GACCACCCA", "hIGFBP-6"= "GACCCCCCA", "HNF-3beta"="GAACACCCA", "hPlakoglobin"= "GACCACCAA", "rIGFBP-6"= "GTCCACCCA", "Sox-9"= "GGCCACCCA") countMatrix <- consensusMatrix(sitesSeqs) pfm <- PFMatrix(ID="Gli-1", name="Gli-1", profileMatrix=countMatrix) ## Coersion as.matrix(pfm) as(pfm, "matrix") ## Methods pwm <- toPWM(pfm) reverseComplement(pwm) length(pfm)
"XMatrixList"
The XMatrixList virtual class is a container for storing a collection of XMatrix objects. No object can be constructed directly from this virtual and it has three subclasses: PFMatrixList, PWMatrixList and ICMatrixList. Basically it is a SimpleList and is designed for manipulating the set of XMatrix objects as a whole.
## Constructors: PFMatrixList(..., use.names=TRUE) PWMatrixList(..., use.names=TRUE) ICMatrixList(..., use.names=TRUE) ## Accessor-like methods: ## S4 method for signature 'XMatrixList' ID(x) ## S4 method for signature 'XMatrixList' name(x) ## S4 method for signature 'XMatrixList' bg(x) ## S4 method for signature 'XMatrixList' tags(x) ## S4 method for signature 'XMatrixList' name(x) ## S4 method for signature 'XMatrixList' strand(x)
## Constructors: PFMatrixList(..., use.names=TRUE) PWMatrixList(..., use.names=TRUE) ICMatrixList(..., use.names=TRUE) ## Accessor-like methods: ## S4 method for signature 'XMatrixList' ID(x) ## S4 method for signature 'XMatrixList' name(x) ## S4 method for signature 'XMatrixList' bg(x) ## S4 method for signature 'XMatrixList' tags(x) ## S4 method for signature 'XMatrixList' name(x) ## S4 method for signature 'XMatrixList' strand(x)
... |
The |
use.names |
A logical value. When TRUE, the names of the |
x |
A |
A XMatrixList
object.
Ge Tan
data(MA0003.2) data(MA0004.1) ## Construction of PFMatrixList pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) ## Construction of PFM<atrixList from list of PFMatrix pfmList <- do.call(PFMatrixList, list(pfm1=MA0003.2, pfm2=MA0004.1))
data(MA0003.2) data(MA0004.1) ## Construction of PFMatrixList pfmList <- PFMatrixList(pfm1=MA0003.2, pfm2=MA0004.1, use.names=TRUE) ## Construction of PFM<atrixList from list of PFMatrix pfmList <- do.call(PFMatrixList, list(pfm1=MA0003.2, pfm2=MA0004.1))