Title: | High-throughput prediction of DNA shape features |
---|---|
Description: | DNAhapeR is an R/BioConductor package for ultra-fast, high-throughput predictions of DNA shape features. The package allows to predict, visualize and encode DNA shape features for statistical learning. |
Authors: | Tsu-Pei Chiu and Federico Comoglio |
Maintainer: | Tsu-Pei Chiu <[email protected]> |
License: | GPL-2 |
Version: | 1.35.0 |
Built: | 2024-10-30 05:26:22 UTC |
Source: | https://github.com/bioc/DNAshapeR |
The main functions in the package are getFasta
and
getShape
. Shape predictions can be additionally plotted with plotShape
.
See package vignette for examples.
All support questions should be posted to the Bioconductor support site: support.bioconductor.org, using DNAshapeR as a tag.
Tsu-Pei Chiu and Federico Comoglio
DNAshapeR reference:
T.-P. Chiu*, F. Comoglio*, T. Zhou, L. Yang, R. Paro, and R. Rohs: DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding (2016). Bioinformatics http://bioinformatics.oxfordjournals.org/content/early/2016/01/09/bioinformatics.btv735 (*equal contributor in alphabetic order)
Convert fasta file to methylated file format
convertMethFile(fastaFileName, methPositionFileName)
convertMethFile(fastaFileName, methPositionFileName)
fastaFileName |
The name of the input fasta format file, including full path to file if it is located outside the current working directory. |
methPositionFileName |
The name of the input position file indicating the methlation position |
methFileName fasta file containing methylated Cytosine
Satyanarayan Rao & Tsu-Pei Chiu
encode Hbond
encodeKmerHbond (k, dnaStringSet )
encodeKmerHbond (k, dnaStringSet )
k |
k-mer sequence |
dnaStringSet |
dnaStringSet |
featureVector
DNAshapeR can be used to generate feature vectors for a user-defined model. The model can be a k-mer sequence. Sequence is encoded in four binary features (i.e., in terms of 1-mers, 0001 for adenine, 0010 for cytosine, 0100 for guanine, and 1000 for thymine) at each nucleotide position (Zhou, et al., 2015). The function permits an encoding of 2-mers and 3-mers (16 and 64 binary features at each position, respectively).
encodeKMerSeq(k, dnaStringSet)
encodeKMerSeq(k, dnaStringSet)
k |
A number indicating k-mer sequence encoding |
dnaStringSet |
A DNAStringSet object of the inputted fasta file |
featureVector A matrix containing encoded features. Sequence feature is represented as binary numbers
Tsu-Pei Chiu
Encode n-st order shape features DNAshapeR can be used to generate feature vectors for a user-defined model. The model can be a shape model. There are four structural parameters including MGW, Roll, ProT and HelT. The second order shape features are product terms of values for the same category of shape features at adjacent positions.
encodeNstOrderShape(n, shapeMatrix, shapeType)
encodeNstOrderShape(n, shapeMatrix, shapeType)
n |
A number indicating n-st order shape encoding |
shapeMatrix |
A matrix containing DNAshape prediction result |
shapeType |
A character name of shape (MGW, Roll, ProT, HelT) features |
featureVector A matrix containing encoded features. shape feature is represented as continuous numbers
Tsu-Pei Chiu
DNAshapeR can be used to generate feature vectors for a user-defined model. These models can be based on DNA sequence (1-mer, 2-mer, 3-mer) or DNA shape (MGW, Roll, ProT, HelT) features or any combination thereof. Sequence is encoded as four binary features (i.e., 0001 for adenine, 0010 for cytosine, 0100 for guanine, and 1000 for thymine, for encoding of 1-mers) at each nucleotide position (Zhou, et al., 2015). Encoding of 2-mers and 3-mers (16 and 64 binary features at each position, respectively) is also supported. Shape features include first and second order (or higher order) values for the four structural parameters MGW, Roll, ProT and HelT. The second order shape features are product terms of values for the same category of shape features at adjacent positions. The function allows to generate any subset of these features, e.g. a given shape category or first order shape features, and any desired combination of shape and sequence features. Feature encoding returns a feature matrix for a dataset of multiple sequences, in which each sequence generates a concatenated feature vector. The output of this function can be used directly for any statistical machine learning method.
encodeSeqShape(fastaFileName, shapeMatrix, featureNames, normalize)
encodeSeqShape(fastaFileName, shapeMatrix, featureNames, normalize)
fastaFileName |
A character name of the input fasta format file, including full path to file if it is located outside the current working directory. |
shapeMatrix |
A matrix containing DNAshape prediction result |
featureNames |
A vector containing a combination of user-defined sequence and shape parameters. The parameters can be any combination of "k-mer", "n-shape", "n-MGW", "n-ProT", "n-Roll", "n-HelT" (k, n are integers) |
normalize |
A logical indicating whether to perform normalization. Default to TRUE. |
featureVector A matrix containing encoded features. Sequence features are represented as binary numbers, while shape features are represented as real numbers.
Tsu-Pei Chiu
fn <- system.file("extdata", "CGRsample_short.fa", package = "DNAshapeR") pred <- getShape(fn) featureNames <- c("1-shape") featureVector <- encodeSeqShape(fn, pred, featureNames)
fn <- system.file("extdata", "CGRsample_short.fa", package = "DNAshapeR") pred <- getShape(fn) featureNames <- c("1-shape") featureVector <- encodeSeqShape(fn, pred, featureNames)
DNAshapeR can predict DNA shape features from custom FASTA files or directly from genomic coordinates in the form of a GRanges object within BioConductor (see <https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html> for more information).
getFasta(GR, BSgenome, width = 1e3, filename = 'tmp.fa')
getFasta(GR, BSgenome, width = 1e3, filename = 'tmp.fa')
GR |
A GRanges object indicating genomic coordinates |
BSgenome |
A BSgenome object indicating the genome of interest |
width |
A number indicating a fixed width of sequences |
filename |
The Name of the input fasta format file, including full path to file if it is located outside the current working directory |
writes a fasta file
Federico Comoglio
gr <- GRanges(seqnames = c("chrI"), strand = c("+", "-", "+"), ranges = IRanges(start = c(100, 200, 300), width = 100)) library(BSgenome.Scerevisiae.UCSC.sacCer3) getFasta(gr, BSgenome = Scerevisiae, width = 100, filename = "tmp.fa") fn <- "tmp.fa" pred <- getShape(fn)
gr <- GRanges(seqnames = c("chrI"), strand = c("+", "-", "+"), ranges = IRanges(start = c(100, 200, 300), width = 100)) library(BSgenome.Scerevisiae.UCSC.sacCer3) getFasta(gr, BSgenome = Scerevisiae, width = 100, filename = "tmp.fa") fn <- "tmp.fa" pred <- getShape(fn)
The DNA prediction uses a sliding pentamer window where structural features unique to each of the 512 distinct pentamers define a vector of minor groove width (MGW), Roll, propeller twist (ProT), and helix twist (HelT) at each nucleotide position (Zhou, et al., 2013). MGW and ProT define base-pair parameter whereas Roll and HelT represent base pair-step parameters. The values for each DNA shape feature as function of its pentamer sequence were derived from all-atom Monte Carlo simulations where DNA structure is sampled in collective and internal degrees of freedom in combination with explicit counter ions (Zhang, et al., 2014). The Monte Carlo simulations were analyzed with a modified Curves approach (Zhou, et al., 2013). Through data mining, average values for each shape feature were calculated for the on average 44 occurrences of each pentamer in an ensemble of Monte Carlo trajectories for 2,121 DNA fragments of 12-27 base pairs in length. DNAshapeR predicts four DNA shape features, which were observed in various co-crystal structures playing an important role in specific protein-DNA binding. The core prediction algorithm enables ultra-fast, high-throughput predictions of shape features for thousands of genomic sequences and is implemented in C++. Since it is likely that features describing additional structural properties or equivalent features derived from different experimental or computational sources will become available, the package has a flexible modular design that easily allows future expansions. In the latest version, we further added additional 9 DNA shape features beyond our previous set of 4 features, and expanded our available repertoire to a total of 13 features, including 6 inter-base pair or base pair-step parameters (HelT, Rise, Roll, Shift, Slide, and Tilt), 6 intra-base pair or base pair-step parameters (Buckle, Opening, ProT, Shear, Stagger, and Stretch), and MGW.
getShape(filename, shapeType = 'Default', parse = TRUE, methylate = FALSE, methylatedPosFile = NULL)
getShape(filename, shapeType = 'Default', parse = TRUE, methylate = FALSE, methylatedPosFile = NULL)
filename |
The name of input fasta format file, including full path to file if it is located outside the current working directory. |
shapeType |
A character indicating the shape parameters which can be "MGW", "ProT", "Roll", "HelT" or "All" (meaning all four shapes) |
parse |
A logical value indicating whether parse the prediction result |
methylate |
A logical value indicating wheter consider methlatation |
methylatedPosFile |
The name of input postion file indicating methlated position |
Predict biophysical feature
Our previous work explained protein-DNA binding specificity based on correlations between MGW and electrostatic potential (EP) observed in experimentally available structures (Joshi, et al., 2007). However, A/T and C/G base pairs carry different partial charge distributions in the minor groove (due primarily to the guanine amino group), which will affect minor-groove EP. We developed a high-throughput method to predict minor-groove EP based on data mining of results from solving the nonlinear Poisson-Boltzmann calculations (Honig & Nicholls, 1995) on 2,297 DNA structures derived from Monte Carlo simulations. DNAshapeR includes EP as an additional feature.
shapeList A List containing shapre prediction result
Federico Comoglio & Tsu-Pei Chiu
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- getShape(fn)
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- getShape(fn)
Plot heatmap of DNA shape features
heatShape(shapeMatrix, nBins, ordRow = NULL, useRaster = TRUE, ... )
heatShape(shapeMatrix, nBins, ordRow = NULL, useRaster = TRUE, ... )
shapeMatrix |
A matrix containing DNAshape prediction results. |
nBins |
An integer specifying the number of equally-sized bins in which shape predictions should be aggregated. Summarized predictions can be visualized by setting nBins=1. |
ordRow |
A numeric vector (of the same length as the number of rows of shapeMatrix) defining the permutation of the rows of shapeMatrix to be used for plotting. Default to NULL, i.e. rows are ordered by coefficients of variation. |
useRaster |
Logical, if TRUE a bitmap raster is used to plot the image instead of polygons (see ?graphics::image for details). |
... |
Additional parameters to be passed to the image.plot function (see ?fields::image.plot for details). |
Called for its effects
Federico Comoglio
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- getShape(fn) library(fields) heatShape(pred$MGW, 20)
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- getShape(fn) library(fields) heatShape(pred$MGW, 20)
Min-Max normalization
normalize(x, max, min)
normalize(x, max, min)
x |
A matrix containing encoded features |
max |
A number maximum number for Min-Max Normalization |
min |
A number minimum number for Min-Max Normalization |
featureVector A matrix containing encoded features. shape feature is represented as continuous numbers
Tsu-Pei Chiu
Normalize n-st order shape features
normalizeShape(featureVector, thOrder, shapeType, normalize)
normalizeShape(featureVector, thOrder, shapeType, normalize)
featureVector |
A matrix containing encoded features. |
thOrder |
A number indicating n-st order shape encoding |
shapeType |
A character name of shape (MGW, Roll, ProT, HelT) features |
normalize |
A logical indicating whether to perform normalization. Default to TRUE. |
featureVector A matrix containing encoded features.
Tsu-Pei Chiu
DNA shape features can be visualized as aggregated line plots (also known as metaprofiles, see Comoglio et al., 2015), heat maps (Yang et al., 2014) and genome browser tracks (Chiu et al., 2014).
plotShape(shapeMatrix, background = NULL, colDots = rgb( 0, 0, 1, 0.1), colDotsBg = rgb( 0, 0, 0, 0.1), colLine = 'steelblue', colLineBg = 'gray50', cex = 0.5, lwd = 2, ylim, ...)
plotShape(shapeMatrix, background = NULL, colDots = rgb( 0, 0, 1, 0.1), colDotsBg = rgb( 0, 0, 0, 0.1), colLine = 'steelblue', colLineBg = 'gray50', cex = 0.5, lwd = 2, ylim, ...)
shapeMatrix |
A matrix containing DNAshape prediction results |
background |
A matrix containing DNAshape prediction results for a set of background regions. Default to NULL, i.e. background not provided. |
colDots |
A character vector specifying the color of the points representing the column mean of shapeMatrix. Default to rgb( 0, 0, 1, 0.1). |
colDotsBg |
A character vector specifying the color of the points representing the column mean of background. Default to rgb( 0, 0, 0, 0.1). |
colLine |
A character string giving the color name of line representing the column mean of shapeMatrix. Default to 'steelblue'. |
colLineBg |
A character string giving the color name of line representing the column mean of background. Default to 'gray50'. |
cex |
A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. Default to 0.5. |
lwd |
A numerical value specifying the line width. Default to 2. |
ylim |
A numerical vector of size 2 specifying the y-axis plot range. |
... |
Additional parameters to be passed to the R plot function. |
Called for its effects
Federico Comoglio
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- getShape(fn) plotShape(pred$MGW) plotShape(pred$ProT) plotShape(pred$Roll) plotShape(pred$HelT)
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- getShape(fn) plotShape(pred$MGW) plotShape(pred$ProT) plotShape(pred$Roll) plotShape(pred$HelT)
Read the position fasta file
readNonStandardFastaFile(filename)
readNonStandardFastaFile(filename)
filename |
The name of the input position file indicating the methlation position |
df dataframe
Satyanarayan Rao & Tsu-Pei Chiu
Read DNA shape predictions
readShape(filename)
readShape(filename)
filename |
character name of the file containing shape predictions, including full path to file if it is located outside the current working directory. |
shapeMatrix matrix containing the shape prediction result
Federico Comoglio & Tsu-Pei Chiu
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- readShape(fn)
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- readShape(fn)
Plot track view of DNA shape features
trackShape( filename, shapeList )
trackShape( filename, shapeList )
filename |
The name of the input fasta format file, including full path to file if it is located outside the current working directory |
shapeList |
A list containing four DNAshape prediction results |
Called for its effects
None.
Tsu-Pei Chiu
fn2 <- system.file("extdata", "SingleSeqsample.fa", package = "DNAshapeR") pred2 <- getShape(fn2) trackShape(fn2, pred2) # Only for single sequence file
fn2 <- system.file("extdata", "SingleSeqsample.fa", package = "DNAshapeR") pred2 <- getShape(fn2) trackShape(fn2, pred2) # Only for single sequence file