Title: | cleanUpdTSeq cleans up artifacts from polyadenylation sites from oligo(dT)-mediated 3' end RNA sequending data |
---|---|
Description: | This package implements a Naive Bayes classifier for accurately differentiating true polyadenylation sites (pA sites) from oligo(dT)-mediated 3' end sequencing such as PAS-Seq, PolyA-Seq and RNA-Seq by filtering out false polyadenylation sites, mainly due to oligo(dT)-mediated internal priming during reverse transcription. The classifer is highly accurate and outperforms other heuristic methods. |
Authors: | Sarah Sheppard, Haibo Liu, Jianhong Ou, Nathan Lawson, Lihua Julie Zhu |
Maintainer: | Jianhong Ou <[email protected]>; Lihua Julie Zhu <[email protected]> |
License: | GPL-2 |
Version: | 1.45.0 |
Built: | 2024-10-30 04:43:05 UTC |
Source: | https://github.com/bioc/cleanUpdTSeq |
Convert to a GRanges object from a (extended) BED6 file with at least six columns: chrom, chromStart, strEnd, name, score and strand, and optional upstream sequences (including pA sites) and downstream sequences of pA sites
BED6WithSeq2GRangesSeq( file, skip = 1L, withSeq = TRUE, upstream.seq.ind = 7L, downstream.seq.ind = 8L )
BED6WithSeq2GRangesSeq( file, skip = 1L, withSeq = TRUE, upstream.seq.ind = 7L, downstream.seq.ind = 8L )
file |
A character(1) vector, representing a path to a extended BED file containing at least six columns in the order of chrom, chromStart, strEnd, name, score and strand. The strand information must be designated as "+", or "-". Optional fields–upstream sequences (including pA sites) and downstream sequences of pA sites–are allowed. For more details about the BED format, see https://genome.ucsc.edu/FAQ/FAQformat.html#format1. |
skip |
A integer(1) vector, indicating how many rows (header lines) to skip when the BED file is read into R. |
withSeq |
A logical(1) vector, indicating that upstream and downstream sequences flanking pA sites are included in the file |
upstream.seq.ind |
An integer(1),vector delineating the column location of upstream sequences of the putative pA site |
downstream.seq.ind |
An integer(1),vector delineating the column location of downstream sequences of the putative pA site |
An object of GRanges
Haibo Liu, Lihua J. Zhu
testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = TRUE)
testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = TRUE)
Computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule.
buildClassifier( Ndata.NaiveBayes, Pdata.NaiveBayes, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = c("ACGT") )
buildClassifier( Ndata.NaiveBayes, Pdata.NaiveBayes, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = c("ACGT") )
Ndata.NaiveBayes |
A data.frame, containing features for the negative
training data, described further in |
Pdata.NaiveBayes |
A data.frame, containing features for the positive
training data, described further in |
upstream |
An integer(1) vector, length of upstream sequence to retrieve. |
downstream |
An integer(1) vector, length of downstream sequence to retrieve. |
wordSize |
An integer(1) vector, size of the kmer feature for the upstream sequence. wordSize = 6 should always be used. |
alphabet |
A character(1) vector, a string containing DNA bases. By default, "ACTG". |
An object of class "naiveBayes".
Jianhong Ou
if (interactive()){ data(data.NaiveBayes) classifier <- buildClassifier(data.NaiveBayes$Negative, data.NaiveBayes$Positive) }
if (interactive()){ data(data.NaiveBayes) classifier <- buildClassifier(data.NaiveBayes$Negative, data.NaiveBayes$Positive) }
This function creates a data frame. Fields include peak name, upstream sequence, downstream sequence, and features to be used in classifying the putative polyadenylation site.
buildFeatureVector( peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = "ACGT", sampleType = c("TP", "TN", "unknown"), replaceNAdistance = 30L, method = c("NaiveBayes", "SVM"), fetchSeq = FALSE, return_sequences = FALSE )
buildFeatureVector( peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = "ACGT", sampleType = c("TP", "TN", "unknown"), replaceNAdistance = 30L, method = c("NaiveBayes", "SVM"), fetchSeq = FALSE, return_sequences = FALSE )
peaks |
An object of GRanges that may contain the upstream and downstream sequence information. This item is created by the function BED6WithSeq2GRangesSeq. |
genome |
Name of the genome to get sequences from. To find out a list of available genomes, please type BSgenome::available.genomes() in R. |
upstream |
An integer(1) vector, length of upstream sequence to retrieve. |
downstream |
An integer(1) vector, length of downstream sequence to retrieve. |
wordSize |
An integer(1) vector, size of the kmer feature for the upstream sequence. wordSize = 6 should always be used. |
alphabet |
A character(1) vector, a string containing DNA bases. By default, "ACTG". |
sampleType |
A character(1) vector, indicating type of sequences for building feature vectors. Options are TP (true positive) and TN (true negative) for training data, or unknown for test data. |
replaceNAdistance |
An integer(1) vector, specifying an number for avg.distanceA2PeakEnd, the average distance of As to the putative pA site, when there is no A in the downstream sequence. |
method |
A character(1) vector, specifying a machine learning method to to use. Currently, only "NaiveBayes" is implemented. |
fetchSeq |
A logical (1), indicating whether upstream and downstream sequences should be retrieved from the BSgenome object at this step or not. |
return_sequences |
A logical(1) vector, indicating whether upstream and downstream sequences should be included in the output |
An object of "featureVector
"
Sarah Sheppard, Haibo Liu, Jianhong Ou, Nathan Lawson, Lihua J. Zhu
library(BSgenome.Drerio.UCSC.danRer7) testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = TRUE) ## build the feature vector for the test set with sequence information testSet.NaiveBayes = buildFeatureVector(peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = "ACGT", sampleType = "unknown", replaceNAdistance = 30, method = "NaiveBayes", fetchSeq = FALSE, return_sequences = TRUE) ## convert the test set to GRanges without upstream and downstream ## sequence information peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = FALSE) #build the feature vector for the test set without sequence information testSet.NaiveBayes = buildFeatureVector(peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = "ACGT", sampleType = "unknown", replaceNAdistance = 30, method = "NaiveBayes", fetchSeq = TRUE, return_sequences = TRUE)
library(BSgenome.Drerio.UCSC.danRer7) testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = TRUE) ## build the feature vector for the test set with sequence information testSet.NaiveBayes = buildFeatureVector(peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = "ACGT", sampleType = "unknown", replaceNAdistance = 30, method = "NaiveBayes", fetchSeq = FALSE, return_sequences = TRUE) ## convert the test set to GRanges without upstream and downstream ## sequence information peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = FALSE) #build the feature vector for the test set without sequence information testSet.NaiveBayes = buildFeatureVector(peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = "ACGT", sampleType = "unknown", replaceNAdistance = 30, method = "NaiveBayes", fetchSeq = TRUE, return_sequences = TRUE)
An object of class "naiveBayes" generated from data.NaiveBayes
classifier
classifier
An object of class "PASclassifier
" including
components:
data(classifier) names(classifier)
data(classifier) names(classifier)
3'ends of transcripts have generally been poorly annotated. With the advent of deep sequencing, many methods have been developed to identify 3'ends. The majority of these methods use an oligodT primer which can bind to internal adenine-rich sequences, and lead to artifactual identification of polyadenylation sites. Heuristic filtering methods rely on a certain number of As downstream of a putative polyadenylation site to classify the site as true or oligodT primed. This package provides a robust method to classify putative polyadenylation sites using a Naive Bayes classifier.
Sarah Sheppard, Haibo Liu, Jianhong Ou, Nathan Lawson, Lihua Julie Zhu
A RData containing negative and positive training data
data.NaiveBayes
data.NaiveBayes
A list with 2 data frame, "Negative" and "Positive". Negative has 9219 observations on the following 4120 variables. And Positive is a data frame with 22770 observations on the following 4120 variables. The format is:
'data.frame': 9219 obs. of 4120 variables:
'data.frame': 22770 obs. of 4120 variables:
Both of them have same structure.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a factor with levels 0
1
a factor with levels
0
1
a vector of sequence string
a vector of sequence string
library(BSgenome.Drerio.UCSC.danRer7) data(data.NaiveBayes) head(str(data.NaiveBayes$Negative)) head(str(data.NaiveBayes$Positive))
library(BSgenome.Drerio.UCSC.danRer7) data(data.NaiveBayes) head(str(data.NaiveBayes$Negative)) head(str(data.NaiveBayes$Positive))
"featureVector"
An object of class "featureVector"
represents the output of
buildFeatureVector
Objects can be created by calls of the form
new("featureVector", data, info)
.
Retrieve upstream and downstream sequences of pA sites from a BSgenome object based on a GRanges object
getContextSequences(peaks, upstream = 40L, downstream = 30L, genome)
getContextSequences(peaks, upstream = 40L, downstream = 30L, genome)
peaks |
An object of GRanges representing pA sites |
upstream |
An integer(1) vector, length of upstream sequence of pA sites, including pA site. |
downstream |
An integer(1) vector, length of downstream sequences of pA sites |
genome |
An object of BSgenome. |
A data.frame containing sequences upstream and downstream pA sites:
sequence upstream pA site, including pA site
sequence downstream pA site
Haibo Liu
library(BSgenome.Drerio.UCSC.danRer7) testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = FALSE) peaks_seq <- getContextSequences(peaks, upstream = 40L, downstream = 30L, genome = Drerio)
library(BSgenome.Drerio.UCSC.danRer7) testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = FALSE) peaks_seq <- getContextSequences(peaks, upstream = 40L, downstream = 30L, genome = Drerio)
"modelInfo"
An object of class "modelInfo"
represents the information of sequence
to use in the analysis
Objects can be created by calls of the form
new("modelInfo", upstream, downstream, wordSize, alphabet)
.
"naiveBayes"
An object of class "naiveBayes"
represents the conditional
a-posterior probabilities of a categorical class variable given independent
predictor variables using the Bayes rule.
Objects can be created by calls of the form
new("naiveBayes", apriori, tables, levels, call)
.
"PASclassifier"
An object of class "PASclassifier"
represents the output of
buildClassifier
Objects can be created by calls of the form
new("PASclassifier", classifier, info)
.
data(classifier) classifier$info$upstream classifier$info$wordSize classifier$info$alphabet
data(classifier) classifier$info$upstream classifier$info$wordSize classifier$info$alphabet
classify putative pA sites into true and false bins.
predictTestSet( Ndata.NaiveBayes = NULL, Pdata.NaiveBayes = NULL, testSet.NaiveBayes, classifier = NULL, outputFile = "test-predNaiveBayes.tsv", assignmentCutoff = 0.5, return_sequences = FALSE )
predictTestSet( Ndata.NaiveBayes = NULL, Pdata.NaiveBayes = NULL, testSet.NaiveBayes, classifier = NULL, outputFile = "test-predNaiveBayes.tsv", assignmentCutoff = 0.5, return_sequences = FALSE )
Ndata.NaiveBayes |
A data.frame, containing features for the negative
training data, which is built using the function
|
Pdata.NaiveBayes |
A data.frame, containing features for the positive
training data, which is built using the function
|
testSet.NaiveBayes |
An object of |
classifier |
An object of class PASclassifier. |
outputFile |
A character(1) vector, file name for outputting prediction results. The prediction output is written to the file, tab separated. |
assignmentCutoff |
A numeric(1) vector, specifying the cutoff for classifying a putative pA site into a true or false pA class. It should be any number between 0 and 1. For example, assignmentCutoff = 0.5 will assign an putative pA site with prob_true_pA > 0.5 to the True class (1), and any putative pA site with prob_true_pA < = 0.5 as False (0). |
return_sequences |
A logical(1) vector, indicating whether upstream and downstream sequences should be included in the output |
A data.frame including all info as described below. The upstream and downstream sequence used in assessing the putative pA site might be included when return_sequences = TRUE.
peak_name |
the name of the putative pA site (originally from the 4th field in the bed file). |
prob_fake_pA |
the probability that the putative pA site is false |
prob_true_pA |
the probability that the putative pA site is true |
pred_class |
the predicted class of the putative pA site, based on the assignment cutoff. 0 = Falsee/oligo(dT) internally primed, 1 = True |
upstream_seq |
the upstream sequence of the putative pA site used in the analysis |
downstream_seq |
the downstream sequence of the putative pA site used in the analysis. |
Sarah Sheppard, Haibo Liu, Jianhong Ou, Nathan Lawson, Lihua J. Zhu
Sheppard S, Lawson ND, Zhu LJ. Accurate identification of polyadenylation sites from 3' end deep sequencing using a naive Bayes classifier. Bioinformatics. 2013;29(20):2564-2571.
library(BSgenome.Drerio.UCSC.danRer7) testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") ## convert the test set to GRanges without upstream and downstream sequence ## information peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = TRUE) ## build the feature vector for the test set without sequence information testSet.NaiveBayes = buildFeatureVector(peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = c("ACGT"), sampleType = "unknown", replaceNAdistance = 30, method = "NaiveBayes", fetchSeq = TRUE, return_sequences = TRUE) data(data.NaiveBayes) ## sample the test data for code testing, DO NOT do this for real data samp <- c(1:22, sample(23:4118, 50), 4119, 4120) Ndata.NaiveBayes <- data.NaiveBayes$Negative[, samp] Pdata.NaiveBayes <- data.NaiveBayes$Positive[, samp] testSet.NaiveBayes@data <- testSet.NaiveBayes@data[, samp[-1]-1] test_out <- predictTestSet(Ndata.NaiveBayes, Pdata.NaiveBayes, testSet.NaiveBayes, outputFile = tempfile(), assignmentCutoff = 0.5)
library(BSgenome.Drerio.UCSC.danRer7) testFile <- system.file("extdata", "test.bed", package = "cleanUpdTSeq") ## convert the test set to GRanges without upstream and downstream sequence ## information peaks <- BED6WithSeq2GRangesSeq(file = testFile, skip = 1L, withSeq = TRUE) ## build the feature vector for the test set without sequence information testSet.NaiveBayes = buildFeatureVector(peaks, genome = Drerio, upstream = 40L, downstream = 30L, wordSize = 6L, alphabet = c("ACGT"), sampleType = "unknown", replaceNAdistance = 30, method = "NaiveBayes", fetchSeq = TRUE, return_sequences = TRUE) data(data.NaiveBayes) ## sample the test data for code testing, DO NOT do this for real data samp <- c(1:22, sample(23:4118, 50), 4119, 4120) Ndata.NaiveBayes <- data.NaiveBayes$Negative[, samp] Pdata.NaiveBayes <- data.NaiveBayes$Positive[, samp] testSet.NaiveBayes@data <- testSet.NaiveBayes@data[, samp[-1]-1] test_out <- predictTestSet(Ndata.NaiveBayes, Pdata.NaiveBayes, testSet.NaiveBayes, outputFile = tempfile(), assignmentCutoff = 0.5)