Title: | de novo Annotation of Pack-TYPE Transposable Elements |
---|---|
Description: | Algorithm and tools for in silico pack-TYPE transposon discovery. Filters a given genome for properties unique to DNA transposons and provides tools for the investigation of returned matches. Sequences are input in DNAString format, and ranges are returned as a dataframe (in the format returned by as.dataframe(GRanges)). |
Authors: | Jack Gisby [aut, cre] , Marco Catoni [aut] |
Maintainer: | Jack Gisby <[email protected]> |
License: | GPL-2 |
Version: | 1.19.0 |
Built: | 2024-12-05 06:15:25 UTC |
Source: | https://github.com/bioc/packFinder |
The chromosome 3 reference sequence for Arabidopsis
thaliana as a
DNAStringSet
.
Can be used as a test data set, as in the associated
introduction vignette. The DNA sequence between bases
10,500,000 and 14,300,000 was extracted for use in this dataset.
data(arabidopsisThalianaRefseq)
data(arabidopsisThalianaRefseq)
A DNAStringSet
object containing a
DNAString
for
Arabidopsis thaliana's chromosome 3 sequence.
Jack Gisby
The Arabidopsis thaliana genome was downloaded from the
NCBI refseq database on 20/SEP/2019, using
getGenome
, and chromosome 3 was
extracted. The genome may also be accessed from the NCBI
ftp server: ftp://ftp.ncbi.nlm.nih.gov/genomes.
getGenome
,
DNAStringSet
,
DNAString
,
packSearch
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 )
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 )
Run BLAST against user-specified databases of non-transposon and transposon-related proteins. Can be used to classify transposons based on their internal sequences.
blastAnalysis( packMatches, Genome, blastPath, protDb = NULL, autoDb = NULL, minE = 0.001, blastTask = "blastn-short", maxHits = 100, threads = 1, saveFolder = NULL, tirCutoff = 0 )
blastAnalysis( packMatches, Genome, blastPath, protDb = NULL, autoDb = NULL, minE = 0.001, blastTask = "blastn-short", maxHits = 100, threads = 1, saveFolder = NULL, tirCutoff = 0 )
packMatches |
A dataframe of potential Pack-TYPE transposable elements,
in the format given by |
Genome |
A DNAStringSet object containing sequences referred to
in |
blastPath |
Path to the BLAST+ executable, or name of the BLAST+ application for Linux/MacOS users. |
protDb |
For assigning Pack-TYPE elements.
Path to the blast database containing nucleotide or protein
sequences to be matched against internal transposon
sequences. Can be generated
using BLAST+, or with
|
autoDb |
For assigning autonomous elements.
Path to the blast database containing nucleotide or protein
sequences to be matched against internal transposon
sequences. Can be generated
using BLAST+, or with
|
minE |
Blast results with e values greater than the specified cutoff will be ignored. |
blastTask |
Type of BLAST+ task, defaults to "blastn-short". |
maxHits |
Maximum hits returned by BLAST+ per query. |
threads |
Allowable number of threads to be utilised by BLAST+. |
saveFolder |
Directory to save BLAST+ results in; defaults to the working directory. |
tirCutoff |
How many bases to ignore at the terminal ends of the transposons to prevent hits to TIR sequences. |
No return value; executes BLAST+ to generate hits which are stored in a .blast file in the chosen directory.
Jack Gisby
For further information, see the NCBI BLAST+ application documentation and help pages (https://www.ncbi.nlm.nih.gov/pubmed/20003500?dopt=Citation).
blastAnnotate
,
readBlast
, packBlast
## Not run: packMatches <- data(packMatches) Genome <- data(arabidopsisThalianaRefseq) blastAnalysis(packMatches, Genome, protDb = "C:/data/TAIR10_CDS", autoDb = "C:/data/TAIR10_transposons", blastPath = "C:/blast/bin/blastn.exe") ## End(Not run)
## Not run: packMatches <- data(packMatches) Genome <- data(arabidopsisThalianaRefseq) blastAnalysis(packMatches, Genome, protDb = "C:/data/TAIR10_CDS", autoDb = "C:/data/TAIR10_transposons", blastPath = "C:/blast/bin/blastn.exe") ## End(Not run)
Uses hits, previously generated using blast, to annotate transposon hits. Transposons with non-redundant transposase hits are classed as autonomous ("auto"), while others are classed as "other" or "pack" based on whether the element has non-redundant hits to other proteins.
blastAnnotate(protHits, autoHits, packMatches)
blastAnnotate(protHits, autoHits, packMatches)
protHits |
BLAST results for non-transposon related genes or
proteins (as a |
autoHits |
BLAST results for transposon related genes or
proteins (as a |
packMatches |
A dataframe of potential Pack-TYPE transposable elements,
in the format given by |
Returns the original packMatches
dataframe,
with the addition of a "classification" column
containing one of the following values:
auto - elements that match known transposases or transposon-related proteins are classified as autonomous elements
pack - elements that match other proteins or genic sequences may be classified as Pack-TYPE elements
other - elements that generate no significant hits
Requires that the query ids in the protein and autonomous hits match the row names in packMatches.
Jack Gisby
For further information, see the NCBI BLAST+ application documentation and help pages (https://www.ncbi.nlm.nih.gov/pubmed/20003500?dopt=Citation).
blastAnalysis
,
readBlast
, packBlast
data("packMatches") # read in some protein hits p <- data.frame( query_id = c(2, 3), subject_id = c("prot", "hyp") ) # read in some autonomous hits a <- data.frame( query_id = c(3, 4), subject_id = c("transposase", "mutator") ) blastAnnotate(p, a, packMatches)
data("packMatches") # read in some protein hits p <- data.frame( query_id = c(2, 3), subject_id = c("prot", "hyp") ) # read in some autonomous hits a <- data.frame( query_id = c(3, 4), subject_id = c("transposase", "mutator") ) blastAnnotate(p, a, packMatches)
The sequences predicted by packSearch
often
overlap, which may be due to the presence of closely
interspersed elements or false TIR identification.
In such cases, these elements can be combined using
link[GenomicRanges:GRanges-class]{GRanges}
in order to collapse overlapping elements, preventing
over-estimation of transposon numbers. Also removes
duplicate elements that have been generated in the
case of multiple searches.
collapseSeqs(packMatches, Genome)
collapseSeqs(packMatches, Genome)
packMatches |
A dataframe containing genomic ranges and names referring
to sequences to be extracted. This dataframe is in the format
produced by coercing a
Must contain the following features:
|
Genome |
A DNAStringSet object containing sequences referred to
in |
A set of non-overlapping transposon sequences in the format of the input dataframe.
Jack Gisby
packSearch
,
link[GenomicRanges:GRanges-class]{GRanges}
data(packMatches) data(arabidopsisThalianaRefseq) packMatches$start <- 1 packMatches$end <- 10 collapseSeqs(packMatches, arabidopsisThalianaRefseq)
data(packMatches) data(arabidopsisThalianaRefseq) packMatches$start <- 1 packMatches$end <- 10 collapseSeqs(packMatches, arabidopsisThalianaRefseq)
Takes transposable elements detected by
packSearch
and removes those with large
numbers of wildcard ("N") bases. Used by
packClust
and packAlign
to
remove poor quality sequences that may interfere with
the quality of sequence alignments.
filterWildcards(packMatches, Genome, maxWildcards = 0.05)
filterWildcards(packMatches, Genome, maxWildcards = 0.05)
packMatches |
A dataframe containing genomic ranges and names referring to sequences to be extracted. |
Genome |
The original set of sequences used to generate the
transposons detected by |
maxWildcards |
The maximal allowable proportion of wildcards in the
sequence of each match (defaults to |
The original dataframe, packMatches
, with
sequences removed that are found to contain a proportion
of wildcards ("N") greater than that specified in
maxWildcards
.
Jack Gisby
packClust
, packAlign
, packMatches
,
data(arabidopsisThalianaRefseq)
data(arabidopsisThalianaRefseq) data(packMatches) filteredMatches <- filterWildcards( packMatches, arabidopsisThalianaRefseq, maxWildcards = 0.05 )
data(arabidopsisThalianaRefseq) data(packMatches) filteredMatches <- filterWildcards( packMatches, arabidopsisThalianaRefseq, maxWildcards = 0.05 )
Method to quickly extract the sequences of predicted
Pack-TYPE elements (as created by packSearch
).
getPackSeqs(packMatches, Genome, output = "DNAStringSet")
getPackSeqs(packMatches, Genome, output = "DNAStringSet")
packMatches |
A dataframe containing genomic ranges and names referring
to sequences to be extracted. This dataframe is in the format
produced by coercing a
Must contain the following features:
|
Genome |
A DNAStringSet object containing sequences referred to
in |
output |
The type of object to be returned:
|
transposon sequences extracted from packMatches
.
At default returns the sequences as a
DNAStringSet
or, if output
is set to "character", returns a
character vector.
Jack Gisby
DNAStringSet
,
packSearch
,
DNAString
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 ) packSeqs <- getPackSeqs(packMatches, arabidopsisThalianaRefseq)
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 ) packSeqs <- getPackSeqs(packMatches, arabidopsisThalianaRefseq)
Retrieves a dataframe of potential Pack-TYPE elements,
previously saved using packSearch
followed
by packsToCsv
.
getPacksFromCsv(file)
getPacksFromCsv(file)
file |
File path to predicted transposons in CSV format. |
Dataframe in the format used by packSearch
.
Jack Gisby
packsToCsv
, read.table
,
packSearch
data(packMatches) packMatches <- getPacksFromCsv( system.file("extdata", "packMatches.csv", package = "packFinder") )
data(packMatches) packMatches <- getPacksFromCsv( system.file("extdata", "packMatches.csv", package = "packFinder") )
Retrieves a dataframe of potential Pack-TYPE elements,
previously saved using packSearch
followed
by packsToFasta
. Parses the .fasta file
and title field containing:
seqnames - name of origin sequence
start - transposon base start position on origin sequence
end - transposon base end position on origin sequence
width - width of transposon
strand - direction of transposon ("+", "-" or "*")
TSD - terminal site duplication (TSD) sequence
getPacksFromFasta(file)
getPacksFromFasta(file)
file |
Path to predicted transposons in FASTA format. |
Dataframe in the format used by packSearch
.
Jack Gisby
data(arabidopsisThalianaRefseq) data(packMatches) packMatches <- getPacksFromFasta( system.file("extdata", "packMatches.fasta", package = "packFinder") )
data(arabidopsisThalianaRefseq) data(packMatches) packMatches <- getPacksFromFasta( system.file("extdata", "packMatches.fasta", package = "packFinder") )
A link[GenomicRanges:GRanges-class]{GRanges}
object, potentially
generated using packSearch
and
packsToGRanges
, can be converted to a
dataframe. If a GRanges object is supplied without TSD
information, this can be calculated and appended to the
final dataframe.
getPacksFromGRanges(packGRanges, Genome = NULL, tsdLength = NULL)
getPacksFromGRanges(packGRanges, Genome = NULL, tsdLength = NULL)
packGRanges |
|
Genome |
(optional) Sequences referred to by |
tsdLength |
(optional) Length of TSD sequences. |
Dataframe in the format used by packSearch
.
If Genome
and tsdLength
are supplied, then
TSD sequences are retrieved and returned
as part of the dataframe.
Jack Gisby
packsToGRanges
,
link[GenomicRanges:GRanges-class]{GRanges}
,
packSearch
data(packMatches) GRangesObject <- packsToGRanges(packMatches) packMatches <- getPacksFromGRanges(GRangesObject)
data(packMatches) GRangesObject <- packsToGRanges(packMatches) packMatches <- getPacksFromGRanges(GRangesObject)
Gets the flanking TSD sequences of TIRs or predicted
Pack-TYPE transposable elements. A dataframe of these
elements can be in tirMatches
.
getTsds(tirMatches, Genome, tsdLength, strand = "+", output = "character")
getTsds(tirMatches, Genome, tsdLength, strand = "+", output = "character")
tirMatches |
A dataframe containing genomic ranges and names referring
to TIR sequences or predicted Pack-TYPE transposable
elements. Should be in the format used by
|
Genome |
A |
tsdLength |
The length of the TSD region to be retrieved (integer). |
strand |
The strand of the TIR; "+" for forward, "-" for reverse. If the TSD sequences of transposable elements are being predicted, then this parameter can be left as default ("+"); if the TSD sequences of TIRs are being found then the strand direction must be supplied. |
output |
The type of object to be returned:
|
Called by packSearch
. It is recommended
to use the general pipeline function
packSearch
for identification of potential
pack elements, which returns TSD sequences as a feature
of results, however each stage may be called individually.
Flanking TSD sequences as a vector of characters, or if
output is specified as "DNAStringSet", TSD sequences
will be returned as a
DNAStringSet
object.
Jack Gisby
DNAStringSet
,
packSearch
, tirMatches
data(arabidopsisThalianaRefseq) data(packMatches) tsdSeqs <- getTsds(packMatches, arabidopsisThalianaRefseq, 3)
data(arabidopsisThalianaRefseq) data(packMatches) tsdSeqs <- getTsds(packMatches, arabidopsisThalianaRefseq, 3)
Primary filtering stage for the packSearch
algorithm. Identifies potential Pack-TYPE transposable
elements based on proximity of matching inverted repeats
and equality of TSD sequences.
identifyPotentialPackElements( forwardMatches, reverseMatches, Genome, elementLength, tsdMismatch = 0 )
identifyPotentialPackElements( forwardMatches, reverseMatches, Genome, elementLength, tsdMismatch = 0 )
forwardMatches |
A dataframe containing genomic ranges and names referring to forwards-facing TIR sequences and their respective TSD sequences. |
reverseMatches |
A dataframe containing genomic ranges and names referring to reverse-facing TIR sequences and their respective TSD sequences. |
Genome |
A DNAStringSet object containing the matches referred to
in |
elementLength |
A vector of two integers containing the minimum and maximum transposable element length. |
tsdMismatch |
An integer referring to the allowable mismatch
(substitutions or indels) between a transposon's TSD
sequences. |
Used by packSearch
as a primariy filtering
stage. Identifies matches likely to be transposons based
on their TIR region, from identifyTirMatches
,
and their TSD region, from getTsds
. It is
recommended to use the general pipeline function
packSearch
for identification of potential
pack elements, however each stage may be called
individually. Note that only exact TSD matches are
considered, so supplying long sequences for TSD elements
may lead to false-negative results.
A dataframe, packMatches
, containing the locations
of potential Pack-TYPE transposable elements in Genome
.
Jack Gisby
packSearch
data(arabidopsisThalianaRefseq) forwardMatches <- identifyTirMatches( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, tsdLength = 3, strand = "+" ) reverseMatches <- identifyTirMatches( Biostrings::reverseComplement(Biostrings::DNAString("CACTACAA")), arabidopsisThalianaRefseq, tsdLength = 3, strand = "-" ) packMatches <- identifyPotentialPackElements( forwardMatches, reverseMatches, arabidopsisThalianaRefseq, c(300, 3500) )
data(arabidopsisThalianaRefseq) forwardMatches <- identifyTirMatches( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, tsdLength = 3, strand = "+" ) reverseMatches <- identifyTirMatches( Biostrings::reverseComplement(Biostrings::DNAString("CACTACAA")), arabidopsisThalianaRefseq, tsdLength = 3, strand = "-" ) packMatches <- identifyPotentialPackElements( forwardMatches, reverseMatches, arabidopsisThalianaRefseq, c(300, 3500) )
Searches a DNAStringSet
for potential TIRs based on sequence similarity.
identifyTirMatches( tirSeq, Genome, mismatch = 0, strand = "*", tsdLength, fixed = TRUE )
identifyTirMatches( tirSeq, Genome, mismatch = 0, strand = "*", tsdLength, fixed = TRUE )
tirSeq |
A |
Genome |
A |
mismatch |
The allowable mismatch between |
strand |
The directionality of the search string ("+" or "-"). Note that this does affect the search for tirSeqs, if you wish to search the reverse strand you should use the reverse complement of your sequence. |
tsdLength |
Integer referring to the length of the flanking TSD region. |
fixed |
Logical that will be passed to the 'fixed' argument of
|
Called by packSearch
. Used by
packSearch
as an initial filtering stage.
matchPattern
from Biostrings
is used for pattern matching. It is recommended to use
the general pipeline function packSearch
for identification of potential pack elements, however
each stage may be called individually.
A dataframe, tirMatches
, containing identified
matches. The dataframe is in the format generated by
packSearch
.
Jack Gisby
DNAStringSet
,
packSearch
, matchPattern
,
DNAString
data(arabidopsisThalianaRefseq) forwardMatches <- identifyTirMatches( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, tsdLength = 3, strand = "+" )
data(arabidopsisThalianaRefseq) forwardMatches <- identifyTirMatches( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, tsdLength = 3, strand = "+" )
Generates a BLAST database to be queried. Required for identifying sequences using the BLAST+ software.
makeBlastDb(fastaFile, dbPath, blastPath, dbType = "nucl")
makeBlastDb(fastaFile, dbPath, blastPath, dbType = "nucl")
fastaFile |
FASTA file containing sequences to generate a BLAST database from. |
dbPath |
Path to save the BLAST database to. |
blastPath |
Path/name of BLAST program to use. Name of the application for Linux/MacOS, absolute path for the executable for windows users. |
dbType |
Type of BLAST database to create, e.g. "nucl" for a nucleotide database. |
No return value; generates a blast database in the chosen directory.
Jack Gisby
For further information, see the NCBI BLAST+ application documentation and help pages (https://www.ncbi.nlm.nih.gov/pubmed/20003500?dopt=Citation).
## Not run: makeBlastDb("genes.fasta", "blastdb.db", "C:/blast.exe") ## End(Not run)
## Not run: makeBlastDb("genes.fasta", "blastdb.db", "C:/blast.exe") ## End(Not run)
A global pairwise alignment of pack-TYPE elements by
sequence similarity. mIt may be useful to run
packClust
to identify groups of similar
transposable elements, before generating alignments of
each group.
packAlign( packMatches, Genome, identity = 0, threads = 1, identityDefinition = 2, maxWildcards = 0.05, saveFolder, vSearchPath = "vsearch" )
packAlign( packMatches, Genome, identity = 0, threads = 1, identityDefinition = 2, maxWildcards = 0.05, saveFolder, vSearchPath = "vsearch" )
packMatches |
A dataframe of potential Pack-TYPE transposable elements,
in the format given by |
Genome |
A DNAStringSet object containing sequences referred to in
|
identity |
The sequence identity of two transposable elements in
|
threads |
The number of threads to be used by VSEARCH. |
identityDefinition |
The pairwise identity definition used by VSEARCH. Defaults to 2, the standard VSEARCH definition. |
maxWildcards |
The maximal allowable proportion of wildcards in the
sequence of each match (defaults to |
saveFolder |
The folder to save saveFolder files (uc, blast6out, FASTA) |
vSearchPath |
When the package is run on windows systems, the location of the VSEARCH executable file must be given; this should be left as default on Linux/MacOS systems. |
Saves alignment information, including a uc
,
blast6out
and a pairwise alignment fasta
file, to the specified location. Returns the uc summary
file generated by the alignment.
In order to align sequences using VSEARCH, the executable file must first be installed.
Jack Gisby
VSEARCH may be downloaded from https://github.com/torognes/vsearch, along with a manual documenting the program's parameters. See https://www.ncbi.nlm.nih.gov/pubmed/27781170 for further information.
tirClust
, packClust
,
readBlast
, readUc
,
filterWildcards
, packSearch
data(arabidopsisThalianaRefseq) data(packMatches) # packAlign run on a Linux/MacOS system ## Not run: packAlign(packMatches, Genome) ## End(Not run) # packAlign run on a Windows system ## Not run: packAlign(packMatches, Genome, vSearchPath = "path/to/vsearch/vsearch.exe") ## End(Not run)
data(arabidopsisThalianaRefseq) data(packMatches) # packAlign run on a Linux/MacOS system ## Not run: packAlign(packMatches, Genome) ## End(Not run) # packAlign run on a Windows system ## Not run: packAlign(packMatches, Genome, vSearchPath = "path/to/vsearch/vsearch.exe") ## End(Not run)
Run BLAST against user-specified databases of non-transposon and transposon-relates proteins. Can be used to classify transposons based on their internal sequences.
packBlast( packMatches, Genome, blastPath, protDb, autoDb, minE = 0.001, blastTask = "blastn-short", maxHits = 100, threads = 1, saveFolder = NULL, tirCutoff = 100, autoCutoff = 1e-05, autoLength = 150, autoIdentity = 70, autoScope = NULL, protCutoff = 1e-05, protLength = 250, protIdentity = 70, protScope = 0.3 )
packBlast( packMatches, Genome, blastPath, protDb, autoDb, minE = 0.001, blastTask = "blastn-short", maxHits = 100, threads = 1, saveFolder = NULL, tirCutoff = 100, autoCutoff = 1e-05, autoLength = 150, autoIdentity = 70, autoScope = NULL, protCutoff = 1e-05, protLength = 250, protIdentity = 70, protScope = 0.3 )
packMatches |
A dataframe of potential Pack-TYPE transposable elements,
in the format given by |
Genome |
A DNAStringSet object containing sequences referred to
in |
blastPath |
Path to the BLAST+ executable, or name of the BLAST+ application for Linux/MacOS users. |
protDb |
For assigning Pack-TYPE elements.
Path to the blast database containing nucleotide or protein
sequences to be matched against internal transposon
sequences. Can be generated
using BLAST+, or with
|
autoDb |
For assigning autonomous elements.
Path to the blast database containing nucleotide or protein
sequences to be matched against internal transposon
sequences. Can be generated
using BLAST+, or with
|
minE |
Blast results with e values greater than the specified cutoff will be ignored. This will be passed to BLASTN and applied to both transposon and non-transposon matches. |
blastTask |
Type of BLAST+ task, defaults to "blastn-short". |
maxHits |
Maximum hits returned by BLAST+ per query. |
threads |
Allowable number of threads to be utilised by BLAST+. |
saveFolder |
Directory to save BLAST+ results in; defaults to the working directory. |
tirCutoff |
How many bases to ignore at the terminal ends of the transposons to prevent hits to TIR sequences. |
autoCutoff |
Blast results for transposon-related elements will be filtered to ignore those with e values above the specified cutoff. |
autoLength |
Blast results for transposon-related elements containing hits with alignment lengths lower than this value will be ignored |
autoIdentity |
Blast results for transposon-related elements containing hits with sequence identities lower than this value will be ignored |
autoScope |
If specified, transposon-related blast results below the specified value will be ignored. Note that the dataframe of transposon matches must also be supplied to calculate scope. Scope is the proportion of the transposon's internal sequence occupied by the BLAST hit. |
protCutoff |
Blast results for genic/other matches will be filtered to ignore those with e values above the specified cutoff. |
protLength |
Blast results for genic/other matches containing hits with alignment lengths lower than this value will be ignored |
protIdentity |
Blast results for genic/other matches containing hits with sequence identities lower than this value will be ignored |
protScope |
If specified, genic/other blast matches below the specified value will be ignored. Note that the dataframe of transposon matches must also be supplied to calculate scope. Scope is the proportion of the transposon's internal sequence occupied by the BLAST hit. |
Returns the original packMatches
dataframe,
with the addition of a "classification" column
containing one of the following values:
auto - elements that match known transposases or transposon-related proteins are classified as autonomous elements
pack - elements that match other proteins or genic sequences may be classified as Pack-TYPE elements
other - elements that generate no significant hits
Jack Gisby
For further information, see the NCBI BLAST+ application documentation and help pages (https://www.ncbi.nlm.nih.gov/pubmed/20003500?dopt=Citation).
blastAnalysis
, packSearch
,
readBlast
, blastAnnotate
## Not run: packMatches <- data(packMatches) Genome <- data(arabidopsisThalianaRefseq) packBlast(packMatches, Genome, protDb = "C:/data/TAIR10_CDS", autoDb = "C:/data/TAIR10_transposons", blastPath = "C:/blast/bin/blastn.exe") ## End(Not run)
## Not run: packMatches <- data(packMatches) Genome <- data(arabidopsisThalianaRefseq) packBlast(packMatches, Genome, protDb = "C:/data/TAIR10_CDS", autoDb = "C:/data/TAIR10_transposons", blastPath = "C:/blast/bin/blastn.exe") ## End(Not run)
Cluster potential pack-TYPE elements by sequence
similarity. Resulting groups may be aligned with
packAlign
, or the clusters may be
analysed with tirClust
packClust( packMatches, Genome, identity = 0.6, threads = 1, identityDefinition = 2, maxWildcards = 0.05, strand = "both", saveFolder = NULL, vSearchPath = "vsearch" )
packClust( packMatches, Genome, identity = 0.6, threads = 1, identityDefinition = 2, maxWildcards = 0.05, strand = "both", saveFolder = NULL, vSearchPath = "vsearch" )
packMatches |
A dataframe of potential Pack-TYPE transposable elements,
in the format given by |
Genome |
A DNAStringSet object containing sequences referred to
in |
identity |
The sequence identity of two transposable elements in
|
threads |
The number of threads to be used by VSEARCH. |
identityDefinition |
The pairwise identity definition used by VSEARCH. Defaults to 2, the standard VSEARCH definition. |
maxWildcards |
The maximal allowable proportion of wildcards in the
sequence of each match (defaults to |
strand |
The strand direction (+, - or *) to be clustered. |
saveFolder |
The folder to save output files (uc, blast6out, FASTA) |
vSearchPath |
When the package is run on windows systems, the location of the VSEARCH executable file must be given; this should be left as default on Linux/MacOS systems. |
Saves cluster information, including a uc
and
blast6out
file, to the specified location. Returns
the given packMatches
dataframe with an additional
column, cluster
, containing cluster IDs.
In order to cluster sequences using VSEARCH, the executable file must first be installed.
Jack Gisby
VSEARCH may be downloaded from https://github.com/torognes/vsearch. See https://www.ncbi.nlm.nih.gov/pubmed/27781170 for further information.
tirClust
, packAlign
,
readBlast
, readUc
,
filterWildcards
, packSearch
data(arabidopsisThalianaRefseq) data(packMatches) # packClust run on a Linux/MacOS system ## Not run: packClust(packMatches, Genome) ## End(Not run) # packClust run on a Windows system ## Not run: packClust(packMatches, Genome, vSearchPath = "path/to/vsearch/vsearch.exe") ## End(Not run)
data(arabidopsisThalianaRefseq) data(packMatches) # packClust run on a Linux/MacOS system ## Not run: packClust(packMatches, Genome) ## End(Not run) # packClust run on a Windows system ## Not run: packClust(packMatches, Genome, vSearchPath = "path/to/vsearch/vsearch.exe") ## End(Not run)
Algorithm and tools for in silico pack-TYPE transposon discovery. Filters a given genome for properties unique to DNA transposons and provides tools for the investigation of returned matches.
The goal of packFinder was to implement a simple tool for the prediction of potential Pack-TYPE elements. packFinder uses the following prior knowledge, provided by the user, to detect transposons:
Terminal Inverted Repeat (TIR) Base Sequence
Length of Terminal Site Duplication (TSD)
Length of the Transposon
These features provide enough information to detect autonomous and pack-TYPE elements. For a transposon to be predicted by packFinder its TSD sequences must be identical to each other, its forward TIR sequence must match the base sequence provided and its reverse TIR sequence must match its reverse complement.
Transposons are therefore predicted by searching a given genome for these characteristics, and further analysis steps can reveal the nature of these elements - while the packFinder tool is sensitive for the detection of transposons, it does not discriminate between autonomous and Pack-TYPE elements. Autonomous elements will contain a transposase gene within the terminal inverted repeats and tend to be larger than their Pack-TYPE counterparts; pack-TYPE elements instead capture sections of host genomes. Following cluster analysis, BLAST can be used to discern which predicted elements are autonomous (transposase-containing) and with are true Pack-TYPE elements.
An example of a standard workflow can be found using
browseVignettes(package = "packFinder")
. The
primary functions include:
packSearch - the packSearch algorithm uses simple pattern matching to detect DNA transposons.
packClust - VSEARCH is used for clustering elements based on sequence similarity.
Having obtained the sequences of transposable elements in a given genome, it is recommende to carry out a BLAST search for each transposon cluster. This can identify which elements are likely autonomous, and which may be Pack-TYPE.
The packFinder
functions report the
position of elements in a given genome using a
dataframe in the format of packMatches
.
This dataframe is in the format produced by
coercing a link[GenomicRanges:GRanges-class]{GRanges}
object to a dataframe: data.frame(GRanges)
.
Jack Gisby
A sample output from packSearch
with
cluster information. This dataframe is in the format produced by
coercing a link[GenomicRanges:GRanges-class]{GRanges}
object to a dataframe: data.frame(GRanges)
.
data(packMatches)
data(packMatches)
A dataframe of 9 obs. and 7 variables.
Was obtained from running packSearch
on the Arabidopsis thaliana chromosome 3 reference
sequence, followed by clustering using
packClust
. Contains the following features:
start - the predicted element's start base sequence position.
end - the predicted element's end base sequence position.
seqnames - character string referring to the
sequence name in Genome
to which start
and end
refer to.
The dataset was generated as in the example below.
packSearch
, data.frame
,
arabidopsisThalianaRefseq
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 )
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 )
General use pipeline function for the Pack-TYPE transposon finding algorithm.
packSearch( tirSeq, Genome, mismatch = 0, elementLength, tsdLength, tsdMismatch = 0, fixed = TRUE )
packSearch( tirSeq, Genome, mismatch = 0, elementLength, tsdLength, tsdMismatch = 0, fixed = TRUE )
tirSeq |
A |
Genome |
A |
mismatch |
The maximum edit distance to be considered for TIR
matches (indels + substitions). See
|
elementLength |
The maximum element length to be considered, as a vector
of two integers. E.g. |
tsdLength |
Integer referring to the length of the flanking TSD region. |
tsdMismatch |
An integer referring to the allowable mismatch
(substitutions or indels) between a transposon's TSD
sequences. |
fixed |
Logical that will be passed to the 'fixed' argument of
|
Finds potential pack-TYPE elements based on:
Similarity of TIR sequence to tirSeq
Proximity of potential TIR sequences
Directionality of TIR sequences
Similarity of TSD sequences
The algorithm finds potential forward and reverse TIR
sequences using identifyTirMatches
and
their associated TSD sequence via getTsds
.
The main filtering stage,
identifyPotentialPackElements
, filters
matches to obtain a dataframe of potential PACK elements.
Note that this pipeline does not consider the
possibility of discovered elements being autonomous
elements, so it is recommended to cluster and/or BLAST
elements for further analysis. Furthermore, only exact TSD
matches are considered, so supplying long sequences for
TSD elements may lead to false-negative results.
A dataframe, containing elements identified by thealgorithm. These may be autonomous or pack-TYPE elements. Will contain the following features:
start - the predicted element's start base sequence position.
end - the predicted element's end base sequence position.
seqnames - character string referring to the
sequence name in Genome
to which start
and end
refer to.
width - the width of the predicted element.
strand - the strand direction of the
transposable element. This will be set to "*" as the
packSearch
function does not consider
transposons to have a direction - only TIR sequences.
Passing the packMatches
dataframe to
packClust
will assign a direction to
each predicted Pack-TYPE element.
This dataframe is in the format produced by
coercing a link[GenomicRanges:GRanges-class]{GRanges}
object to a dataframe: data.frame(GRanges)
. Downstream
functions, such as packClust
, use this
dataframe to manipulate predicted transposable elements.
This algorithm does not consider:
Autonomous elements - autonomous elements will
be predicted by this algorithm as there is no BLAST
step. It is recommended that, after clustering
elements using packClust
, the user
analyses each group to determine which predicted
elements are autonomous and which are likely
Pack-TYPE elements. Alternatively, databases such as
Repbase (https://www.girinst.org/repbase/)
supply annotations for autonomous transposable
elements that can be used to filter autonomous matches.
TSD Mismatches - if two TIRs do not have exact matches for their terminal site duplications they will be ignored. Supplying longer TSD sequences will likely lead to a lower false-positive rate, however may also cause a greater rate of false-negative results.
Pattern matching is done via matchPattern
.
Jack Gisby
identifyTirMatches
, getTsds
,
identifyPotentialPackElements
, packClust
,
packMatches
,
DNAStringSet
,
DNAString
,
matchPattern
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 )
data(arabidopsisThalianaRefseq) packMatches <- packSearch( Biostrings::DNAString("CACTACAA"), arabidopsisThalianaRefseq, elementLength = c(300, 3500), tsdLength = 3 )
Saves a dataframe of potential Pack-TYPE elements,
usually generated via packSearch
. May be
retrieved using getPacksFromCsv
.
packsToCsv(packMatches, file)
packsToCsv(packMatches, file)
packMatches |
A dataframe containing genomic ranges and names referring
to sequences to be extracted. Can be obtained from
|
file |
CSV file save path. |
Save location of csv file.
Jack Gisby
getPacksFromCsv
, write.table
,
packSearch
data(packMatches) packsToCsv( packMatches, system.file("extdata", "packMatches.csv", package = "packFinder") )
data(packMatches) packsToCsv( packMatches, system.file("extdata", "packMatches.csv", package = "packFinder") )
Saves a dataframe of potential Pack-TYPE elements,
usually generated via packSearch
.
May be retrieved using getPacksFromFasta
.
packsToFasta(packMatches, file, Genome)
packsToFasta(packMatches, file, Genome)
packMatches |
taframe containing genomic ranges and names referring
to sequences to be extracted. Can be obtained from
|
file |
FASTA file save path. |
Genome |
A DNAStringSet object containing sequences referred to
in |
Save location of Fasta file.
Jack Gisby
data(arabidopsisThalianaRefseq) data(packMatches) packsToFasta( packMatches, system.file("extdata", "packMatches.fasta", package = "packFinder"), arabidopsisThalianaRefseq )
data(arabidopsisThalianaRefseq) data(packMatches) packsToFasta( packMatches, system.file("extdata", "packMatches.fasta", package = "packFinder"), arabidopsisThalianaRefseq )
A dataframe containing genomic ranges and names referring
to sequences to be extracted, likely obtained from
packSearch
, can be converted to a GRanges
object. Can be converted back to a dataframe using
getPacksFromGRanges
. Additional features,
such as clusters and TSD sequences, will be included in
the object as metadata columns.
packsToGRanges(packMatches)
packsToGRanges(packMatches)
packMatches |
A dataframe containing genomic ranges and names
referring to sequences to be extracted. Can be obtained
from
|
A GRanges object containing the ranges contained in
packMatches
and additional metadata columns. May
be easily converted between dataframe and GRanges format
for use in the packFinder
package and
link[GenomicRanges:GRanges-class]{GRanges}
package. Note that most functions in the packFinder
package require sequence ranges to be provided in
dataframe format.
Jack Gisby
getPacksFromGRanges
,
link[GenomicRanges:GRanges-class]{GRanges}
data(packMatches) packGRanges <- packsToGRanges(packMatches)
data(packMatches) packGRanges <- packsToGRanges(packMatches)
Reads .blast6out files (NCBI Blast Format) generated by the VSEARCH clustering and alignment algorithms.
readBlast( file, minE = 1, length = 0, identity = 0, removeExactMatches = FALSE, scope = NULL, packMatches = NULL )
readBlast( file, minE = 1, length = 0, identity = 0, removeExactMatches = FALSE, scope = NULL, packMatches = NULL )
file |
The file path of the blast file. |
minE |
Blast results with e values greater than the specified cutoff will be ignored. |
length |
Blast results alignment lengths lower below this value will be ignored |
identity |
Blast results with target sequence identities below this value will be ignored. |
removeExactMatches |
If true, matches with 100 be ignored to prevent self-hits. |
scope |
If specified, blast results below the specified value will be ignored. Note that the dataframe of transposon matches must also be supplied to calculate scope. Scope is the proportion of the transposon's internal sequence occupied by the BLAST hit. |
packMatches |
taframe containing genomic ranges and names referring
to sequences to be extracted. Can be obtained from
|
blast6out file is tab-separated text file compatible with NCBI BLAST m8 and NCBI BLAST+ outfmt 6 formats. One cluster/alignment can be found for each line.
A dataframe containing the converted .blast6out file. The file contains the following features:
Query sequence ID
Target sequence ID
Percenty sequence identity
Alignment length
Number of mismatches
Number of gaps
Base position of alignment start in query sequence
Base position of alignment end in query sequence
Base position of alignment start in target sequence
Base position of alignment end in target sequence
E-value
Bit score
Jack Gisby
For further information, see the NCBI BLAST+ application documentation and help pages (https://www.ncbi.nlm.nih.gov/pubmed/20003500?dopt=Citation). VSEARCH may be downloaded from https://github.com/torognes/vsearch; see https://www.ncbi.nlm.nih.gov/pubmed/27781170 for further information.
codeblastAnalysis, codeblastAnnotate, codepackAlign, codereadUc, codepackClust
readBlast(system.file( "extdata", "packMatches.blast6out", package = "packFinder" ))
readBlast(system.file( "extdata", "packMatches.blast6out", package = "packFinder" ))
Reads .uc files (USEARCH Cluster Format) generated by the VSEARCH clustering and alignment algorithms.
readUc(file, output = "cluster")
readUc(file, output = "cluster")
file |
The file path of the .uc file. |
output |
The type of analysis that was carried out to produce the .uc file.
Note that clustering produces one "H" record for each sequence, and one "C" record for each cluster, while an alignment produces an "H" record for each alignment (see details). |
USEARCH cluster format is a tab separated text file that contains clustering and/or alignment information for a set of sequences. For each sequence a record type, "H, C or N", is provided providing information about the type of "hit" in the dataframe. These refer to:
H - Hit - for alignments, indicates an identified alignment of two supplied sequences. For clustering, indicates the cluster assignment for a query.
C - Cluster record - a record for each cluster generated.
N - No hit - indicates that no cluster was assigned or no alignment was found with a target sequence. For clustering, a query with no hits becomes the centroid of a new cluster.
Additionally, for each record a "compressed alignment" is generated. This is the alignment represented in a compact format including the letters "M", "D", and "I". Before each letter, the number of consecutive columns of the given letter type is also given. The letter types are as follows:
"M" - Match - Identical bases between the query and target sequence
"D" - Deletion - A gap in the target sequence
"I" - Insertion - A gap in the query sequence
An example of this would be "13M", referring to 13 consecutive matches between the query and target sequence.
A dataframe containing the converted .uc file. The fields contained within are as follows:
Record type - "H, C or N", see details for further information.
Cluster designation
(output = "cluster"
only)
Sequence length, or cluster size
Percent identity to target
The nucleotide strand
(output = "cluster"
only)
A compressed alignment - see details for further information.
ID of query sequence
ID of target sequence ("H" records only)
Jack Gisby
VSEARCH may be downloaded from https://github.com/torognes/vsearch. See https://www.ncbi.nlm.nih.gov/pubmed/27781170 for further information.
codetirClust, codepackAlign, codereadBlast, codepackClust
readUc(system.file( "extdata", "packMatches.uc", package = "packFinder" ))
readUc(system.file( "extdata", "packMatches.uc", package = "packFinder" ))
Takes transposable elements clustered by VSEARCH,
packClust
, and produces consensus sequences
for the terminal inverted repeats of each. Allows for the
visualisation of TIR similarities between clusters for both
forward and reverse strands.
tirClust( packMatches, Genome, tirLength = 25, plot = TRUE, plotSavePath = NULL, k = 5, output = "consensus" )
tirClust( packMatches, Genome, tirLength = 25, plot = TRUE, plotSavePath = NULL, k = 5, output = "consensus" )
packMatches |
A dataframe containing genomic ranges and names referring
to sequences to be extracted. This dataframe is in the format
produced by coercing a
Must contain the following features:
|
Genome |
A DNAStringSet object containing sequences referred to in
|
tirLength |
The TIR size to be considered. Consensus sequences will
be generated based on the first and last |
plot |
Argument specifying whether the TIR consensus sequences should be plottted as a dendrogram. |
plotSavePath |
File path for the dendrogram plot. If unspecified, the dendrogram plot is not saved. |
k |
The k-mer size to be used for calculating a distance
matrix between TIR consensus sequences. See
|
output |
Controls the output of |
If output
is specified as "consensus" (default),
returns a list of consensus sequences for each cluster
specified in packMatches
as a
DNAStringSet
.
Else if output
is specified as "dendrogram",
returns a dendrogram object used to create hierarchical
clustering diagrams.
Jack Gisby
codepackClust, codepackAlign,
kdistance
,
DNAStringSet
,
as.alignment
, packSearch
data(arabidopsisThalianaRefseq) data(packMatches) tirClust(packMatches, arabidopsisThalianaRefseq)
data(arabidopsisThalianaRefseq) data(packMatches) tirClust(packMatches, arabidopsisThalianaRefseq)