GRanges
tRNAscan-SE (Lowe and Eddy 1997) can be
used for prediction of tRNA genes in whole genomes based on sequence
context and calculated structural features. Many tRNA annotations in
genomes contain or are based on information generated by tRNAscan-SE,
for example the current SGD reference genome sacCer3 for Saccharomyces
cerevisiae. However, not all available information from tRNAscan-SE end
up in the genome annotation. Among these are for example structural
information, additional scores and the information, whether the
conserved CCA-end is encoded in the genomic DNA. To work with this
complete set of information, the tRNAscan-SE output can be parsed into a
more accessible GRanges object using tRNAscanImport
.
The default tRNAscan-SE output, either from running tRNAscan-SE (Lowe and Eddy 1997) locally or retrieving the output from the gtRNADb (Chan and Lowe 2016), consist of a formatted text document containing individual text blocks per tRNA delimited by an empty line.
library(tRNAscanImport)
yeast_file <- system.file("extdata",
file = "yeast.tRNAscan",
package = "tRNAscanImport")
# output for sacCer3
# Before
readLines(con = yeast_file, n = 7L)
## [1] "chrI.trna1 (139152-139254)\tLength: 103 bp"
## [2] "Type: Pro\tAnticodon: TGG at 33-35 (139184-139186)\tScore: 62.1"
## [3] "Possible intron: 37-67 (139188-139218)"
## [4] "HMM Sc=37.90\tSec struct Sc=24.20"
## [5] " * | * | * | * | * | * | * | * | * | * | "
## [6] "Seq: GGGCGTGTGGTCTAGTGGTATGATTCTCGCTTTGGGcgacttcctgattaaacaggaagacaaagcaTGCGAGAGGcCCTGGGTTCAATTCCCAGCTCGCCCC"
## [7] "Str: >>>>>.>..>>>.........<<<.>>>>>......................................<<<<<.....>>>>>.......<<<<<<.<<<<<."
GRanges
To access the information in a BioC context the import as a GRanges
object comes to mind. import.tRNAscanAsGRanges()
performs
this task by evaluating each text block using regular expressions.
## GRanges object with 2 ranges and 18 metadata columns:
## seqnames ranges strand | no tRNA_length tRNA_type
## <Rle> <IRanges> <Rle> | <integer> <integer> <character>
## [1] chrI 139152-139254 + | 1 72 Pro
## [2] chrI 166267-166339 + | 2 73 Ala
## tRNA_anticodon tRNA_anticodon.start tRNA_anticodon.end tRNAscan_score
## <character> <integer> <integer> <numeric>
## [1] TGG 33 35 62.1
## [2] TGC 34 36 76.0
## tRNA_seq tRNA_str tRNA_CCA.end
## <DNAStringSet> <DotBracketStringSet> <logical>
## [1] GGGCGTGTGG...AGCTCGCCCC <<<<<.<..<...>>>.>>>>>. FALSE
## [2] GGGCACATGG...GTTGCGTCCA <<<<.<<..<...>>>>.>>>>. FALSE
## tRNAscan_potential.pseudogene tRNAscan_intron.start tRNAscan_intron.end
## <logical> <integer> <integer>
## [1] FALSE 139188 139218
## [2] FALSE <NA> <NA>
## tRNAscan_intron.locstart tRNAscan_intron.locend tRNAscan_hmm.score
## <integer> <integer> <numeric>
## [1] 37 67 37.9
## [2] <NA> <NA> 53.4
## tRNAscan_sec.str.score tRNAscan_infernal
## <numeric> <numeric>
## [1] 24.2 NA
## [2] 22.6 NA
## -------
## seqinfo: 17 sequences from an unspecified genome; no seqlengths
## [1] TRUE
The result can be used directly in R or saved as gff3/fasta file for further use, including processing the sequences for HTS read mapping or statistical analysis on tRNA content of the analyzed genome.
The tRNAscan-SE information can be visualized using the
gettRNAFeaturePlots()
function of the tRNA
package, returning a named list of ggplot2 plots, which can be plotted
or further modified. Alternatively, gettRNASummary()
returns the aggregated information for further use.
# tRNAscan-SE output for hg38
human_file <- system.file("extdata",
file = "human.tRNAscan",
package = "tRNAscanImport")
# tRNAscan-SE output for E. coli MG1655
eco_file <- system.file("extdata",
file = "ecoli.tRNAscan",
package = "tRNAscanImport")
# import tRNAscan-SE files
gr_human <- import.tRNAscanAsGRanges(human_file)
gr_eco <- import.tRNAscanAsGRanges(eco_file)
# get summary plots
grl <- GRangesList(Sce = gr,
Hsa = gr_human,
Eco = gr_eco)
plots <- gettRNAFeaturePlots(grl)
Since tRNAscan reports the genomic location for tRNAs found,
approximate tRNA precursor sequences can be retrieved by combining a
tRNAscan input object with matching genomic sequences for the function
get.tRNAprecursor
.
library(BSgenome.Scerevisiae.UCSC.sacCer3)
genome <- getSeq(BSgenome.Scerevisiae.UCSC.sacCer3)
# renaming chromosome to match tRNAscan output
names(genome) <- c(names(genome)[-17L],"chrmt")
tRNAprecursor <- get.tRNAprecursor(gr, genome)
head(tRNAprecursor)
## DNAStringSet object of length 6:
## width seq names
## [1] 203 CAATTTGTATATATATACATCTA...AATTAAAGTAGCAGTACTTCAAC pre_chrI.tRNA1
## [2] 173 AGCTTCTAAGCACTTACCATTCC...AATTCGTGAATAGCTGACTGTCA pre_chrI.tRNA2
## [3] 214 GTCAGTGTCCAAATAGTTAAAAC...CATAATCTACGTAGGAATGAAAG pre_chrI.tRNA3
## [4] 182 GTCATACTGACATATCTCATTTT...CCTCTTCAAAGCATACTCATCTT pre_chrI.tRNA4
## [5] 184 GGGTAAAATAGGGTATTTAACTG...ATTAACTAGAATAATAGGGAAAT pre_chrII.tRNA1
## [6] 191 TTTGCTAATAATAAATCTATTTC...TTCATTTCTAGGCCTGTTTCTCC pre_chrII.tRNA2
The length of the overhangs can be defined with the arguments
add.5prime
and add.3prime
, respectively. Both
support individual lengths for each tRNA and require values to be
integer only. In addition, introns can be removed by setting
trim.introns = TRUE
.
Further examples of working with tRNA information can be found in the
vignette of the tRNA
package.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0
## [2] BSgenome_1.75.0
## [3] BiocIO_1.17.1
## [4] rtracklayer_1.67.0
## [5] tRNAscanImport_1.27.0
## [6] tRNA_1.25.0
## [7] Structstrings_1.23.1
## [8] Biostrings_2.75.3
## [9] XVector_0.47.1
## [10] GenomicRanges_1.59.1
## [11] GenomeInfoDb_1.43.2
## [12] IRanges_2.41.2
## [13] S4Vectors_0.45.2
## [14] BiocGenerics_0.53.3
## [15] generics_0.1.3
## [16] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.37.0 gtable_0.3.6
## [3] rjson_0.2.23 xfun_0.49
## [5] bslib_0.8.0 ggplot2_3.5.1
## [7] lattice_0.22-6 Biobase_2.67.0
## [9] vctrs_0.6.5 tools_4.4.2
## [11] bitops_1.0-9 curl_6.0.1
## [13] parallel_4.4.2 tibble_3.2.1
## [15] pkgconfig_2.0.3 Matrix_1.7-1
## [17] RColorBrewer_1.1-3 lifecycle_1.0.4
## [19] GenomeInfoDbData_1.2.13 farver_2.1.2
## [21] compiler_4.4.2 stringr_1.5.1
## [23] Rsamtools_2.23.1 munsell_0.5.1
## [25] Modstrings_1.23.0 codetools_0.2-20
## [27] htmltools_0.5.8.1 sys_3.4.3
## [29] buildtools_1.0.0 sass_0.4.9
## [31] RCurl_1.98-1.16 yaml_2.3.10
## [33] pillar_1.10.0 crayon_1.5.3
## [35] jquerylib_0.1.4 BiocParallel_1.41.0
## [37] DelayedArray_0.33.3 cachem_1.1.0
## [39] abind_1.4-8 digest_0.6.37
## [41] stringi_1.8.4 restfulr_0.0.15
## [43] labeling_0.4.3 maketools_1.3.1
## [45] fastmap_1.2.0 grid_4.4.2
## [47] SparseArray_1.7.2 colorspace_2.1-1
## [49] cli_3.6.3 magrittr_2.0.3
## [51] S4Arrays_1.7.1 XML_3.99-0.17
## [53] withr_3.0.2 scales_1.3.0
## [55] UCSC.utils_1.3.0 rmarkdown_2.29
## [57] httr_1.4.7 matrixStats_1.4.1
## [59] evaluate_1.0.1 knitr_1.49
## [61] rlang_1.1.4 glue_1.8.0
## [63] BiocManager_1.30.25 jsonlite_1.8.9
## [65] R6_2.5.1 MatrixGenerics_1.19.0
## [67] GenomicAlignments_1.43.0 zlibbioc_1.52.0