The txdbmaker
package provides functions to make TxDb
objects from
genomic annotation provided by the UCSC Genome Browser (https://genome.ucsc.edu/), Ensembl (https://ensembl.org/),
BioMart (http://www.biomart.org/), or directly from a GFF or GTF
file.
In this document we will quickly demonstrate the use of these functions.
Note that the package also provides a lower-level utility,
makeTxDb()
, for creating TxDb
objects from
data directly supplied by the user. Please refer to its man page
(?makeTxDb
) for more information.
See vignette in the GenomicFeatures
package for an introduction to TxDb
objects.
txdbmaker
packageInstall the package with:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("txdbmaker")
Then load it with:
makeTxDbFromUCSC
The function makeTxDbFromUCSC
downloads UCSC Genome
Bioinformatics transcript tables (e.g. knownGene
,
refGene
, ensGene
) for a genome build (e.g.
mm9
, hg19
). Use the
supportedUCSCtables
utility function to get the list of
tables known to work with makeTxDbFromUCSC
.
## tablename track composite_track
## 1 acembly AceView Genes <NA>
## 2 augustusGene AUGUSTUS <NA>
## 3 ccdsGene CCDS <NA>
## 4 ensGene Ensembl Genes <NA>
## 5 exoniphy Exoniphy <NA>
## 6 geneid Geneid Genes <NA>
## 7 genscan Genscan Genes <NA>
## 8 knownGene UCSC Genes <NA>
## 9 knownGeneOld4 Old UCSC Genes <NA>
## 10 nscanGene N-SCAN <NA>
## 11 pseudoYale60 Yale Pseudo60 <NA>
## 12 refGene RefSeq Genes <NA>
## 13 sgpGene SGP Genes <NA>
## 14 transcriptome Transcriptome <NA>
## 15 vegaPseudoGene Vega Pseudogenes Vega Genes
## 16 vegaGene Vega Protein Genes Vega Genes
## 17 xenoRefGene Other RefSeq <NA>
## Download the knownGene table ... OK
## Download the knownToLocusLink table ... OK
## Extract the 'transcripts' data frame ... OK
## Extract the 'splicings' data frame ... OK
## Download and preprocess the 'chrominfo' data frame ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
## is not available for this TxDb object
## OK
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: mm9
## # Organism: Mus musculus
## # Taxonomy ID: 10090
## # UCSC Table: knownGene
## # UCSC Track: UCSC Genes
## # Resource URL: https://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: NA
## # Nb of transcripts: 55419
## # Db created by: txdbmaker package from Bioconductor
## # Creation time: 2024-11-22 03:36:24 +0000 (Fri, 22 Nov 2024)
## # txdbmaker version at creation time: 1.3.1
## # RSQLite version at creation time: 2.3.8
## # DBSCHEMAVERSION: 1.2
See ?makeTxDbFromUCSC
for more information.
makeTxDbFromBiomart
Retrieve data from BioMart by specifying the mart and the data set to
the makeTxDbFromBiomart
function (not all BioMart data sets
are currently supported):
As with the makeTxDbFromUCSC
function, the
makeTxDbFromBiomart
function also has a
circ_seqs
argument that will default to using the contents
of the DEFAULT_CIRC_SEQS
vector. And just like those UCSC
sources, there is also a helper function called
getChromInfoFromBiomart
that can show what the different
chromosomes are called for a given source.
Using the makeTxDbFromBiomart
makeTxDbFromUCSC
functions can take a while and may also
require some bandwidth as these methods have to download and then
assemble a database from their respective sources. It is not expected
that most users will want to do this step every time. Instead, we
suggest that you save your annotation objects and label them with an
appropriate time stamp so as to facilitate reproducible research.
See ?makeTxDbFromBiomart
for more information.
makeTxDbFromEnsembl
The makeTxDbFromEnsembl
function creates a
TxDb
object for a given organism by importing the genomic
locations of its transcripts, exons, CDS, and genes from an Ensembl
database.
See ?makeTxDbFromEnsembl
for more information.
makeTxDbFromGFF
You can also extract transcript information from either GFF3 or GTF
files by using the makeTxDbFromGFF
function. Usage is
similar to makeTxDbFromBiomart
and
makeTxDbFromUCSC
.
See ?makeTxDbFromGFF
for more information.
TxDb
ObjectOnce a TxDb
object has been created, it can be saved to
avoid the time and bandwidth costs of recreating it and to make it
possible to reproduce results with identical genomic feature data at a
later date. Since TxDb
objects are backed by a SQLite
database, the save format is a SQLite database file (which could be
accessed from programs other than R if desired). Note that it is not
possible to serialize a TxDb
object using R’s
save
function.
And as was mentioned earlier, a saved TxDb
object can be
initialized from a .sqlite file by simply using loadDb
.
makeTxDbPackageFromUCSC
and
makeTxDbPackageFromBiomart
It is often much more convenient to just make an annotation package
out of your annotations. If you are finding that this is the case, then
you should consider the convenience functions:
makeTxDbPackageFromUCSC
and
makeTxDbPackageFromBiomart
. These functions are similar to
makeTxDbFromUCSC
and makeTxDbFromBiomart
except that they will take the extra step of actually wrapping the
database up into an annotation package for you. This package can then be
installed and used as of the standard TxDb packages found on in the
Bioconductor repository.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] txdbmaker_1.3.1 GenomicFeatures_1.59.1 AnnotationDbi_1.69.0
## [4] Biobase_2.67.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.1
## [7] IRanges_2.41.1 S4Vectors_0.45.2 BiocGenerics_0.53.3
## [10] generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] blob_1.2.4 filelock_1.0.3
## [5] Biostrings_2.75.1 bitops_1.0-9
## [7] fastmap_1.2.0 RCurl_1.98-1.16
## [9] BiocFileCache_2.15.0 GenomicAlignments_1.43.0
## [11] XML_3.99-0.17 digest_0.6.37
## [13] timechange_0.3.0 lifecycle_1.0.4
## [15] KEGGREST_1.47.0 RSQLite_2.3.8
## [17] magrittr_2.0.3 compiler_4.4.2
## [19] rlang_1.1.4 sass_0.4.9
## [21] progress_1.2.3 tools_4.4.2
## [23] utf8_1.2.4 yaml_2.3.10
## [25] rtracklayer_1.67.0 knitr_1.49
## [27] prettyunits_1.2.0 S4Arrays_1.7.1
## [29] bit_4.5.0 curl_6.0.1
## [31] DelayedArray_0.33.2 xml2_1.3.6
## [33] abind_1.4-8 BiocParallel_1.41.0
## [35] sys_3.4.3 grid_4.4.2
## [37] fansi_1.0.6 biomaRt_2.63.0
## [39] SummarizedExperiment_1.37.0 cli_3.6.3
## [41] rmarkdown_2.29 crayon_1.5.3
## [43] httr_1.4.7 rjson_0.2.23
## [45] DBI_1.2.3 cachem_1.1.0
## [47] stringr_1.5.1 zlibbioc_1.52.0
## [49] parallel_4.4.2 BiocManager_1.30.25
## [51] XVector_0.47.0 restfulr_0.0.15
## [53] matrixStats_1.4.1 vctrs_0.6.5
## [55] Matrix_1.7-1 jsonlite_1.8.9
## [57] hms_1.1.3 bit64_4.5.2
## [59] maketools_1.3.1 jquerylib_0.1.4
## [61] glue_1.8.0 codetools_0.2-20
## [63] lubridate_1.9.3 stringi_1.8.4
## [65] BiocIO_1.17.1 UCSC.utils_1.3.0
## [67] tibble_3.2.1 pillar_1.9.0
## [69] rappdirs_0.3.3 htmltools_0.5.8.1
## [71] GenomeInfoDbData_1.2.13 R6_2.5.1
## [73] dbplyr_2.5.0 httr2_1.0.6
## [75] evaluate_1.0.1 lattice_0.22-6
## [77] RMariaDB_1.3.3 png_0.1-8
## [79] Rsamtools_2.23.0 memoise_2.0.1
## [81] bslib_0.8.0 SparseArray_1.7.2
## [83] xfun_0.49 MatrixGenerics_1.19.0
## [85] buildtools_1.0.0 pkgconfig_2.0.3