BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences of a particular genome assembly for a given organism. For example BSgenome.Hsapiens.UCSC.hg19 is a BSgenome data package that contains the genomic sequences of the hg19 genome from UCSC. Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.
Bioconductor currently provides more than 100 BSgenome data packages, for more than 30 organisms. Most of them contain the genomic sequences of UCSC genomes (i.e. genomes supported by the UCSC Genome Browser) or NCBI assemblies. The packages are used in various Bioconductor workflows, as well as in man page examples and vignettes of other Bioconductor packages, typically in conjunction with tools available in the BSgenome and Biostrings software packages. New BSgenome data packages get added on a regular basis, based on user demand.
The BSgenomeForge
package provides tools that allow the user to make their own BSgenome
data package. The two primary tools in the package are the
forgeBSgenomeDataPkgfromNCBI
and
forgeBSgenomeDataPkgfromUCSC
functions. These functions
allow the user to forge a BSgenome data package for a given NCBI
assembly or UCSC genome.
For other genome assemblies please consult the Advanced BSgenomeForge usage vignettes also provided in this package.
forgeBSgenomeDataPkgFromNCBI()
Example 1: Information about assembly ASM972954v1 can be found at https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1/, including the assembly accession, GCA_009729545.1 and organism name, Acidianus infernus. Assembly ASM972954v1 does not contain any circular sequences to be specified:
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_009729545.1",
pkg_maintainer="Jane Doe <[email protected]>",
organism="Acidianus infernus")
## Creating package in ./BSgenome.Ainfernus.NCBI.ASM972954v1
Example 2: Information about assembly ASM836960v1 can be found at https://www.ncbi.nlm.nih.gov/assembly/GCA_008369605.1/, including the assembly accession, GCA_008369605.1 and organism name, Vibrio cholerae. Assembly ASM836960v1 contains three circular sequence, “1”, “2” and “unnamed”. See CP043554.1, CP043556.1, and CP043555.1 in the NCBI Nucleotide database at https://www.ncbi.nlm.nih.gov/nuccore/. They must be specified as shown in the example below:
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_008369605.1",
pkg_maintainer="Jane Doe <[email protected]>",
organism="Vibrio cholerae",
circ_seqs=c("1", "2", "unnamed"))
## Creating package in ./BSgenome.Vcholerae.NCBI.ASM836960v1
Check ?forgeBSgenomeDataPkgFromNCBI
for more
information.
forgeBSgenomeDataPkgFromUCSC()
Example 3: Information about genome wuhCor1 can be found at https://genome.ucsc.edu/cgi-bin/hgGateway. This belongs to the organism Severe acute respiratory syndrome coronavirus 2. Genome wuhCor1 does not contain any circular sequences to be specified:
forgeBSgenomeDataPkgFromUCSC(
genome="wuhCor1",
organism="Severe acute respiratory syndrome coronavirus 2",
pkg_maintainer="Jane Doe <[email protected]>"
)
## Creating package in ./BSgenome.Scoronavirus2.UCSC.wuhCor1
Check ?forgeBSgenomeDataPkgFromUCSC
for more
information.
forgeBSgenomeDataPkgfromNCBI
or
forgeBSgenomeDataPkgfromUCSC
returns the path to the
created package at the end of its execution. This can be used to find
the package location, and afterwards carry out the following commands to
build the package source tarball via command line (i.e. in a Linux/Unix
terminal or Windows PowerShell terminal).
R CMD build <pkgdir>
where <pkgdir> is the path to the source tree of the package. Then check the package with
R CMD check <tarball>
where <tarball> is the path to the tarball produced by R CMD build. Finally install the package with
R CMD INSTALL <tarball>
These operations can also be carried out within R, instead, using the devtools package
## ── R CMD build ─────────────────────────────────────────────────────────────────
## * checking for file ‘/tmp/RtmpehaFUd/Rbuild1c1868a9a423/BSgenomeForge/vignettes/BSgenome.Ainfernus.NCBI.ASM972954v1/DESCRIPTION’ ... OK
## * preparing ‘BSgenome.Ainfernus.NCBI.ASM972954v1’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * building ‘BSgenome.Ainfernus.NCBI.ASM972954v1_1.0.0.tar.gz’
## [1] "/tmp/RtmpehaFUd/Rbuild1c1868a9a423/BSgenomeForge/vignettes/BSgenome.Ainfernus.NCBI.ASM972954v1_1.0.0.tar.gz"
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenomeForge_1.7.0 BSgenome_1.73.1 rtracklayer_1.65.0
## [4] BiocIO_1.15.2 GenomicRanges_1.57.2 Biostrings_2.73.2
## [7] XVector_0.45.0 GenomeInfoDb_1.41.2 IRanges_2.39.2
## [10] S4Vectors_0.43.2 BiocGenerics_0.51.3 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 fastmap_1.2.0
## [3] RCurl_1.98-1.16 GenomicAlignments_1.41.0
## [5] promises_1.3.0 XML_3.99-0.17
## [7] digest_0.6.37 mime_0.12
## [9] lifecycle_1.0.4 ellipsis_0.3.2
## [11] processx_3.8.4 magrittr_2.0.3
## [13] compiler_4.4.1 rlang_1.1.4
## [15] sass_0.4.9 tools_4.4.1
## [17] yaml_2.3.10 knitr_1.48
## [19] S4Arrays_1.5.11 htmlwidgets_1.6.4
## [21] pkgbuild_1.4.5 curl_5.2.3
## [23] DelayedArray_0.31.14 pkgload_1.4.0
## [25] abind_1.4-8 BiocParallel_1.39.0
## [27] miniUI_0.1.1.1 purrr_1.0.2
## [29] sys_3.4.3 desc_1.4.3
## [31] grid_4.4.1 urlchecker_1.0.1
## [33] profvis_0.4.0 xtable_1.8-4
## [35] SummarizedExperiment_1.35.5 cli_3.6.3
## [37] rmarkdown_2.28 crayon_1.5.3
## [39] remotes_2.5.0 rstudioapi_0.17.1
## [41] httr_1.4.7 rjson_0.2.23
## [43] sessioninfo_1.2.2 cachem_1.1.0
## [45] zlibbioc_1.51.2 parallel_4.4.1
## [47] BiocManager_1.30.25 restfulr_0.0.15
## [49] matrixStats_1.4.1 vctrs_0.6.5
## [51] devtools_2.4.5 Matrix_1.7-1
## [53] jsonlite_1.8.9 callr_3.7.6
## [55] maketools_1.3.1 jquerylib_0.1.4
## [57] glue_1.8.0 ps_1.8.1
## [59] codetools_0.2-20 later_1.3.2
## [61] UCSC.utils_1.1.0 htmltools_0.5.8.1
## [63] GenomeInfoDbData_1.2.13 R6_2.5.1
## [65] evaluate_1.0.1 shiny_1.9.1
## [67] lattice_0.22-6 Biobase_2.65.1
## [69] Rsamtools_2.21.2 memoise_2.0.1
## [71] httpuv_1.6.15 bslib_0.8.0
## [73] Rcpp_1.0.13 SparseArray_1.5.45
## [75] xfun_0.48 fs_1.6.4
## [77] MatrixGenerics_1.17.1 buildtools_1.0.0
## [79] usethis_3.0.0