Package 'BSgenomeForge' reference manual

Title:	Forge your own BSgenome data package
Description:	A set of tools to forge BSgenome data packages. Supersedes the old seed-based tools from the BSgenome software package. This package allows the user to create a BSgenome data package in one function call, simplifying the old seed-based process.
Authors:	Hervé Pagès [aut, cre], Atuhurira Kirabo Kakopo [aut], Emmanuel Chigozie Elendu [ctb], Prisca Chidimma Maduka [ctb]
Maintainer:	Hervé Pagès <[email protected]>
License:	Artistic-2.0
Version:	1.7.0
Built:	2025-03-02 06:07:55 UTC
Source:	https://github.com/bioc/BSgenomeForge

The BSgenomeForge package

Description

A package that simplifies the process of forging a BSgenome data package, by allowing the user to use one function to create the package.

Details

BSgenomeForge provides two major functions, the forgeBSgenomeDataPkgFromNCBI function and forgeBSgenomeDataPkgFromUCSC function which allow one to forge a BSgenome data package from a NCBI assembly or UCSC genome respectively.

For an overview of the functionality provided by the package, please see the vignette: vignette("QuickBSgenomeForge", package="BSgenomeForge")

Author(s)

Atuhurira Kirabo Kakopo, Hervé Pagès

Maintainer: Hervé Pagès

Examples

## ---------------------------------------------------------------------
## EXAMPLE 1
## ---------------------------------------------------------------------

## Create a BSgenome data package for NCBI assembly GCF_000857545.1
## (organism Torque teno virus 1):
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1",
                             pkg_maintainer="Jane Doe <[email protected]>",
                             organism="Torque teno virus 1",
                             circ_seqs="NC_002076.2",
                             destdir=tempdir())

## ---------------------------------------------------------------------
## EXAMPLE 2
## ---------------------------------------------------------------------

## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2
## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1):
forgeBSgenomeDataPkgFromUCSC(
    genome="wuhCor1",
    organism="Severe acute respiratory syndrome coronavirus 2",
    pkg_maintainer="Jane Doe <[email protected]>",
    destdir=tempdir()
)
## ---------------------------------------------------------------------
## EXAMPLE 1
## ---------------------------------------------------------------------

## Create a BSgenome data package for NCBI assembly GCF_000857545.1
## (organism Torque teno virus 1):
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1",
                             pkg_maintainer="Jane Doe <[email protected]>",
                             organism="Torque teno virus 1",
                             circ_seqs="NC_002076.2",
                             destdir=tempdir())

## ---------------------------------------------------------------------
## EXAMPLE 2
## ---------------------------------------------------------------------

## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2
## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1):
forgeBSgenomeDataPkgFromUCSC(
    genome="wuhCor1",
    organism="Severe acute respiratory syndrome coronavirus 2",
    pkg_maintainer="Jane Doe <[email protected]>",
    destdir=tempdir()
)

Advanced BSgenomeForge functions

Description

A set of functions for making a BSgenome data package from any genome assembly not covered by forgeBSgenomeDataPkgFromNCBI() or forgeBSgenomeDataPkgFromUCSC().

Usage

## Main functions:

forgeBSgenomeDataPkg(x, seqs_srcdir=".", destdir=".", replace=FALSE, verbose=TRUE)

forgeMaskedBSgenomeDataPkg(x, masks_srcdir=".", destdir=".", verbose=TRUE)

## Low-level helpers:

forgeSeqlengthsRdsFile(seqnames, prefix="", suffix=".fa",
                       seqs_srcdir=".", seqs_destdir=".",
                       genome=NA_character_, verbose=TRUE)

forgeSeqlengthsRdaFile(seqnames, prefix="", suffix=".fa",
                       seqs_srcdir=".", seqs_destdir=".",
                       genome=NA_character_, verbose=TRUE)

forgeSeqFiles(provider, genome,
              seqnames, mseqnames=NULL,
              seqfile_name=NA, prefix="", suffix=".fa",
              seqs_srcdir=".", seqs_destdir=".",
              ondisk_seq_format=c("2bit", "rds", "rda", "fa.rz", "fa"),
              verbose=TRUE)

forgeMasksFiles(seqnames, nmask_per_seq,
                seqs_destdir=".",
                ondisk_seq_format=c("2bit", "rda", "fa.rz", "fa"),
                masks_srcdir=".", masks_destdir=".",
                AGAPSfiles_type="gap", AGAPSfiles_name=NA,
                AGAPSfiles_prefix="", AGAPSfiles_suffix="_gap.txt",
                RMfiles_name=NA, RMfiles_prefix="", RMfiles_suffix=".fa.out",
                TRFfiles_name=NA, TRFfiles_prefix="", TRFfiles_suffix=".bed",
                verbose=TRUE)
## Main functions:

forgeBSgenomeDataPkg(x, seqs_srcdir=".", destdir=".", replace=FALSE, verbose=TRUE)

forgeMaskedBSgenomeDataPkg(x, masks_srcdir=".", destdir=".", verbose=TRUE)

## Low-level helpers:

forgeSeqlengthsRdsFile(seqnames, prefix="", suffix=".fa",
                       seqs_srcdir=".", seqs_destdir=".",
                       genome=NA_character_, verbose=TRUE)

forgeSeqlengthsRdaFile(seqnames, prefix="", suffix=".fa",
                       seqs_srcdir=".", seqs_destdir=".",
                       genome=NA_character_, verbose=TRUE)

forgeSeqFiles(provider, genome,
              seqnames, mseqnames=NULL,
              seqfile_name=NA, prefix="", suffix=".fa",
              seqs_srcdir=".", seqs_destdir=".",
              ondisk_seq_format=c("2bit", "rds", "rda", "fa.rz", "fa"),
              verbose=TRUE)

forgeMasksFiles(seqnames, nmask_per_seq,
                seqs_destdir=".",
                ondisk_seq_format=c("2bit", "rda", "fa.rz", "fa"),
                masks_srcdir=".", masks_destdir=".",
                AGAPSfiles_type="gap", AGAPSfiles_name=NA,
                AGAPSfiles_prefix="", AGAPSfiles_suffix="_gap.txt",
                RMfiles_name=NA, RMfiles_prefix="", RMfiles_suffix=".fa.out",
                TRFfiles_name=NA, TRFfiles_prefix="", TRFfiles_suffix=".bed",
                verbose=TRUE)

Arguments

`x`	For `forgeBSgenomeDataPkg()`: A BSgenomeDataPkgSeed object or the name of a BSgenome data package seed file. For `forgeMaskedBSgenomeDataPkg()`: A MaskedBSgenomeDataPkgSeed object or the name of a masked BSgenome data package seed file. See the "Advanced BSgenomeForge usage" vignette in this package for more information.
`seqs_srcdir`, `masks_srcdir`	Single strings indicating the path to the source directories i.e. to the directories containing the source data files. Only read access to these directories is needed. See the "Advanced BSgenomeForge usage" vignette in this package for more information.
`destdir`	A single string indicating the path to the directory where the source tree of the target package should be created. This directory must already exist. See the "Advanced BSgenomeForge usage" vignette in this package for more information.
`replace`	`TRUE` or `FALSE`. When set to TRUE, `replace` replaces the package directory if it already exists.
`verbose`	`TRUE` or `FALSE`.
`provider`	The provider of the sequence data files e.g. `"UCSC"`, `"NCBI"`, `"BDGP"`, `"FlyBase"`, etc...
`genome`	The name of the genome. Typically the name of an NCBI assembly (e.g. `"GRCh38.p12"`, `"WBcel235"`, `"TAIR10.1"`, `"ARS-UCD1.2"`, etc...) or UCSC genome (e.g. `"hg38"`, `"bosTau9"`, `"galGal6"`, `"ce11"`, etc...).
`seqnames`, `mseqnames`	A character vector containing the names of the single (for `seqnames`) and multiple (for `mseqnames`) sequences to forge. See the "Advanced BSgenomeForge usage" vignette in this package for more information.
`seqfile_name`, `prefix`, `suffix`	See the "Advanced BSgenomeForge usage" vignette in this package for more information, in particular the description of the `seqfile_name`, `seqfiles_prefix` and `seqfiles_suffix` fields of a BSgenome data package seed file.
`seqs_destdir`, `masks_destdir`	During the forging process the source data files are converted into serialized Biostrings objects. `seqs_destdir` and `masks_destdir` must be single strings indicating the path to the directories where these serialized objects should be saved. These directories must already exist. Both `forgeSeqlengthsRdsFile` and `forgeSeqlengthsRdaFile` will produce a single `.rds` or `.rda` file. Both `forgeSeqFiles` and `forgeMasksFiles` will produce one file per sequence (all files being either `.rds` or `.rda` files).
`ondisk_seq_format`	Specifies how the single sequences should be stored in the forged package. Can be `"2bit"`, `"rds"`, `"rda"`, `"fa.rz"`, or `"fa"`. If `"2bit"` (the default), then all the single sequences are stored in a single twoBit file. If `"rds"` or `"rda"`, then each single sequence is stored in a separated serialized XString derivative (one per single sequence). If `"fa.rz"` or `"fa"`, then all the single sequences are stored in a single FASTA file (compressed in the RAZip format if `"fa.rz"`).
`nmask_per_seq`	A single integer indicating the desired number of masks per sequence. See the "Advanced BSgenomeForge usage" vignette in this package for more information.
`AGAPSfiles_type`, `AGAPSfiles_name`, `AGAPSfiles_prefix`, `AGAPSfiles_suffix`, `RMfiles_name`, `RMfiles_prefix`, `RMfiles_suffix`, `TRFfiles_name`, `TRFfiles_prefix`, `TRFfiles_suffix`	These arguments are named accordingly to the corresponding fields of a BSgenome data package seed file. See the "Advanced BSgenomeForge usage" vignette in this package for more information.

Details

These functions are intended for Bioconductor users who want to make a new BSgenome data package, not for regular users of these packages. See the "Advanced BSgenomeForge usage" vignette in this package (vignette("AdvancedBSgenomeForge")) for an extensive coverage of this topic.

Author(s)

H. Pagès

Examples

seqs_srcdir <- system.file("extdata", package="BSgenome")
seqnames <- c("chrX", "chrM")

## Forge .2bit sequence files:
forgeSeqFiles("UCSC", "ce2",
              seqnames, prefix="ce2", suffix=".fa.gz",
              seqs_srcdir=seqs_srcdir,
              seqs_destdir=tempdir(), ondisk_seq_format="2bit")

## Forge .rds sequence files:
forgeSeqFiles("UCSC", "ce2",
              seqnames, prefix="ce2", suffix=".fa.gz",
              seqs_srcdir=seqs_srcdir,
              seqs_destdir=tempdir(), ondisk_seq_format="rds")

## Sanity checks:
library(BSgenome.Celegans.UCSC.ce2)
genome <- BSgenome.Celegans.UCSC.ce2

ce2_sequences <- import(file.path(tempdir(), "single_sequences.2bit"))
ce2_sequences0 <- DNAStringSet(list(chrX=genome$chrX, chrM=genome$chrM))
stopifnot(identical(names(ce2_sequences0), names(ce2_sequences)),
          all(ce2_sequences0 == ce2_sequences))

chrX <- readRDS(file.path(tempdir(), "chrX.rds"))
stopifnot(genome$chrX == chrX)
chrM <- readRDS(file.path(tempdir(), "chrM.rds"))
stopifnot(genome$chrM == chrM)
seqs_srcdir <- system.file("extdata", package="BSgenome")
seqnames <- c("chrX", "chrM")

## Forge .2bit sequence files:
forgeSeqFiles("UCSC", "ce2",
              seqnames, prefix="ce2", suffix=".fa.gz",
              seqs_srcdir=seqs_srcdir,
              seqs_destdir=tempdir(), ondisk_seq_format="2bit")

## Forge .rds sequence files:
forgeSeqFiles("UCSC", "ce2",
              seqnames, prefix="ce2", suffix=".fa.gz",
              seqs_srcdir=seqs_srcdir,
              seqs_destdir=tempdir(), ondisk_seq_format="rds")

## Sanity checks:
library(BSgenome.Celegans.UCSC.ce2)
genome <- BSgenome.Celegans.UCSC.ce2

ce2_sequences <- import(file.path(tempdir(), "single_sequences.2bit"))
ce2_sequences0 <- DNAStringSet(list(chrX=genome$chrX, chrM=genome$chrM))
stopifnot(identical(names(ce2_sequences0), names(ce2_sequences)),
          all(ce2_sequences0 == ce2_sequences))

chrX <- readRDS(file.path(tempdir(), "chrX.rds"))
stopifnot(genome$chrX == chrX)
chrM <- readRDS(file.path(tempdir(), "chrM.rds"))
stopifnot(genome$chrM == chrM)

Download genomic sequences from NCBI

Description

A utility function to download the compressed FASTA file that contains the genomic sequences of a given NCBI assembly.

Usage

downloadGenomicSequencesFromNCBI(assembly_accession, assembly_name=NA,
                                 destdir=".", method, quiet=FALSE)
downloadGenomicSequencesFromNCBI(assembly_accession, assembly_name=NA,
                                 destdir=".", method, quiet=FALSE)

Arguments

`assembly_accession`	A single string containing a GenBank assembly accession (e.g. `"GCA_000001405.15"`) or a RefSeq assembly accession (e.g. `"GCF_000001405.26"`).
`assembly_name`	A single string or NA.
`destdir`	A single string containing the path to the directory where the compressed FASTA file is to be downloaded. This directory must already exist. Note that, by default, the file will be downloaded to the current directory (`"."`).
`method`, `quiet`	Passed to the internal call to `download.file()`. See `?download.file` in the utils package for more information.

Details

This function is intended for Bioconductor users who want to download the compressed FASTA file from NCBI for a given assembly specified by the assembly_accession argument.

Value

The path to the downloaded file as an invisible string.

Author(s)

Prisca Chidimma Maduka

Examples

## Download the compressed FASTA file for NCBI assembly ASM972954v1 (see
## https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1/):
downloadGenomicSequencesFromNCBI("GCF_009729545.1")

## Use the 'destdir' argument to specify the directory where to
## download the file:
downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir())

## Download and import the file in R as a DNAStringSet object:
filepath <- downloadGenomicSequencesFromNCBI("GCF_009729545.1",
                                             destdir=tempdir())
genomic_sequences <- readDNAStringSet(filepath)
genomic_sequences
## Download the compressed FASTA file for NCBI assembly ASM972954v1 (see
## https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1/):
downloadGenomicSequencesFromNCBI("GCF_009729545.1")

## Use the 'destdir' argument to specify the directory where to
## download the file:
downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir())

## Download and import the file in R as a DNAStringSet object:
filepath <- downloadGenomicSequencesFromNCBI("GCF_009729545.1",
                                             destdir=tempdir())
genomic_sequences <- readDNAStringSet(filepath)
genomic_sequences

Download genomic sequences from UCSC

Description

A utility function to download the 2bit file that contains the genomic sequences of a given UCSC genome.

Usage

downloadGenomicSequencesFromUCSC(
            genome,
            goldenPath.url=getOption("UCSC.goldenPath.url"),
            destdir=".", method, quiet=FALSE)
downloadGenomicSequencesFromUCSC(
            genome,
            goldenPath.url=getOption("UCSC.goldenPath.url"),
            destdir=".", method, quiet=FALSE)

Arguments

`genome`	This is the name of the UCSC genome sequence to be downloaded. It is used to form the download URL.
`goldenPath.url`	A string set to `getOption("UCSC.goldenPath.url")` by default. `getOption("UCSC.goldenPath.url")` returns the goldenPath URL, http://hgdownload.cse.ucsc.edu/goldenPath.
`destdir`	A single string containing the path to the directory where the 2bit file is to be downloaded. This directory must already exist. Note that, by default, the file will be downloaded to the current directory (`"."`).
`method`, `quiet`	Passed to the internal call to `download.file()`. See `?download.file` in the utils package for more information.

Details

This function is intended for Bioconductor users who want to download the 2bit genomic sequence file of a UCSC genome specified by the genome argument.

Value

The path to the downloaded file as an invisible string.

Author(s)

Emmanuel Chigozie Elendu (Simplecodez)

Examples

## Download the 2bit file for UCSC genome sacCer1:
downloadGenomicSequencesFromUCSC("sacCer1")

## Use the 'destdir' argument to specify the directory where to
## download the file:
downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir())

## Download and import the file in R as a DNAStringSet object:
filepath <- downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir())
genomic_sequences <- import(filepath)
genomic_sequences
## Download the 2bit file for UCSC genome sacCer1:
downloadGenomicSequencesFromUCSC("sacCer1")

## Use the 'destdir' argument to specify the directory where to
## download the file:
downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir())

## Download and import the file in R as a DNAStringSet object:
filepath <- downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir())
genomic_sequences <- import(filepath)
genomic_sequences

Convert files from FASTA to 2bit

Description

fastaTo2bit is a utility function to convert a FASTA file to the 2bit format.

Usage

fastaTo2bit(origfile, destfile, assembly_accession=NA)
fastaTo2bit(origfile, destfile, assembly_accession=NA)

Arguments

`origfile`	A single string containing the path to the FASTA file (possibly compressed) to read, e.g. `"felCat9.fa"`, `"felCat9.fa.gz"`, or `"path/to/felCat9.fa.gz"`.
`destfile`	A single string containing the path to the 2bit file to be written, e.g. `"felCat9.2bit"` or `"path/to/felCat9.2bit"`.
`assembly_accession`	A single string containing a GenBank assembly accession (e.g. `"GCA_009729545.1"`) or a RefSeq assembly accession (e.g. `"GCF_009729545.1"`). When specified, this uses `getChromInfoFromNCBI` to get chromosome information for the NCBI assembly, which is matched against the corresponding information in the FASTA file, consequently reordering its sequences. The sequences are then renamed from their GenBank or RefSeq accession assembly names, to their corresponding sequence names. If missing, the function does not perform sequence reordering or renaming.

Details

This function is intended for Bioconductor users who want to convert a FASTA file to the 2bit format.

Value

An invisible NULL.

Author(s)

Atuhurira Kirabo Kakopo and Hervé Pagès

Examples

## Most assemblies at NCBI can be accessed using either their GenBank
## or RefSeq assembly accession. For example assembly ASM972954v1 (for
## Acidianus infernus) can be accessed either with GCA_009729545.1
## (GenBank assembly accession) or GCF_009729545.1 (RefSeq assembly
## accession).
## See https://www.ncbi.nlm.nih.gov/assembly/GCA_009729545.1
## or https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1 for
## the landing page of this assembly.

## ---------------------------------------------------------------------
## USING FASTA FILE FROM **GenBank** ASSEMBLY
## ---------------------------------------------------------------------

## Download the FASTA file containing the genomic sequences for
## the ASM972954v1 assembly to the tempdir() folder:
fasta_path <- downloadGenomicSequencesFromNCBI("GCA_009729545.1",
                                               destdir=tempdir())

## Use fastaTo2bit() to convert the file to 2bit. We're using the
## function in its simplest form here so there won't be any sequence
## renaming or reordering:
twobit_path1 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path1)

## Take a look at the sequence names in the resulting 2bit file:
seqlevels(rtracklayer::TwoBitFile(twobit_path1))

## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to
## names(rtracklayer::import.2bit(.)) but a lot more efficient because
## it doesn't load the sequences.

## Use fastaTo2bit() again to convert the file to 2bit. However
## this time we want the function to rename and reorder the
## sequences as in getChromInfoFromNCBI("GCA_009729545.1"), so
## we set 'assembly_accession' to "GCA_009729545.1" in the call
## to fastaTo2bit():
twobit_path2 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCA_009729545.1")

## Take a look at the sequence names in the resulting 2bit file:
seqlevels(rtracklayer::TwoBitFile(twobit_path2))

## ---------------------------------------------------------------------
## USING FASTA FILE FROM **RefSeq** ASSEMBLY
## ---------------------------------------------------------------------

## Same as above but using GCF_009729545.1 instead of GCA_009729545.1

fasta_path <- downloadGenomicSequencesFromNCBI("GCF_009729545.1",
                                               destdir=tempdir())

twobit_path1 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path1)
seqlevels(rtracklayer::TwoBitFile(twobit_path1))

twobit_path2 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCF_009729545.1")
seqlevels(rtracklayer::TwoBitFile(twobit_path2))

## ---------------------------------------------------------------------
## USING A FASTA FILE WITH IUPAC AMBIGUITY LETTERS
## ---------------------------------------------------------------------

## Download the FASTA file containing the genomic sequences for
## for Escherichia coli assembly ASM1484v1 (see
## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/):
fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1",
                                               destdir=tempdir())

## The DNA sequences in this file contain IUPAC ambiguity letters
## not supported by the 2bit format, so fastaTo2bit() will replace
## them with N's and issue a warning:
twobit_path <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path)  # warning!

## Use suppressWarnings() to suppress the warning:
suppressWarnings(fastaTo2bit(fasta_path, twobit_path))
## Most assemblies at NCBI can be accessed using either their GenBank
## or RefSeq assembly accession. For example assembly ASM972954v1 (for
## Acidianus infernus) can be accessed either with GCA_009729545.1
## (GenBank assembly accession) or GCF_009729545.1 (RefSeq assembly
## accession).
## See https://www.ncbi.nlm.nih.gov/assembly/GCA_009729545.1
## or https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1 for
## the landing page of this assembly.

## ---------------------------------------------------------------------
## USING FASTA FILE FROM **GenBank** ASSEMBLY
## ---------------------------------------------------------------------

## Download the FASTA file containing the genomic sequences for
## the ASM972954v1 assembly to the tempdir() folder:
fasta_path <- downloadGenomicSequencesFromNCBI("GCA_009729545.1",
                                               destdir=tempdir())

## Use fastaTo2bit() to convert the file to 2bit. We're using the
## function in its simplest form here so there won't be any sequence
## renaming or reordering:
twobit_path1 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path1)

## Take a look at the sequence names in the resulting 2bit file:
seqlevels(rtracklayer::TwoBitFile(twobit_path1))

## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to
## names(rtracklayer::import.2bit(.)) but a lot more efficient because
## it doesn't load the sequences.

## Use fastaTo2bit() again to convert the file to 2bit. However
## this time we want the function to rename and reorder the
## sequences as in getChromInfoFromNCBI("GCA_009729545.1"), so
## we set 'assembly_accession' to "GCA_009729545.1" in the call
## to fastaTo2bit():
twobit_path2 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCA_009729545.1")

## Take a look at the sequence names in the resulting 2bit file:
seqlevels(rtracklayer::TwoBitFile(twobit_path2))

## ---------------------------------------------------------------------
## USING FASTA FILE FROM **RefSeq** ASSEMBLY
## ---------------------------------------------------------------------

## Same as above but using GCF_009729545.1 instead of GCA_009729545.1

fasta_path <- downloadGenomicSequencesFromNCBI("GCF_009729545.1",
                                               destdir=tempdir())

twobit_path1 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path1)
seqlevels(rtracklayer::TwoBitFile(twobit_path1))

twobit_path2 <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCF_009729545.1")
seqlevels(rtracklayer::TwoBitFile(twobit_path2))

## ---------------------------------------------------------------------
## USING A FASTA FILE WITH IUPAC AMBIGUITY LETTERS
## ---------------------------------------------------------------------

## Download the FASTA file containing the genomic sequences for
## for Escherichia coli assembly ASM1484v1 (see
## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/):
fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1",
                                               destdir=tempdir())

## The DNA sequences in this file contain IUPAC ambiguity letters
## not supported by the 2bit format, so fastaTo2bit() will replace
## them with N's and issue a warning:
twobit_path <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path)  # warning!

## Use suppressWarnings() to suppress the warning:
suppressWarnings(fastaTo2bit(fasta_path, twobit_path))

Create a BSgenome data package from an NCBI assembly

Description

The forgeBSgenomeDataPkgFRomNCBI function allows the user to create a BSgenome data package from an NCBI assembly.

Usage

forgeBSgenomeDataPkgFromNCBI(assembly_accession,
             pkg_maintainer, pkg_author=NA,
             pkg_version="1.0.0", pkg_license="Artistic-2.0",
             organism=NULL, circ_seqs=NULL, destdir=".")
forgeBSgenomeDataPkgFromNCBI(assembly_accession,
             pkg_maintainer, pkg_author=NA,
             pkg_version="1.0.0", pkg_license="Artistic-2.0",
             organism=NULL, circ_seqs=NULL, destdir=".")

Arguments

`assembly_accession`	A single string containing a GenBank assembly accession (e.g. `"GCA_009729545.1"`) or a RefSeq assembly accession (e.g. `"GCF_000857545.1"`). Alternatively, if the assembly is registered in the GenomeInfoDb package (see `?GenomeInfoDb::registered_NCBI_assemblies`), the assembly name (e.g. `"mLoxAfr1.hap2"`) can be supplied instead of its GenBank or RefSeq accession.
`pkg_maintainer`	A single string containing the name and email address of the package maintainer (e.g `"Jane Doe, <[email protected]>"`).
`pkg_author`	A single string containing the name of the package author. When unspecified, this takes the value of `pkg_maintainer`.
`pkg_version`	The version of the package. Set to `"1.0.0"` by default.
`pkg_license`	The license of the package. This must be the name of a software license used for free and open-source packages. Set to `"Artistic-2.0"` by default.
`organism`	The full name of the organism e.g. `"Homo sapiens"`, `"Felis catus"`, `"Loxodonta africana"`, `"Acidianus infernus"`, etc... Only needs to be specified if the assembly is not registered in the GenomeInfoDb package (see `?GenomeInfoDb::registered_NCBI_assemblies`).
`circ_seqs`	NULL (the default), or a character vector containing the names of the circular sequences in the assembly. This only needs to be specified if the assembly is not registered in the GenomeInfoDb package (if the assembly is registered then its circular sequences are known so there's no need to specify `circ_seqs`). Notes: You can use `registered_NCBI_assemblies()` to get the list of assemblies that are registered in the GenomeInfoDb package. Only assembled molecules can be circular. To see the list of assembled molecules for a given assembly, call `getChromInfoFromNCBI(assembly_accession, assembled.molecules.only=TRUE)$SequenceName`. If the assembly is not registered and does not have circular sequences, then `circ_seqs` must be set to `character(0)`.
`destdir`	A single string containing the path to the directory where the BSgenome data package is to be created. This directory must already exist. Note that, by default, the package will be created in the current directory (`"."`).

Details

This function is intended for Bioconductor users who want to forge a BSgenome data package from an NCBI assembly. It typically makes use of the downloadGenomicSequencesFromNCBI utility function to download the compressed FASTA file that contains the genomic sequences of the assembly, and stores it in the working directory. However, if the file already exists in the working directory, then it is used and not downloaded again.

Value

The path to the created package as an invisible string.

Author(s)

Atuhurira Kirabo Kakopo

Examples

## ---------------------------------------------------------------------
## EXAMPLE 1
## ---------------------------------------------------------------------

## Create a BSgenome data package for NCBI assembly GCA_009729545.1
## (organism Acidianus infernus):
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_009729545.1",
                             pkg_maintainer="Jane Doe <[email protected]>",
                             organism="Acidianus infernus",
                             destdir=tempdir())

## ---------------------------------------------------------------------
## EXAMPLE 2
## ---------------------------------------------------------------------

## Create a BSgenome data package for NCBI assembly GCF_000857545.1
## (organism Torque teno virus 1):
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1",
                             pkg_maintainer="Jane Doe <[email protected]>",
                             organism="Torque teno virus 1",
                             circ_seqs="NC_002076.2",
                             destdir=tempdir())
## ---------------------------------------------------------------------
## EXAMPLE 1
## ---------------------------------------------------------------------

## Create a BSgenome data package for NCBI assembly GCA_009729545.1
## (organism Acidianus infernus):
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_009729545.1",
                             pkg_maintainer="Jane Doe <[email protected]>",
                             organism="Acidianus infernus",
                             destdir=tempdir())

## ---------------------------------------------------------------------
## EXAMPLE 2
## ---------------------------------------------------------------------

## Create a BSgenome data package for NCBI assembly GCF_000857545.1
## (organism Torque teno virus 1):
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1",
                             pkg_maintainer="Jane Doe <[email protected]>",
                             organism="Torque teno virus 1",
                             circ_seqs="NC_002076.2",
                             destdir=tempdir())

Create a BSgenome data package from a 2bit file

Description

The forgeBSgenomeDataPkgFromTwobitFile function allows the user to create a BSgenome data package from a 2bit file.

Usage

forgeBSgenomeDataPkgFromTwobitFile(filepath, organism, provider, genome,
             pkg_maintainer, pkg_author=NA,
             pkg_version="1.0.0", pkg_license="Artistic-2.0",
             seqnames=NULL,
             circ_seqs=NULL,
             destdir=".")
forgeBSgenomeDataPkgFromTwobitFile(filepath, organism, provider, genome,
             pkg_maintainer, pkg_author=NA,
             pkg_version="1.0.0", pkg_license="Artistic-2.0",
             seqnames=NULL,
             circ_seqs=NULL,
             destdir=".")

Arguments

`filepath`	A single string containing a path to a 2bit file.
`organism`	The full name of the organism e.g. `"Escherichia coli"`, `"Mus musculus"`, `"Saccharomyces cerevisiae"`, etc...
`provider`	The provider of the sequence data stored in the 2bit file e.g. `"NCBI"`, `"UCSC"`, `"BDGP"`, `"FlyBase"`, `"WormBase"`, etc... Will be used to make part 3 of the package name.
`genome`	A single string specifying the name of the genome assembly e.g. `"ASM1484v1"`, `"mm39"`, or `"sacCer3"`. Will be used to make part 4 of the package name.
`pkg_maintainer`	A single string containing the name and email address of the package maintainer (e.g `"Jane Doe, <[email protected]>"`).
`pkg_author`	A single string containing the name of the package author. When unspecified, this takes the value of `pkg_maintainer`.
`pkg_version`	The version of the package. Set to `"1.0.0"` by default.
`pkg_license`	The license of the package. This must be the name of a software license used for free and open-source packages. Set to `"Artistic-2.0"` by default.
`seqnames`	NULL (the default), or a character vector containing a subset of the sequence names stored in the 2bit file. Use this to select and/or reorder the sequences that will go in the BSgenome data package. By default (i.e. when `seqnames` is NULL), all the sequences in the file are selected and put in the BSgenome data package in the order that they appear in the 2bit file. Use `seqlevels(rtracklayer::TwoBitFile(filepath))` to see the sequence names stored in the 2bit file.
`circ_seqs`	NULL (the default), or a character vector providing the names of the sequences stored in the 2bit file that are known to be circular. Note that if `seqnames` is supplied then `circ_seqs` must be a subset of it. Set to `circ_seqs` to `character(0)` if none of the sequences in the 2bit file (or in `seqnames`) is known to be circular. By default (i.e. if `circ_seqs` is NULL), then the circular sequences are guessed based on their names. Because this is not really reliable, a warning is issued.
`destdir`	A single string containing the path to the directory where the BSgenome data package is to be created. This directory must already exist. Note that, by default, the package will be created in the current directory (`"."`).

Details

This function is intended for Bioconductor users who want to forge a BSgenome data package from a 2bit file.

Value

The path to the created package as an invisible string.

Author(s)

Hervé Pagès

Examples

## Download the FASTA file containing the genomic sequences for
## for Escherichia coli assembly ASM1484v1 (see
## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/):
fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1",
                                               destdir=tempdir())

## Use fastaTo2bit() to convert the file to 2bit:
twobit_path <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path)

## All the DNA sequences in Escherichia coli are circular:
circ_seqs <- seqlevels(rtracklayer::TwoBitFile(twobit_path))

## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to
## names(rtracklayer::import.2bit(.)) but a lot more efficient because
## it doesn't load the sequences.

## Create a BSgenome data package from the 2bit file:
forgeBSgenomeDataPkgFromTwobitFile(
    filepath=twobit_path,
    organism="Escherichia coli",
    provider="NCBI",
    genome="ASM1484v1",
    pkg_maintainer="Jane Doe <[email protected]>",
    circ_seqs=circ_seqs,
    destdir=tempdir()
)
## Download the FASTA file containing the genomic sequences for
## for Escherichia coli assembly ASM1484v1 (see
## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/):
fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1",
                                               destdir=tempdir())

## Use fastaTo2bit() to convert the file to 2bit:
twobit_path <- tempfile(fileext=".2bit")
fastaTo2bit(fasta_path, twobit_path)

## All the DNA sequences in Escherichia coli are circular:
circ_seqs <- seqlevels(rtracklayer::TwoBitFile(twobit_path))

## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to
## names(rtracklayer::import.2bit(.)) but a lot more efficient because
## it doesn't load the sequences.

## Create a BSgenome data package from the 2bit file:
forgeBSgenomeDataPkgFromTwobitFile(
    filepath=twobit_path,
    organism="Escherichia coli",
    provider="NCBI",
    genome="ASM1484v1",
    pkg_maintainer="Jane Doe <[email protected]>",
    circ_seqs=circ_seqs,
    destdir=tempdir()
)

Create a BSgenome data package from a UCSC genome

Description

The forgeBSgenomeDataPkgFromUCSC function allows the user to create a BSgenome data package from a UCSC genome.

Usage

forgeBSgenomeDataPkgFromUCSC(genome, organism,
             pkg_maintainer, pkg_author=NA,
             pkg_version="1.0.0", pkg_license="Artistic-2.0",
             circ_seqs=NULL,
             goldenPath.url=getOption("UCSC.goldenPath.url"),
             destdir=".")
forgeBSgenomeDataPkgFromUCSC(genome, organism,
             pkg_maintainer, pkg_author=NA,
             pkg_version="1.0.0", pkg_license="Artistic-2.0",
             circ_seqs=NULL,
             goldenPath.url=getOption("UCSC.goldenPath.url"),
             destdir=".")

Arguments

`genome`	A single string specifying the name of a UCSC genome (e.g. `"mm39"` or `"sacCer3"`).
`organism`	The full name of the organism e.g. `"Mus musculus"`, `"Saccharomyces cerevisiae"`, etc...
`pkg_maintainer`	A single string containing the name and email address of the package maintainer (e.g `"Jane Doe, <[email protected]>"`).
`pkg_author`	A single string containing the name of the package author. When unspecified, this takes the value of `pkg_maintainer`.
`pkg_version`	The version of the package. Set to `"1.0.0"` by default.
`pkg_license`	The license of the package. This must be the name of a software license used for free and open-source packages. Set to `"Artistic-2.0"` by default.
`circ_seqs`	NULL (the default), or a character vector containing the names of the circular sequences in the UCSC genome. This only needs to be specified if the genome is not registered in the GenomeInfoDb package (if the genome is registered then its circular sequences are known so there's no need to specify `circ_seqs`). Notes: You can use `registered_UCSC_genomes()` to get the list of UCSC genomes that are registered in the GenomeInfoDb package. If the genome is not registered and does not have circular sequences, then `circ_seqs` must be set to `character(0)`.
`goldenPath.url`	A single string specifying the URL to the UCSC goldenPath location where the genomic sequences and chromosome sizes are expected to be found.
`destdir`	A single string containing the path to the directory where the BSgenome data package is to be created. This directory must already exist. Note that, by default, the package will be created in the current directory (`"."`).

Details

This function is intended for Bioconductor users who want to forge a BSgenome data package from a UCSC genome. It typically makes use of the downloadGenomicSequencesFromUCSC utility function to download the 2bit file that contains the genomic sequences of the genome, and stores it in the working directory. However, if the file already exists in the working directory, then it is used and not downloaded again.

Value

The path to the created package as an invisible string.

Author(s)

Hervé Pagès

Examples

## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2
## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1):
forgeBSgenomeDataPkgFromUCSC(
    genome="wuhCor1",
    organism="Severe acute respiratory syndrome coronavirus 2",
    pkg_maintainer="Jane Doe <[email protected]>",
    destdir=tempdir()
)
## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2
## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1):
forgeBSgenomeDataPkgFromUCSC(
    genome="wuhCor1",
    organism="Severe acute respiratory syndrome coronavirus 2",
    pkg_maintainer="Jane Doe <[email protected]>",
    destdir=tempdir()
)

Package 'BSgenomeForge'

Help Index

The BSgenomeForge package

Description

Details

Author(s)

See Also

Examples

Advanced BSgenomeForge functions

Description

Usage

Arguments

Details

Author(s)

See Also

Examples

Download genomic sequences from NCBI

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Download genomic sequences from UCSC

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Convert files from FASTA to 2bit

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Create a BSgenome data package from an NCBI assembly

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Create a BSgenome data package from a 2bit file

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Create a BSgenome data package from a UCSC genome

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples