Title: | Forge your own BSgenome data package |
---|---|
Description: | A set of tools to forge BSgenome data packages. Supersedes the old seed-based tools from the BSgenome software package. This package allows the user to create a BSgenome data package in one function call, simplifying the old seed-based process. |
Authors: | Hervé Pagès [aut, cre], Atuhurira Kirabo Kakopo [aut], Emmanuel Chigozie Elendu [ctb], Prisca Chidimma Maduka [ctb] |
Maintainer: | Hervé Pagès <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.7.0 |
Built: | 2024-10-30 04:28:44 UTC |
Source: | https://github.com/bioc/BSgenomeForge |
A package that simplifies the process of forging a BSgenome data package, by allowing the user to use one function to create the package.
BSgenomeForge provides two major functions, the
forgeBSgenomeDataPkgFromNCBI
function and
forgeBSgenomeDataPkgFromUCSC
function which allow one to forge
a BSgenome data package from a NCBI assembly or UCSC genome respectively.
For an overview of the functionality provided by the package, please see the
vignette:
vignette("QuickBSgenomeForge", package="BSgenomeForge")
Atuhurira Kirabo Kakopo, Hervé Pagès
Maintainer: Hervé Pagès
The forgeBSgenomeDataPkgFromNCBI
function for
creating a BSgenome data package from a NCBI assembly.
The forgeBSgenomeDataPkgFromUCSC
function for
creating a BSgenome data package from a UCSC genome.
## --------------------------------------------------------------------- ## EXAMPLE 1 ## --------------------------------------------------------------------- ## Create a BSgenome data package for NCBI assembly GCF_000857545.1 ## (organism Torque teno virus 1): forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1", pkg_maintainer="Jane Doe <[email protected]>", organism="Torque teno virus 1", circ_seqs="NC_002076.2", destdir=tempdir()) ## --------------------------------------------------------------------- ## EXAMPLE 2 ## --------------------------------------------------------------------- ## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2 ## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1): forgeBSgenomeDataPkgFromUCSC( genome="wuhCor1", organism="Severe acute respiratory syndrome coronavirus 2", pkg_maintainer="Jane Doe <[email protected]>", destdir=tempdir() )
## --------------------------------------------------------------------- ## EXAMPLE 1 ## --------------------------------------------------------------------- ## Create a BSgenome data package for NCBI assembly GCF_000857545.1 ## (organism Torque teno virus 1): forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1", pkg_maintainer="Jane Doe <[email protected]>", organism="Torque teno virus 1", circ_seqs="NC_002076.2", destdir=tempdir()) ## --------------------------------------------------------------------- ## EXAMPLE 2 ## --------------------------------------------------------------------- ## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2 ## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1): forgeBSgenomeDataPkgFromUCSC( genome="wuhCor1", organism="Severe acute respiratory syndrome coronavirus 2", pkg_maintainer="Jane Doe <[email protected]>", destdir=tempdir() )
A set of functions for making a BSgenome data package from any
genome assembly not covered by forgeBSgenomeDataPkgFromNCBI()
or forgeBSgenomeDataPkgFromUCSC()
.
## Main functions: forgeBSgenomeDataPkg(x, seqs_srcdir=".", destdir=".", replace=FALSE, verbose=TRUE) forgeMaskedBSgenomeDataPkg(x, masks_srcdir=".", destdir=".", verbose=TRUE) ## Low-level helpers: forgeSeqlengthsRdsFile(seqnames, prefix="", suffix=".fa", seqs_srcdir=".", seqs_destdir=".", genome=NA_character_, verbose=TRUE) forgeSeqlengthsRdaFile(seqnames, prefix="", suffix=".fa", seqs_srcdir=".", seqs_destdir=".", genome=NA_character_, verbose=TRUE) forgeSeqFiles(provider, genome, seqnames, mseqnames=NULL, seqfile_name=NA, prefix="", suffix=".fa", seqs_srcdir=".", seqs_destdir=".", ondisk_seq_format=c("2bit", "rds", "rda", "fa.rz", "fa"), verbose=TRUE) forgeMasksFiles(seqnames, nmask_per_seq, seqs_destdir=".", ondisk_seq_format=c("2bit", "rda", "fa.rz", "fa"), masks_srcdir=".", masks_destdir=".", AGAPSfiles_type="gap", AGAPSfiles_name=NA, AGAPSfiles_prefix="", AGAPSfiles_suffix="_gap.txt", RMfiles_name=NA, RMfiles_prefix="", RMfiles_suffix=".fa.out", TRFfiles_name=NA, TRFfiles_prefix="", TRFfiles_suffix=".bed", verbose=TRUE)
## Main functions: forgeBSgenomeDataPkg(x, seqs_srcdir=".", destdir=".", replace=FALSE, verbose=TRUE) forgeMaskedBSgenomeDataPkg(x, masks_srcdir=".", destdir=".", verbose=TRUE) ## Low-level helpers: forgeSeqlengthsRdsFile(seqnames, prefix="", suffix=".fa", seqs_srcdir=".", seqs_destdir=".", genome=NA_character_, verbose=TRUE) forgeSeqlengthsRdaFile(seqnames, prefix="", suffix=".fa", seqs_srcdir=".", seqs_destdir=".", genome=NA_character_, verbose=TRUE) forgeSeqFiles(provider, genome, seqnames, mseqnames=NULL, seqfile_name=NA, prefix="", suffix=".fa", seqs_srcdir=".", seqs_destdir=".", ondisk_seq_format=c("2bit", "rds", "rda", "fa.rz", "fa"), verbose=TRUE) forgeMasksFiles(seqnames, nmask_per_seq, seqs_destdir=".", ondisk_seq_format=c("2bit", "rda", "fa.rz", "fa"), masks_srcdir=".", masks_destdir=".", AGAPSfiles_type="gap", AGAPSfiles_name=NA, AGAPSfiles_prefix="", AGAPSfiles_suffix="_gap.txt", RMfiles_name=NA, RMfiles_prefix="", RMfiles_suffix=".fa.out", TRFfiles_name=NA, TRFfiles_prefix="", TRFfiles_suffix=".bed", verbose=TRUE)
x |
For For See the "Advanced BSgenomeForge usage" vignette in this package for more information. |
seqs_srcdir , masks_srcdir
|
Single strings indicating the path to the source directories i.e. to the directories containing the source data files. Only read access to these directories is needed. See the "Advanced BSgenomeForge usage" vignette in this package for more information. |
destdir |
A single string indicating the path to the directory where the source tree of the target package should be created. This directory must already exist. See the "Advanced BSgenomeForge usage" vignette in this package for more information. |
replace |
|
verbose |
|
provider |
The provider of the sequence data files e.g.
|
genome |
The name of the genome. Typically the name of an NCBI assembly (e.g.
|
seqnames , mseqnames
|
A character vector containing the names of the single (for |
seqfile_name , prefix , suffix
|
See the "Advanced BSgenomeForge usage" vignette in this package for more
information, in particular the description of the |
seqs_destdir , masks_destdir
|
During the forging process the source data files are converted into
serialized Biostrings objects. Both |
ondisk_seq_format |
Specifies how the single sequences should be stored in the forged package.
Can be |
nmask_per_seq |
A single integer indicating the desired number of masks per sequence. See the "Advanced BSgenomeForge usage" vignette in this package for more information. |
AGAPSfiles_type , AGAPSfiles_name , AGAPSfiles_prefix , AGAPSfiles_suffix , RMfiles_name , RMfiles_prefix , RMfiles_suffix , TRFfiles_name , TRFfiles_prefix , TRFfiles_suffix
|
These arguments are named accordingly to the corresponding fields of a BSgenome data package seed file. See the "Advanced BSgenomeForge usage" vignette in this package for more information. |
These functions are intended for Bioconductor users who want to make a new
BSgenome data package, not for regular users of these packages.
See the "Advanced BSgenomeForge usage" vignette in this package
(vignette("AdvancedBSgenomeForge")
) for an extensive coverage
of this topic.
H. Pagès
forgeBSgenomeDataPkgFromNCBI
and
forgeBSgenomeDataPkgFromUCSC
in the
BSgenomeForge package.
available.genomes
to find BSgenome data packages
available in Bioconductor.
BSgenome objects.
seqs_srcdir <- system.file("extdata", package="BSgenome") seqnames <- c("chrX", "chrM") ## Forge .2bit sequence files: forgeSeqFiles("UCSC", "ce2", seqnames, prefix="ce2", suffix=".fa.gz", seqs_srcdir=seqs_srcdir, seqs_destdir=tempdir(), ondisk_seq_format="2bit") ## Forge .rds sequence files: forgeSeqFiles("UCSC", "ce2", seqnames, prefix="ce2", suffix=".fa.gz", seqs_srcdir=seqs_srcdir, seqs_destdir=tempdir(), ondisk_seq_format="rds") ## Sanity checks: library(BSgenome.Celegans.UCSC.ce2) genome <- BSgenome.Celegans.UCSC.ce2 ce2_sequences <- import(file.path(tempdir(), "single_sequences.2bit")) ce2_sequences0 <- DNAStringSet(list(chrX=genome$chrX, chrM=genome$chrM)) stopifnot(identical(names(ce2_sequences0), names(ce2_sequences)), all(ce2_sequences0 == ce2_sequences)) chrX <- readRDS(file.path(tempdir(), "chrX.rds")) stopifnot(genome$chrX == chrX) chrM <- readRDS(file.path(tempdir(), "chrM.rds")) stopifnot(genome$chrM == chrM)
seqs_srcdir <- system.file("extdata", package="BSgenome") seqnames <- c("chrX", "chrM") ## Forge .2bit sequence files: forgeSeqFiles("UCSC", "ce2", seqnames, prefix="ce2", suffix=".fa.gz", seqs_srcdir=seqs_srcdir, seqs_destdir=tempdir(), ondisk_seq_format="2bit") ## Forge .rds sequence files: forgeSeqFiles("UCSC", "ce2", seqnames, prefix="ce2", suffix=".fa.gz", seqs_srcdir=seqs_srcdir, seqs_destdir=tempdir(), ondisk_seq_format="rds") ## Sanity checks: library(BSgenome.Celegans.UCSC.ce2) genome <- BSgenome.Celegans.UCSC.ce2 ce2_sequences <- import(file.path(tempdir(), "single_sequences.2bit")) ce2_sequences0 <- DNAStringSet(list(chrX=genome$chrX, chrM=genome$chrM)) stopifnot(identical(names(ce2_sequences0), names(ce2_sequences)), all(ce2_sequences0 == ce2_sequences)) chrX <- readRDS(file.path(tempdir(), "chrX.rds")) stopifnot(genome$chrX == chrX) chrM <- readRDS(file.path(tempdir(), "chrM.rds")) stopifnot(genome$chrM == chrM)
A utility function to download the compressed FASTA file that contains the genomic sequences of a given NCBI assembly.
downloadGenomicSequencesFromNCBI(assembly_accession, assembly_name=NA, destdir=".", method, quiet=FALSE)
downloadGenomicSequencesFromNCBI(assembly_accession, assembly_name=NA, destdir=".", method, quiet=FALSE)
assembly_accession |
A single string containing a GenBank assembly accession (e.g.
|
assembly_name |
A single string or NA. |
destdir |
A single string containing the path to the directory where the
compressed FASTA file is to be downloaded. This directory must already
exist.
Note that, by default, the file will be downloaded to the current
directory ( |
method , quiet
|
Passed to the internal call to |
This function is intended for Bioconductor users who want
to download the compressed FASTA file from NCBI for a given assembly
specified by the assembly_accession
argument.
The path to the downloaded file as an invisible string.
Prisca Chidimma Maduka
The download.file
function in the utils
package that downloadGenomicSequencesFromNCBI
uses internally
to download the compressed FASTA file.
The downloadGenomicSequencesFromUCSC
function to
download genomic sequences from UCSC.
## Download the compressed FASTA file for NCBI assembly ASM972954v1 (see ## https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1/): downloadGenomicSequencesFromNCBI("GCF_009729545.1") ## Use the 'destdir' argument to specify the directory where to ## download the file: downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir()) ## Download and import the file in R as a DNAStringSet object: filepath <- downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir()) genomic_sequences <- readDNAStringSet(filepath) genomic_sequences
## Download the compressed FASTA file for NCBI assembly ASM972954v1 (see ## https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1/): downloadGenomicSequencesFromNCBI("GCF_009729545.1") ## Use the 'destdir' argument to specify the directory where to ## download the file: downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir()) ## Download and import the file in R as a DNAStringSet object: filepath <- downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir()) genomic_sequences <- readDNAStringSet(filepath) genomic_sequences
A utility function to download the 2bit file that contains the genomic sequences of a given UCSC genome.
downloadGenomicSequencesFromUCSC( genome, goldenPath.url=getOption("UCSC.goldenPath.url"), destdir=".", method, quiet=FALSE)
downloadGenomicSequencesFromUCSC( genome, goldenPath.url=getOption("UCSC.goldenPath.url"), destdir=".", method, quiet=FALSE)
genome |
This is the name of the UCSC genome sequence to be downloaded. It is used to form the download URL. |
goldenPath.url |
A string set to |
destdir |
A single string containing the path to the directory where the 2bit file
is to be downloaded. This directory must already exist.
Note that, by default, the file will be downloaded to the current
directory ( |
method , quiet
|
Passed to the internal call to |
This function is intended for Bioconductor users who want
to download the 2bit genomic sequence file of a UCSC genome
specified by the genome
argument.
The path to the downloaded file as an invisible string.
Emmanuel Chigozie Elendu (Simplecodez)
The download.file
function in the utils
package that downloadGenomicSequencesFromUCSC
uses internally
to download the 2bit file.
The downloadGenomicSequencesFromNCBI
function to
download genomic sequences from NCBI.
## Download the 2bit file for UCSC genome sacCer1: downloadGenomicSequencesFromUCSC("sacCer1") ## Use the 'destdir' argument to specify the directory where to ## download the file: downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir()) ## Download and import the file in R as a DNAStringSet object: filepath <- downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir()) genomic_sequences <- import(filepath) genomic_sequences
## Download the 2bit file for UCSC genome sacCer1: downloadGenomicSequencesFromUCSC("sacCer1") ## Use the 'destdir' argument to specify the directory where to ## download the file: downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir()) ## Download and import the file in R as a DNAStringSet object: filepath <- downloadGenomicSequencesFromUCSC("sacCer1", destdir=tempdir()) genomic_sequences <- import(filepath) genomic_sequences
fastaTo2bit
is a utility function to convert a FASTA file
to the 2bit format.
fastaTo2bit(origfile, destfile, assembly_accession=NA)
fastaTo2bit(origfile, destfile, assembly_accession=NA)
origfile |
A single string containing the path to the FASTA file
(possibly compressed) to read, e.g. |
destfile |
A single string containing the path to the 2bit file to be
written, e.g. |
assembly_accession |
A single string containing a GenBank assembly accession (e.g.
|
This function is intended for Bioconductor users who want to convert a FASTA file to the 2bit format.
An invisible NULL.
Atuhurira Kirabo Kakopo and Hervé Pagès
The readDNAStringSet
function in the
Biostrings package that fastaTo2bit
uses internally
to import the FASTA file.
The export.2bit
function in the
rtracklayer package that fastaTo2bit
uses internally
to export the 2bit file.
The getChromInfoFromNCBI
function in
the GenomeInfoDb package that fastaTo2bit
uses
internally to get chromosome information for the specified NCBI
assembly.
The downloadGenomicSequencesFromNCBI
function that
downloads genomic sequences from NCBI.
## Most assemblies at NCBI can be accessed using either their GenBank ## or RefSeq assembly accession. For example assembly ASM972954v1 (for ## Acidianus infernus) can be accessed either with GCA_009729545.1 ## (GenBank assembly accession) or GCF_009729545.1 (RefSeq assembly ## accession). ## See https://www.ncbi.nlm.nih.gov/assembly/GCA_009729545.1 ## or https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1 for ## the landing page of this assembly. ## --------------------------------------------------------------------- ## USING FASTA FILE FROM **GenBank** ASSEMBLY ## --------------------------------------------------------------------- ## Download the FASTA file containing the genomic sequences for ## the ASM972954v1 assembly to the tempdir() folder: fasta_path <- downloadGenomicSequencesFromNCBI("GCA_009729545.1", destdir=tempdir()) ## Use fastaTo2bit() to convert the file to 2bit. We're using the ## function in its simplest form here so there won't be any sequence ## renaming or reordering: twobit_path1 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path1) ## Take a look at the sequence names in the resulting 2bit file: seqlevels(rtracklayer::TwoBitFile(twobit_path1)) ## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to ## names(rtracklayer::import.2bit(.)) but a lot more efficient because ## it doesn't load the sequences. ## Use fastaTo2bit() again to convert the file to 2bit. However ## this time we want the function to rename and reorder the ## sequences as in getChromInfoFromNCBI("GCA_009729545.1"), so ## we set 'assembly_accession' to "GCA_009729545.1" in the call ## to fastaTo2bit(): twobit_path2 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCA_009729545.1") ## Take a look at the sequence names in the resulting 2bit file: seqlevels(rtracklayer::TwoBitFile(twobit_path2)) ## --------------------------------------------------------------------- ## USING FASTA FILE FROM **RefSeq** ASSEMBLY ## --------------------------------------------------------------------- ## Same as above but using GCF_009729545.1 instead of GCA_009729545.1 fasta_path <- downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir()) twobit_path1 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path1) seqlevels(rtracklayer::TwoBitFile(twobit_path1)) twobit_path2 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCF_009729545.1") seqlevels(rtracklayer::TwoBitFile(twobit_path2)) ## --------------------------------------------------------------------- ## USING A FASTA FILE WITH IUPAC AMBIGUITY LETTERS ## --------------------------------------------------------------------- ## Download the FASTA file containing the genomic sequences for ## for Escherichia coli assembly ASM1484v1 (see ## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/): fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1", destdir=tempdir()) ## The DNA sequences in this file contain IUPAC ambiguity letters ## not supported by the 2bit format, so fastaTo2bit() will replace ## them with N's and issue a warning: twobit_path <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path) # warning! ## Use suppressWarnings() to suppress the warning: suppressWarnings(fastaTo2bit(fasta_path, twobit_path))
## Most assemblies at NCBI can be accessed using either their GenBank ## or RefSeq assembly accession. For example assembly ASM972954v1 (for ## Acidianus infernus) can be accessed either with GCA_009729545.1 ## (GenBank assembly accession) or GCF_009729545.1 (RefSeq assembly ## accession). ## See https://www.ncbi.nlm.nih.gov/assembly/GCA_009729545.1 ## or https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1 for ## the landing page of this assembly. ## --------------------------------------------------------------------- ## USING FASTA FILE FROM **GenBank** ASSEMBLY ## --------------------------------------------------------------------- ## Download the FASTA file containing the genomic sequences for ## the ASM972954v1 assembly to the tempdir() folder: fasta_path <- downloadGenomicSequencesFromNCBI("GCA_009729545.1", destdir=tempdir()) ## Use fastaTo2bit() to convert the file to 2bit. We're using the ## function in its simplest form here so there won't be any sequence ## renaming or reordering: twobit_path1 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path1) ## Take a look at the sequence names in the resulting 2bit file: seqlevels(rtracklayer::TwoBitFile(twobit_path1)) ## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to ## names(rtracklayer::import.2bit(.)) but a lot more efficient because ## it doesn't load the sequences. ## Use fastaTo2bit() again to convert the file to 2bit. However ## this time we want the function to rename and reorder the ## sequences as in getChromInfoFromNCBI("GCA_009729545.1"), so ## we set 'assembly_accession' to "GCA_009729545.1" in the call ## to fastaTo2bit(): twobit_path2 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCA_009729545.1") ## Take a look at the sequence names in the resulting 2bit file: seqlevels(rtracklayer::TwoBitFile(twobit_path2)) ## --------------------------------------------------------------------- ## USING FASTA FILE FROM **RefSeq** ASSEMBLY ## --------------------------------------------------------------------- ## Same as above but using GCF_009729545.1 instead of GCA_009729545.1 fasta_path <- downloadGenomicSequencesFromNCBI("GCF_009729545.1", destdir=tempdir()) twobit_path1 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path1) seqlevels(rtracklayer::TwoBitFile(twobit_path1)) twobit_path2 <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path2, assembly_accession="GCF_009729545.1") seqlevels(rtracklayer::TwoBitFile(twobit_path2)) ## --------------------------------------------------------------------- ## USING A FASTA FILE WITH IUPAC AMBIGUITY LETTERS ## --------------------------------------------------------------------- ## Download the FASTA file containing the genomic sequences for ## for Escherichia coli assembly ASM1484v1 (see ## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/): fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1", destdir=tempdir()) ## The DNA sequences in this file contain IUPAC ambiguity letters ## not supported by the 2bit format, so fastaTo2bit() will replace ## them with N's and issue a warning: twobit_path <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path) # warning! ## Use suppressWarnings() to suppress the warning: suppressWarnings(fastaTo2bit(fasta_path, twobit_path))
The forgeBSgenomeDataPkgFRomNCBI
function allows the user to
create a BSgenome data package from an NCBI assembly.
forgeBSgenomeDataPkgFromNCBI(assembly_accession, pkg_maintainer, pkg_author=NA, pkg_version="1.0.0", pkg_license="Artistic-2.0", organism=NULL, circ_seqs=NULL, destdir=".")
forgeBSgenomeDataPkgFromNCBI(assembly_accession, pkg_maintainer, pkg_author=NA, pkg_version="1.0.0", pkg_license="Artistic-2.0", organism=NULL, circ_seqs=NULL, destdir=".")
assembly_accession |
A single string containing a GenBank assembly accession (e.g.
|
pkg_maintainer |
A single string containing the name and email address of the package
maintainer (e.g |
pkg_author |
A single string containing the name of the package author. When
unspecified, this takes the value of |
pkg_version |
The version of the package. Set to |
pkg_license |
The license of the package. This must be the name of a software license
used for free and open-source packages. Set to |
organism |
The full name of the organism e.g. |
circ_seqs |
NULL (the default), or a character vector containing the names of the
circular sequences in the assembly. This only needs to be specified if
the assembly is not registered in the GenomeInfoDb package
(if the assembly is registered then its circular sequences are known so
there's no need to specify Notes:
|
destdir |
A single string containing the path to the directory where the
BSgenome data package is to be created. This directory must already
exist. Note that, by default, the package will be created in the
current directory ( |
This function is intended for Bioconductor users who want to forge a BSgenome
data package from an NCBI assembly. It typically makes use of the
downloadGenomicSequencesFromNCBI
utility function to download the
compressed FASTA file that contains the genomic sequences of the assembly,
and stores it in the working directory. However, if the file already
exists in the working directory, then it is used and not downloaded again.
The path to the created package as an invisible string.
Atuhurira Kirabo Kakopo
The registered_NCBI_assemblies
and
getChromInfoFromNCBI
functions
defined in the GenomeInfoDb package.
The downloadGenomicSequencesFromNCBI
function that
forgeBSgenomeDataPkgFromNCBI
uses internally to download
the genomic sequences from NCBI.
The fastaTo2bit
function that
forgeBSgenomeDataPkgFromNCBI
uses internally to convert
the file downloaded by downloadGenomicSequencesFromNCBI
from FASTA to 2bit.
The forgeBSgenomeDataPkgFromUCSC
function for
creating a BSgenome data package from a UCSC genome.
## --------------------------------------------------------------------- ## EXAMPLE 1 ## --------------------------------------------------------------------- ## Create a BSgenome data package for NCBI assembly GCA_009729545.1 ## (organism Acidianus infernus): forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_009729545.1", pkg_maintainer="Jane Doe <[email protected]>", organism="Acidianus infernus", destdir=tempdir()) ## --------------------------------------------------------------------- ## EXAMPLE 2 ## --------------------------------------------------------------------- ## Create a BSgenome data package for NCBI assembly GCF_000857545.1 ## (organism Torque teno virus 1): forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1", pkg_maintainer="Jane Doe <[email protected]>", organism="Torque teno virus 1", circ_seqs="NC_002076.2", destdir=tempdir())
## --------------------------------------------------------------------- ## EXAMPLE 1 ## --------------------------------------------------------------------- ## Create a BSgenome data package for NCBI assembly GCA_009729545.1 ## (organism Acidianus infernus): forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_009729545.1", pkg_maintainer="Jane Doe <[email protected]>", organism="Acidianus infernus", destdir=tempdir()) ## --------------------------------------------------------------------- ## EXAMPLE 2 ## --------------------------------------------------------------------- ## Create a BSgenome data package for NCBI assembly GCF_000857545.1 ## (organism Torque teno virus 1): forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_000857545.1", pkg_maintainer="Jane Doe <[email protected]>", organism="Torque teno virus 1", circ_seqs="NC_002076.2", destdir=tempdir())
The forgeBSgenomeDataPkgFromTwobitFile
function allows the user
to create a BSgenome data package from a 2bit file.
forgeBSgenomeDataPkgFromTwobitFile(filepath, organism, provider, genome, pkg_maintainer, pkg_author=NA, pkg_version="1.0.0", pkg_license="Artistic-2.0", seqnames=NULL, circ_seqs=NULL, destdir=".")
forgeBSgenomeDataPkgFromTwobitFile(filepath, organism, provider, genome, pkg_maintainer, pkg_author=NA, pkg_version="1.0.0", pkg_license="Artistic-2.0", seqnames=NULL, circ_seqs=NULL, destdir=".")
filepath |
A single string containing a path to a 2bit file. |
organism |
The full name of the organism e.g. |
provider |
The provider of the sequence data stored in the 2bit file e.g.
|
genome |
A single string specifying the name of the genome assembly
e.g. |
pkg_maintainer |
A single string containing the name and email address of the package
maintainer (e.g |
pkg_author |
A single string containing the name of the package author. When
unspecified, this takes the value of |
pkg_version |
The version of the package. Set to |
pkg_license |
The license of the package. This must be the name of a software license
used for free and open-source packages. Set to |
seqnames |
NULL (the default), or a character vector containing a subset of the sequence names stored in the 2bit file. Use this to select and/or reorder the sequences that will go in the BSgenome data package. By default (i.e. when Use |
circ_seqs |
NULL (the default), or a character vector providing the names of the
sequences stored in the 2bit file that are known to be circular.
Note that if Set to By default (i.e. if |
destdir |
A single string containing the path to the directory where the
BSgenome data package is to be created. This directory must already
exist. Note that, by default, the package will be created in the
current directory ( |
This function is intended for Bioconductor users who want to forge a BSgenome data package from a 2bit file.
The path to the created package as an invisible string.
Hervé Pagès
The forgeBSgenomeDataPkgFromNCBI
function for
creating a BSgenome data package from an NCBI assembly
(similar to forgeBSgenomeDataPkgFromTwobitFile
but
slightly more convenient).
The forgeBSgenomeDataPkgFromUCSC
function for
creating a BSgenome data package from a UCSC genome
(similar to forgeBSgenomeDataPkgFromTwobitFile
but
slightly more convenient).
The fastaTo2bit
function to convert a FASTA
file to the 2bit format.
## Download the FASTA file containing the genomic sequences for ## for Escherichia coli assembly ASM1484v1 (see ## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/): fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1", destdir=tempdir()) ## Use fastaTo2bit() to convert the file to 2bit: twobit_path <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path) ## All the DNA sequences in Escherichia coli are circular: circ_seqs <- seqlevels(rtracklayer::TwoBitFile(twobit_path)) ## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to ## names(rtracklayer::import.2bit(.)) but a lot more efficient because ## it doesn't load the sequences. ## Create a BSgenome data package from the 2bit file: forgeBSgenomeDataPkgFromTwobitFile( filepath=twobit_path, organism="Escherichia coli", provider="NCBI", genome="ASM1484v1", pkg_maintainer="Jane Doe <[email protected]>", circ_seqs=circ_seqs, destdir=tempdir() )
## Download the FASTA file containing the genomic sequences for ## for Escherichia coli assembly ASM1484v1 (see ## https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014845.1/): fasta_path <- downloadGenomicSequencesFromNCBI("GCF_000014845.1", destdir=tempdir()) ## Use fastaTo2bit() to convert the file to 2bit: twobit_path <- tempfile(fileext=".2bit") fastaTo2bit(fasta_path, twobit_path) ## All the DNA sequences in Escherichia coli are circular: circ_seqs <- seqlevels(rtracklayer::TwoBitFile(twobit_path)) ## Note that seqlevels(rtracklayer::TwoBitFile(.)) is equivalent to ## names(rtracklayer::import.2bit(.)) but a lot more efficient because ## it doesn't load the sequences. ## Create a BSgenome data package from the 2bit file: forgeBSgenomeDataPkgFromTwobitFile( filepath=twobit_path, organism="Escherichia coli", provider="NCBI", genome="ASM1484v1", pkg_maintainer="Jane Doe <[email protected]>", circ_seqs=circ_seqs, destdir=tempdir() )
The forgeBSgenomeDataPkgFromUCSC
function allows the user to
create a BSgenome data package from a UCSC genome.
forgeBSgenomeDataPkgFromUCSC(genome, organism, pkg_maintainer, pkg_author=NA, pkg_version="1.0.0", pkg_license="Artistic-2.0", circ_seqs=NULL, goldenPath.url=getOption("UCSC.goldenPath.url"), destdir=".")
forgeBSgenomeDataPkgFromUCSC(genome, organism, pkg_maintainer, pkg_author=NA, pkg_version="1.0.0", pkg_license="Artistic-2.0", circ_seqs=NULL, goldenPath.url=getOption("UCSC.goldenPath.url"), destdir=".")
genome |
A single string specifying the name of a UCSC genome
(e.g. |
organism |
The full name of the organism e.g. |
pkg_maintainer |
A single string containing the name and email address of the package
maintainer (e.g |
pkg_author |
A single string containing the name of the package author. When
unspecified, this takes the value of |
pkg_version |
The version of the package. Set to |
pkg_license |
The license of the package. This must be the name of a software license
used for free and open-source packages. Set to |
circ_seqs |
NULL (the default), or a character vector containing the names of the
circular sequences in the UCSC genome. This only needs to be specified if
the genome is not registered in the GenomeInfoDb package
(if the genome is registered then its circular sequences are known so
there's no need to specify Notes:
|
goldenPath.url |
A single string specifying the URL to the UCSC goldenPath location where the genomic sequences and chromosome sizes are expected to be found. |
destdir |
A single string containing the path to the directory where the
BSgenome data package is to be created. This directory must already
exist. Note that, by default, the package will be created in the
current directory ( |
This function is intended for Bioconductor users who want to forge a BSgenome
data package from a UCSC genome. It typically makes use of the
downloadGenomicSequencesFromUCSC
utility function to download
the 2bit file that contains the genomic sequences of the genome,
and stores it in the working directory. However, if the file already
exists in the working directory, then it is used and not downloaded again.
The path to the created package as an invisible string.
Hervé Pagès
The registered_UCSC_genomes
and
getChromInfoFromUCSC
functions
defined in the GenomeInfoDb package.
The downloadGenomicSequencesFromUCSC
function that
forgeBSgenomeDataPkgFromUCSC
uses internally to download
the genomic sequences from UCSC.
The forgeBSgenomeDataPkgFromNCBI
function for
creating a BSgenome data package from an NCBI assembly.
## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2 ## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1): forgeBSgenomeDataPkgFromUCSC( genome="wuhCor1", organism="Severe acute respiratory syndrome coronavirus 2", pkg_maintainer="Jane Doe <[email protected]>", destdir=tempdir() )
## Create a BSgenome data package for UCSC genome wuhCor1 (SARS-CoV-2 ## assembly, see https://genome.ucsc.edu/cgi-bin/hgGateway?db=wuhCor1): forgeBSgenomeDataPkgFromUCSC( genome="wuhCor1", organism="Severe acute respiratory syndrome coronavirus 2", pkg_maintainer="Jane Doe <[email protected]>", destdir=tempdir() )