Title: | Transform public data resources into Bioconductor Data Structures |
---|---|
Description: | These recipes convert a wide variety and a growing number of public bioinformatic data sets into easily-used standard Bioconductor data structures. |
Authors: | Martin Morgan [ctb], Marc Carlson [ctb], Dan Tenenbaum [ctb], Sonali Arora [ctb], Paul Shannon [ctb], Lori Shepherd [ctb], Bioconductor Package Maintainer [cre] |
Maintainer: | Bioconductor Package Maintainer <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.37.0 |
Built: | 2024-12-29 03:36:49 UTC |
Source: | https://github.com/bioc/AnnotationHubData |
These recipes convert a wide variety and a growing number of public bioinformatic data sets into easily-used standard Bioconductor data structures.
This package provides a set of methods which convert bioinformatic data resources into standard Bioconductor data types. For example, a UCSC genome browser track, expressed as a BED file, is converted into a GRanges object. Not every valuable data resource can be transformed quite so easily; some require more elaborate transformation, and hence a more specialized recipe. Every effort is made to limit the number of recipes required. One strategy that helps with the principle of "zero curation": unless absolutely required, the "cooked" version of the data resource produced by a recipe is a simple and unembellished reflection of the original data in its downloaded form.
Dan Tenenbaum, Paul Shannon
AnnotationHubMetadata-class
, makeAnnotationHubMetadata
"AnnotationHubMetadata"
and methodsAnnotationHubMetadata
is used to represent record(s) in the
server data base.
AnnotationHubMetadata(AnnotationHubRoot, SourceUrl, SourceType, SourceVersion, SourceLastModifiedDate, SourceMd5 = NA_character_, SourceSize, DataProvider, Title, Description, Species, TaxonomyId, Genome, Tags, Recipe, RDataClass, RDataDateAdded, RDataPath, Maintainer, ..., BiocVersion = BiocManager::version(), Coordinate_1_based = TRUE, Notes = NA_character_, DispatchClass, Location_Prefix = "https://bioconductorhubs.blob.core.windows.net/annotationhub/") toJson(x) constructSeqInfo(species, genome) metadata(x, ...) hubError(x) inputFiles(object, ...) outputFile(object) ahmToJson(ahm) deleteResources(id) getImportPreparerClasses() makeAnnotationHubResource(objName, makeAnnotationHubMetadataFunction, ..., where)
AnnotationHubMetadata(AnnotationHubRoot, SourceUrl, SourceType, SourceVersion, SourceLastModifiedDate, SourceMd5 = NA_character_, SourceSize, DataProvider, Title, Description, Species, TaxonomyId, Genome, Tags, Recipe, RDataClass, RDataDateAdded, RDataPath, Maintainer, ..., BiocVersion = BiocManager::version(), Coordinate_1_based = TRUE, Notes = NA_character_, DispatchClass, Location_Prefix = "https://bioconductorhubs.blob.core.windows.net/annotationhub/") toJson(x) constructSeqInfo(species, genome) metadata(x, ...) hubError(x) inputFiles(object, ...) outputFile(object) ahmToJson(ahm) deleteResources(id) getImportPreparerClasses() makeAnnotationHubResource(objName, makeAnnotationHubMetadataFunction, ..., where)
AnnotationHubRoot |
|
SourceUrl |
|
SourceType |
|
SourceVersion |
|
SourceLastModifiedDate |
|
SourceMd5 |
|
SourceSize |
|
DataProvider |
|
Title |
|
Description |
|
Species |
|
TaxonomyId |
|
Genome |
|
Tags |
|
Recipe |
|
RDataClass |
|
RDataDateAdded |
|
RDataPath |
|
Maintainer |
|
BiocVersion |
|
Coordinate_1_based |
|
DispatchClass |
A number of dispatch classes are pre-defined in
AnnotationHub/R/AnnotationHubResource-class.R with the suffix
‘Resource’. For example, if you have sqlite files, the
AnnotationHubResource-class.R defines SQLiteFileResource so the
DispatchClass would be SQLiteFile. Contact [email protected] if
you are not sure which class to use. The function
|
Location_Prefix |
|
Notes |
|
ahm |
An instance of class |
x |
An instance of class |
object |
An |
species |
|
genome |
|
id |
An id whose DB record is to be fully deleted. |
objName |
|
makeAnnotationHubMetadataFunction |
|
where |
Environment where function definition is defined. Default value is sufficient. |
... |
Additional arguments passed to methods. |
AnnotationHubMetadata
returns an instance of the class.
jsonPath
returns a character(1))
representation of the
full path to the location of the json
file associated with this
record.
toJson
returns the JSON representation of the record.
fromJson
retuns an instance of the class, as parsed from the
JSON file.
Objects can be created by calls to the constructor,
AnnotationHubMetadata()
.
Dan Tenenbaum and Marc Carlson
getClass("AnnotationHubMetadata")
getClass("AnnotationHubMetadata")
Write logging message to console and a file.
flog(level, ...)
flog(level, ...)
level |
A |
... |
Further arguments. |
Writes the message to the console and to a file.
None.
Dan Tenenbaum
futile.logger
ImportPreparer
and generic newResources
The ImportPreparer
and derived classes are used for dispatch
during data discovery (see newResources
). There is one
ImportPreparer
class for each data source for
AnnotationHubMetadata
.
newResources
is a generic function; with methods implemented
for each ImportPreparer
.
Martin Morgan
getImportPreparerClasses()
getImportPreparerClasses()
Make AnnotationHubMetadata objects from .csv files located in the "inst/extdata/" package directory of an AnnotationHub package.
makeAnnotationHubMetadata(pathToPackage, fileName=character())
makeAnnotationHubMetadata(pathToPackage, fileName=character())
pathToPackage |
Full path to data package including the package name; no trailing slash |
fileName |
Name of metadata file(s) with csv extension. If none are provided, all files with .csv extension in "inst/extdata" will be processed. |
makeAnnotationHubMetadata: Reads the resource metadata from .csv files into a AnnotationHubMetadata object. The AnnotationHubMetadata is inserted in the AnnotationHub database. Intended for internal use or package authors checking the validity of package metadata.
Formatting metadata files:
makeAnnotationHubMetadata
reads .csv files of metadata
located in "inst/extdata". Internal functions perform checks for
required columns and data types and can be used by package authors
to validate their metadata before submitting the package for
review.
The rows of the .csv file(s) represent individual Hub
resources (i.e., data objects) and the columns are the metadata
fields. All fields should be a single character string of length 1.
Required Fields in metadata file:
Title: character(1)
. Name of the resource. This can be
the exact file name (if self-describing) or a more complete
description.
Description: character(1)
. Brief description of the
resource, similar to the 'Description' field in a package
DESCRIPTION file.
BiocVersion: character(1)
. The first Bioconductor version
the resource was made available for. Unless removed from
the hub, the resource will be available for all versions
greater than or equal to this field. Generally the current
devel version of Bioconductor.
Genome: character(1)
. Genome. Can be NA.
SourceType: character(1)
. Format of original data, e.g., FASTA,
BAM, BigWig, etc. getValidSourceTypes()
list currently
acceptable values. If nothing seems appropiate for your data
reach out to [email protected].
SourceUrl: character(1)
. Optional location of original
data files. Multiple urls should be provided as a comma separated
string.
SourceVersion: character(1)
. Version of original data.
Species: character(1)
. Species. For help on valid
species see getSpeciesList, validSpecies, or
suggestSpecies. Can be NA.
TaxonomyId: character(1)
. Taxonomy ID. There are
checks for valid taxonomyId given the Species which produce
warnings. See GenomeInfoDb::loadTaxonomyDb() for full validation
table. Can be NA.
Coordinate_1_based: logical
. TRUE if data are
1-based. Can be NA
DataProvider: character(1)
. Name of company or institution
that supplied the original (raw) data.
Maintainer: character(1)
. Maintainer name and email in the
following format: Maintainer Name <username@address>.
RDataClass: character(1)
. R / Bioconductor class the data
are stored in, e.g., GRanges, SummarizedExperiment,
ExpressionSet etc. If the file is loaded or read into R
what is the class of the object.
DispatchClass: character(1)
. Determines how data are
loaded into R. The value for this field should be
‘Rda’ if the data were serialized with save()
and
‘Rds’ if serialized with saveRDS
. The filename
should have the appropriate ‘rda’ or ‘rds’
extension. There are other available DispathClass types
and the function AnnotationHub::DispatchClassList()
A number of dispatch classes are pre-defined in
AnnotationHub/R/AnnotationHubResource-class.R with the suffix
‘Resource’. For example, if you have sqlite files, the
AnnotationHubResource-class.R defines SQLiteFileResource so
the DispatchClass would be SQLiteFile. Contact
[email protected] if you are not sure which class
to use. The function
AnnotationHub::DispatchClassList()
will output a
matrix of currently implemented DispatchClass and brief
description of utility. If a predefine class does not seem
appropriate contact [email protected]. An all
purpose DispathClass is FilePath
that instead of trying
to load the file into R, will only return the path to the
locally downloaded file.
Location_Prefix: character(1)
. Do not include this field
if data are stored in the Bioconductor AWS S3; it will be
generated automatically.
If data will be accessed from a location other than AWS S3 this field should be the base url.
RDataPath: character()
.This field should be the
remainder of the path to the resource. The
Location_Prefix
will be prepended to
RDataPath
for the full path to the resource.
If the resource is stored in Bioconductor's AWS S3
buckets, it should start with the name of the package associated
with the metadata and should not start with a leading
slash. It should include the resource file name. For
strongly associated files, like a bam file and its index
file, the two files should be separates with a colon
:
. This will link a single hub id with the multiple files.
Tags: character() vector
.
‘Tags’ are search terms used to define a subset of
resources in a Hub
object, e.g, in a call to query
.
‘Tags’ are automatically generated from the ‘biocViews’ in the DESCRIPTION and applied to all resources of the metadata file. Optionally, maintainers can define ‘Tags’ column of the metadata to define tags for each resource individually. Multiple ‘Tags’ are specified as a colon separated string, e.g., tags for two resources would look like this:
Tags=c("tag1:tag2:tag3", "tag1:tag3")
NOTE: The metadata file can have additional columns beyond the 'Required Fields' listed above. These values are not added to the Hub database but they can be used in package functions to provide an additional level of metadata on the resources.
More on Location_Prefix
and RDataPath
. These two fields make up
the complete file path url for downloading the data file. If using
the Bioconductor AWS S3 bucket the Location_Prefix should not be
included in the metadata file[s] as this field will be populated
automatically. The RDataPath
will be the directory structure you
uploaded to S3. If you uploaded a directory ‘MyAnnotation/’, and
that directory had a subdirectory ‘v1/’ that contained two files
‘counts.rds’ and ‘coldata.rds’, your metadata file will contain
two rows and the RDataPaths would be ‘MyAnnotation/v1/counts.rds’
and ‘MyAnnotation/v1/coldata.rds’. If you host your data on a
publicly accessible site you must include a base url as the
Location_Prefix
. If your data file was at
‘ftp://myinstiututeserver/biostats/project2/counts.rds’, your
metadata file will have one row and the Location_Prefix
would be
‘ftp://myinstiututeserver/’ and the RDataPath
would be
‘biostats/project2/counts.rds’.
A named list the length of fileName
. Each element is a list of
of AnnotationHubMetadata
objects created from the .csv file.
## Each row of the metadata file represents a resource added to one of ## the 'Hubs'. This example creates a metadata.csv file for a single resource. ## In the case of multiple resources, the arguments below would be character ## vectors that produced multiple rows in the data.frame. meta <- data.frame( Title = "RNA-Sequencing dataset from study XYZ", Description = paste0("RNA-seq data from study XYZ containing 10 normal ", "and 10 tumor samples represented as a", "SummarizedExperiment"), BiocVersion = "3.4", Genome = "GRCh38", SourceType = "BAM", SourceUrl = "http://www.path/to/original/data/file", SourceVersion = "Jan 01 2016", Species = "Homo sapiens", TaxonomyId = 9606, Coordinate_1_based = TRUE, DataProvider = "GEO", Maintainer = "Your Name <[email protected]>", RDataClass = "SummarizedExperiment", DispatchClass = "Rda", ResourceName = "FileName.rda" ) ## Not run: ## Write the data out and put in the inst/extdata directory. write.csv(meta, file="metadata.csv", row.names=FALSE) ## Test the validity of metadata.csv makeAnnotationHubMetadata("path/to/mypackage") ## End(Not run)
## Each row of the metadata file represents a resource added to one of ## the 'Hubs'. This example creates a metadata.csv file for a single resource. ## In the case of multiple resources, the arguments below would be character ## vectors that produced multiple rows in the data.frame. meta <- data.frame( Title = "RNA-Sequencing dataset from study XYZ", Description = paste0("RNA-seq data from study XYZ containing 10 normal ", "and 10 tumor samples represented as a", "SummarizedExperiment"), BiocVersion = "3.4", Genome = "GRCh38", SourceType = "BAM", SourceUrl = "http://www.path/to/original/data/file", SourceVersion = "Jan 01 2016", Species = "Homo sapiens", TaxonomyId = 9606, Coordinate_1_based = TRUE, DataProvider = "GEO", Maintainer = "Your Name <[email protected]>", RDataClass = "SummarizedExperiment", DispatchClass = "Rda", ResourceName = "FileName.rda" ) ## Not run: ## Write the data out and put in the inst/extdata directory. write.csv(meta, file="metadata.csv", row.names=FALSE) ## Test the validity of metadata.csv makeAnnotationHubMetadata("path/to/mypackage") ## End(Not run)
Transform an Ensembl FASTA file to a Bioconductor FaFile or ToBitFile.
makeEnsemblFastaToAHM(currentMetadata, baseUrl = "ftp://ftp.ensembl.org/pub/", baseDir = "fasta/", release, justRunUnitTest = FALSE, BiocVersion = BiocManager::version()) makeEnsemblTwoBitToAHM(currentMetadata, baseUrl = "ftp://ftp.ensembl.org/pub/", baseDir = "fasta/", release, justRunUnitTest = FALSE, BiocVersion = BiocManager::version()) ensemblFastaToFaFile(ahm) ensemblFastaToTwoBitFile(ahm)
makeEnsemblFastaToAHM(currentMetadata, baseUrl = "ftp://ftp.ensembl.org/pub/", baseDir = "fasta/", release, justRunUnitTest = FALSE, BiocVersion = BiocManager::version()) makeEnsemblTwoBitToAHM(currentMetadata, baseUrl = "ftp://ftp.ensembl.org/pub/", baseDir = "fasta/", release, justRunUnitTest = FALSE, BiocVersion = BiocManager::version()) ensemblFastaToFaFile(ahm) ensemblFastaToTwoBitFile(ahm)
currentMetadata |
Currently not used. Intended to be a list of metadata to filter, i.e., records that do not need to be processed again. Need to remove or fix. |
baseUrl |
ftp file location. |
baseDir |
ftp file directory. |
release |
Integer version number, e.g., "84". |
justRunUnitTest |
A |
BiocVersion |
A |
ahm |
List of |
makeEnsemblFastaToAHM
and makeEnsemblTwoBitToAHM
process
metadata into a list of AnnotationHubMetadata
objects.
ensemblFastaToFaFile
unzips a .gz files, creates and index and
writes out .rz and .rz.fai files to disk.
ensemblFastaToTwoBit
converts a fasta file to twobit format and
writes the .2bit file out to disk.
makeEnsemblFastaToAHM
and makeEnsemblTwoBitToAHM
return
a list of AnnotationHubMetadata
objects.
ensemblFastaToFaFile
write out .rz and .rz.fai files to disk.
ensemblFastaToTwoBit
writes out a .2bit file to disk.
Bioconductor Core Team
## updateResources() generates metadata, process records and ## pushes files to AWS S3 buckets. See ?updateResources for details. ## 'release' is passed to makeEnsemblFastaToFaFile. ## Not run: meta <- updateResources("/local/path", BiocVersion = c("3.2", "3.3"), preparerClasses = "EnsemblFastaImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, release = "83") ## End(Not run)
## updateResources() generates metadata, process records and ## pushes files to AWS S3 buckets. See ?updateResources for details. ## 'release' is passed to makeEnsemblFastaToFaFile. ## Not run: meta <- updateResources("/local/path", BiocVersion = c("3.2", "3.3"), preparerClasses = "EnsemblFastaImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, release = "83") ## End(Not run)
Create metadata and process raw Gencode FASTA files for inclusion in AnnotationHub
makeGencodeFastaToAHM(currentMetadata, baseUrl="ftp://ftp.ebi.ac.uk/pub/databases/gencode/", species=c("Human", "Mouse"), release, justRunUnitTest=FALSE, BiocVersion=BiocManager::version()) gencodeFastaToFaFile(ahm)
makeGencodeFastaToAHM(currentMetadata, baseUrl="ftp://ftp.ebi.ac.uk/pub/databases/gencode/", species=c("Human", "Mouse"), release, justRunUnitTest=FALSE, BiocVersion=BiocManager::version()) gencodeFastaToFaFile(ahm)
currentMetadata |
Currently not used. Intended to be a list of metadata to filter, i.e., records that do not need to be processed again. Need to remove or fix. |
baseUrl |
ftp file location. |
species |
A |
release |
A |
justRunUnitTest |
A |
BiocVersion |
A |
ahm |
List of |
http://www.gencodegenes.org/releases/
ftp://ftp.ebi.ac.uk/pub/databases/gencode/. Gencode_human and Gencode_mouse are used.
Code is currently specific for human and mouse. Files chosen for download are described in AnnotationHubData:::.gencodeDescription().
makeGencodeFastaAHM
returns a list of AnnotationHubMetadata
instances. gencodeFastaToFaFile
returns nothing.
Bioconductor Core Team.
## updateResources() generates metadata, process records and ## pushes files to AWS S3 buckets. ## To run the GencodeFasta recipe specify ## 'preparerClasses = GencodeFastaImportPreparer'. The 'species' and 'release' ## arguments are passed to makeGencodeFastaAHM(). ## Not run: meta <- updateResources("/local/path", BiocVersion = c("3.2", "3.3"), preparerClasses = "GencodeFastaImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE) ## End(Not run)
## updateResources() generates metadata, process records and ## pushes files to AWS S3 buckets. ## To run the GencodeFasta recipe specify ## 'preparerClasses = GencodeFastaImportPreparer'. The 'species' and 'release' ## arguments are passed to makeGencodeFastaAHM(). ## Not run: meta <- updateResources("/local/path", BiocVersion = c("3.2", "3.3"), preparerClasses = "GencodeFastaImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE) ## End(Not run)
Add OrgDb and TxDb sqlite files to AnnotationHub
makeStandardOrgDbsToAHM(currentMetadata, justRunUnitTest = FALSE, BiocVersion = BiocManager::version(), downloadOrgDbs = TRUE) makeStandardTxDbsToAHM(currentMetadata, justRunUnitTest = FALSE, BiocVersion = BiocManager::version(), TxDbs) makeNCBIToOrgDbsToAHM(currentMetadata, justRunUnitTest = FALSE, BiocVersion = BiocManager::version(), baseUrl = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/")
makeStandardOrgDbsToAHM(currentMetadata, justRunUnitTest = FALSE, BiocVersion = BiocManager::version(), downloadOrgDbs = TRUE) makeStandardTxDbsToAHM(currentMetadata, justRunUnitTest = FALSE, BiocVersion = BiocManager::version(), TxDbs) makeNCBIToOrgDbsToAHM(currentMetadata, justRunUnitTest = FALSE, BiocVersion = BiocManager::version(), baseUrl = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/")
currentMetadata |
Historically was intended to be a list of metadata to filter, i.e., records that do not need to be processed again. In some recipes this is used as a way to pass additional arguments. Need to remove or make consistent. |
baseUrl |
A |
justRunUnitTest |
A |
BiocVersion |
A |
TxDbs |
Character vector of the |
downloadOrgDbs |
A |
makeStandardOrgDbsToAHM
and makeStandardTxDbsToAHM
extracts
the sqlite files from the existing OrgDb
and TxDb
packages
in the Bioconductor repositories and generate associated metadata.
makeNCBIToOrgDbsToAHM
creates sqlite files and metadata for 1000
organisms with the makeOrgPackageFromNCBI
function. These
organisms are less 'main stream' than those hosted in the Bioconductor
repository (makeStandardOrgDbsToAHM
) and the databases are less
comprehensive because data only come from one source, NCBI.
List of AnnotationHubMetadata
objects.
Bioconductor Core Team
## Not run: ## In Bioconductor 3.5, one new TxDb was added and 4 active ## tracks were updated. This piece of code shows how to add these 5 ## packages to AnnotationHub. ## Step I: generate metadata ## ## Generate the metadata with the low-level helper for inspection. TxDbs <- c("TxDb.Ggallus.UCSC.galGal5.refGene", "TxDb.Celegans.UCSC.ce11.refGene", "TxDb.Rnorvegicus.UCSC.rn5.refGene", "TxDb.Dmelanogaster.UCSC.dm6.ensGene", "TxDb.Rnorvegicus.UCSC.rn6.refGene") meta <- makeStandardTxDbsToAHM(currentMetadata=list(AnnotationHubRoot="TxDbs"), justRunUnitTest=FALSE, TxDbs = TxDbs) ## Once the low-level helper runs with no errors, try generating the ## metadata with the high-level wrapper updateResources(). Setting ## metadataOnly=TRUE will generate metadata only and not push resources ## to data bucket. insert=FALSE prevents the metadata from being inserted in the ## database. ## ## The metadata generated by updateResources() will be the same as that ## generated by makeStandardTxDbsToAHM(). Both should be a list the same ## length as the number of TxDbs specified. meta <- updateResources("TxDbs", preparerClasses="TxDbFromPkgsImportPreparer", metadataOnly=TRUE, insert = FALSE, justRunUnitTest=FALSE, TxDbs = TxDbs) INFO [2017-04-11 09:12:09] Preparer Class: TxDbFromPkgsImportPreparer complete! > length(meta) [1] 5 ## Step II: push resources to Azure ## ## If the metadata looks correct we are ready to push resources to Azure. ## Set metadataOnly=FALSE but keep insert=FALSE. ## export an environment variable with a core generated SAS URL for ## upload example: ## export AZURE_SAS_URL='https://bioconductorhubs.blob.core.windows.net/staginghub?sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz meta <- updateResources("TxDbs", BiocVersion="3.5", preparerClasses="TxDbFromPkgsImportPreparer", metadataOnly=FALSE, insert = FALSE, justRunUnitTest=FALSE, TxDbs = TxDbs) ## Step III: insert metadata in AnnotationHub production database ## ## Inserting the metadata in the database is usually done as a separte step ## and with the help of the AnnotationHub docker. ## Set metadataOnly=TRUE and insert=TRUE. meta <- updateResources("TxDbs", BiocVersion="3.5", preparerClasses="TxDbFromPkgsImportPreparer", metadataOnly=FALSE, insert = FALSE, justRunUnitTest=FALSE, TxDbs = TxDbs) ## End(Not run)
## Not run: ## In Bioconductor 3.5, one new TxDb was added and 4 active ## tracks were updated. This piece of code shows how to add these 5 ## packages to AnnotationHub. ## Step I: generate metadata ## ## Generate the metadata with the low-level helper for inspection. TxDbs <- c("TxDb.Ggallus.UCSC.galGal5.refGene", "TxDb.Celegans.UCSC.ce11.refGene", "TxDb.Rnorvegicus.UCSC.rn5.refGene", "TxDb.Dmelanogaster.UCSC.dm6.ensGene", "TxDb.Rnorvegicus.UCSC.rn6.refGene") meta <- makeStandardTxDbsToAHM(currentMetadata=list(AnnotationHubRoot="TxDbs"), justRunUnitTest=FALSE, TxDbs = TxDbs) ## Once the low-level helper runs with no errors, try generating the ## metadata with the high-level wrapper updateResources(). Setting ## metadataOnly=TRUE will generate metadata only and not push resources ## to data bucket. insert=FALSE prevents the metadata from being inserted in the ## database. ## ## The metadata generated by updateResources() will be the same as that ## generated by makeStandardTxDbsToAHM(). Both should be a list the same ## length as the number of TxDbs specified. meta <- updateResources("TxDbs", preparerClasses="TxDbFromPkgsImportPreparer", metadataOnly=TRUE, insert = FALSE, justRunUnitTest=FALSE, TxDbs = TxDbs) INFO [2017-04-11 09:12:09] Preparer Class: TxDbFromPkgsImportPreparer complete! > length(meta) [1] 5 ## Step II: push resources to Azure ## ## If the metadata looks correct we are ready to push resources to Azure. ## Set metadataOnly=FALSE but keep insert=FALSE. ## export an environment variable with a core generated SAS URL for ## upload example: ## export AZURE_SAS_URL='https://bioconductorhubs.blob.core.windows.net/staginghub?sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz meta <- updateResources("TxDbs", BiocVersion="3.5", preparerClasses="TxDbFromPkgsImportPreparer", metadataOnly=FALSE, insert = FALSE, justRunUnitTest=FALSE, TxDbs = TxDbs) ## Step III: insert metadata in AnnotationHub production database ## ## Inserting the metadata in the database is usually done as a separte step ## and with the help of the AnnotationHub docker. ## Set metadataOnly=TRUE and insert=TRUE. meta <- updateResources("TxDbs", BiocVersion="3.5", preparerClasses="TxDbFromPkgsImportPreparer", metadataOnly=FALSE, insert = FALSE, justRunUnitTest=FALSE, TxDbs = TxDbs) ## End(Not run)
Add new resources to AnnotationHub
updateResources(AnnotationHubRoot, BiocVersion = BiocManager::version(), preparerClasses = getImportPreparerClasses(), metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, ...) pushResources(allAhms, uploadToRemote = TRUE, download = TRUE) pushMetadata(allAhms, url)
updateResources(AnnotationHubRoot, BiocVersion = BiocManager::version(), preparerClasses = getImportPreparerClasses(), metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, ...) pushResources(allAhms, uploadToRemote = TRUE, download = TRUE) pushMetadata(allAhms, url)
AnnotationHubRoot |
Local path where files will be downloaded. |
BiocVersion |
A |
preparerClasses |
One of the |
metadataOnly |
A When FALSE, metadata are generated and data files are downloaded,
processed and pushed to their final location in S3 buckets.
|
insert |
NOTE: This option is for inserting metadata records in the production data base (done by Bioconductor core team member) and is for internal use only. A When |
justRunUnitTest |
A |
allAhms |
List of |
url |
URL of AnnotationHub database where metadata will be inserted. |
uploadToRemote |
A |
download |
A |
... |
Arguments passed to other methods such as |
updateResources:
updateResources
is responsible for creating metadata records
and downloading, processing and pushing data files to their final
resting place. The
preparerClasses argument is used in method dispatch to determine which recipe is used.
By manipulating the metadataOnly
, insert
and
justRunUnitTest
arguments one can flexibly test the metadata
for a small number of records with or without downloading and
processing the data files.
global options:
When insert = TRUE
the "AH_SERVER_POST_URL" option must be
set to the https location of the AnnotationHub db.
A list of AnnotationHubMetadata
objects.
Martin Morgan, Marc Carlson
## Not run: ## ----------------------------------------------------------------------- ## Inspect metadata: ## ----------------------------------------------------------------------- ## A useful first step in testing a new recipe is to generate and ## inspect a small number of metadata records. The combination of ## 'metadataOnly=TRUE', 'insert=FALSE' and 'justRunUnitTest=TRUE' ## generates metadata for the first 5 records and does not download or ## process any data. meta <- updateResources("/local/path", BiocVersion = "3.3", preparerClasses = "EnsemblFastaImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = TRUE, release = "84") INFO [2015-11-12 07:58:05] Preparer Class: EnsemblFastaImportPreparer Ailuropoda_melanoleuca.ailMel1.cdna.all.fa.gz Ailuropoda_melanoleuca.ailMel1.dna_rm.toplevel.fa.gz Ailuropoda_melanoleuca.ailMel1.dna_sm.toplevel.fa.gz Ailuropoda_melanoleuca.ailMel1.dna.toplevel.fa.gz Ailuropoda_melanoleuca.ailMel1.ncrna.fa.gz ## The return value is a list of metadata for the first 5 records: > names(meta) [1] "FASTA cDNA sequence for Ailuropoda melanoleuca" [2] "FASTA DNA sequence for Ailuropoda melanoleuca" [3] "FASTA DNA sequence for Ailuropoda melanoleuca" [4] "FASTA DNA sequence for Ailuropoda melanoleuca" [5] "FASTA ncRNA sequence for Ailuropoda melanoleuca" ## Each record is of class AnnotationHubMetadata: > class(meta[[1]]) [1] "AnnotationHubMetadata" attr(,"package") [1] "AnnotationHubData" ## ----------------------------------------------------------------------- ## Insert metadata in the db and process/push data files: ## ----------------------------------------------------------------------- ## This next code chunk creates the metadata and downloads and processes ## the data (metadataOnly=FALSE). If all files are successfully pushed to ## to their final resting place, metadata records are inserted in the ## AnnotationHub db (insert=TRUE). Metadata insertion is done by a ## Bioconductor team member; contact [email protected] for help. meta <- updateResources("local/path", BiocVersion = "3.5", preparerClasses = "EnsemblFastaImportPreparer", metadataOnly = FALSE, insert = TRUE, justRunUnitTest = FALSE, regex = ".*release-81") ## ----------------------------------------------------------------------- ## Recovery helpers: ## ----------------------------------------------------------------------- ## pushResources() and pushMetadata() are both called from updateResources() ## but can be used solo for testing or completing a run that ## terminated unexpectedly. ## Download, process and push to azure the last 2 files in 'meta': sub <- meta[length(meta) - 1:length(meta)] pushResources(sub) ## Insert metadata in the AnotationHub db for the last 2 files in 'meta': pushMetadata(sub, url = getOption("AH_SERVER_POST_URL")) ## End(Not run)
## Not run: ## ----------------------------------------------------------------------- ## Inspect metadata: ## ----------------------------------------------------------------------- ## A useful first step in testing a new recipe is to generate and ## inspect a small number of metadata records. The combination of ## 'metadataOnly=TRUE', 'insert=FALSE' and 'justRunUnitTest=TRUE' ## generates metadata for the first 5 records and does not download or ## process any data. meta <- updateResources("/local/path", BiocVersion = "3.3", preparerClasses = "EnsemblFastaImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = TRUE, release = "84") INFO [2015-11-12 07:58:05] Preparer Class: EnsemblFastaImportPreparer Ailuropoda_melanoleuca.ailMel1.cdna.all.fa.gz Ailuropoda_melanoleuca.ailMel1.dna_rm.toplevel.fa.gz Ailuropoda_melanoleuca.ailMel1.dna_sm.toplevel.fa.gz Ailuropoda_melanoleuca.ailMel1.dna.toplevel.fa.gz Ailuropoda_melanoleuca.ailMel1.ncrna.fa.gz ## The return value is a list of metadata for the first 5 records: > names(meta) [1] "FASTA cDNA sequence for Ailuropoda melanoleuca" [2] "FASTA DNA sequence for Ailuropoda melanoleuca" [3] "FASTA DNA sequence for Ailuropoda melanoleuca" [4] "FASTA DNA sequence for Ailuropoda melanoleuca" [5] "FASTA ncRNA sequence for Ailuropoda melanoleuca" ## Each record is of class AnnotationHubMetadata: > class(meta[[1]]) [1] "AnnotationHubMetadata" attr(,"package") [1] "AnnotationHubData" ## ----------------------------------------------------------------------- ## Insert metadata in the db and process/push data files: ## ----------------------------------------------------------------------- ## This next code chunk creates the metadata and downloads and processes ## the data (metadataOnly=FALSE). If all files are successfully pushed to ## to their final resting place, metadata records are inserted in the ## AnnotationHub db (insert=TRUE). Metadata insertion is done by a ## Bioconductor team member; contact [email protected] for help. meta <- updateResources("local/path", BiocVersion = "3.5", preparerClasses = "EnsemblFastaImportPreparer", metadataOnly = FALSE, insert = TRUE, justRunUnitTest = FALSE, regex = ".*release-81") ## ----------------------------------------------------------------------- ## Recovery helpers: ## ----------------------------------------------------------------------- ## pushResources() and pushMetadata() are both called from updateResources() ## but can be used solo for testing or completing a run that ## terminated unexpectedly. ## Download, process and push to azure the last 2 files in 'meta': sub <- meta[length(meta) - 1:length(meta)] pushResources(sub) ## Insert metadata in the AnotationHub db for the last 2 files in 'meta': pushMetadata(sub, url = getOption("AH_SERVER_POST_URL")) ## End(Not run)
This function is for uploading a file resource to the Microsoft Azure Data Lake.
upload_to_azure(file, sas)
upload_to_azure(file, sas)
file |
The file or directory to upload. |
sas |
A SAS url for the designated destination on Microsoft Azure Data Lake. |
Uses the azcopy Command Line Interface
to copy a file to Microsoft Azure Data Lake. Assumes azcopy is properly installed
and that the azcopy
program is in your PATH. The function
performs a recursive automatically so it can take a file or directory
for upload. The SAS URL is generated on Azure by someone who has
permission to the desired destination. Please be sure to use the SAS url
and not the SAS token. The sas url can be provided as an argument; if
the argument is not provided it will search for a system environment
variable 'AZURE_SAS_URL'.
TRUE
on success. If the command fails, the function
will exit with an error.
Lori Shepherd
## Not run: upload_to_azure("myfile.txt", "https://sasurl") ## End(Not run)
## Not run: upload_to_azure("myfile.txt", "https://sasurl") ## End(Not run)
This function is for uploading a file resource to the S3 cloud.
upload_to_S3(file, remotename, bucket, profile, acl="public-read")
upload_to_S3(file, remotename, bucket, profile, acl="public-read")
file |
The file to upload. |
remotename |
The name this file should have in S3, including any "keys" that are part of the name. This should not start with a slash (if it does, the leading slash will be removed), but can contain forward slashes. |
bucket |
Name of the S3 bucket to copy to. |
profile |
Corresponds to a profile set in the config file for the AWS CLI (see the documentation). If this argument is omitted,the default profile is used. |
acl |
Should be one of |
Uses the AWS Command Line Interface
to copy a file to Amazon S3. Assumes the CLI is properly configured
and that the aws
program is in your PATH. The CLI should be
configured with the credentials of a user who has permission to
upload to the appropriate bucket. It's recommended to use
IAM to set up users
with limited permissions.
There is an RAmazonS3
package but it seems to have issues
uploading files to S3.
TRUE
on success. If the command fails, the function
will exit with an error.
Dan Tenenbaum
## Not run: upload_to_S3("myfile.txt", "foo/bar/baz/yourfile.txt") # If this is successful, the file should be accessible at # http://s3.amazonaws.com/annotationhub/foo/bar/baz/yourfile.txt ## End(Not run)
## Not run: upload_to_S3("myfile.txt", "foo/bar/baz/yourfile.txt") # If this is successful, the file should be accessible at # http://s3.amazonaws.com/annotationhub/foo/bar/baz/yourfile.txt ## End(Not run)
Functions to assist in the validation process of creating the metadata.csv file for Hub Resources
getSpeciesList(verbose=FALSE) validSpecies(species, verbose=TRUE) suggestSpecies(query, verbose=FALSE, op=c("|", "&")) getValidSourceTypes() checkSpeciesTaxId(txid, species, verbose=TRUE) validDispatchClass(dc, verbose=TRUE)
getSpeciesList(verbose=FALSE) validSpecies(species, verbose=TRUE) suggestSpecies(query, verbose=FALSE, op=c("|", "&")) getValidSourceTypes() checkSpeciesTaxId(txid, species, verbose=TRUE) validDispatchClass(dc, verbose=TRUE)
species |
species to validate (may be single value or list) |
query |
terms to query. Whether AND or OR is determined by argument op. |
verbose |
should additional information and useful tips be displayed |
op |
Should searching of mulitple terms be conditional OR ("|") or AND ("&") |
txid |
taxonomy id (single value or list) |
dc |
Dispatch class to validate (may be single value or list) |
getSpeciesList: Provides a list of valid species as determined by the GenomeInfoDbData package specData.rda file.
validSpecies: True/False if argument is considered a valid species based on the list generated by getSpeciesList. A species may be deemed invalid if the capitalization mismatches or punctuation mismatches. Use suggestSpecies to find similar terms.
suggestSpecies: Based on a term or multiple terms suggest possible valid species.
getValidSourceTypes: returns list of acceptable values for SourceType in metadata.csv. If you think a valid source type should be added to the list please reach out to [email protected]
checkSpeciesTaxId: cross validates a list of species and
taxonomy ids for expected values based on
GenomeInfoDb::loadTaxonomyDb()
. Warning when there is a
mismatch.
validDispatchClass: TRUE/FALSE if argument is considered a
valid DispatchClass based on the currently available methods in
AnnotationHub. Use AnnotationHub::DispatchClassList()
to see
the table of currently available methods. If a currently available
method is not appropriate for your resource, please reach out to
Lori Shepherd [email protected] to request a
new method be added.
For getSpeciesList: character vector of valid species
For validSpecies: True/False if all species given as argument are valid
For suggestSpecies: data.frame of taxonomy id and species name of possible valid species based on given query key words.
For getValidSourceTypes: character vector of valid source types.
For checkSpeciesTaxId: NULL if check is verfified, If verbose is ture a table of suggested values along with the warning.
For validDispatchClass: True/False if all dispatch class given as argument are valid
Lori Shepherd
species = getSpeciesList() # following is TRUE validSpecies("Homo sapiens") # followin is FALSE because of starting "h" validSpecies("homo sapiens") # can provide multiple, if any are not valid FALSE # TRUE validSpecies(c("Homo sapiens", "Canis domesticus")) suggestSpecies("Canis") getValidSourceTypes() checkSpeciesTaxId(1003232, "Edhazardia aedis") checkSpeciesTaxId(9606, "Homo sapiens") validDispatchClass("GRanges")
species = getSpeciesList() # following is TRUE validSpecies("Homo sapiens") # followin is FALSE because of starting "h" validSpecies("homo sapiens") # can provide multiple, if any are not valid FALSE # TRUE validSpecies(c("Homo sapiens", "Canis domesticus")) suggestSpecies("Canis") getValidSourceTypes() checkSpeciesTaxId(1003232, "Edhazardia aedis") checkSpeciesTaxId(9606, "Homo sapiens") validDispatchClass("GRanges")