First, one must decide if an ExperimentHub or AnnotationHub package is appropriate.
The AnnotationHubData
package provides tools to acquire,
annotate, convert and store data for use in Bioconductor’s
AnnotationHub
. BED files from the Encode project, gtf files
from Ensembl, or annotation tracks from UCSC, are examples of data that
can be downloaded, described with metadata, transformed to standard
Bioconductor
data types, and stored so that they may be
conveniently served up on demand to users via the AnnotationHub client.
While data are often manipulated into a more R-friendly form, the data
themselves retain their raw content and are not normally filtered or
curated like those in ExperimentHub.
Each resource has associated metadata that can be searched through the
AnnotationHub
client interface.
ExperimentHubData
provides tools to add or modify
resources in Bioconductor’s ExperimentHub
. This ‘hub’
houses curated data from courses, publications, or experiments. It is
often convenient to store data to be used in package examples, testings,
or vignettes in the ExperimentHub. The resources can be files of raw
data or more often are R
/ Bioconductor
objects such as GRanges, SummarizedExperiment, data.frame etc. Each
resource has associated metadata that can be searched through the
ExperimentHub
client interface.
It is advisable to create a separate package for annotations or experiment data rather than an all encompassing package of data and code. However, it is sometimes understandable to have a Software package that also serves as the package front end for the hubs. Although this is generally not recommended; if you think you have a use case please reach out to [email protected] to confirm before proceeding with a single package rather than the accompanied package approach.
Related resources are added to AnnotationHub
or
ExperimentHub
by creating a package. The package should
minimally contain the resource metadata, man pages describing the
resources, and a vignette. It may also contain supporting R
functions the author wants to provide. This is a similar design to the
existing Bioconductor
experimental data packages or
annotation packages except the data is stored in Microsoft Azure Genomic
Data Lake or other publicly accessibly sites (like Amazon S3 buckets or
institutional servers) instead of the data/
or
inst/extdata/
directory of the package. This keeps the
package light weight and allows users to download only necessary data
files.
Below are the steps required for creating the package and adding new resources:
Bioconductor
team memberThe man page and vignette examples in the package will not work until
the data are available in AnnotationHub
or
ExperimentHub
. If you are not hosting the data on a stable
web server (github does not suffice), you may use the Bioconductor
Microsoft Azure Genomic Data Lake. Adding the data to the Data Lake and
the metadata to the production database involves assistance from a
Bioconductor
team member. The metadata.csv file will have
to be created before the data can officially be added to the hub (See
inst/extdata section below). Please read the section on “Storage of Data
Files”.
When a resource is downloaded from one of the hubs the associated package is loaded in the workspace making the man pages and vignettes readily available. Because documentation plays an important role in understanding these resources please take the time to develop clear man pages and a detailed vignette. These documents provide essential background to the user and guide appropriate use the of resources.
Below is an outline of package organization. The files listed are required unless otherwise stated.
inst/extdata/
metadata.csv
: This file contains the metadata in the
format of one row per resource to be added to the Hub database (each row
corresponds to one data file uploaded to publically hosted data server).
The file should be generated from the code in
inst/scripts/make-metadata.R where the final data are written out with
write.csv(..., row.names=FALSE)
. The required column names
and data types are specified in
ExperimentHubData::makeExperimentHubMetadata
or
AnnotationHubData::makeAnnotationHubMetadata
. See
?ExperimentHubData::makeExperimentHubMetadata
or
?AnnotationHubData::makeAnnotationHubMetadata
for details.
Ensuring that the above function runs without ERROR is also a validation
step for the metadata file.
An example data experiment package metadata.csv file can be found here
If necessary, metadata can be broken up into multiple csv files instead having of all records in a single “metadata.csv”. The requirement is the necessary required columns and using csv format.
inst/scripts/
make-data.R
: A script describing the steps involved
in making the data object(s). It can be code, pseudo-code, or text but
should include where the original data were downloaded from,
pre-processing, and how the final R object was made. Include a
description of any steps performed outside of R
with third
party software. Output of the script should be files on disk ready to be
pushed to data server. If data are to be hosted on a personal web site
instead of Microsoft Azure Genomic Data Lake, this file should explain
any manipulation of the data prior to hosting on the web site. For data
hosted on a public web site with no prior manipulation this file is not
needed. For experimental data objects, it is encouraged to serialize
Data objects with save()
with the .rda extension on the
filename but not strictly necessary. If the data is provided in another
format an appropriate loading method may need to be implemented. Please
advise when reaching out for “Uploading Data to Microsoft Azure Genomic
Data Lake”.
make-metadata.R
: A script to make the metadata.csv
file located in inst/extdata of the package. See
?ExperimentHubData::makeExperimentHubMetadata
or
?AnnotationHubData::makeAnnotationHubMetadata
for a
description of expected fields and data types. The
ExperimentHubData::makeExperimentHubMetadata()
or
AnnotationHubData::makeAnnotationHubMetadata()
can be used
to validate the metadata.csv file before submitting the
package.
vignettes/
R/
R/*.R
: Optional. Functions to enhance data
exploration.For ExperimentHub resources only: -
zzz.R
: Optional. You can include a .onLoad()
function in a zzz.R file that exports each resource name (i.e.,
metadata.csv field title
) into a function. This allows the
data to be loaded by name, e.g., resource123()
.
``` r
.onLoad <- function(libname, pkgname) {
fl <- system.file("extdata", "metadata.csv", package=pkgname)
titles <- read.csv(fl, stringsAsFactors=FALSE)$Title
createHubAccessors(pkgname, titles)
}
```
`ExperimentHub::createHubAccessors()` and
`ExperimentHub:::.hubAccessorFactory()` provide internal
detail. The resource-named function has a single 'metadata'
argument. When metadata=TRUE, the metadata are loaded (equivalent
to single-bracket method on an ExperimentHub object) and when
FALSE the full resource is loaded (equivalent to double-bracket
method).
man/
package man page: The package man page serves as a landing point and should briefly describe all resources associated with the package. There should be an entry for each resource title either on the package man page or individual man pages. While this is optional, it is strongly recommended.
resource man pages: Resources can be documented on the same page, grouped by common type or have their own dedicated man pages. Man page(s) should describe the resource (raw data source, processing, QC steps) and demonstrate how the data can be loaded through the standard hub interface.
Data can be accessed via the standard ExperimentHub or AnnotationHub interface with single and double-bracket methods. Queries are often useful for finding resources. For example you could replace packagename with the name of this package being developed, e.g.,
library(ExperimentHub)
eh <- ExperimentHub()
myfiles <- query(eh, "PACKAGENAME")
myfiles[[1]] ## load the first resource in the list
myfiles[["EH123"]] ## load by EH id
NOTE: As a developer, resources should be accessed within your package using the Hub id, e.g., `myfiles[[“EH123”]].
You can use multiple search queries to further filter resources. For example, replace “SEARCHTERM*” below with one or more search terms that uniquely identify resources in your package.
library(AnnotationHub)
hub <- AnnotationHub()
myfiles <- query(hub, "SEARCHTERM1", "SEARCHTERM2")
myfiles[[1]] ## load the first resource in the list
ExperimentHub packages only If a
.onLoad()
function is used to export each resource as a
function also document that method of loading, e.g.,
Package authors are encouraged to use the
ExperimentHub::listResources()
and
ExperimentHub::loadResource()
functions in their man pages
and vignette. These helpers are designed to facilitate data discovery
within a specific package vs within all of ExperimentHub.
DESCRIPTION
/ NAMESPACE
The package should depend on and fully import AnnotationHub or
ExperimentHub. If using the suggested .onLoad()
function
for ExperimentHub, import the utils package in the DESCRIPTION file and
selectively importFrom(utils, read.csv) in the NAMESPACE.
If making an Experiment Data Hub package, the biocViews should
contain terms from ExperimentData
and should also contain the term ExperimentHub
.
If making an Annotation Hub package, the biocViews should contain
terms from AnnotationData
and should also contain the term AnnotationHub
.
If the case where a software package was appropriate rather than a
separate annotation or experiment data package, the biocViews term
should include only Software
terms but must include either AnnotationHubSoftware
or
ExperimentHubSoftware
.
Data are not formally part of the software package and are stored separately in a publicly accessible hosted site or by Bioconductor on Microsoft Genomic Data Lakes. The author should read the following section on “Storage of Data Files”.
When you are satisfied with the representation of your resources in
your metadata.csv (or other aptly named csv file) the
Bioconductor
team member will add the metadata to the
production database. Confirm the metadata csv files in inst/extdata/ are
valid by by running either
ExperimentHubData::makeExperimentHubMetadata() or
AnnotationHubData::makeAnnotationHubData() on your package. Please
address any warnings or errors.
Once the data are in Genomic Data Lakes or public site and the metadata have been added to the production database the man pages and vignette can be finalized. When the package passes R CMD build and check it can be submitted to the package tracker for review. The package should be submitted without any of the data that is now located remotely. This keeps the package light weight and minimal size while still providing access to key large data files now stored remotely. If the data files were added to the github repository please see removing large data files and clean git tree to remove the large files and reduce package size.
Many times these data package are created as a supplement to a software package. There is a process for submitting multiple package under the same issue.
Metadata for new versions of the data can be added to the same package as they become available.
The titles for the new versions should be unique and not match
the title of any resource currently in the Hub. Good practice would be
to include the version and / or genome build in the title. If the title
is not unique, the AnnotationHub
or
ExperimentHub
object will list multiple files with the same
title. The user will need to use ‘rdatadateadded’ to determine which is
the most current or infer from the id numbers which could lead to
confusion.
Make data available: either on publicly accessible site or see section on “Uploading Data to Microsoft Azure Genomic Data Lake”.
Update make-metadata.R with the new metadata information
Generate a new metadata.csv file. The package should contain metadata for all versions of the data in ExperimentHub or AnnotationHub so the old file should remain. When adding a new version it might be helpful to write a new csv file named by version, e.g., metadata_v84.csv, metadata_85.csv etc.
Bump package version and commit to git
Notify [email protected] that an update is ready and a team member will add the new metadata to the production database; new resources will not be visible in AnnotationHub or ExperimentHub until the metadata are added to the database.
Contact [email protected] or [email protected] with any questions.
experiment data package to utilizing the Hub.
The concepts and directory structure of the package would stay the same. The main steps involved would be
Restructure the inst/extdata and inst/scripts to include
metadata.csv and make-data.R as described in the section above for
creating new packages. Ensure the metadata.csv file is formatted
correctly by running
AnnotationHubData::makeAnnotationHubMetadata()
or
ExperimentHubData::makeExperimentHubMetadata()
on your
package.
Add biocViews term “AnnotationHub” or “ExperimentHub” to DESCRIPTION
Upload the data to data lake or place on a publicly accessible site and remove the data from the package. See the section on “Storage of Data Files” below.
Once the data is officially added to the hub, update any code to utilize AnnotationHub or ExperimentHub for retrieving data.
Push all changes with a version bump back to Bioconductor git.bioconductor.org location
A bug fix may involve a change to the metadata, data resource or both.
The replacement resource must have the same name as the original and be at the same location (path).
Notify [email protected] that you want to replace the data and make the files available: see section “Uploading Data to Microsoft Azure Genomic Data Lake”.
If a file is replaced on the data lake directly, the old file will no longer be accessible. This could affect reproducibility of end users’ research if the old file has already been utilized. This approach should be done with caution.
New metadata records can be added for new resources but modifying existing records is discouraged. Record modification will only be done in the case of bug fixes and has to be done manually on the database by a core team member.
Update make-metadata.R and regenerate the metadata.csv file if necessary
Bump the package version and commit to git
Notify [email protected] that you want to change the metadata for resources. The core team member will likely need the current AH/EH ids for the resources that need updating and a summary of what fields in the metadata file changed. NOTE: Large chanes to the metadata may require the core team member to remove the resources entirely from the database and re-add resulting in new AH/EH ids.
Removing resources should be done with caution. The intent is that resources in the Hubs be ‘reproducible’ research by providing a stable snapshot of the data. Data made available in Bioconductor version x.y.z should be available for all versions greater than x.y.z. Unfortunately this is not always possible. If you find it necessary to remove data from AnnotationHub/ExperimentHub please contact [email protected] or [email protected] for assistance.
When a resource is removed from ExperimentHub or AnnotationHub two
things happen: the ‘rdatadateremoved’ field is populated with a date and
the ‘status’ field is populated with a reason why the resource is no
longer available. Once these changes are made, the
ExperimentHub()
or AnnotationHub()
constructor
will not list the resource among the available ids. An attempt to
extract the resource with ‘[[’ and the EH/AH id will return an error
along with the status message. The function getInfoOnIds()
will display metadata information for any resource including resources
still in the database but no longer available.
In general, resources are only removed when they are no longer available (e.g., moved from web location, no longer provided etc.).
To remove a resource from AnnotationHub
contact [email protected] or [email protected].
Versioning of resources is handled by the maintainer. If you plan to provide incremental updates to a file for the same organism / genome build, we recommend including a version in the title of the resource so it is easy to distinguish which is most current. We also would recommend when uploading the data to genomic data lake or your publicly accessible site to have a directory structure accounting for versioning.
If you do not include a version, or make the title unique in some
way, multiple files with the same title will be listed in the
ExperimentHub
or AnnotationHub
object. The
user will have to use the ‘rdatadateadded’ metadata field to determine
which file is the most current or try an infer from ids which can lead
to confusion.
Several metadata fields control which resources are visible when a user invokes ExperimentHub()/AnnotationHub(). Records are filtered based on these criteria:
Once a record is added to ExperimentHub/AnnotationHub it is visible from that point forward until stamped with ‘rdatadateremoved’. For example, a record added on May 1, 2017 with ‘biocVersion’ 3.6 will be visible in all snapshots >= May 1, 2017 and in all Bioconductor versions >= 3.6.
A special filter for OrgDb is utilized in AnnotationHub. Only one
OrgDb is available per release/devel cycle. Therefore contributed OrgDb
added to a devel cycle are masked until the following release. There are
options for debugging these masked resources. See
?setAnnotationHubOption
The data should not be included in the package. This keeps the package light weight and quick for a user to install. This allows the user to investigate functions and documentation without downloading large data files and only proceeding with the download when necessary. There are two options for storing data: Bioconductor Microsoft Azure Genomic Data Lake or hosting the data elsewhere on a publicly accessible site. See information below and choose the option that fits best for your situation.
Data can be accessed through the hubs from any publicly accessible
site. The metadata.csv file[s] created will need the column
Location_Prefix
to indicate the hosted site. See more in
the description of the metadata columns/fields below but as a quick
example if the link to the data file is
ftp://mylocalserver/singlecellExperiments/dataSet1.Rds
an
example breakdown of the Location_Prefix
and
RDataPath
for this entry in the metadata.csv file would be
ftp://mylocalserver/
for the Location_Prefix
and singlecellExperiments/dataSet1.Rds
for the
RDataPath
. Github is not an acceptable hosting platform for
data.
Instead of providing the data files via dropbox, ftp, github, etc. we will grant temporary access to temporary data lakes directory where you can upload your data. Please email [email protected] to obtain a SAS token for identification.
Please upload the data with the appropriate directory structure, including subdirectories as necessary (i.e. top directory must be software package name, then if applicable, subdirectories of versions, …).
Once the upload is complete, email [email protected] to continue the process. To add the data officially the data will need to be uploaded and the metadata.csv file will need to be created in the github repository.
There are a few different options users have for connecting to
Microsoft Azure Genomic Data Lake to upload data. All require obtaining
either a SAS token or SAS URL from the Bioconductor Core Team by
emailing [email protected]. In the examples below if the
token is used, please insert provided sas token for
There is a way to upload data through the R package AzureStor and avoid having to download anything directly on your computer. Most of the documentation here is an adaption of the provided README and documentation provided through the AzureStor package and AzureStor Github.
Open R and load the AzureStor package provided through CRAN:
if (!requireNamespace("AzureStor", quietly = TRUE))
install.packages("AzureStor")
library("AzureStor")
You will need to connect to the temporary storage location with provided sas credentials:
sas <- <sas token>
url <- "https://bioconductorhubs.blob.core.windows.net"
ep <- storage_endpoint(url, sas = sas)
container <- storage_container(ep, "staginghub")
Now the command to upload will depend on if your data is currently stored locally or in a remote location.
For locally available data use storage_multiupload
. If
your data files are in a local path
/home/user/mypackage/data
and assuming the name of your
package is mypackage
then you would use something like the
following call:
files <- dir("/home/user/mypackage/data", recursive=TRUE)
src <- dir("/home/user/mypackage/data", recursive=TRUE, full.names=TRUE)
dest <- paste0("mypackage/", files)
storage_multiupload(container, src=src, dest=dest)
Please make sure the dest
value starts with the name of
your package.
For data that is currently being stored remotely (github, dropbox,
ftp, etc), use copy_url_to_storage
or
multicopy_url_to_storage
. As an example, say the data is
store on a public github repository at MyGithub/MyPackage
,
in a package like directory strucutre where the data is in a data
directory.
library(httr)
# get the list of files for the repository
response <- GET("https://api.github.com/repos/MyGithub/MyPackage/git/trees/master?recursive=1")
# get the blob urls and file names
src <- sapply(content(response)$tree, function(elt) elt$url)
names <- sapply(content(response)$tree, function(elt) elt$path)
# filter for the files in the data directory
# if you are uploaded subdirectories filter out the github blob for the
# directory name, subdirectories will be created automatically
keep <- grepl("^data/", names)
src <- src[keep]
names <- names[keep]
# we want the data in a directory with the package name
# the data should only have relevant subdirectories
dest = paste0("MyPackage/", gsub("data/","",names))
# upload to azure
multicopy_url_to_storage(container, src=src, dest=dest)
Keep in mind that github has rate limiting factors. If you have
reached a max rate, you might get an error
rate limit exceeded
. You would have to check your upload to
see what uploaded correctly and what is missing. You can also use the
argument max_concurrent_transfers to lower the transfer rate.
If you are using AzureStor version > 3.5.2.9000, you have the option of passing an authentication header into the multicopy_url_to_storage function. For github, you would pass a generated Personal Access Token (PAT) with repo level access.
token = <github PAT>
auth_header = paste("token", token)
multicopy_url_to_storage(container, src=src, dest=dest, auth_header = auth_header)
This allows for secure access and will increase the maximum rate github allows.
The command line interface for upload is through azcopy. Download Microsoft azcopy and unzip/untar. You can choose to add the location of the azcopy executable file on your computer system PATH so that it can be found anywhere otherwise the following examples of utilizing azcopy should include the full path location where the file was unzip/untar. If the directory of data on your system is called MyPackageData, the following command would upload the directory:
azcopy copy --recursive MyPackageData <sas url>
All files should be in a folder that matches your package name. Only upload data files; subdirectories are optionally okay to include to distguish versions or characteristics of the data (i.e species, tissue types). Do not upload your entire package directory (i.e DESCRIPTION, NAMESPACE, R/, etc.)
For a GUI like experience for uploading data, download the Microsoft Azure Storage Explorer.
Once Installed, open the storage explorer and follow the following steps.
The Select Resource
window should automatically
appear : select Blob Container
. If the windows does not
automatically appear, see Troubleshooting GUI at the bottom of this
section for instructions on how to make this window appear, how to
navigate if already logged in with a valid, non-expired sas token, and
what to do if your sas token has expired or seeing an older login
displaying without access.
In the Select Connection Method
window, select
Shared access signature URL (SAS)
and click Next in the
bottom right corner.
In the Enter Connection Info
window, Type
stainghub
into the Display name. And insert the give
On the Summary
window, verify and click Connect in
the bottom right corner.
You should now see a GUI version of the storage container.
staginghub
is the temporary location to upload data. This
is a shared location and there may be other users data folders located
here that are visible to you. Your SAS token allows for list and create
options so no user will be able to delete another users data. Please do
not put your data into someone else’s folder. All files should be in a
folder that matches your package name. Only upload data files;
subdirectories are optionally okay to include to distguish versions or
characteristics of the data (i.e species, tissue types). Do not upload
your entire package directory (i.e DESCRIPTION, NAMESPACE, R/, etc.)
If your data is already is a directory with your package name, Use the upload folder option. Uploading a Folder will automatically upload any subdirectories if utilized.
Choose Upload
in the top left and select
Upload Folder
Navigate to the appropriate folder on your local file system in
the Selected folder
field.
Select Block blob
as the Blob type.
Leave the Destination directory as /
Choose Upload in the bottom right
If your data is not in a directory with your package name:
Choose Upload
in the top left and select
Upload Files
Navigate to and select the appropriate files on your local file system. This option will not allow you to select at subdirectories or folders, only files.
Select Block blob
as the Blob type.
Change the Destination directory to your package name.
Choose Upload in the bottom right
Troubleshooting GUI
If the connection window did not appear automatically on opening there are a few common issues that might be the cause.
If you know you are not logged into a session, you can click on what looks like a outlet plug to launch the resource connection windows (see beginning of GUI section)
If you have already logged in and are still connected with a valid, non-expired SAS, you can naviage directly to the storage container using the left navigation pane.
Click on Local & Attached
to expand.
Click on Storage Accounts
to expand.
Click on Attached Containers
to expand.
Click on Blob Containers
to expand.
You should see the attached staginghub
. If you click
on it should be accessible. If you get an error about connection
authentication or at the bottom left in the properties for
Shared Access Signature
it says expired, you will have to
detach the session and login with a valid SAS URL. To detach, right
click on the staginghub
in the explorer section and select
detach
. In the pop up for verification click “Yes”. Relogin
with a valid SAS URL by clicking on the picture that looks like an
outlet plug in the far.
coming soon!
The best way to validate record metadata is to read
inst/extdata/metadata.csv (or aptly named csv file in inst/extdata)
using the AnnotationHubData::makeAnnotationHubMetadata()
or
ExperimentHubData::makeExperimentHubMetadata()
. If that is
successful the metadata should be valid and able to be entered into the
database.
As described above the metadata.csv file (or multiple metadata.csv
files) will need to be created before the data can be added to the
database. To ensure proper formatting one should run
AnnotationHubData::makeAnnotationHubMetadata
or
ExperimentHubData::makeExperimentHubMetadata
on the package
with any/all metadata files, and address any ERRORs that occur. Each
data object uploaded to data server should have an entry (row) in the
metadata file. Briefly, a description of the metadata columns
required:
FilePath
that instead of trying to load the file into R,
will only return the path to the locally downloaded file.Any additional columns in the metadata.csv file will be ignored but could be included for internal reference.
More on Location_Prefix and RDataPath. These two fields make up the
complete file path url for downloading the data file. If using the
Bioconductor Microsoft Azure Genomic Data Lake the Location_Prefix
should not be included in the metadata file[s] as this field will be
populated automatically. The RDataPath will be the directory structure
you uploaded to the Data Lake. If you uploaded a directory
MyAnnotation/
, and that directory had a subdirectory
v1/
that contained two files counts.rds
and
coldata.rds
, your metadata file will contain two rows and
the RDataPaths would be MyAnnotation/v1/counts.rds
and
MyAnnotation/v1/coldata.rds
. If you host your data on a
publicly accessible site you must include a base url as the
Location_Prefix
. If your data file was at
ftp://myinstiututeserver/biostats/project2/counts.rds
, your
metadata file will have one row and the Location_Prefix
would be ftp://myinstiututeserver/
and the
RDataPath
would be
biostats/project2/counts.rds
.
This is a bad example because these annotations are already in the hubs but it should give you an idea of the format for AnnotationHub. Let’s say I have a package myAnnotations and I upload two annotation files for dog and cow with information extracted from ensembl to Bioconductor’s Data Lake location. You would want the following saved as a csv (comma seperated output) but for easier view we show in a table:
Title | Description | BiocVersion | Genome | SourceType | SourceUrl | SourceVersion | Species | TaxonomyId | Coordinate_1_based | DataProvider | Maintainer | RDataClass | DispatchClass | RDataPath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dog Annotation | Gene Annotation for Canis lupus from ensembl | 3.9 | Canis lupus | GTF | ftp://ftp.ensembl.org/pub/release-95/gtf/canis_lupus_dingo/Canis_lupus_dingo.ASM325472v1.95.gtf.gz | release-95 | Canis lupus | 9612 | true | ensembl | Bioconductor Maintainer [email protected] | character | FilePath | myAnnotations/canis_lupus_dingo.ASM325472v1.95.gtf.gz |
Cow Annotation | Gene Annotation for Bos taurus from ensemble | 3.9 | Bos taurus | GTF | ftp://ftp.ensembl.org/pub/release-74/gtf/bos_taurus/Bos_taurus.UMD3.1.74.gtf.gz | release-74 | Bos taurus | 9913 | true | ensembl | Bioconductor Maintainer [email protected] | character | FilePath | myAnnotations/Bos_taurus.UMD3.1.74.gtf.gz |
This is a dummy example but hopefully it will give you an idea of the format for ExperimentHub. Let’s say I have a package myExperimentPackage and I upload two files one a SummarizedExperiments of expression data saved as a .rda and the other a sqlite database both considered simulated data. You would want the following saved as a csv (comma seperated output) but for easier view we show in a table:
Title | Description | BiocVersion | Genome | SourceType | SourceUrl | SourceVersion | Species | TaxonomyId | Coordinate_1_based | DataProvider | Maintainer | RDataClass | DispatchClass | RDataPath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Simulated Expression Data | Simulated Expression values for 12 samples and 12000 probles | 3.9 | NA | Simulated | http://mylabshomepage | v1 | NA | NA | NA | http://bioconductor.org/packages/myExperimentPackage | Bioconductor Maintainer [email protected] | SummarizedExperiment | Rda | myExperimentPackage/SEobject.rda |
Simulated Database | Simulated Database containing gene mappings | 3.9 | hg19 | Simulated | http://bioconductor.org/packages/myExperimentPackage | v2 | Home sapiens | 9606 | NA | http://bioconductor.org/packages/myExperimentPackage | Bioconductor Maintainer [email protected] | SQLiteConnection | SQLiteFile | myExperimentPackage/mydatabase.sqlite |