Title: | R interface to EBI MGnify metagenomics resource |
---|---|
Description: | Utility package to facilitate integration and analysis of EBI MGnify data in R. The package can be used to import microbial data for instance into TreeSummarizedExperiment (TreeSE). In TreeSE format, the data is directly compatible with miaverse framework. |
Authors: | Tuomas Borman [aut, cre] , Ben Allen [aut], Leo Lahti [aut] |
Maintainer: | Tuomas Borman <[email protected]> |
License: | Artistic-2.0 | file LICENSE |
Version: | 1.3.0 |
Built: | 2024-12-01 04:20:32 UTC |
Source: | https://github.com/bioc/MGnifyR |
MGnifyR
Package.MGnifyR
implements an interface to the EBI MGnify database.
See the vignette for a general introduction to this package.
About MGnify for general MGnify
information, and
API documentation for
details about the JSONAPI implementation.
Maintainer: Tuomas Borman [email protected] (ORCID)
Authors:
Ben Allen [email protected]
Leo Lahti [email protected] (ORCID)
TreeSummarizedExperiment class
MgnifyClient accessors and mutators
databaseUrl(x) authTok(x) useCache(x) cacheDir(x) showWarnings(x) clearCache(x) verbose(x) databaseUrl(x) <- value authTok(x) <- value useCache(x) <- value cacheDir(x) <- value showWarnings(x) <- value clearCache(x) <- value verbose(x) <- value ## S4 method for signature 'MgnifyClient' databaseUrl(x) ## S4 method for signature 'MgnifyClient' authTok(x) ## S4 method for signature 'MgnifyClient' useCache(x) ## S4 method for signature 'MgnifyClient' cacheDir(x) ## S4 method for signature 'MgnifyClient' showWarnings(x) ## S4 method for signature 'MgnifyClient' clearCache(x) ## S4 method for signature 'MgnifyClient' verbose(x) ## S4 replacement method for signature 'MgnifyClient' databaseUrl(x) <- value ## S4 replacement method for signature 'MgnifyClient' authTok(x) <- value ## S4 replacement method for signature 'MgnifyClient' useCache(x) <- value ## S4 replacement method for signature 'MgnifyClient' cacheDir(x) <- value ## S4 replacement method for signature 'MgnifyClient' showWarnings(x) <- value ## S4 replacement method for signature 'MgnifyClient' clearCache(x) <- value ## S4 replacement method for signature 'MgnifyClient' verbose(x) <- value
databaseUrl(x) authTok(x) useCache(x) cacheDir(x) showWarnings(x) clearCache(x) verbose(x) databaseUrl(x) <- value authTok(x) <- value useCache(x) <- value cacheDir(x) <- value showWarnings(x) <- value clearCache(x) <- value verbose(x) <- value ## S4 method for signature 'MgnifyClient' databaseUrl(x) ## S4 method for signature 'MgnifyClient' authTok(x) ## S4 method for signature 'MgnifyClient' useCache(x) ## S4 method for signature 'MgnifyClient' cacheDir(x) ## S4 method for signature 'MgnifyClient' showWarnings(x) ## S4 method for signature 'MgnifyClient' clearCache(x) ## S4 method for signature 'MgnifyClient' verbose(x) ## S4 replacement method for signature 'MgnifyClient' databaseUrl(x) <- value ## S4 replacement method for signature 'MgnifyClient' authTok(x) <- value ## S4 replacement method for signature 'MgnifyClient' useCache(x) <- value ## S4 replacement method for signature 'MgnifyClient' cacheDir(x) <- value ## S4 replacement method for signature 'MgnifyClient' showWarnings(x) <- value ## S4 replacement method for signature 'MgnifyClient' clearCache(x) <- value ## S4 replacement method for signature 'MgnifyClient' verbose(x) <- value
x |
A |
value |
A value to be added to a certain slot. |
These functions are for fetching and mutating slots of
MgnifyClient
object.
A value of MgnifyClient object or nothing.
mg <- MgnifyClient() databaseUrl(mg) showWarnings(mg) <- FALSE
mg <- MgnifyClient() databaseUrl(mg) showWarnings(mg) <- FALSE
These functions will be deprecated. Please use other functions instead.
mgnify_client( username = NULL, password = NULL, usecache = FALSE, cache_dir = NULL, warnings = FALSE, use_memcache = FALSE, ... ) mgnify_query( client, qtype = "samples", accession = NULL, asDataFrame = TRUE, maxhits = 200, usecache = FALSE, ... ) mgnify_analyses_from_samples(client, accession, usecache = TRUE, ...) mgnify_analyses_from_studies(client, accession, usecache = TRUE, ...) mgnify_get_download_urls( client, accessions, accession_type, usecache = TRUE, ... ) mgnify_download( client, url, file = NULL, read_func = NULL, usecache = TRUE, Debug = FALSE, ... ) mgnify_get_analyses_results( client = NULL, accessions, retrievelist = c(), compact_results = TRUE, usecache = TRUE, bulk_dl = FALSE, ... ) mgnify_get_analyses_phyloseq( client = NULL, accessions, usecache = TRUE, returnLists = FALSE, tax_SU = "SSU", get_tree = FALSE, ... ) mgnify_get_analyses_metadata(client, accessions, usecache = TRUE, ...) mgnify_retrieve_json( client, path = "biomes", complete_url = NULL, qopts = NULL, maxhits = 200, usecache = FALSE, Debug = FALSE, ... )
mgnify_client( username = NULL, password = NULL, usecache = FALSE, cache_dir = NULL, warnings = FALSE, use_memcache = FALSE, ... ) mgnify_query( client, qtype = "samples", accession = NULL, asDataFrame = TRUE, maxhits = 200, usecache = FALSE, ... ) mgnify_analyses_from_samples(client, accession, usecache = TRUE, ...) mgnify_analyses_from_studies(client, accession, usecache = TRUE, ...) mgnify_get_download_urls( client, accessions, accession_type, usecache = TRUE, ... ) mgnify_download( client, url, file = NULL, read_func = NULL, usecache = TRUE, Debug = FALSE, ... ) mgnify_get_analyses_results( client = NULL, accessions, retrievelist = c(), compact_results = TRUE, usecache = TRUE, bulk_dl = FALSE, ... ) mgnify_get_analyses_phyloseq( client = NULL, accessions, usecache = TRUE, returnLists = FALSE, tax_SU = "SSU", get_tree = FALSE, ... ) mgnify_get_analyses_metadata(client, accessions, usecache = TRUE, ...) mgnify_retrieve_json( client, path = "biomes", complete_url = NULL, qopts = NULL, maxhits = 200, usecache = FALSE, Debug = FALSE, ... )
username |
- |
password |
- |
usecache |
- |
cache_dir |
- |
warnings |
- |
use_memcache |
- |
... |
- |
client |
- |
qtype |
- |
accession |
- |
asDataFrame |
- |
maxhits |
- |
accessions |
- |
accession_type |
- |
url |
- |
file |
- |
read_func |
- |
Debug |
- |
retrievelist |
- |
compact_results |
- |
bulk_dl |
- |
returnLists |
- |
tax_SU |
- |
get_tree |
- |
path |
- |
complete_url |
- |
qopts |
- |
-
Search MGnify database for studies, samples, runs, analyses, biomes, assemblies, and genomes.
doQuery(x, ...) ## S4 method for signature 'MgnifyClient' doQuery( x, type = "studies", accession = NULL, as.df = TRUE, max.hits = 200, ... )
doQuery(x, ...) ## S4 method for signature 'MgnifyClient' doQuery( x, type = "studies", accession = NULL, as.df = TRUE, max.hits = 200, ... )
x |
A |
... |
Remaining parameter key/value pairs may be supplied to filter
the returned values. Available options differ between |
type |
A single character value specifying the type of objects to
query. Must be one of the following options: |
accession |
A single character value or a vector of character values
specifying MGnify accession identifiers (of type |
as.df |
A single boolean value specifying whether to return the
results as a data.frame or leave as a nested list. In most cases,
|
max.hits |
A single integer value specifying the maximum number of
results to return or FALSE. The actual number of results will actually be
higher than |
doQuery
is a flexible query function, harnessing the "full"
power of the JSONAPI MGnify search filters. Search results may be filtered
by metadata value, associated study/sample/analyse etc.
See Api browser for information on MGnify database filters. You can find help on customizing queries from here.
For example the following filters are available:
studies: accession, biome_name, lineage, centre_name, include
samples: accession, experiment_type, biome_name, lineage, geo_loc_name, latitude_gte, latitude_lte, longitude_gte, longitude_lte, species, instrument_model, instrument_platform, metadata_key, metadata_value_gte, metadata_value_lte, metadata_value, environment_material, environment_feature, study_accession, include
runs: accession, experiment_type, biome_name, lineage, species, instrument_platform, instrument_model, metdata_key, metadata_value_gte, metadata_value_lte, metadata_value, sample_accession, study_accession, include
analyses: biome_name, lineage, experiment_type, species, sample_accession, pipeline_version
biomes: depth_gte, depth_lte
assemblies: depth_gte, depth_lte
Unfortunately it appears that in some cases, some of these filters don't work as expected, so it is important to check the results returned match up with what's expected. Even more unfortunately if there's an error in the parameter specification, the query will run as if no filter parameters were present at all. Thus the result will appear superficially correct but will infact correspond to something completely different. This behaviour will hopefully be fixed in future incarnations of the MGnifyR or JSONAPI, but for now users should double check returned values.
It is currently not possible to combine queries of the same type in a single call (for example to search for samples between latitude). However, it is possible to run multiple queries and combine the results using set operations in R to get the desired behaviour.
A nested list or data.frame containing the results of the query.
mg <- MgnifyClient(useCache = FALSE) # Get a list of studies from the Agricultural Wastewater : agwaste_studies <- doQuery( mg, "studies", biome_name="Agricultural wastewater" ) ## Not run: # Get all samples from a particular study samps <- doQuery(mg, "samples", accession="MGYS00004521") # Search polar samples samps_np <- doQuery(mg, "samples", latitude_gte=66, max.hits=10) samps_sp <- doQuery(mg, "samples", latitude_lte=-66, max.hits=10) # Search studies that have studied drinking water tbl <- doQuery( mg, type = "studies", biome_name = "root:Environmental:Aquatic:Freshwater:Drinking water", max.hits = 10) ## End(Not run)
mg <- MgnifyClient(useCache = FALSE) # Get a list of studies from the Agricultural Wastewater : agwaste_studies <- doQuery( mg, "studies", biome_name="Agricultural wastewater" ) ## Not run: # Get all samples from a particular study samps <- doQuery(mg, "samples", accession="MGYS00004521") # Search polar samples samps_np <- doQuery(mg, "samples", latitude_gte=66, max.hits=10) samps_sp <- doQuery(mg, "samples", latitude_lte=-66, max.hits=10) # Search studies that have studied drinking water tbl <- doQuery( mg, type = "studies", biome_name = "root:Environmental:Aquatic:Freshwater:Drinking water", max.hits = 10) ## End(Not run)
Versatile function to retrieve raw results
getData(x, ...) ## S4 method for signature 'MgnifyClient' getData(x, type, accession.type = NULL, accession = NULL, as.df = TRUE, ...)
getData(x, ...) ## S4 method for signature 'MgnifyClient' getData(x, type, accession.type = NULL, accession = NULL, as.df = TRUE, ...)
x |
A |
... |
optional arguments fed to internal functions. |
type |
A single character value specifying the type of data retrieve.
Must be one of the following options: |
accession.type |
A single character value specifying type of accession
IDs ( |
accession |
A single character value or a vector of character values
specifying accession IDs to return results for.
(By default: |
as.df |
A single boolean value specifying whether to return the
results as a data.frame or leave as a nested list.
(By default: |
This function returns data from MGnify database. Compared to
getResult
, this function allows more flexible framework for fetching
the data. However, there are drawbacks: for counts data, getResult
returns optimally structured data container which is easier for downstream
analysis. getData
returns raw data from the database. However, if
you want to retrieve data on pipelines or publications, for instance,
getResult
is not suitable for it, and getData
can be utilized
instead.
data.frame
or list
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Find kegg modules for certain analysis df <- getData( mg, type = "kegg-modules", accession = "MGYA00642773", accession.type = "analyses")
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Find kegg modules for certain analysis df <- getData( mg, type = "kegg-modules", accession = "MGYA00642773", accession.type = "analyses")
Download any MGnify files, also including processed reads and identified protein sequences
Listing files available for download
getFile(x, ...) searchFile(x, ...) ## S4 method for signature 'MgnifyClient' getFile(x, url, file = NULL, read.func = NULL, ...) ## S4 method for signature 'MgnifyClient' searchFile( x, accession, type = c("studies", "samples", "analyses", "assemblies", "genomes", "run"), ... )
getFile(x, ...) searchFile(x, ...) ## S4 method for signature 'MgnifyClient' getFile(x, url, file = NULL, read.func = NULL, ...) ## S4 method for signature 'MgnifyClient' searchFile( x, accession, type = c("studies", "samples", "analyses", "assemblies", "genomes", "run"), ... )
x |
A |
... |
Additional arguments; not used currently. |
url |
A single character value specifying the url address of the file we wish to download. |
file |
A single character value or NULL specifying an
optional local filename to use for saving the file. If |
read.func |
A function specifying an optional function to process the
downloaded file and return the results, rather than relying on post
processing. The primary use-case for this parameter is when local disk
space is limited and downloaded files can be quickly processed and
discarded. The function should take a single parameter, the downloaded
filename, and may return any valid R object.
(By default: |
accession |
A single character value or a vector of character values specifying accession IDs to return results for. |
type |
A single character value specifying the type of objects to
query. Must be one of the following options: |
getFile
is a convenient wrapper round generic the URL
downloading functionality in R, taking care of things like local
caching and authentication.
searchFile()
function is a wrapper function allowing easy
enumeration of downloads available for a given accession IDs.
Returns a single data.frame containing all available downloads and associated
metadata, including the url location and description. This can then be
filtered to extract the urls of interest, before actually
retrieving the files using getFile()
For getFile()
, either the local filename of the downloaded
file, be it either the location in the MGNifyR cache or file. If
read.func
is used, its result will be returned.
For searchFile()
data.frame
containing all discovered
downloads. If multiple accessions
are queried, the accessions
column may to filter the results - since rownames are not set (and wouldn't
make sense as each query will return multiple items)
# Make a client object mg <- MgnifyClient(useCache = FALSE) # Create a vector of accession ids - these happen to be \code{analysis} # accessions accession_vect <- c("MGYA00563876", "MGYA00563877") downloads <- searchFile(mg, accession_vect, "analyses") # Filter to find the urls of 16S encoding sequences url_list <- downloads[ downloads$attributes.description.label == "Contigs encoding SSU rRNA", "download_url"] # Example 1: # Download the first file supplied_filename <- getFile( mg, url_list[[1]], file="SSU_file.fasta.gz") ## Not run: # Example 2: # Just use local caching cached_filename <- getFile(mg, url_list[[2]]) # Example 3: # Using read.func to open the reads with readDNAStringSet from # \code{biostrings}. Without retaining on disk dna_seqs <- getFile( mg, url_list[[3]], read.func = readDNAStringSet) ## End(Not run) # Make a client object mg <- MgnifyClient(useCache = TRUE) # Create a vector of accession ids - these happen to be \code{analysis} # accessions accession_vect <- c( "MGYA00563876", "MGYA00563877", "MGYA00563878", "MGYA00563879", "MGYA00563880" ) downloads <- searchFile(mg, accession_vect, "analyses")
# Make a client object mg <- MgnifyClient(useCache = FALSE) # Create a vector of accession ids - these happen to be \code{analysis} # accessions accession_vect <- c("MGYA00563876", "MGYA00563877") downloads <- searchFile(mg, accession_vect, "analyses") # Filter to find the urls of 16S encoding sequences url_list <- downloads[ downloads$attributes.description.label == "Contigs encoding SSU rRNA", "download_url"] # Example 1: # Download the first file supplied_filename <- getFile( mg, url_list[[1]], file="SSU_file.fasta.gz") ## Not run: # Example 2: # Just use local caching cached_filename <- getFile(mg, url_list[[2]]) # Example 3: # Using read.func to open the reads with readDNAStringSet from # \code{biostrings}. Without retaining on disk dna_seqs <- getFile( mg, url_list[[3]], read.func = readDNAStringSet) ## End(Not run) # Make a client object mg <- MgnifyClient(useCache = TRUE) # Create a vector of accession ids - these happen to be \code{analysis} # accessions accession_vect <- c( "MGYA00563876", "MGYA00563877", "MGYA00563878", "MGYA00563879", "MGYA00563880" ) downloads <- searchFile(mg, accession_vect, "analyses")
Get all study, sample and analysis metadata for the supplied analysis accessions
getMetadata(x, ...) ## S4 method for signature 'MgnifyClient' getMetadata(x, accession, ...)
getMetadata(x, ...) ## S4 method for signature 'MgnifyClient' getMetadata(x, accession, ...)
x |
A |
... |
Optional arguments; not currently used. |
accession |
A single character value or a vector of analysis accession IDs specifying accessions to retrieve data for. |
The function retrieves all study, sample and analysis metadata associated with provided analysis accessions.
A data.frame
containing metadata for each analysis in the
accession
list. Each row represents a single analysis.
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Download all associated study/sample and analysis metadata accession_list <- c("MGYA00377505") meta_dataframe <- getMetadata(mg, accession_list)
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Download all associated study/sample and analysis metadata accession_list <- c("MGYA00377505") meta_dataframe <- getMetadata(mg, accession_list)
Get microbial and/or functional profiling data for a list of accessions
getResult(x, ...) ## S4 method for signature 'MgnifyClient' getResult( x, accession, get.taxa = TRUE, get.func = TRUE, output = "TreeSE", ... )
getResult(x, ...) ## S4 method for signature 'MgnifyClient' getResult( x, accession, get.taxa = TRUE, get.func = TRUE, output = "TreeSE", ... )
x |
A |
... |
optional arguments:
|
accession |
A single character value or a vector of character values specifying accession IDs to return results for. |
get.taxa |
A boolean value specifying whether to retrieve taxonomy
data (OTU table). See |
get.func |
A boolean value or a single character value or a vector
character values specifying functional analysis types to retrieve. If
|
output |
A single character value specifying the format of an output.
Must be one of the following options: |
Given a set of analysis accessions and collection of annotation types,
the function queries the MGNify API and returns the results. This function
is convenient for retrieving highly structured (analysis vs counts) data on
certain instances. For example, BIOM files are downloaded automatically.
If you want just to retrieve raw data from the database, see getData
.
If only taxonomy data is retrieved, the result is returned in
TreeSummarizedExperiment
object by default. The result can also be
returned as a phyloseq
object or as a list of data.frames
.
Note that phyloseq
object can include only one phylogenetic tree
meaning that some taxa might be lost when data is subsetted based on tree.
When functional data is retrieved in addition to taxonomy data, the result
is returned as a MultiAssayExperiment
object. Other options are a list
containing phyloseq
object and data.frames
or just
data.frames
.
Functional data can be returned as a MultiAssayExperiment
object or
as a list of data.frames
.
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Get OTU tables as TreeSE accession_list <- c("MGYA00377505") tse <- getResult(mg, accession_list, get.func=FALSE, get.taxa=TRUE) ## Not run: # Get functional data along with OTU tables as MAE mae <- getResult(mg, accession_list, get.func=TRUE, get.taxa=TRUE) # Get same data as list list <- getResult( mg, accession_list, get.func=TRUE, get.taxa=TRUE, output = "list", as.df = TRUE, use.cache = TRUE) ## End(Not run)
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Get OTU tables as TreeSE accession_list <- c("MGYA00377505") tse <- getResult(mg, accession_list, get.func=FALSE, get.taxa=TRUE) ## Not run: # Get functional data along with OTU tables as MAE mae <- getResult(mg, accession_list, get.func=TRUE, get.taxa=TRUE) # Get same data as list list <- getResult( mg, accession_list, get.func=TRUE, get.taxa=TRUE, output = "list", as.df = TRUE, use.cache = TRUE) ## End(Not run)
Constructor for creating a MgnifyClient object to allow the access to MGnify database.
A MgnifyClient object
MgnifyClient( username = NULL, password = NULL, useCache = FALSE, cacheDir = tempdir(), showWarnings = FALSE, verbose = TRUE, clearCache = FALSE, ... )
MgnifyClient( username = NULL, password = NULL, useCache = FALSE, cacheDir = tempdir(), showWarnings = FALSE, verbose = TRUE, clearCache = FALSE, ... )
username |
A single character value specifying an optional username for
authentication. (By default: |
password |
A single character value specifying an optional password for
authentication. (By default: |
useCache |
A single boolean value specifying whether to enable on-disk
caching of results during this session. In most use cases should be TRUE.
(By default: |
cacheDir |
A single character value specifying a folder to contain the
local cache. Note that cached files are persistent, so the cache directory
may be reused between sessions, taking advantage of previously downloaded
results. The directory will be created if it doesn't exist already.
(By default: |
showWarnings |
A single boolean value specifying whether to print
warnings during invocation of some MGnifyR functions.
(By default: |
verbose |
A single boolean value specifying whether to print extra
output during invocation of some MGnifyR functions.
(By default: |
clearCache |
A single boolean value specifying whether to clear the
cache. (By default: |
... |
optional arguments:
|
All functions in the MGnifyR package take a MgnifyClient
object as
their first argument. The object allows the simple handling of both user
authentication and access to private data, and manages general options for
querying the MGnify database.
An object that are required by functions of MGnifyR package.
A MgnifyClient object.
databaseUrl
A single character value specifying an URL address of database.
authTok
A single character value specifying authentication token.
useCache
A single boolean value specifying whether to use cache.
cacheDir
A single character value specifying cache directory.
showWarnings
A single boolean value specifying whether to show warnings.
clearCache
A single boolean value specifying whether to clear cache.
verbose
A single boolean value specifying whether to show messages.
See MgnifyClient
for constructor.
See MgnifyClient-accessors
for accessor functions.
my_client <- MgnifyClient( useCache = TRUE, cacheDir = "/scratch/MGnify_cache_location" ) ## Not run: # Use username and password to get access to non-public data my_client <- MgnifyClient( username = "Webin-1122334", password = "SecretPassword", useCache = TRUE, cacheDir = "/scratch/MGnify_cache_location" ) ## End(Not run)
my_client <- MgnifyClient( useCache = TRUE, cacheDir = "/scratch/MGnify_cache_location" ) ## Not run: # Use username and password to get access to non-public data my_client <- MgnifyClient( username = "Webin-1122334", password = "SecretPassword", useCache = TRUE, cacheDir = "/scratch/MGnify_cache_location" ) ## End(Not run)
Look up analysis accession IDs for one or more study or sample accessions
searchAnalysis(x, ...) ## S4 method for signature 'MgnifyClient' searchAnalysis(x, type, accession, ...)
searchAnalysis(x, ...) ## S4 method for signature 'MgnifyClient' searchAnalysis(x, type, accession, ...)
x |
A |
... |
Optional arguments; not currently used. |
type |
A single character value specifying a type of
accession IDs specified by |
accession |
A single character value or a vector of character values specifying study or sample accession IDs that are used to retrieve analyses IDs. |
Retrieve analysis accession IDs associated with the supplied study or sample accession. In MGnify, an analysis accession refers to a certain pipeline analysis, such as specific 16S rRNA or shotgun metagenomic mapping. Studies can include multiple samples, and each sample can undergo multiple analyses using these pipelines. Each analysis is identified by a unique accession ID, allowing precise tracking and retrieval of analysis results within the MGnify database.
Vector of analysis accession IDs.
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Retrieve analysis ids from study MGYS00005058 result <- searchAnalysis(mg, "studies", c("MGYS00005058")) ## Not run: # Retrieve all analysis ids from samples result <- searchAnalysis( mg, "samples", c("SRS4392730", "SRS4392743")) ## End(Not run)
# Create a client object mg <- MgnifyClient(useCache = FALSE) # Retrieve analysis ids from study MGYS00005058 result <- searchAnalysis(mg, "studies", c("MGYS00005058")) ## Not run: # Retrieve all analysis ids from samples result <- searchAnalysis( mg, "samples", c("SRS4392730", "SRS4392743")) ## End(Not run)