--- title: "MGnifyR, extended vignette" date: "`r Sys.Date()`" package: MGnifyR output: BiocStyle::html_document: fig_height: 7 fig_width: 10 toc: yes toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{MGnifyR, extended vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r include = FALSE} library(knitr) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE, cache = TRUE ) ``` [MGnifyR homepage](http://github.com/EBI-Metagenomics/MGnifyR) # Introduction `MGnifyR` is a package designed to ease access to the EBI's [MGnify](https://www.ebi.ac.uk/metagenomics) resource, allowing searching and retrieval of multiple datasets for downstream analysis. While MGnify pipelines are undoubtedly useful, as currently implemented they produce results on a strictly per-sample basis. While some whole study results are available, comparisons across studies are difficult. The `MGnifyR` package is designed to facilitate cross-study analyses by handling all the per-sample data retrieval and merging details internally, leaving the user free to perform the analysis as they see fit. The latest version of MGnifyR seamlessly integrates with the [miaverse framework](https://microbiome.github.io/) providing access to tools in microbiome down-stream analytics. This integration enables users to leverage optimized and standardized methods for analyzing the microbiome. Additionally, users can benefit from a comprehensive tutorial book that offers valuable guidance and support. # Installation `MGnifyR` is currently hosted on GitHub, and can be installed using via `devtools`. `MGnifyR` should be built using the following snippet. ```{r install, eval=FALSE} BiocManager::install("MGnifyR") ``` # Load `MGnifyR` package Once installed, `MGnifyR` is made available in the usual way. ```{r load_package} library(MGnifyR) ``` # Create a client All functions in `MGnifyR` make use of a `MgnifyClient` object to keep track of the JSONAPI url, disk cache location and user access tokens. Thus the first thing to do when starting any analysis is to instantiate this object. The following snippet creates this. ```{r create_client} mg <- MgnifyClient() mg ``` It's recommended that local caching is enabled with `useCache = TRUE`. Queries to the MGnify API can be quite slow, particularly when retrieving multipage results for many analyses (such as many `Interpro` results). Using a local disk cache can significantly speed up subsequent work, bypassing the need to re-query the API. Use of the cache should be entirely transparent, as the caching occurs at the raw data level. The cache can persist across `MGnifyR` sessions, and can even be used for multiple sessions simultaneously - provided that different sets of accessions are queried at once. Optionally, a username and password may be specified during client creation, causing `MGnifyR` to attempt retrieval of an authentication token from the API. Doing so gives access to non-public results, such as those currently under an author imposed embargo period. ```{r create_client_passwd, eval=FALSE} mg <- MgnifyClient( username = "Webin-username", password = "your-password", useCache = TRUE) ``` # Functions for fetching the data ## Search data `MGnifyR` gives users access to the complete range of search functionality implemented in the MGnify JSON API. A single function `doQuery()` is used to do perform this searching, allowing Studies, Samples, Runs and Accession to be interrogated from a common interface. As with all MGnifyR functions the first argument `client` must be a valid `MgnifyClient` instance. The only remaining **required** parameter is `qtype`, specifying the type of data to be queried, and may be one of `studies`, `samples`, `runs`, `analyses` or `assemblies`. Other general parameter include `max.hits`. Unlike most other `MGnifyR` high level functions, caching is turned off by default for `doQuery()`. New data and analyses are being added to MGnify all the time, so enabling caching by default may lead to out-of-date search results for long-lived sessions. However, it's easy to switch back on, and may be useful in many cases. Also, given the huge and ever increasing number of datasets available in MGnify, a limit to the number of results returned may be set using `max.hits`. By default this is set to 200, which for most exploratory queries should be sufficient. It may be increased or decreased by directly specifying `max.hits`, and disabled completely (no limit) by setting `max.hits=NULL`. In most cases we will want to be more specific about the search, and will also use either an `accession` parameter, or the many filter options available through the API, and discussed below. Specifying an `accession` id, which in the case of `samples`, `runs` and `assemblies` may be a vector of ids, returns a data.frame of metadata with one row per matching accession. If `accession` is `NULL` (the default) then remaining parameters define the filters applied by the API to the search result. Details of these parameters are given in `help(doQuery)`. By way of example though, supposing we are interested in amplicon Illumina samples from the arctic, we might try the following query: ```{r search_studies} northpolar <- doQuery( mg, "samples", latitude_gte=60.0, experiment_type="amplicon", biome_name="Soil", instrument_platform = "Illumina", max.hits = 10) head(northpolar) ``` Specifying an `accession` parameter will restrict results to just those matching that particular entry, be it a study, sample or run. For example, to retrieve information for study "MGYS00002891": ```{r search_studies_accession} study_samples <- doQuery(mg, "studies", accession="MGYS00002891") head(study_samples) ``` ## Find relevent **analyses** accessions Having obtained a particular set of search hits, it's now time to retrieve the associated results. General automated analysis is complicated by the MGnify database design, wherein for example samples may be shared between multiple studies, or studies analysed multiple times using different versions of the pipeline. Navigating these "many-to-one" relationships can be tricky, so `MGnifyR` resorts to using `analyses` accessions as it's canonical identifier. Each analysis corresponds to a single run of a particular pipeline on a single sample in a single study. The downside of this approach is that queries returning `studies`, `samples` (or anything other than `analyses`) accessions need converting to the corresponding `analyses`. `MGnifyR` therefore provides a helper function to handle this conversion - `searchAnalysis()`. Following on from our previous search, we have a list of `study` accessions, so to convert to corresponding `analyses` we use: ```{r convert_to_analyses} analyses_accessions <- searchAnalysis( mg, type="studies", accession = study_samples$accession) head(analyses_accessions) ``` A useful side effect of the above call is that some attribute metadata for each sample has now been retrieved and stored in the local cache. Thus subsequent API calls for these samples (which will occur multiple times in later steps) will be significantly faster. It's important to be aware that the results of a `searchAnalysis()` command will not necessarily be a one-to-one match with the input accessions. `MGnify` analysis runs are sometimes performed multiple times, perhaps using different versions of the pipeline. Thus further filtering of the result list may be required, but is easily performed and is illustrated in the next section. ## Fetch metadata At this point we have a long list of analysis instances (with potential duplicates) corresponding to the samples previously found. We use the `getMetadata` function to download and combine all associated `sample`, `run` and `study` metadata, which we then filter as required to include only the rows we want. ```{r get_metadata} analyses_metadata <- getMetadata(mg, analyses_accessions) head(analyses_metadata) ``` The resulting data.frame has columns with names prefixed with their source type. For example, "sample_xxx" columns correspond to metadata gleaned from querying an accession's `sample` entry. MGnify allows quite flexible specification of arbitray metadata at submission time, in many cases leading to quite sparse `data.frame` results if accession queries are sourced from more than one study. For instance, if only one sample contains an entry for "sample_soil_PH", entries for other rows will be filled with `NA`. `MGnifyR` does not automatically clean these missing values - instead opting to allow the the user to choose the a correct action. The particular study we're looking at is from the marine biome, suppose we were interested in only those samples or analyses for which the sampling depth was known. The following snippet filters the full `data.frame` selecting only entries which contain a valid `sample_depth`. It's worth noting the `as.numeric` call to ensure the column is converted to `numeric` type before it is checked. *All* sample data from MGnifyR is initially retrieved as type `character`, and it's up to the user to make sure ostensibly numeric entries are converted properly. ```{r filter_metadata} known_depths <- analyses_metadata[ !is.na(as.numeric(analyses_metadata$sample_depth)), ] # How many are left? dim(known_depths) ``` ## Fetch microbiome data Having selected the analyses we wish to examine further, `getResult()` is used to both download associated OTU tables and taxonomy, and join all results into a single `r BiocStyle::Biocpkg("TreeSummarizedExperiment")` (`TreeSE`) object. TreeSE is becoming a defacto standard for taxonomic abundance *munging* in R. `TreeSE` objects integrate abundance, taxonomic, phylogenetic, sample and sequence data into a single object, with powerful facilities for filtering, processing and plotting the results. Compared to `r BiocStyle::Biocpkg("phyloseq")` object, `TreeSE` is more scalable and capable for efficient data analysis. `miaverse` framework is developed around `TreeSE` data container. It provides tools for analysis and visualization. Moreover, it includes a comprehensive tutorial book called [OMA](https://microbiome.github.io/OMA/). ### Amplicon sequencing When the dataset includes amplicon sequencing data, i.e., the dataset does not include function predictions, `getResult()` method returns the dataset as a `TreeSE` by default. See other output types from the function documentation. ```{r get_treese} tse <- getResult(mg, accession = analyses_accessions, get.func = FALSE) tse ``` `TreeSE` object is uniquely positioned to support `r BiocStyle::Biocpkg("SummarizedExperiment")`-based microbiome data manipulation and visualization. Moreover, it enables access to `miaverse` tools. For example, we can estimate diversity of samples. ```{r calculate_diversity} library(mia) tse <- estimateDiversity(tse, index = "shannon") library(scater) plotColData(tse, "shannon", x = "sample_geo.loc.name") ``` ```{r plot_abundance} library(miaViz) plotAbundance(tse[!is.na( rowData(tse)[["Kingdom"]] ), ], rank = "Kingdom") ``` If needed, `TreeSE` can be converted to `phyloseq`. ```{r to_phyloseq} pseq <- makePhyloseqFromTreeSE(tse) pseq ``` ### Metagenomics Although the previous queries have been based on the results from `doQuery()`, from now on we will concentrate on combining and comparing results from specific studies. Since newly performed analyses are retrieved first in the `doQuery()` call, it's likely that by the time this vignette is read, the query results will be different. This is principally due to the rapid increase in MGnify submissions, leading to a potential lack of consistency between even closely spaced queries. As mentioned previously, it may be best to use `useCache=FALSE` from `MgnifyCLient` object for `doQuery()` calls, to ensure queries are actually returning the latest data. For the remainder of this vignette however, we'll be comparing 3 ostensibly different studies. A study of saltmarsh soils from York University, human faecal samples from a survey of healthy Sardinians, and a set of samples from hydrothermal vents in the Mid-Cayman rise in the Carribbean Sea. To simplify things, only the first 20 samples from each study will be used. Furthermore, the intention is only to demonstrate the functionality of the MGnifyR package, rather than produce scientifically rigorous results. ```{r get_analyses} soil <- searchAnalysis(mg, "studies", "MGYS00001447") human <- searchAnalysis(mg, "studies", "MGYS00001442") marine <- searchAnalysis(mg, "studies", "MGYS00001282") # Combine analyses all_accessions <- c(soil, human, marine) head(all_accessions) ``` The first step with this new accession list is, as previously, to retrieve the associated metadata using `getMetadata()`, and as seen with the `doQuery()` results, the returned `data.frame` contains a large number of columns. Being autogenerated and flexible, the column names can be a little difficult to predict, but examining `colnames(full_metadata)` should make things clearer. ```{r get_new_metadata} full_metadata <- getMetadata(mg, all_accessions) colnames(full_metadata) head(full_metadata) ``` From `full_metadata` we get an idea of the type of data we're dealing with, and can extract useul information such as sequencing platform, source biome, etc. The next code snippet tallies a few of these columns to give an idea about what's available. The boxplot also indicates that while within study read counts are similar, we probably need to use some sort of normalization procedure when comparing across samples. We might also want to drop particularly low read coverage samples from further analysis. ```{r full_metatdata_explore} # Load ggplot2 library(ggplot2) #Distribution of sample source material: table(full_metadata$`sample_environment-material`) #What sequencing machine(s) were used? table(full_metadata$`sample_instrument model`) # Boxplot of raw read counts: ggplot( full_metadata, aes(x=study_accession, y=log( as.numeric(`analysis_Submitted nucleotide sequences`)))) + geom_boxplot(aes(group=study_accession)) + theme_bw() + ylab("log(submitted reads)") ``` Again, we can fetch the data by calling `getResult()`. `bulk.dl=TRUE` has the potential to significantly speed up data retrieval. MGnify makes its functional results available in two separate ways, either on a per-analysis basis through the web api, or at the whole study level as large files, tab separated (TSV), and with columns representing the results for each analysis. When `bulk.dl` is `FALSE`, `MGnifyR` queries the web api to get results which (given some functional analyses results may consist of thousands of entries) may take significant time. Setting `bulk.dl` to `TRUE` causes `MGnifyR` to determine the source study associated with a particular `analysis` and to instead download and parse its corresponding results file. Since this result file contains entries for all analyses associated with the study, by taking advantage of `MGnifyR`'s local caching this single download provides results for many future analyses. In some cases this affords several orders of magnitude speedup over the api query case. Unfortunately, column entries in the per-study results files do not always directly correspond to those from a particular analysis run, causing the retrieval to fail. The principal cause of this is believed to be the running of multiple analyses jobs on the same sample. Thus for reliability, `bulk.dl` is `FALSE` by default. As a general recommendation though, you should try setting it `TRUE` the first time `getResult()` is used on a set of accessions. If this fails, setting `bulk.dl` to `FALSE` will enable the more robust approach allowing the analysis to continue. It might take a while though. Hopefully in the future the sample/analysis correspondence mismatches will be fixed and the default `bulk.dl` will be switch to `TRUE`. ```{r get_mae} mae <- getResult(mg, all_accessions, bulk.dl = TRUE) mae ``` For metagenomic samples, the result is `r BiocStyle::Biocpkg("MultiAssayExperiment")` (`MAE`) which links multiple `TreeSE` objects into one dataset. These `TreeSE` objects include taxonomic profiling data along with functional data in unique objects. Each objects is linked with each other by their sample names. You can get access to individual object or experiment by specifying index or name. ```{r mae_access} mae[[2]] ``` We can perform principal component analysis to microbial profiling data by utilizing miaverse tools. ```{r pcoa} # Apply relative transformation mae[[1]] <- transformAssay(mae[[1]], method = "relabundance") # Perform PCoA mae[[1]] <- runMDS( mae[[1]], assay.type = "relabundance", FUN = vegan::vegdist, method = "bray") # Plot plotReducedDim(mae[[1]], "MDS", colour_by = "sample_environment.feature") ``` ## Fetch raw files While `getResult()` can be utilized to retrieve microbial profiling data, `getData()` can be used more flexibly to retrieve any kind of data from the database. It returns data as simple data.frame or list format. ```{r fetch_data} kegg <- getData( mg, type = "kegg-modules", accession = "MGYA00642773", accession.type = "analyses") head(kegg) ``` ## Fetch sequence files Finally, we can use `searchFile()` and `getFile()` to retrieve other MGnify pipeline outputs such as merged sequence reads, assembled contigs, and details of the functional analyses. `searchFile()` is a simple wrapper function which, when supplied a list of accessions, finds the urls of the files we're after. In most cases we'll want to filter the returned list down to only the files of interest, which is easily done on the resulting data.frame object. In addition to the actual download location (the `download_url` column), extra columns include file type, contents and compression. It's recommended that the `colnames` of the `data.frame` be examined to get a grasp on the available metadata. To demonstrate the process, the code below retrieves a data.frame containing all available downloads for each accession we've been examining previously. It then filters this to retain only those files corresponding retain the annotated amino acid sequence files. ```{r get_download_urls} # Find list of available downloads dl_urls <- searchFile( mg, full_metadata$analysis_accession, type = "analyses") # Filter table target_urls <- dl_urls[ dl_urls$attributes.description.label == "Predicted CDS with annotation", ] head(target_urls) ``` To list the types of available files, and guide the filtering, something like the following might be useful. ```{r list_descriptions} table(dl_urls$attributes.description.label) ``` Unlike other `MGnifyR` functions, `searchFile()` is not limited to `analyses`, and by specifying `accession_type` other results types may be found. For instance, while general `genome` functionality is not yet integrated into `MGnifyR`, we can retrieve associated files for a particular `genome` accession with the following: ```{r get_genome_urls} genome_urls <- searchFile(mg, "MGYG000433953", type = "genomes") genome_urls[ , c("id", "attributes.file.format.name", "download_url")] ``` Having found the a set of target urls, the final step is to use `getFile()` to actually retrieve the file. Unlike other functions, this only works with a single url location at once, so each entry in `target_urls` from above must be downloaded individually - easily done by either looping or `apply`ing over the list. If the files are intended to be used with external programs, it might be easiest to provide a `file` parameter to the function call, which specifies a local filename for writing the file. By default `MGnifyR` will use the local cache, which can make getting to the file afterwards more awkward. Regardless, the default behaviour of `getFile()` is to retrieve the file specified in the parameter `url`, save it to disk, and return the filepath it was saved to. ```{r get_files} # Just select a single file from the target_urls list for demonstration. # Default behavior - use local cache. cached_location1 = getFile(mg, target_urls$download_url[[1]]) # Specifying a file cached_location2 <- getFile( mg, target_urls$download_url[[1]]) cached_location <- c(cached_location1, cached_location2) # Where are the files? cached_location ``` A second download option is available, which allows built-in parsing of the file. If we know ahead of time what processing will be performed, it may be possible to integrate it into a function, pass this function to `getFile()` as the `read.func` argument. The function in question should take a single argument (the complete path name of the locally downloaded file) and the result of the call will be returned in place of the usual output file name. Alternatively the files could first be downloaded in the standard way, and then processed using this same function in a loop. Therefore in many cases the `read.func` parameter is redundant. However, many of the outputs from MGnify can be quite large, meaning local storage of many files may become an issue. By providing a `read_func` parameter (and necessarily setting from `MgnifyClient` object: `useCache=FALSE`) analysis of a large number of datasets may be possible with minimal storage requirements. To illustrate, suppose we were interested in retrieving all detected sequences matching a particular PFAM motif in a set of analyses. The simple function below uses the `Biostrings` package to read an amino acid fasta file, searches for a matching PFAM tag in the sequence name, and then tallies up the unique sequences into a single data.frame row. In this case the PFAM motif identifies sequences coding for the amoC gene, found in both ammonia and methane oxidizing organisms, but any other filtering method could be used. ```{r simple_parse_function} library(Biostrings) # Simple function to a count of unique sequences matching PFAM amoC/mmoC motif getAmoCseqs <- function(fname){ sequences <- readAAStringSet(fname) tgtvec <- grepl("PF04896", names(sequences)) as.data.frame(as.list(table(as.character(sequences[tgtvec])))) } ``` Having defined the function, it just remains to include it in the call to `getFile()`. ```{r download_with_read} # Just download a single accession for demonstration, specifying a read_function amoC_seq_counts <- getFile( mg, target_urls$download_url[[1]], read_func = getAmoCseqs) amoC_seq_counts ``` ```{r session_info} sessionInfo() ```