MGnifyR
is a package designed to ease access to the
EBI’s MGnify resource,
allowing searching and retrieval of multiple datasets for downstream
analysis.
The latest version of MGnifyR seamlessly integrates with the miaverse framework providing access to cutting-edge tools in microbiome down-stream analytics.
MGnifyR
is hosted on Bioconductor, and can be installed
using via BiocManager
.
MGnifyR
packageOnce installed, MGnifyR
is made available in the usual
way.
library(MGnifyR)
#> Loading required package: MultiAssayExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:generics':
#>
#> intersect, setdiff, setequal, union
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
#> setequal, table, tapply, union, unique, unsplit, which.max,
#> which.min
#> Loading required package: S4Vectors
#> Warning: multiple methods tables found for 'setequal'
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'IRanges'
#> Loading required package: GenomeInfoDb
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'GenomeInfoDb'
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'GenomicRanges'
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'XVector'
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'SummarizedExperiment'
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'S4Arrays'
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'DelayedArray'
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'SparseArray'
#> Warning: replacing previous import 'S4Arrays::read_block' by
#> 'DelayedArray::read_block' when loading 'SummarizedExperiment'
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'MultiAssayExperiment'
#> Loading required package: TreeSummarizedExperiment
#> Loading required package: SingleCellExperiment
#> Loading required package: Biostrings
#> Loading required package: XVector
#> Warning: replacing previous import 'BiocGenerics::setequal' by
#> 'S4Vectors::setequal' when loading 'Biostrings'
#> Warning: multiple methods tables found for 'setequal'
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
All functions in MGnifyR
make use of a
MgnifyClient
object to keep track of the JSONAPI url, disk
cache location and user access tokens. Thus the first thing to do when
starting any analysis is to instantiate this object. The following
snippet creates this.
mg <- MgnifyClient(useCache = TRUE)
mg
#> An object of class "MgnifyClient"
#> Slot "databaseUrl":
#> [1] "https://www.ebi.ac.uk/metagenomics/api/v1"
#>
#> Slot "authTok":
#> [1] NA
#>
#> Slot "useCache":
#> [1] TRUE
#>
#> Slot "cacheDir":
#> [1] "/tmp/RtmpQoJKHV/.MGnifyR_cache"
#>
#> Slot "showWarnings":
#> [1] FALSE
#>
#> Slot "clearCache":
#> [1] FALSE
#>
#> Slot "verbose":
#> [1] TRUE
The MgnifyClient
object contains slots for each of the
previously mentioned settings.
doQuery()
function can be utilized to search results
such as samples and studies from MGnify database. Below, we fetch
information drinking water samples.
# Fetch studies
samples <- doQuery(
mg,
type = "samples",
biome_name = "root:Environmental:Aquatic:Freshwater:Drinking water",
max.hits = 10)
The result is a table containing accession IDs and description – in this case – on samples.
Now we want to find analysis accessions. Each sample might have multiple analyses. Each analysis ID corresponds to a single run of a particular pipeline on a single sample in a single study.
By running the searchAnalysis()
function, we get a
vector of analysis IDs of samples that we fed as an input.
We can now check the metadata to get hint of what kind of data we
have. We use getMetadata()
function to fetch data based on
analysis IDs.
The returned value is a data.frame
that includes
metadata for example on how analysis was conducted and what kind of
samples were analyzed.
After we have selected the data to fetch, we can use
getResult()
The output is TreeSummarizedExperiment
(TreeSE
) or MultiAssayExperiment
(MAE
) depending on the dataset. If the dataset includes
only taxonomic profiling data, the output is a single
TreeSE
. If dataset includes also functional data, the
output is multiple TreeSE
objects that are linked together
by utilizing MAE
.
mae
#> A MultiAssayExperiment object of 6 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 6:
#> [1] microbiota: TreeSummarizedExperiment with 3506 rows and 50 columns
#> [2] go-slim: TreeSummarizedExperiment with 116 rows and 38 columns
#> [3] go-terms: TreeSummarizedExperiment with 3133 rows and 38 columns
#> [4] interpro-identifiers: TreeSummarizedExperiment with 18223 rows and 38 columns
#> [5] taxonomy: TreeSummarizedExperiment with 3617 rows and 50 columns
#> [6] taxonomy-lsu: TreeSummarizedExperiment with 3378 rows and 42 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
You can get access to individual TreeSE
object in
MAE
by specifying index or name.
mae[[1]]
#> class: TreeSummarizedExperiment
#> dim: 3506 50
#> metadata(0):
#> assays(1): counts
#> rownames(3506): 82608 62797 ... 5820 6794
#> rowData names(9): Kingdom Phylum ... taxonomy1 taxonomy
#> colnames(50): MGYA00144458 MGYA00144419 ... MGYA00652185 MGYA00652201
#> colData names(64): analysis_analysis.status analysis_pipeline.version
#> ... sample_geo.loc.name sample_instrument.model
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL
TreeSE
object is uniquely positioned to support
SummarizedExperiment
-based microbiome data manipulation and
visualization. Moreover, it enables access to miaverse
tools. For example, we can estimate diversity of samples…
library(mia)
#> This is mia version 1.15.0
#> - Online documentation and vignettes: https://microbiome.github.io/mia/
#> - Online book 'Orchestrating Microbiome Analysis (OMA)': https://microbiome.github.io/OMA/docs/devel/
mae[[1]] <- estimateDiversity(mae[[1]], index = "shannon")
#> Warning in estimateDiversity(mae[[1]], index = "shannon"): 'estimateDiversity'
#> is deprecated. Use 'addAlpha' instead.
library(scater)
#> Loading required package: scuttle
#> Loading required package: ggplot2
plotColData(mae[[1]], "shannon", x = "sample_environment..biome.")
… and plot abundances of most abundant phyla.
# Agglomerate data
altExps(mae[[1]]) <- splitByRanks(mae[[1]])
library(miaViz)
#> Loading required package: ggraph
#>
#> Attaching package: 'miaViz'
#> The following object is masked from 'package:mia':
#>
#> plotNMDS
# Plot top taxa
top_taxa <- getTopFeatures(altExp(mae[[1]], "Phylum"), 10)
#> Warning in getTopFeatures(altExp(mae[[1]], "Phylum"), 10): 'getTopFeatures' is
#> deprecated. Use 'getTop' instead.
plotAbundance(
altExp(mae[[1]], "Phylum")[top_taxa, ],
rank = "Phylum",
as.relative = TRUE
)
#> Warning: The following values are already present in `metadata` and will be
#> overwritten: 'agglomerated_by_rank'. Consider using the 'name' argument to
#> specify alternative names.
We can also perform other analyses such as principal component analysis to microbial profiling data by utilizing miaverse tools.
While getResult()
can be utilized to retrieve microbial
profiling data, getData()
can be used more flexibly to
retrieve any kind of data from the database. It returns data as simple
data.frame or list format.
colnames(publications) |> head()
#> [1] "document.id" "type"
#> [3] "id" "attributes.pubmed-id"
#> [5] "attributes.pubmed-central-id" "attributes.pub-title"
The result is a data.frame
by default. In this case, it
includes information on publications fetched from the data portal.
Finally, we can use searchFile()
and
getFile()
to retrieve other MGnify pipeline outputs such as
merged sequence reads, assembled contigs, and details of the functional
analyses.
With searchFile()
, we can search files from the
database.
The returned table contains search results related to analyses that we fed as an input. The table contains information on file and also URL address from where the file can be loaded.
target_urls <- dl_urls[
dl_urls$attributes.description.label == "Predicted alpha tmRNA", ]
colnames(target_urls) |> head()
#> [1] "type" "id"
#> [3] "attributes.alias" "attributes.file.format.name"
#> [5] "attributes.file.format.extension" "attributes.file.format.compression"
Finally, we can download the files with getFile()
.
# Just select a single file from the target_urls list for demonstration.
file_url <- target_urls$download_url[[1]]
cached_location <- getFile(mg, file_url)
The function returns a path where the file is stored.
# Where are the files?
cached_location
#> [1] "/.MGnifyR_cache/analyses/MGYA00652201/file/ERZ20300939_alpha_tmRNA.RF01849.fasta.gz"
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] miaViz_1.15.0 ggraph_2.2.1
#> [3] scater_1.35.0 ggplot2_3.5.1
#> [5] scuttle_1.17.0 mia_1.15.0
#> [7] MGnifyR_1.3.0 TreeSummarizedExperiment_2.15.0
#> [9] Biostrings_2.75.0 XVector_0.47.0
#> [11] SingleCellExperiment_1.29.0 MultiAssayExperiment_1.33.0
#> [13] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [15] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [17] IRanges_2.41.0 S4Vectors_0.45.0
#> [19] BiocGenerics_0.53.1 generics_0.1.3
#> [21] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [23] knitr_1.48 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.1 ggplotify_0.1.2
#> [3] urltools_1.7.3 tibble_3.2.1
#> [5] triebeard_0.4.1 polyclip_1.10-7
#> [7] rpart_4.1.23 DirichletMultinomial_1.49.0
#> [9] lifecycle_1.0.4 lattice_0.22-6
#> [11] MASS_7.3-61 SnowballC_0.7.1
#> [13] backports_1.5.0 magrittr_2.0.3
#> [15] Hmisc_5.2-0 sass_0.4.9
#> [17] rmarkdown_2.28 jquerylib_0.1.4
#> [19] yaml_2.3.10 DBI_1.2.3
#> [21] buildtools_1.0.0 minqa_1.2.8
#> [23] abind_1.4-8 zlibbioc_1.51.2
#> [25] purrr_1.0.2 yulab.utils_0.1.7
#> [27] nnet_7.3-19 tweenr_2.0.3
#> [29] sandwich_3.1-1 GenomeInfoDbData_1.2.13
#> [31] ggrepel_0.9.6 tokenizers_0.3.0
#> [33] irlba_2.3.5.1 tidytree_0.4.6
#> [35] maketools_1.3.1 vegan_2.6-8
#> [37] rbiom_1.0.3 tidyjson_0.3.2
#> [39] permute_0.9-7 DelayedMatrixStats_1.29.0
#> [41] codetools_0.2-20 DelayedArray_0.33.1
#> [43] ggforce_0.4.2 tidyselect_1.2.1
#> [45] aplot_0.2.3 UCSC.utils_1.3.0
#> [47] farver_2.1.2 lme4_1.1-35.5
#> [49] ScaledMatrix_1.15.0 viridis_0.6.5
#> [51] base64enc_0.1-3 jsonlite_1.8.9
#> [53] BiocNeighbors_2.1.0 decontam_1.27.0
#> [55] tidygraph_1.3.1 Formula_1.2-5
#> [57] tools_4.4.1 ggnewscale_0.5.0
#> [59] treeio_1.31.0 Rcpp_1.0.13
#> [61] glue_1.8.0 gridExtra_2.3
#> [63] SparseArray_1.7.0 BiocBaseUtils_1.9.0
#> [65] xfun_0.49 mgcv_1.9-1
#> [67] dplyr_1.1.4 withr_3.0.2
#> [69] BiocManager_1.30.25 fastmap_1.2.0
#> [71] boot_1.3-31 bluster_1.17.0
#> [73] fansi_1.0.6 digest_0.6.37
#> [75] rsvd_1.0.5 gridGraphics_0.5-1
#> [77] R6_2.5.1 colorspace_2.1-1
#> [79] lpSolve_5.6.21 utf8_1.2.4
#> [81] tidyr_1.3.1 data.table_1.16.2
#> [83] DECIPHER_3.3.0 graphlayouts_1.2.0
#> [85] httr_1.4.7 htmlwidgets_1.6.4
#> [87] S4Arrays_1.7.1 pkgconfig_2.0.3
#> [89] gtable_0.3.6 sys_3.4.3
#> [91] janeaustenr_1.0.0 htmltools_0.5.8.1
#> [93] scales_1.3.0 ggfun_0.1.7
#> [95] rstudioapi_0.17.1 reshape2_1.4.4
#> [97] checkmate_2.3.2 nlme_3.1-166
#> [99] nloptr_2.1.1 cachem_1.1.0
#> [101] zoo_1.8-12 stringr_1.5.1
#> [103] parallel_4.4.1 vipor_0.4.7
#> [105] foreign_0.8-87 pillar_1.9.0
#> [107] grid_4.4.1 vctrs_0.6.5
#> [109] slam_0.1-54 BiocSingular_1.23.0
#> [111] beachmat_2.23.0 cluster_2.1.6
#> [113] beeswarm_0.4.0 htmlTable_2.4.3
#> [115] evaluate_1.0.1 mvtnorm_1.3-1
#> [117] cli_3.6.3 compiler_4.4.1
#> [119] rlang_1.1.4 crayon_1.5.3
#> [121] tidytext_0.4.2 labeling_0.4.3
#> [123] mediation_4.5.0 plyr_1.8.9
#> [125] fs_1.6.5 ggbeeswarm_0.7.2
#> [127] stringi_1.8.4 viridisLite_0.4.2
#> [129] BiocParallel_1.41.0 assertthat_0.2.1
#> [131] munsell_0.5.1 lazyeval_0.2.2
#> [133] Matrix_1.7-1 patchwork_1.3.0
#> [135] sparseMatrixStats_1.19.0 highr_0.11
#> [137] igraph_2.1.1 memoise_2.0.1
#> [139] RcppParallel_5.1.9 bslib_0.8.0
#> [141] ggtree_3.15.0 ape_5.8