Metabolomics Workbench (link) hosts a metabolomics data repository. It contains over 1000 publicly available studies including raw data, processed data and metabolite/compound information.
The repository is searchable using a REST service API. The metabolomicsWorkbenchR package makes the endpoints of this service available in R and provides functionality to search the database and import datasets and metabolite information into commonly used formats such as data frames and SummarizedExperiment objects.
In this vigenette we will use metabolomicsWorkbenchR
to
retrieve the uploaded peak matrix for a study. We will then use
structToolbox
to apply a basic workflow to analyse the
data.
To install this package enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("metabolomicsWorkbenchR")
For older versions, please refer to the appropriate Bioconductor release.
The API endpoints for Metabolomics Workbench are accessible using the
do_query
function in
metabolomicsWorkBenchR
.
The do_query
functions takes 4 inputs: -
context
A valid context name (character) -
input_item
A valid input_item name (character) -
input_value
A valid input_value name (character) -
output_item
A valid output_item (character)
Contexts refer to the different database searches available in the
API. The reader is referred to the API manual for details of each
context (link).
In metabolomicsWorkBenchR
contexts are stored as a list,
and a list of valid contexts can be obtained using the
names
function:
## [1] "study" "compound" "refmet" "gene" "protein" "moverz"
## [7] "exactmass"
input_item
is specific to a context. Valid items for a
context can be listed using context_inputs
function:
## Valid inputs:
## [1] "study_id" "study_title" "institute" "last_name"
## [5] "analysis_id" "metabolite_id"
##
## Valid outputs:
## [1] "summary" "factors"
## [3] "analysis" "metabolites"
## [5] "mwtab" "source"
## [7] "species" "disease"
## [9] "number_of_metabolites" "data"
## [11] "datatable" "untarg_studies"
## [13] "untarg_factors" "untarg_data"
## [15] "metabolite_info" "SummarizedExperiment"
## [17] "untarg_SummarizedExperiment" "DatasetExperiment"
## [19] "untarg_DatasetExperiment"
First we query the database to return a list of untargeted studies. We use the “study” context in combination with a special case input item called “ignored” that is required for the “untarg_studies” output item.
US = do_query(
context = 'study',
input_item = 'ignored',
input_value = 'ignored',
output_item = 'untarg_studies'
)
head(US[,1:3])
## study_id analysis_id analysis_display
## 1 ST000009 AN000023 LC/Electro-spray /QTOF positive ion mode
## 2 ST000009 AN000024 LC/Electro-spray /QTOF negative ion mode
## 3 ST000010 AN000025 LC/Electro-spray /QTOF positive ion mode
## 4 ST000010 AN000026 LC/Electro-spray /QTOF negative ion mode
## 5 ST000045 AN000072 MS positive ion mode/C18
## 6 ST000045 AN000073 MS positive ion mode/HILIC
We will pull data for study “ST000009”. We can obtain summary information using the “summary” output item.
## [,1]
## study_id "ST000010"
## study_title "Lung Cancer Cells 4"
## study_type "MS analysis (Untargeted)"
## institute "University of Michigan"
## department ""
## last_name "Keshamouni"
## first_name "Venkat"
## email "[email protected]"
## phone ""
## submit_date "2013-04-03"
## study_summary "In cancer cells, the process of epithelial–mesenchymal transition (EMT) confers migratory and invasive capacity, resistance to apoptosis, drug resistance, evasion of host immune surveillance and tumor stem cell traits. Cells undergoing EMT may represent tumor cells with metastatic potential. Characterizing the EMT secretome may identify biomarkers to monitor EMT in tumor progression and provide a prognostic signature to predict patient survival. Utilizing a transforming growth factor-β-induced cell culture model of EMT, we quantitatively profiled differentially secreted proteins, by GeLC-tandem mass spectrometry. Integrating with the corresponding transcriptome, we derived an EMT-associated secretory phenotype (EASP) comprising of proteins that were differentially upregulated both at protein and mRNA levels. Four independent primary tumor-derived gene expression data sets of lung cancers were used for survival analysis by the random survival forests (RSF) method. Analysis of 97-gene EASP expression in human lung adenocarcinoma tumors revealed strong positive correlations with lymph node metastasis, advanced tumor stage and histological grade. RSF analysis built on a training set (n = 442), including age, sex and stage as variables, stratified three independent lung cancer data sets into low-, medium- and high-risk groups with significant differences in overall survival. We further refined EASP to a 20 gene signature (rEASP) based on variable importance scores from RSF analysis. Similar to EASP, rEASP predicted survival of both adenocarcinoma and squamous carcinoma patients. More importantly, it predicted survival in the early-stage cancers. These results demonstrate that integrative analysis of the critical biological process of EMT provides mechanism-based and clinically relevant biomarkers with significant prognostic value.\nResearch is published, core data not used but project description is relevant:\nhttp://www.jimmunol.org/content/194/12/5789.long\n"
## subject_species "Homo sapiens"
As there are multiple datasets per study untargeted data needs to be
requested by Analysis ID. We will request DatasetExperiment format so
that we can use the data directly with structToolbox
.
Now we construct a minimal metabolomics workflow consisting of quality filtering, normalisation, imputation and scaling before applying PCA.
# model sequence
M =
mv_feature_filter(
threshold = 40,
method='across',
factor_name='FCS') +
mv_sample_filter(mv_threshold =40) +
vec_norm() +
knn_impute() +
log_transform() +
mean_centre() +
PCA()
# apply model
M = model_apply(M,DE)
# pca scores plot
C = pca_scores_plot(factor_name=c('FCS'))
chart_plot(C,M[length(M)])
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] structToolbox_1.19.0 struct_1.19.0
## [3] curl_6.1.0 metabolomicsWorkbenchR_1.17.0
## [5] httptest_4.2.2 testthat_3.2.2
## [7] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2
## [3] dplyr_1.1.4 fastmap_1.2.0
## [5] digest_0.6.37 lifecycle_1.0.4
## [7] magrittr_2.0.3 compiler_4.4.2
## [9] rngtools_1.5.2 rlang_1.1.4
## [11] sass_0.4.9 tools_4.4.2
## [13] yaml_2.3.10 data.table_1.16.4
## [15] knitr_1.49 labeling_0.4.3
## [17] S4Arrays_1.7.1 doRNG_1.8.6
## [19] ontologyIndex_2.12 sp_2.1-4
## [21] DelayedArray_0.33.3 plyr_1.8.9
## [23] abind_1.4-8 withr_3.0.2
## [25] purrr_1.0.2 BiocGenerics_0.53.3
## [27] itertools_0.1-3 sys_3.4.3
## [29] stats4_4.4.2 colorspace_2.1-1
## [31] ggplot2_3.5.1 scales_1.3.0
## [33] iterators_1.0.14 MultiAssayExperiment_1.33.4
## [35] SummarizedExperiment_1.37.0 cli_3.6.3
## [37] rmarkdown_2.29 crayon_1.5.3
## [39] generics_0.1.3 reshape2_1.4.4
## [41] httr_1.4.7 BiocBaseUtils_1.9.0
## [43] cachem_1.1.0 stringr_1.5.1
## [45] ggthemes_5.1.0 parallel_4.4.2
## [47] impute_1.81.0 BiocManager_1.30.25
## [49] XVector_0.47.1 matrixStats_1.4.1
## [51] vctrs_0.6.5 Matrix_1.7-1
## [53] jsonlite_1.8.9 IRanges_2.41.2
## [55] S4Vectors_0.45.2 maketools_1.3.1
## [57] foreach_1.5.2 jquerylib_0.1.4
## [59] missForest_1.5 glue_1.8.0
## [61] codetools_0.2-20 stringi_1.8.4
## [63] gtable_0.3.6 GenomeInfoDb_1.43.2
## [65] GenomicRanges_1.59.1 UCSC.utils_1.3.0
## [67] munsell_0.5.1 tibble_3.2.1
## [69] pillar_1.10.0 pcaMethods_1.99.0
## [71] htmltools_0.5.8.1 brio_1.1.5
## [73] pmp_1.19.0 randomForest_4.7-1.2
## [75] GenomeInfoDbData_1.2.13 R6_2.5.1
## [77] evaluate_1.0.1 lattice_0.22-6
## [79] Biobase_2.67.0 bslib_0.8.0
## [81] Rcpp_1.0.13-1 gridExtra_2.3
## [83] SparseArray_1.7.2 xfun_0.49
## [85] MatrixGenerics_1.19.0 buildtools_1.0.0
## [87] pkgconfig_2.0.3