One of the functionalities of the SIAMCAT
package is the
training of statistical machine learning models on metagenomics data. In
this vignette, we demonstrate how such a model can be built on one
dataset and then be applied on another, similarly processed holdout
dataset. This might be of interest when comparing data from two
different studies on the same disease.
In this vignette, we look at two datasets from studies on colorectal cancer (CRC). The first study from Zeller et al. investigated metagenomic markers for CRC in a population in France, while the second study from Yu et al. used samples from China for the same goal. Both datasets were profiled with the same taxonomic profiling tool, yielding the same taxonomic identifiers, which is required for holdout testing.
The datasets can be found on the web repository for public metagenomics datasets from the Zeller group.
library("SIAMCAT")
data.loc <- 'https://zenodo.org/api/files/d81e429c-870f-44e0-a44a-2a4aa541b6c1/'
# this is data from Zeller et al., Mol. Syst. Biol. 2014
fn.meta.fr <- paste0(data.loc, 'meta_Zeller.tsv')
fn.feat.fr <- paste0(data.loc, 'specI_Zeller.tsv')
# this is the external dataset from Yu et al., Gut 2017
fn.meta.cn <- paste0(data.loc, 'meta_Yu.tsv')
fn.feat.cn <- paste0(data.loc, 'specI_Yu.tsv')
First of all, we build a SIAMCAT
object using the data
from the French study in the same way that we have seen before in the
main SIAMCAT
vignette.
# features
# be vary of the defaults in R!!!
feat.fr <- read.table(fn.feat.fr, sep='\t', quote="",
check.names = FALSE, stringsAsFactors = FALSE)
# the features are counts, but we want to work with relative abundances
feat.fr.rel <- prop.table(as.matrix(feat.fr), 2)
# metadata
meta.fr <- read.table(fn.meta.fr, sep='\t', quote="",
check.names=FALSE, stringsAsFactors=FALSE)
# create SIAMCAT object
siamcat.fr <- siamcat(feat=feat.fr.rel, meta=meta.fr,
label='Group', case='CRC')
## + starting create.label
## Label used as case:
## CRC
## Label used as control:
## CTR
## + finished create.label.from.metadata in 0.001 s
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 141 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.016 s
We can load the data from the Chinese study in a similar way and also
create a SIAMCAT
object for the holdout dataset.
# features
feat.cn <- read.table(fn.feat.cn, sep='\t', quote="",
check.names = FALSE)
feat.cn.rel <- prop.table(as.matrix(feat.cn), 2)
# metadata
meta.cn <- read.table(fn.meta.cn, sep='\t', quote="",
check.names=FALSE, stringsAsFactors = FALSE)
# SIAMCAT object
siamcat.cn <- siamcat(feat=feat.cn.rel, meta=meta.cn,
label='Group', case='CRC')
## + starting create.label
## Label used as case:
## CRC
## Label used as control:
## CTR
## + finished create.label.from.metadata in 0.001 s
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 128 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.015 s
With the French dataset, we perform the complete process of model
building within SIAMCAT
, including data preprocessing steps
like data validation, filtering, and data normalization.
siamcat.fr <- filter.features(
siamcat.fr,
filter.method = 'abundance',
cutoff = 0.001,
rm.unmapped = TRUE,
verbose=2
)
## + starting filter.features
## +++ before filtering, the data have 1754 features
## +++ removed 1 features corresponding to UNMAPPED reads
## +++ removed 1539 features whose values did not exceed 0.001 in any sample (retaining 215)
## + finished filter.features in 0.003 s
siamcat.fr <- normalize.features(
siamcat.fr,
norm.method = "log.std",
norm.param = list(log.n0 = 1e-06, sd.min.q = 0.1),
verbose = 2
)
## + starting normalize.features
## +++ performing de novo normalization using the log.std method
## + feature sparsity before normalization: 46.05%
## +++ feature sparsity after normalization: 0 %
## + finished normalize.features in 0.003 s
Now, we can build the statistical model. We use the same parameters
as in the main SIAMCAT
vignette, where the process is
explained in more detail.
## Features splitted for cross-validation successfully.
## Trained lasso models successfully.
Finally, we can make predictions for each cross-validation fold and
evaluate the predictions as seen in the main SIAMCAT
vignette.
## Made predictions successfully.
## Evaluated predictions successfully.
Now that we have successfully built the model for the French dataset,
we can apply it to the Chinese holdout dataset. First, we will normalize
the Chinese dataset with the same parameters that we used for the French
dataset in order to make the data comparable. For that step, we can use
the frozen normalization functionality in the
normalize.features
function in SIAMCAT
. We
supply to the function all normalization parameters saved in the
siamcat.fr
object, which can be accessed using the
norm_params
accessor.
siamcat.cn <- normalize.features(siamcat.cn,
norm.param=norm_params(siamcat.fr),
feature.type='original',
verbose = 2)
## + starting normalize.features
## + normalizing original features
## + performing frozen log.std normalization using the supplied parameters
## + feature sparsity before normalization: 49.77%
## + feature sparsity after normalization: 0%
## + finished normalize.features in 0.003 s
Next, we apply the trained model to predict the holdout dataset.
siamcat.cn <- make.predictions(
siamcat = siamcat.fr,
siamcat.holdout = siamcat.cn,
normalize.holdout = FALSE)
## Warning in make.external.predictions(siamcat.trained = siamcat,
## siamcat.external = siamcat.holdout, : holdout set is not being normalized!
## Made predictions successfully.
Note that the make.predictions
function can also take
care of the normalization of the holdout dataset.
## Alternative Code, not run here
siamcat.cn <- siamcat(feat=feat.cn.rel, meta=meta.cn,
label='Group', case='CRC')
siamcat.cn <- make.predictions(siamcat = siamcat.fr,
siamcat.holdout = siamcat.cn,
normalize.holdout = TRUE)
Again, we have to evaluate the predictions:
Now, we can compare the performance of the classifier on the original
and the holdout dataset by using the model.evaluation.plot
function. Here, we can supply several SIAMCAT
objects for
which the model evaluation will be plotted in the same plot. Note that
we can supply the objects as named objects in order to print the names
in the legend.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggpubr_0.6.0 SIAMCAT_2.11.0 phyloseq_1.51.0 mlr3_0.22.1
## [5] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [13] ggplot2_3.5.1 tidyverse_2.0.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_1.8.9
## [4] shape_1.4.6.1 magrittr_2.0.3 farver_2.1.2
## [7] corrplot_0.95 nloptr_2.1.1 rmarkdown_2.29
## [10] vctrs_0.6.5 multtest_2.63.0 minqa_1.2.8
## [13] PRROC_1.3.1 rstatix_0.7.2 htmltools_0.5.8.1
## [16] progress_1.2.3 curl_6.1.0 broom_1.0.7
## [19] Rhdf5lib_1.29.0 Formula_1.2-5 rhdf5_2.51.2
## [22] pROC_1.18.5 sass_0.4.9 parallelly_1.41.0
## [25] bslib_0.8.0 plyr_1.8.9 palmerpenguins_0.1.1
## [28] mlr3tuning_1.3.0 cachem_1.1.0 uuid_1.2-1
## [31] buildtools_1.0.0 igraph_2.1.3 lifecycle_1.0.4
## [34] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-1
## [37] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [40] rbibutils_2.3 future_1.34.0 digest_0.6.37
## [43] numDeriv_2016.8-1.1 colorspace_2.1-1 S4Vectors_0.45.2
## [46] mlr3misc_0.16.0 vegan_2.6-8 labeling_0.4.3
## [49] timechange_0.3.0 httr_1.4.7 abind_1.4-8
## [52] mgcv_1.9-1 compiler_4.4.2 beanplot_1.3.1
## [55] bit64_4.6.0-1 withr_3.0.2 backports_1.5.0
## [58] carData_3.0-5 ggsignif_0.6.4 LiblineaR_2.10-24
## [61] MASS_7.3-64 biomformat_1.35.0 permute_0.9-7
## [64] tools_4.4.2 ape_5.8-1 glue_1.8.0
## [67] lgr_0.4.4 nlme_3.1-166 rhdf5filters_1.19.0
## [70] grid_4.4.2 checkmate_2.3.2 gridBase_0.4-7
## [73] cluster_2.1.8 reshape2_1.4.4 ade4_1.7-22
## [76] generics_0.1.3 gtable_0.3.6 tzdb_0.4.0
## [79] data.table_1.16.4 hms_1.1.3 car_3.1-3
## [82] XVector_0.47.2 BiocGenerics_0.53.3 foreach_1.5.2
## [85] pillar_1.10.1 vroom_1.6.5 bbotk_1.5.0
## [88] splines_4.4.2 lattice_0.22-6 bit_4.5.0.1
## [91] survival_3.8-3 tidyselect_1.2.1 maketools_1.3.1
## [94] Biostrings_2.75.3 knitr_1.49 reformulas_0.4.0
## [97] infotheo_1.2.0.1 gridExtra_2.3 IRanges_2.41.2
## [100] stats4_4.4.2 xfun_0.50 Biobase_2.67.0
## [103] matrixStats_1.5.0 stringi_1.8.4 UCSC.utils_1.3.1
## [106] yaml_2.3.10 boot_1.3-31 evaluate_1.0.3
## [109] codetools_0.2-20 BiocManager_1.30.25 cli_3.6.3
## [112] Rdpack_2.6.2 munsell_0.5.1 jquerylib_0.1.4
## [115] mlr3learners_0.9.0 Rcpp_1.0.14 GenomeInfoDb_1.43.2
## [118] globals_0.16.3 parallel_4.4.2 prettyunits_1.2.0
## [121] paradox_1.0.1 lme4_1.1-36 listenv_0.9.1
## [124] glmnet_4.1-8 lmerTest_3.1-3 scales_1.3.0
## [127] crayon_1.5.3 rlang_1.1.4 mlr3measures_1.0.0