This vignette illustrates how to read and input your own data to the
SIAMCAT
package. We will cover reading in text files from
the disk, formatting them and using them to create an object of
siamcat-class
.
The siamcat-class
is the centerpiece of the package. All
of the input data and result are stored inside of it. The structure of
the object is described below in the siamcat-class object section.
Generally, there are three types of input for
SIAMCAT
:
The features should be a matrix
, a
data.frame
, or an otu_table
, organized as
follows:
features (in rows) x samples (in columns).
Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 | |
---|---|---|---|---|---|
Feature_1 | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 |
Feature_2 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
Feature_3 | 0.02 | 0.00 | 0.00 | 0.00 | 0.20 |
Feature_4 | 0.34 | 0.00 | 0.13 | 0.07 | 0.00 |
Feature_5 | 0.06 | 0.16 | 0.00 | 0.00 | 0.00 |
Please note that
SIAMCAT
is supposed to work with relative abundances. Other types of data (e.g. counts) will also work, but not all functions of the package will result in meaningful outputs.
An example of a typical feature file is attached to the
SIAMCAT
package, containing data from a publication
investigating the microbiome in colorectal cancer (CRC) patients and
controls (the study can be found here: Zeller et al). The
metagenomics data were processed with the MOCAT pipeline, returning taxonomic
profiles on the species levels (specI
):
library(SIAMCAT)
fn.in.feat <- system.file(
"extdata",
"feat_crc_zeller_msb_mocat_specI.tsv",
package = "SIAMCAT"
)
One way to load such data into R
could be the use of
read.table
(Beware of the defaults in R! They are not always useful…)
feat <- read.table(fn.in.feat, sep='\t',
header=TRUE, quote='',
stringsAsFactors = FALSE, check.names = FALSE)
# look at some features
feat[110:114, 1:2]
## CCIS27304052ST-3-0 CCIS15794887ST-4-0
## Bacteroides caccae [h:1096] 1.557937e-03 1.761949e-03
## Bacteroides eggerthii [h:1097] 2.734527e-05 4.146882e-05
## Bacteroides stercoris [h:1098] 1.173786e-03 2.475838e-03
## Bacteroides clarus [h:1099] 4.830533e-04 4.589747e-06
## Methanohalophilus mahii [h:11] 0.000000e+00 0.000000e+00
The metadata should be either a matrix or a
data.frame
.
samples (in rows) x metadata (in columns)
:
Age | Gender | BMI | |
---|---|---|---|
Sample_1 | 52 | 1 | 20 |
Sample_2 | 37 | 1 | 18 |
Sample_3 | 66 | 2 | 24 |
Sample_4 | 54 | 2 | 26 |
Sample_5 | 65 | 2 | 30 |
The rownames
of the metadata should match the
colnames
of the feature matrix.
Again, an example of such a file is attached to the
SIAMCAT
package, taken from the same study:
fn.in.meta <- system.file(
"extdata",
"num_metadata_crc_zeller_msb_mocat_specI.tsv",
package = "SIAMCAT"
)
Also here, the read.table
can be used to load the data
into R
.
meta <- read.table(fn.in.meta, sep='\t',
header=TRUE, quote='',
stringsAsFactors = FALSE, check.names = FALSE)
head(meta)
## age gender bmi diagnosis localization crc_stage fobt
## CCIS27304052ST-3-0 52 1 20 0 NA 0 0
## CCIS15794887ST-4-0 37 1 18 0 NA 0 0
## CCIS74726977ST-3-0 66 2 24 1 NA 0 0
## CCIS16561622ST-4-0 54 2 26 0 NA 0 0
## CCIS79210440ST-3-0 65 2 30 0 NA 0 1
## CCIS82507866ST-3-0 57 2 24 0 NA 0 0
## wif_test
## CCIS27304052ST-3-0 0
## CCIS15794887ST-4-0 0
## CCIS74726977ST-3-0 NA
## CCIS16561622ST-4-0 0
## CCIS79210440ST-3-0 0
## CCIS82507866ST-3-0 0
Finally, the label can come in different three different flavours:
Named vector: A named vector containing
information about cases and controls. The names of the vector should
match the rownames
of the metadata and the
colnames
of the feature data. The label can contain either
the information about cases and controls either
0
and 1
),CTR
and IBD
), orMetadata column: You can provide the name of a column in the metadata for the creation of the label. See below for an example.
Label file: SIAMCAT
has a function
called read.label
, which will create a label object from a
label file. The file should be organized as follows:
#BINARY:1=[label for cases];-1=[label for controls]
1
for each case and -1
for
each control.An example file is attached to the package again, if you want to have a look at it.
For our example dataset, we can create the label object out of the
metadata column called diagnosis
:
When we later plot the results, it might be nicer to have names for
the different groups stored in the label object (instead of
1
and 0
). We can also supply them to the
create.label
function:
label <- create.label(meta=meta, label="diagnosis",
case = 1, control=0,
p.lab = 'cancer', n.lab = 'healthy')
## Label used as case:
## 1
## Label used as control:
## 0
## + finished create.label.from.metadata in 0.001 s
## healthy cancer
## -1 1
Note:
If you have no label information for your dataset, you can still create aSIAMCAT
object from your features alone. TheSIAMCAT
object without label information will contain aTEST
label that can be used for making holdout predictions. Other functions, e.g. model training, will not work on such an object.
LEfSe is a tool for identification of associations between micriobial features and up to two metadata. LEfSe uses LDA (linear discriminant analysis).
LEfSe input file is a .tsv
file. The first few rows
contain the metadata. The following row contains sample names and the
rest of the rows are occupied by features. The first column holds the
row names:
label | healthy | healthy | healthy | cancer | cancer |
---|---|---|---|---|---|
age | 52 | 37 | 66 | 54 | 65 |
gender | 1 | 1 | 2 | 2 | 2 |
Sample_info | Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 |
Feature_1 | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 |
Feature_2 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
Feature_3 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
Feature_4 | 0.34 | 0.00 | 0.43 | 0.00 | 0.00 |
Feature_5 | 0.56 | 0.56 | 0.00 | 0.00 | 0.00 |
An example of such a file is attached to the SIAMCAT
package:
SIAMCAT
has a dedicated function to read LEfSe format
files. The read.lefse
function will read in the input file
and extract metadata and features:
meta.and.features <- read.lefse(fn.in.lefse,
rows.meta = 1:6, row.samples = 7)
meta <- meta.and.features$meta
feat <- meta.and.features$feat
We can then create a label object from one of the columns of the meta
object and create a siamcat
object:
## Label used as case:
## cancer
## Label used as control:
## healthy
## + finished create.label.from.metadata in 0.001 s
metagenomeSeq is an R package to determine differentially abundant features between multiple samples.
There are two ways to input data into metagenomeSeq:
SIAMCAT
just like described in SIAMCAT input with
read.table
:fn.in.feat <- system.file(
"extdata",
"CHK_NAME.otus.count.csv",
package = "metagenomeSeq"
)
feat <- read.table(fn.in.feat, sep='\t',
header=TRUE, quote='', row.names = 1,
stringsAsFactors = FALSE, check.names = FALSE
)
BIOM
format file, that can be used in
SIAMCAT
as described in the following sectionThe BIOM format files can be added to SIAMCAT
via
phyloseq
. First the file should be imported using the
phyloseq
function import_biom
. Then a
phyloseq
object can be imported as a siamcat
object as descibed in the next
section.
The siamcat
object extends on the phyloseq
object. Therefore, creating a siamcat
object from a
phyloseq
object is really straightforward. This can be done
with the siamcat
constructor function. First, however, we
need to create a label object:
data("GlobalPatterns") ## phyloseq example data
label <- create.label(meta=sample_data(GlobalPatterns),
label = "SampleType",
case = c("Freshwater", "Freshwater (creek)", "Ocean"))
## Label used as case:
## Freshwater,Freshwater (creek),Ocean
## Label used as control:
## rest
## + finished create.label.from.metadata in 0.002 s
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 26 sample(s).
## +++ checking sample number per class
## Data set has a limited number of training examples:
## rest 18
## Case 8
## Note that a dataset this small/skewed is not necessarily suitable for analysis in this pipeline.
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.053 s
The siamcat-class
is the centerpiece of the package. All
of the is stored inside of the object:
In the figure above, rectangles depict slots of the object and the
class of the object stored in the slot is given in the ovals. There are
two obligatory slots -phyloseq (containing the metadata
as sample_data
and the original features as
otu_table
) and label - marked with thick
borders.
The siamcat
object is constructed using the
siamcat()
function. There are two ways to initialize
it:
Features: You can provide a feature
matrix
, data.frame
, or otu_table
to the function (together with label and metadata information):
phyloseq: The alternative is to create a
siamcat
object directly out of a phyloseq
object:
Please note that you have to provide either
feat
or phyloseq
and that you
cannot provide both.
In order to explain the siamcat
object better we will
show how each of the slots is filled.
The phyloseq and label slots are obligatory.
phyloseq
, which
is described in the help file of the phyloseq
class. Help
can be accessed by typing into R console:
help('phyloseq-class')
.
otu_table
slot in phyloseq
-see
help('otu_table-class')
- stores the original feature table.
For SIAMCAT
, this slot can be accessed by
orig_feat
.help('label-class')
- that are automatically
generated by the read.label
or create.label
functions.The phyloseq
, label and orig_feat are filled when the
siamcat
object is first created with the constructor
function.
Other slots are filled during the run of the SIAMCAT
workflow:
Each slot in siamcat
can be accessed by typing
slot_name(siamcat)
e.g. for the eval_data
slot you can types
There is one notable exception: the phyloseq slot has to be accessed
with physeq(siamcat)
due to technical reasons.
Slots will be filled during the SIAMCAT
workflow by the
package’s functions. However, if for any reason a slot needs to be
assigned outside of the workflow, the following formula can be used:
slot_name(siamcat) <- object_to_assign
e.g. to assign a new_label
object to the
label
slot:
Please note that this may lead to unforeseen consequences…
There are two slots that have slots inside of them. First, the
model_list
slot has a models
slot that
contains the actual list of mlr
models -can be accessed via models(siamcat)
- and
model.type
which is a character with the name of the method
used to train the model: model_type(siamcat)
.
The phyloseq slot has a complex structure. However, unless the
phyloseq object is created outside of the SIAMCAT
workflow,
only two slots of phyloseq slot will be occupied: the
otu_table
slot containing the features table and the
sam_data
slot containing metadata information. Both can be
accessed by typing either features(siamcat)
or
meta(siamcat)
.
Additional slots inside the phyloseq slots do not have dedicated
accessors, but can easily be reached once the phyloseq object is
exported from the siamcat
object:
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class Order Family
## 549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
## 522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA NA
## 951 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales" "Sulfolobaceae"
## 244423 "Archaea" "Crenarchaeota" "Sd-NA" NA NA
## 586076 "Archaea" "Crenarchaeota" "Sd-NA" NA NA
## 246140 "Archaea" "Crenarchaeota" "Sd-NA" NA NA
## Genus Species
## 549322 NA NA
## 522457 NA NA
## 951 "Sulfolobus" "Sulfolobusacidocaldarius"
## 244423 NA NA
## 586076 NA NA
## 246140 NA NA
If you want to find out more about the phyloseq data structure, head over to the phyloseq BioConductor page. # Session Info
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggpubr_0.6.0 SIAMCAT_2.11.0 phyloseq_1.51.0 mlr3_0.22.1
## [5] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [13] ggplot2_3.5.1 tidyverse_2.0.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_1.8.9
## [4] shape_1.4.6.1 magrittr_2.0.3 farver_2.1.2
## [7] corrplot_0.95 nloptr_2.1.1 rmarkdown_2.29
## [10] zlibbioc_1.52.0 vctrs_0.6.5 multtest_2.63.0
## [13] minqa_1.2.8 PRROC_1.3.1 rstatix_0.7.2
## [16] htmltools_0.5.8.1 progress_1.2.3 curl_6.0.1
## [19] broom_1.0.7 Rhdf5lib_1.29.0 Formula_1.2-5
## [22] rhdf5_2.51.1 pROC_1.18.5 sass_0.4.9
## [25] parallelly_1.40.1 bslib_0.8.0 plyr_1.8.9
## [28] palmerpenguins_0.1.1 mlr3tuning_1.3.0 cachem_1.1.0
## [31] uuid_1.2-1 buildtools_1.0.0 igraph_2.1.2
## [34] lifecycle_1.0.4 iterators_1.0.14 pkgconfig_2.0.3
## [37] Matrix_1.7-1 R6_2.5.1 fastmap_1.2.0
## [40] GenomeInfoDbData_1.2.13 future_1.34.0 digest_0.6.37
## [43] numDeriv_2016.8-1.1 colorspace_2.1-1 S4Vectors_0.45.2
## [46] mlr3misc_0.16.0 vegan_2.6-8 labeling_0.4.3
## [49] timechange_0.3.0 httr_1.4.7 abind_1.4-8
## [52] mgcv_1.9-1 compiler_4.4.2 beanplot_1.3.1
## [55] bit64_4.5.2 withr_3.0.2 backports_1.5.0
## [58] carData_3.0-5 ggsignif_0.6.4 LiblineaR_2.10-24
## [61] MASS_7.3-61 biomformat_1.35.0 permute_0.9-7
## [64] tools_4.4.2 ape_5.8-1 glue_1.8.0
## [67] lgr_0.4.4 nlme_3.1-166 rhdf5filters_1.19.0
## [70] grid_4.4.2 checkmate_2.3.2 gridBase_0.4-7
## [73] cluster_2.1.8 reshape2_1.4.4 ade4_1.7-22
## [76] generics_0.1.3 gtable_0.3.6 tzdb_0.4.0
## [79] data.table_1.16.4 hms_1.1.3 utf8_1.2.4
## [82] car_3.1-3 XVector_0.47.0 BiocGenerics_0.53.3
## [85] foreach_1.5.2 pillar_1.10.0 vroom_1.6.5
## [88] bbotk_1.5.0 splines_4.4.2 lattice_0.22-6
## [91] survival_3.8-3 bit_4.5.0.1 tidyselect_1.2.1
## [94] maketools_1.3.1 Biostrings_2.75.3 knitr_1.49
## [97] infotheo_1.2.0.1 gridExtra_2.3 IRanges_2.41.2
## [100] stats4_4.4.2 xfun_0.49 Biobase_2.67.0
## [103] matrixStats_1.4.1 stringi_1.8.4 UCSC.utils_1.3.0
## [106] yaml_2.3.10 boot_1.3-31 evaluate_1.0.1
## [109] codetools_0.2-20 BiocManager_1.30.25 cli_3.6.3
## [112] munsell_0.5.1 jquerylib_0.1.4 mlr3learners_0.9.0
## [115] Rcpp_1.0.13-1 GenomeInfoDb_1.43.2 globals_0.16.3
## [118] parallel_4.4.2 prettyunits_1.2.0 paradox_1.0.1
## [121] lme4_1.1-35.5 listenv_0.9.1 glmnet_4.1-8
## [124] lmerTest_3.1-3 scales_1.3.0 crayon_1.5.3
## [127] rlang_1.1.4 mlr3measures_1.0.0