phenomis: Postprocessing and univariate statistical analysis of omics data

Introduction

Context

Analysis of a molecular phenotyping (e.g. metabolomics) data sets (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at mining the data (e.g. detecting trends and outliers) and selecting features of predictive value (biomarker discovery). It comprises multiple steps including:

  • Post-processing (quality control, normalization and/or transformation of intensities)

  • Statistical analysis (univariate hypothesis testing, multivariate modeling, feature selection)

  • Annotation (physico-chemical properties, biological pathways)

The phenomis package focuses on the two first steps, and can be combined with other packages for multivariate modeling, feature selection and annotation, such as the ropls and biosigner and biodb Bioconductor packages. As an example, the tutorial below reproduces the whole workflow described in the sacurine study (Thévenot et al. 2015). The phenomis package has also been used to post-processed the preclinical, proteomics and metabolomics data from the ProMetIS study (Imbert et al. 2021). Proteomics and metabolomics data from this study are provided as a second dataset, named prometis. Examples of statistical analyses of this dataset are provided in the “Working with multi-omics datasets” section.

Methods

Methods from the phenomis, ropls, and biosigner packages for the analysis of omics datasets; specific parameter values used for the sacurine metabolomics dataset described in the ‘Hands-on’ part below are provided as examples.
Methods from the phenomis, ropls, and biosigner packages for the analysis of omics datasets; specific parameter values used for the sacurine metabolomics dataset described in the ‘Hands-on’ part below are provided as examples.

Formats

Managing data and metadata: the SummarizedExperiment and MultiAssayExperiment formats

The standard SummarizedExperiment and MultiAssayExperiment Bioconductor formats for single and multi-omics datasets are used by the phenomis methods. The methods return updated SummarizedExperiment and MultiAssayExperiment objects with either modified assay matrices (e.g. when using the transforming method) or enriched rowData and colData with new columns (e.g. fold changes and p-values when using the hypotesting method).

Note that the phenomis package also supports the ExpressionSet (respectively, MultiDataSet) formats, which are previous (respectively, alternative) formats for the management of single (respectively, multiple) datasets.

Importing from/exporting to tabular files

Input (i.e. preprocessed) data consists of a ‘variable times sample’ matrix of intensities (dataMatrix numeric matrix), in addition to sample and variable metadata (sampleMetadata and variableMetadata data frames). Importantly, the row names from the dataMatrix must be identical to the row names from the vaiableMetadata (feature names), and the column names from the dataMatrix must be identical to the row names from sampleMetadata (sample names).

Note that 3 sample metadata columns should be specified when working with the correcting method, namely:

  • sampleType (character): the following types are handled by the algorithms:

    • sample: biological sample of interest

    • blank: e.g. solvent only in Liquid Chromatography coupled to Mass Spectrometry

    • pool: quality control sample generated by pooling an equal volume of each of the sample of interest from the whole study (i.e. from all batches)

    • poolN: (where N is an integer; e.g. pool2, pool4, …): diluted quality control sample: N indicates the dilution factor

  • injectionOrder (integer): order of the sample injection (e.g. in the LC-MS instrument)

  • batch (character): name of each batch

Text and graphical outputs

Text and graphics can be handled with the phenomis methods by setting the two arguments:

  • report.c: if set to ‘interactive’ [default], messages are displayed interactively; if set to a character corresponding the a filename (with the ‘.txt’ extension), messages are diverted to this file by using the sink function internally (the same file can be appended by successive methods); if set to ‘none’, messages are suppressed

  • figure.c: if set to ‘interactive’ [default], graphics are displayed interactively; if set to a character corresponding the a filename (with the ‘.pdf’ extension), a pdf file with the plot is generated instead; if set to ‘none’, graphics are suppressed

Hands-on

The sacurine cohort study

As an example, we will use the phenomis package to study the sacurine human cohort. The study is aimed at characterizing the physiological variations of the human urine metabolome with age, body mass index (BMI), and gender (Thévenot et al. 2015). Urine samples from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh performance liquid chromatography (UPLC) coupled to high-resolution mass spectrometry (LTQ-Orbitrap). Raw data are publicly available on the MetaboLights repository (MTBLS404).

This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):

  • correcting: Correction of the signal drift by local regression (loess) modeling of the intensity trend in pool samples (Dunn et al. 2011); Adjustment of offset differences between the two analytical batches by using the average of the pool intensities in each batch (Kloet et al. 2009)

  • Variable quality control by discarding features with a coefficient of variation above 30% in pool samples

  • Normalization by the sample osmolality

  • transforming: log10 transformation

  • inspecting: Computing metrics to filter out outlier samples according to the Weighted Hotellings’T2 distance (Tenenhaus 1999), the Z-score of one of the intensity distribution deciles (Alonso et al. 2011), and the Z-score of the number of missing values (Alonso et al. 2011). A 0.001 threshold for all p-values results in the HU_096 sample being discarded

  • hypotesting: Univariate hypothesis testing of significant variations with age, BMI, or between genders (Student’s T test with Benjamini Hochberg correction)

  • PCA exploration of the data set; ropls Bioconductor package (Thévenot et al. 2015)

  • clustering: Hierarchical clustering

  • OPLS(-DA) modeling of age, BMI and gender; ropls Bioconductor package (Thévenot et al. 2015)

  • Selection of the features which significantly contributes to the discrimination between gender with PLS-DA, Random Forest, or Support Vector Machines classifiers; biosigner Bioconductor package (Rinaudo et al. 2016)

A Galaxy version of this analysis is available W4M00001 ‘Sacurine-statistics’ on the workflow4metabolomics.usegalaxy.fr online infrastructure (Guitton et al. 2017).

reading: reading the data

The reading function reads the data sets and builds the SummarizedExperiment object. Additional information on how to build and handle SummarizedExperiment objects (as well as MultiAssayExperiment, ExpressionSet, and MultiDataSet) are provided in the Appendix.

library(phenomis)
sacurine.se <- reading(system.file("extdata/sacurine", package = "phenomis"))
## class: SummarizedExperiment 
## dim: 113 210 
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(113): (2-methoxyethoxy)propanoic acid isomer
##   (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(10): database_identifier chemical_formula ...
##   retention_time reliability
## colnames(210): QC1_001 HU_neg_017 ... HU_neg_146_b2 QC1_012_b2
## colData names(10): subset full ... bmi gender

inspecting: looking at the data

The inspecting method provides a numerical and graphical overview of the data. Furthermore, it computes quality metrics which may subsequently be used to filter out some samples or variables.

  • Graphical overview. The data matrix is visualized with a color gradient (top right) and the score plot of the Principal Component Analysis is shown for the two first components (bottom right). The black ellipse corresponds to the area of 95% of the samples, as computed with the Hotelling test. For each sample the mean of variable intensities is shown as a function of the injection order to detect any signal drift and/or batch correction (bottom left). Note that for this coloring to be displayed, the sampleType, injectionOrder and batch columns should be provided in the colData of (each of) the dataset(s). Finally, some metrics are computed regarding the proportion of NAs, 0 values, intensity quantiles, and proportion of features with a coefficient of variation in pool intensities < 30%.

  • Quality metrics (as additional column in the metadata):

    • samples (added in the colData):

      • hotel_pval: p-value from the Hotelling’s T2 test in the first plane of the PC components

      • miss_pval: p-value associated to the z-score of the proportion of missing values (Alonso et al. 2011)

      • deci_pval: p-value associated to the z-score of intensity deciles (Alonso et al. 2011)

        For each test, low p-values highlight samples with extreme behavior.

    • features (included in the rowData)

      • When the type of samples is available (ie the sampleType column is included in the input colData from the individual datasets), variable metrics are computed: sample, pool, and blank mean, sd and coefficient of variation (if the corresponding types are present in the ‘sampleType’ column), as well as ‘blank’ mean / ‘sample’ mean, and ‘pool’ CV / ‘sample’ CV ratio
      • If pool dilutions have been used and are indicated in the ‘sampleType’ column as poolN where N is an integer indicating the dilution factor (eg pool2 for a two-fold dilution of the pool; note that the non-diluted pool remains indicated as ‘pool’) the Pearson correlation (and corresponding p-value) between the intensity and the dilution factor is computed for each variable
sacurine.se <- inspecting(sacurine.se, report.c = "none")

The filtering method may be used to discard samples and/or features with a high proportion of missing values and/or a variance of 0. When filtering the features, sample class may be specified (as the name a column from the sample metadata): in this case, features will be kept when:

  1. the proportion of missing values is below the threshold in at least one class

  2. the variance is above 0 in both classes

Post-processing

correcting: Correcting signal drift and batch effect

For each feature, the signal drift and the batch effect are corrected by using a normalization strategy which relies on the measurements of a pooled (or QC) sample injected periodically: for each variable, a regression model is fitted to the values of the pool and subsequently used to adjust the intensities of the samples of interest (Kloet et al. 2009; Dunn et al. 2011). The sample metadata of each datasets (e.g. colData Data Frames) must contain 3 columns:

  1. sampleType (character): either ‘sample’, ‘blank’, or ‘pool’

  2. injectionOrder (integer): order of injection in the instrument

  3. batch (character): batch name

sacurine.se <- correcting(sacurine.se,
                          reference.vc = "pool",
                          report.c = "none")
## Correction method: loess
## Reference observations: pool
## Processing batch:
## ne1
## ne2

Variable filtering

  • using inspecting to compute the ‘pool_CV’ metric
sacurine.se <- inspecting(sacurine.se,
                          figure.c = "none", report.c = "none")
  • discarding features with ‘pool_CV’ > 0.3
sacurine.se <- sacurine.se[rowData(sacurine.se)[, "pool_CV"] <= 0.3, ]
  • discarding the ‘pool’ observations
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "sampleType"] != "pool"]
print(sacurine.se)
## class: SummarizedExperiment 
## dim: 110 184 
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(110): (2-methoxyethoxy)propanoic acid isomer
##   (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(14): database_identifier chemical_formula ... poolDil_cor
##   poolDil_pval
## colnames(184): HU_neg_017 HU_neg_018 ... HU_neg_106_b2 HU_neg_146_b2
## colData names(15): subset full ... miss_pval deci_pval

Normalization

The normalization of each sample by its ‘osmolality’ value does not require any extra method and relies on the classical methods from the SummarizedExperiment package.

assay(sacurine.se) <- sweep(assay(sacurine.se),
                            2,
                            colData(sacurine.se)[, "osmolality"],
                            "/")

Note that a normalizing method is available in the phenomis package to perform Probabilistic Quotient Normalization (Dieterle et al. 2006).

transforming: transforming the data intensities

sacurine.se <- transforming(sacurine.se, method.vc = "log10")
## 'log10' transformation

Sample filtering

sacurine.se <- inspecting(sacurine.se, figure.c = "none", report.c = "none")
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "hotel_pval"] >= 0.001 &
                             colData(sacurine.se)[, "miss_pval"] >= 0.001 &
                             colData(sacurine.se)[, "deci_pval"] >= 0.001]

Final visual check of the data before performing the statistics:

inspecting(sacurine.se, report.c = "none")

hypotesting: univariate hypothesis testing

The hypotesting method is a wrapper of the main R functions for hypothesis testing and corrections for multiple testing. The list of available tests include two sample tests (t-test and Wilcoxon rank test, but also the limma test), analysis of variance (for one and two factors) and Kruskal-Wallis rank test, and correlation tests (by using either the pearson or the spearman correlation).

sacurine.se <- hypotesting(sacurine.se,
                           test.c = "ttest",
                           factor_names.vc = "gender",
                           adjust.c = "BH",
                           adjust_thresh.n = 0.05,
                           report.c = "none")

The phenomis package can be conveniently complemented with other packages for comprehensive statistical analysis and annotation. In the rest of this tutorial, we illustrate how the ropls, biosigner and biodb Bioconductor packages can be included to perform multivariate modeling (Thévenot et al. 2015), feature selection (Rinaudo et al. 2016), and feature annotation (Roger 2022). All these packages support the SummarizedExperiment and MultiAssayExperiment (but also ExpressionSet and MultiDataSet), to provide interoperability.

Unsupervised analysis

We first focus on unsupervised methods for multivariate analysis, by performing Principal Component Analysis and hierarchical clustering.

Principal component analysis: PCA

PCA can be performed by using the opls function from the ropls package. It returns an updated object with scores and loadings included as new columns in the metadata. The PCA model is stored in the metadata from the SummarizedExperiment and can be accessed with the getOpls method:

sacurine.se <- ropls::opls(sacurine.se, info.txt = "none")

sacurine.pca <- ropls::getOpls(sacurine.se)[["PCA"]]
ropls::plot(sacurine.pca,
            parAsColFcVn = colData(sacurine.se)[, "gender"],
            typeVc = "x-score")

ropls::plot(sacurine.pca,
            parAsColFcVn = colData(sacurine.se)[, "age"],
            typeVc = "x-score")

clustering: hierarchical clustering

The clustering method from the phenomis package performs the hierarchical clustering of samples and variables. The default dissimilarity is $1-cor_{pearson}$.

sacurine.se <- clustering(sacurine.se,
                          clusters.vi = c(5, 3))

Supervised modeling

We now address the multivariate modeling of the response of interest (either age, bmi, or gender), by using Partial Least Squares modeling and feature selection.

Partial Least Squares modeling: (O)PLS(-DA)

Partial Least-Squares (PLS), which is a latent variable regression method based on covariance between the predictors and the response, has been shown to efficiently handle datasets with multi-collinear predictors, as in the case of spectrometry measurements (Wold, Sjostrom, and Eriksson 2001).

The opls function from the ropls package can be used to perform Partial Least Squares modeling, by specifying the response to be modeled as the second argument:

sacurine.se <- ropls::opls(sacurine.se, "gender")
## PLS-DA
## 183 samples x 110 variables and 1 response
## standard scaling of predictors and response(s)
##       R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
## Total    0.273     0.73    0.57 0.261   3   0 0.05 0.05

The (O)PLS(-DA) can be obtained from the metadata, e.g. to perform specific plots:

sacurine_gender.plsda <- ropls::getOpls(sacurine.se)[["gender_PLSDA"]]
ropls::plot(sacurine_gender.plsda,
            parAsColFcVn = colData(sacurine.se)[, "gender"],
            typeVc = "x-score")

Feature selection

The biosigner approach is a wrapper method to select the features that significantly contribute to the performance of a binary classifier, namely PLS-DA, random forest, or Support Vector Machine (Rinaudo et al. 2016). We refer the interested reader to the vignette from the biosigner package.

sacurine.biosign <- biosigner::biosign(sacurine.se, "gender", bootI = 5)
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## Significant features from 'S' groups:
##                              plsda randomforest svm
## p-Anisic acid                "S"   "S"          "S"
## Testosterone glucuronide     "S"   "S"          "S"
## Oxoglutaric acid             "A"   "S"          "S"
## Pantothenic acid             "A"   "E"          "S"
## Acetylphenylalanine          "B"   "E"          "S"
## Malic acid                   "S"   "E"          "C"
## N2-Acetylaminoadipic acid    "C"   "E"          "S"
## 2-Isopropylmalic acid        "E"   "E"          "S"
## Methoxysalicylic acid isomer "E"   "E"          "S"
## N4-Acetylcytidine            "E"   "E"          "S"
## Pyrroledicarboxylic acid     "E"   "E"          "S"
## Taurine                      "E"   "E"          "S"
## Accuracy:
##      plsda randomforest   svm
## Full 0.867        0.826 0.888
## AS   0.898        0.848 0.910
## S    0.885        0.848 0.941

Please note that the bootI parameter is set to 5 to speed up the vignette building, but that we advice to keep the default 50 number of bootstraps to obtain more stable signatures.

annotating: MS annotation

This method is based on the biodb package (and its companion biodbChebi), which provides access to standard remote chemical and biological databases (ChEBI, KEGG, HMDB, …), as well as to in-house local database files (CSV, SQLite), with easy retrieval of entries, access to web services, search of compounds by mass and/or name (Roger 2022).

Viewing the parameters from the annotating method and their default values:

annotating_parameters()
## Annotating parameters for the query of chebi:
##                  mode                      available   default
## query.type  character c('mz', 'chebi.id', 'kegg.id')        mz
## query.col   character                                       mz
## ms.mode     character                c('pos', 'neg')       pos
## mz.tol        numeric                                       10
## mz.tol.unit character              c('plain', 'ppm')       ppm
## fields      character                      see below see below
## fieldsLimit   numeric                                        1
## max.results   numeric                                        3
## prefix      character                                   chebi.
## sep         character                                        |
## 'fields' possible values:
## 'accession', 'charge', 'chebi.id', 'formula', 'inchi', 'inchikey', 'kegg.compound.id', 'molecular.mass', 'monoisotopic.mass', 'name', 'smiles'
## 'fields' default values:
## 'chebi.id', 'name', 'formula', 'monoisotopic.mass', 'kegg.compound.id'

Querying chemical annotation from the Chemical Entities of Biological Interest ChEBI database:

  1. by using mz values:
sacurine.se <- annotating(sacurine.se,
                          database.c = "chebi",
                          param.ls = list(query.type = "mz",
                                          query.col = "mass_to_charge",
                                          ms.mode = "neg",
                                          mz.tol = 10,
                                          mz.tol.unit = "ppm",
                                          max.results = 3,
                                          prefix = "chebiMZ."),
                          report.c = "none")
knitr::kable(head(rowData(sacurine.se)[, grep("chebiMZ", 
                                              colnames(rowData(sacurine.se)))]))
  1. by using ChEBI identifiers:
sacurine.se <- annotating(sacurine.se, database.c = "chebi",
                          param.ls = list(query.type = "chebi.id",
                                          query.col = "database_identifier",
                                          prefix = "chebiID."))
knitr::kable(head(rowData(sacurine.se)[, grep("chebiID", 
                                              colnames(rowData(sacurine.se)))]))

Adding chemical information from a local database:

  1. loading a local (example) MS database:
msdbDF <- read.table(system.file("extdata/local_ms_db.tsv", package = "phenomis"),
                     header = TRUE,
                     sep = "\t",
                     stringsAsFactors = FALSE)
  1. Querying the local database:
sacurine.se <- annotating(sacurine.se,
                          database.c = "local.ms",
                          param.ls = list(query.type = "mz",
                                          query.col = "mass_to_charge",
                                          ms.mode = "neg",
                                          mz.tol = 5,
                                          mz.tol.unit = "ppm",
                                          local.ms.db = msdbDF,
                                          prefix = "localMS."),
                          report.c = "none")
## INFO  [03:39:43.485] Loading definitions from package biodb version 1.15.0.
## INFO  [03:39:43.659] Loading definitions from package biodbChebi version 1.13.0.
## INFO  [03:39:44.610] Closing BiodbMain instance...
## INFO  [03:39:44.611] Connector "mass.csv.file" deleted.
knitr::kable(rowData(sacurine.se)[!is.na(rowData(sacurine.se)[, "localMS.accession"]), grep("localMS.", colnames(rowData(sacurine.se)), fixed = TRUE)])
localMS.accession localMS.formula localMS.mass.csv.file.id localMS.molecular.mass localMS.ms.mode localMS.name localMS.peak.mz localMS.peak.mztheo
Citric acid KNA00550 C6H8O7 KNA00550 192.027 neg Citric acid 191.019252 191.019252
Malic acid KNA00527 C4H6O5 KNA00527 134.02152 neg L-Malic acid 133.014168 133.014168
Taurine KNA00479 C2H7NO3S KNA00479 125.01466 neg Taurine 124.00727 124.00727
Tryptophan KNA00507 C11H12N2O2 KNA00507 204.08988 neg L-Tryptophan 203.082131 203.082131

writing: Exporting the results

Finally, the writing method can be used to export the SummarizedExperiment object in the 3 tabular file format. In case of a MultiAssayExperiment, one subfolder containing the 3 files will be created for each dataset.

writing(sacurine.se, dir.c = getwd(), prefix.c = "sacurine")

Graphical User Interface

The main features from the phenomis package can also be accessed via a graphical user interface in the Quality Metrics, Batch Correction, Univariate, and Heatmap modules from the workflow4metabolomics.usegalaxy.fr online resource for computational metabolomics, based on the Galaxy environment.

Working with multi-omics datasets

The MultiAssayExperiment format is useful to handle multi-omics datasets (Ramos et al. 2017). The phenomis package (as well as the ropls and biosigner packages) can be used to process the data sets in parallel, such as all methods described before can be directly applied to the MultiAssayExperiment object.

In several methods, specific parameters for each dataset can be passed as a vector: for instance, with two datasets, if one wishes to log2 transform the intensities of the first dataset but not the second one, the method.vc = c("log2", "none") parameter can be used with the transforming method (in contrast, if only one value is provided, it will be used for all datasets).

Here we apply phenomis statistical methods to the ProMetIS study , which addresses the molecular phenotyping of knock-out mouse models by combined proteomics and metabolomics analysis (Imbert et al. 2021). The phenomis dataset has been selected from the original data (publicly available on GitHub: IFB-ElixirFr/ProMetIS) as follows:

  • liver data

  • WT and KO mice for the LAT gene (Linker for Activation of T cells) only

  • the proteomics dataset and the metabolomics dataset obtained with the ZIC-pHILIC chromatographic column

  • post-processed data only

  • 100 features only

The multi-omics dataset is saved as text files and can be loaded with the reading function as a MultiAssayExperiment (or as a MultiDataSet with the output.c = "set" argument)

prometis.mae <- reading(system.file("extdata/prometis",
                                    package = "phenomis"))
## Reading the 'metabo' dataset...
## Reading the 'proteo' dataset...
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] metabo: SummarizedExperiment with 100 rows and 28 columns
##  [2] proteo: SummarizedExperiment with 100 rows and 28 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

Let us have a look at the (common) sample metadata:

head(colData(prometis.mae))
## DataFrame with 6 rows and 3 columns
##              gene  mouse_id         sex
##       <character> <integer> <character>
## L818f         LAT       818           F
## L819f         LAT       819           F
## L824m         LAT       824           M
## L825m         LAT       825           M
## L826m         LAT       826           M
## L827m         LAT       827           M

We first explore the datasets:

prometis.mae <- inspecting(prometis.mae, report.c = "none")

We then perform univariate hypothesis testing with the limma approach (Ritchie et al. 2015):

prometis.mae <- hypotesting(prometis.mae, "limma", "gene",
                            report.c = "none")
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

As with single omics studies, the ropls and biosigner can be directly applied to MultiAssayExperiment for multivariate analysis:

prometis.mae <- ropls::opls(prometis.mae, "gene")
## 
## 
## Building the model for the 'metabo' dataset:
## PLS-DA
## 28 samples x 100 variables and 1 response
## standard scaling of predictors and response(s)
##       R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
## Total    0.331    0.941   0.859 0.128   2   0 0.05 0.05

## 
## 
## Building the model for the 'proteo' dataset:
## PLS-DA
## 28 samples x 100 variables and 1 response
## standard scaling of predictors and response(s)
##       R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
## Total    0.137    0.677   0.458 0.294   1   0  0.1 0.05
## Warning: Single component model: only 'overview' and 'permutation' (in case of
## single response (O)PLS(-DA)) plots available

and feature selection:

prometis.mae <- biosigner::biosign(prometis.mae, "gene", bootI = 5)
## 
## 
## Selecting the features for the 'metabo' dataset:
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## Significant features from 'S' groups:
##                                     plsda randomforest svm
## ADP                                 "B"   "S"          "S"
## GDP-L-fucose                        "S"   "E"          "S"
## 9H-xanthine                         "B"   "S"          "E"
## deoxycholic acid                    "E"   "E"          "S"
## taurocholic acid                    "E"   "E"          "S"
## 3-phenylpropionic acid              "E"   "E"          "S"
## 4-oxo-4-(pyridin-3-yl)butanoic acid "E"   "E"          "S"
## ferulic acid                        "E"   "E"          "S"
## glycocholate                        "E"   "E"          "S"
## UDP-alpha-D-galactose               "E"   "E"          "S"
## Accuracy:
##      plsda randomforest   svm
## Full 0.980        0.949 0.963
## AS   0.912        0.918 0.952
## S    0.830        0.933 0.980

## 
## 
## Selecting the features for the 'proteo' dataset:
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## No significant variable found for the selected classifier(s): 'plsda', 'randomforest', 'svm'

The final MultiAssayExperiment can be exported as text files with:

writing(prometis.mae, dir.c = getwd(), prefix.c = "prometis")

Appendix

Additional examples of application to single and multiple omics data sets

CLL data set

The CLL package contains the chronic lymphocytic leukemia (CLL) gene expression data from 24 samples, that were either classified as progressive or stable in regards to disease progression.

We first load the data (which are stored as an ExpressionSet object), and restrict to the first 1,000 features (to speed up computations):

data(sCLLex, package = "CLL")
cll.eset <- sCLLex[seq_len(1000), ]

We explore the data:

cll.eset <- inspecting(cll.eset, report.c = "none")

Investigate whether clusters can be observed (we first add the letter ‘p’ or ‘s’ in the sample names to indicate the phenotype: p = progressive, s = stable):

Biobase::sampleNames(cll.eset) <- paste0(Biobase::sampleNames(cll.eset),
                                         "_",
                                         substr(Biobase::pData(cll.eset)[, "Disease"], 1, 1))

cll.eset <- clustering(cll.eset)

and perform hypothesis testing with the limma test (the dots must be removed from the ‘progress.’ phenotype):

Biobase::pData(cll.eset)[, "Disease"] <- gsub(".", "",
                                              Biobase::pData(cll.eset)[, "Disease"],
                                              fixed = TRUE)
cll.eset <- hypotesting(cll.eset, "limma", "Disease")

All the subsequent multivariate analysis and feature selection steps described with the sacurine data set can be performed with the cll.eset accordingly.

NCI60_4arrays data set

We provide an example based on the NCI60_4arrays cancer data set from the omicade4 package (Meng et al. 2014), which has been made available in the MultiAssayExperiment format in the ropls package. The data consist of gene transcript expression analysis from NCI-60 cell lines by using 4 microarray platforms (Reinhold et al. 2012).

Getting the NCI60 data set as a MultiAssayExperiment:

data(NCI60, package = "ropls")
nci.mae <- NCI60[["mae"]]

Let’s focus on the LEukemia and MElanoma cancer types, and the agilent and hgu95 arrays:

library(MultiAssayExperiment)
table(nci.mae$cancer)
## 
## BR CN CO LC LE ME OV PR RE 
##  5  6  7  9  6 10  7  2  8
nci.mae <- nci.mae[,
                   nci.mae$cancer %in% c("LE", "ME"),
                   c("agilent", "hgu95")]

We can now e.g. perform clustering on each data set independently:

nci.mae <- clustering(nci.mae)

and/or perform hypothesis testing on each data set:

nci.mae <- hypotesting(nci.mae, "limma", "cancer", report.c = "none")

as well as all the analyses described above for the prometis data set.

Cheat sheets for Bioconductor (multi)omics containers

SummarizedExperiment

The SummarizedExperiment format has been designed by the Bioconductor team to handle (single) omics datasets (Morgan et al. 2022).

An example of SummarizedExperiment creation and a summary of useful methods are provided below.

Please refer to package vignettes for a full description of SummarizedExperiment objects .

# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata

# Creating the SummarizedExperiment (package SummarizedExperiment)
library(SummarizedExperiment)
sac.se <- SummarizedExperiment(assays = list(sacurine = t(data.mn)),
                               colData = samp.df,
                               rowData = feat.df)
# note that colData and rowData main format is DataFrame, but data frames are accepted when building the object
stopifnot(validObject(sac.se))

# Viewing the SummarizedExperiment
# ropls::view(sac.se)
Methods Description Returned class
Constructors
SummarizedExperiment Create a SummarizedExperiment object SummarizedExperiment
makeSummarizedExperimentFromExpressionSet SummarizedExperiment
Accessors
assayNames Get or set the names of assay() elements character
assay Get or set the ith (default first) assay element matrix
assays Get the list of experimental data numeric matrices SimpleList
rowData Get or set the row data (features) DataFrame
colData Get or set the column data (samples) DataFrame
metadata Get or set the experiment data list
dim Get the dimensions (features of interest x samples) integer
dimnames Get or set the dimension names list of length 2 character/NULL
rownames Get the feature names character
colnames Get the sample names character
Conversion
as(, "SummarizedExperiment") Convert an ExpressionSet to a SummarizedExperiment SummarizedExperiment

MultiAssayExperiment

The MultiAssayExperiment format has been designed by the Bioconductor team to handle multiomics datasets (Ramos et al. 2017).

An example of MultiAssayExperiment creation and a summary of useful methods are provided below. Please refer to package vignettes or to the original publication for a full description of MultiAssayExperiment objects (Ramos et al. 2017).

Loading the NCI60_4arrays from the omicade4 package:

data("NCI60_4arrays", package = "omicade4")

Creating the MultiAssayExperiment:

library(MultiAssayExperiment)
# Building the individual SummarizedExperiment instances
experiment.ls <- list()
sampleMap.ls <- list()
for (set.c in names(NCI60_4arrays)) {
  # Getting the data and metadata
  assay.mn <- as.matrix(NCI60_4arrays[[set.c]])
  coldata.df <- data.frame(row.names = colnames(assay.mn),
                           .id = colnames(assay.mn),
                           stringsAsFactors = FALSE) # the 'cancer' information will be stored in the main colData of the mae, not the individual SummarizedExperiments
  rowdata.df <- data.frame(row.names = rownames(assay.mn),
                           name = rownames(assay.mn),
                           stringsAsFactors = FALSE)
  # Building the SummarizedExperiment
  assay.ls <- list(se = assay.mn)
  names(assay.ls) <- set.c
  se <- SummarizedExperiment(assays = assay.ls,
                             colData = coldata.df,
                             rowData = rowdata.df)
  experiment.ls[[set.c]] <- se
  sampleMap.ls[[set.c]] <- data.frame(primary = colnames(se),
                                      colname = colnames(se)) # both datasets use identical sample names
}
sampleMap <- listToMap(sampleMap.ls)

# The sample metadata are stored in the colData data frame (both datasets have the same samples)
stopifnot(identical(colnames(NCI60_4arrays[[1]]),
                    colnames(NCI60_4arrays[[2]])))
sample_names.vc <- colnames(NCI60_4arrays[[1]])
colData.df <- data.frame(row.names = sample_names.vc,
                         cancer = substr(sample_names.vc, 1, 2))

nci.mae <- MultiAssayExperiment(experiments = experiment.ls,
                                colData = colData.df,
                                sampleMap = sampleMap)

stopifnot(validObject(nci.mae))
Methods Description Returned class
Constructors
MultiAssayExperiment Create a MultiAssayExperiment object MultiAssayExperiment
ExperimentList Create an ExperimentList from a List or list ExperimentList
Accessors
colData Get or set data that describe the patients/biological units DataFrame
experiments Get or set the list of experimental data objects as original classes experimentList
assays Get the list of experimental data numeric matrices SimpleList
assay Get the first experimental data numeric matrix matrix, matrix-like
sampleMap Get or set the map relating observations to subjects DataFrame
metadata Get or set additional data descriptions list
rownames Get row names for all experiments CharacterList
colnames Get column names for all experiments CharacterList
Subsetting
mae[i, j, k] Get rows, columns, and/or experiments MultiAssayExperiment
mae[i, ,] i: GRanges, character, integer, logical, List, list MultiAssayExperiment
mae[,j,] j: character, integer, logical, List, list MultiAssayExperiment
mae[,,k] k: character, integer, logical MultiAssayExperiment
mae[[i]] Get or set object of arbitrary class from experiments (Varies)
mae[[i]] Character, integer, logical
mae$column Get or set colData column vector (varies)
Management
complete.cases Identify subjects with complete data in all experiments vector (logical)
duplicated Identify subjects with replicate observations per experiment list of LogicalLists
mergeReplicates Merge replicate observations within each experiment MultiAssayExperiment
intersectRows Return features that are present for all experiments MultiAssayExperiment
intersectColumns Return subjects with data available for all experiments MultiAssayExperiment
prepMultiAssay Troubleshoot common problems when constructing main class list
Reshaping
longFormat Return a long and tidy DataFrame with optional colData columns DataFrame
wideFormat Create a wide DataFrame, one row per subject DataFrame
Combining
c Concatenate an experiment MultiAssayExperiment
Viewing
upsetSamples Generalized Venn Diagram analog for sample membership upset
Exporting
exportClass Exporting to flat files csv or tsv files

ExpressionSet

The ExpressionSet format (Biobase package) was designed by the Bioconductor team as the first format to handle (single) omics data. It has now been supplanted by the SummarizedExperiment format.

An example of ExpressionSet creation and a summary of useful methods are provided below. Please refer to ‘An introduction to Biobase and ExpressionSets’ from the documentation from the Biobase package for a full description of ExpressionSet objects.

# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata
# Creating the ExpressionSet (package Biobase)
sac.set <- Biobase::ExpressionSet(assayData = t(data.mn))
Biobase::pData(sac.set) <- samp.df
Biobase::fData(sac.set) <- feat.df
stopifnot(validObject(sac.set))
# Viewing the ExpressionSet
# ropls::view(sac.set)
Methods Description Returned class
exprs ‘variable times samples’ numeric matrix matrix
pData sample metadata data.frame
fData variable metadata data.frame
sampleNames sample names character
featureNames variable names character
dims 2x1 matrix of ‘Features’ and ‘Samples’ dimensions matrix
varLabels Column names of the sample metadata data frame character
fvarLabels Column names of the variable metadata data frame character

MultiDataSet

Loading the NCI60_4arrays from the omicade4 package:

data("NCI60_4arrays", package = "omicade4")

Creating the MultiDataSet:

library(MultiDataSet)
# Creating the MultiDataSet instance
nci.mds <- MultiDataSet::createMultiDataSet()
# Adding the two datasets as ExpressionSet instances
for (set.c in names(NCI60_4arrays)) {
  # Getting the data
  expr.mn <- as.matrix(NCI60_4arrays[[set.c]])
  pdata.df <- data.frame(row.names = colnames(expr.mn),
                        cancer = substr(colnames(expr.mn), 1, 2),
                        stringsAsFactors = FALSE)
  fdata.df <- data.frame(row.names = rownames(expr.mn),
                        name = rownames(expr.mn),
                        stringsAsFactors = FALSE)
  # Building the ExpressionSet
  eset <- Biobase::ExpressionSet(assayData = expr.mn,
                                 phenoData = new("AnnotatedDataFrame",
                                                 data = pdata.df),
                                 featureData = new("AnnotatedDataFrame",
                                                   data = fdata.df),
                                 experimentData = new("MIAME",
                                                      title = set.c))
  # Adding to the MultiDataSet
  nci.mds <- MultiDataSet::add_eset(nci.mds, eset, dataset.type = set.c,
                                     GRanges = NA, warnings = FALSE)
}
stopifnot(validObject(nci.mds))
Methods Description Returned class
Constructors
createMultiDataSet Create a MultiDataSet object MultiDataSet
add_eset Create a MultiAssayExperiment object MultiDataSet
Subsetting
mset[i, ] i: character,logical (samples to select) MultiDataSet
mset[, k] k: character (names of datasets to select) MultiDataSet
mset[[k]] k: character (name of the datast to select) ExpressionSet
Accessors
as.list Get the list of data matrices list
pData Get the list of sample metadata list
fData Get the list of variable metadata list
sampleNames Get the list of sample names list
Management
commonSamples Select samples that are present in all datasets MultiDataSet
Conversion
mds2mae Convert a MultiDataSet to a MultiAssayExperiment MultiAssayExperiment

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] MultiDataSet_1.35.0         MultiAssayExperiment_1.33.1
##  [3] biodb_1.15.0                phenomis_1.9.0             
##  [5] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [7] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [9] IRanges_2.41.2              S4Vectors_0.45.2           
## [11] BiocGenerics_0.53.3         generics_0.1.3             
## [13] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [15] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3               formatR_1.14            rlang_1.1.4            
##   [4] magrittr_2.0.3          PMCMRplus_1.9.12        e1071_1.7-16           
##   [7] compiler_4.4.2          RSQLite_2.3.9           BWStest_0.2.3          
##  [10] vctrs_0.6.5             kSamples_1.2-10         stringr_1.5.1          
##  [13] pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
##  [16] dbplyr_2.5.0            XVector_0.47.0          labeling_0.4.3         
##  [19] ropls_1.39.0            rmarkdown_2.29          UCSC.utils_1.3.0       
##  [22] purrr_1.0.2             bit_4.5.0.1             xfun_0.49              
##  [25] randomForest_4.7-1.2    zlibbioc_1.52.0         cachem_1.1.0           
##  [28] Rmpfr_1.0-0             jsonlite_1.8.9          progress_1.2.3         
##  [31] SuppDists_1.1-9.8       blob_1.2.4              gmp_0.7-5              
##  [34] DelayedArray_0.33.3     chk_0.9.2               prettyunits_1.2.0      
##  [37] R6_2.5.1                bslib_0.8.0             stringi_1.8.4          
##  [40] RColorBrewer_1.1-3      ranger_0.17.0           limma_3.63.2           
##  [43] jquerylib_0.1.4         Rcpp_1.0.13-1           knitr_1.49             
##  [46] BiocBaseUtils_1.9.0     VennDiagram_1.7.3       igraph_2.1.2           
##  [49] Matrix_1.7-1            tidyselect_1.2.1        abind_1.4-8            
##  [52] yaml_2.3.10             curl_6.0.1              lattice_0.22-6         
##  [55] tibble_3.2.1            plyr_1.8.9              withr_3.0.2            
##  [58] askpass_1.2.1           evaluate_1.0.1          lambda.r_1.2.4         
##  [61] proxy_0.4-27            futile.logger_1.4.3     BiocFileCache_2.15.0   
##  [64] pillar_1.10.0           biodbChebi_1.13.0       BiocManager_1.30.25    
##  [67] filelock_1.0.3          plotly_4.10.4           hms_1.1.3              
##  [70] ggplot2_3.5.1           munsell_0.5.1           scales_1.3.0           
##  [73] calibrate_1.7.7         class_7.3-22            glue_1.8.0             
##  [76] lazyeval_0.2.2          maketools_1.3.1         tools_4.4.2            
##  [79] data.table_1.16.4       sys_3.4.3               buildtools_1.0.0       
##  [82] mvtnorm_1.3-2           XML_3.99-0.17           grid_4.4.2             
##  [85] tidyr_1.3.1             colorspace_2.1-1        GenomeInfoDbData_1.2.13
##  [88] cli_3.6.3               rappdirs_0.3.3          futile.options_1.0.1   
##  [91] viridisLite_0.4.2       S4Arrays_1.7.1          dplyr_1.1.4            
##  [94] gtable_0.3.6            sass_0.4.9              digest_0.6.37          
##  [97] ggrepel_0.9.6           SparseArray_1.7.2       farver_2.1.2           
## [100] htmlwidgets_1.6.4       lgr_0.4.4               memoise_2.0.1          
## [103] htmltools_0.5.8.1       lifecycle_1.0.4         biosigner_1.35.0       
## [106] httr_1.4.7              multcompView_0.1-10     statmod_1.5.0          
## [109] openssl_2.3.0           qqman_0.1.9             bit64_4.5.2            
## [112] MASS_7.3-61

References

Alonso, Arnald, Antonio Julia, Antoni Beltran, Maria Vinaixa, Marta Diaz, Lourdes Ibanez, Xavier Correig, and Sara Marsal. 2011. AStream: An R Package for Annotating LC/MS Metabolomic Data.” Bioinformatics 27 (9): 1339–40. https://doi.org/10.1093/bioinformatics/btr138.
Dieterle, Frank, Alfred Ross, Gotz Schlotterbeck, and Hans Senn. 2006. “Probabilistic Quotient Normalization as Robust Method to Account for Dilution of Complex Biological Mixtures. Application in 1H NMR Metabonomics.” Analytical Chemistry 78 (13): 4281–90. https://doi.org/10.1021/ac051632c.
Dunn, Warwick B, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, et al. 2011. “Procedures for Large-Scale Metabolic Profiling of Serum and Plasma Using Gas Chromatography and Liquid Chromatography Coupled to Mass Spectrometry.” Nature Protocols 6 (7): 1060–83. https://doi.org/10.1038/nprot.2011.335.
Guitton, Yann, Marie Tremblay-Franco, Gildas Le Corguillé G., Jean-Francois Martin, Melanie Pétéra, Pierrick Roger-Mele, Alexis Delabrière, et al. 2017. “Create, Run, Share, Publish, and Reference Your LC-MS, FIA-MS, GC-MS, and NMR Data Analysis Workflows with the Workflow4Metabolomics 3.0 Galaxy Online Infrastructure for Metabolomics.” The International Journal of Biochemistry & Cell Biology 93: 89–101. https://doi.org/10.1016/j.biocel.2017.07.002.
Imbert, Alyssa, Magali Rompais, Mohammed Selloum, Florence Castelli, Emmanuelle Mouton-Barbosa, Marion Brandolini-Bunlon, Emeline Chu-Van, et al. 2021. ProMetIS, Deep Phenotyping of Mouse Models by Combined Proteomics and Metabolomics Analysis.” Scientific Data 8 (1). https://doi.org/10.1038/s41597-021-01095-3.
Kloet, Frans M. van der, Ivana Bobeldijk, Elwin R. Verheij, and Renger H. Jellema. 2009. “Analytical Error Reduction Using Single Point Calibration for Accurate and Precise Metabolomic Phenotyping.” Journal of Proteome Research 8 (11): 5132–41. https://doi.org/10.1021/pr900499r.
Meng, Chen, Bernhard Kuster, Aedin C. Culhane, and Amin Moghaddas Gholami. 2014. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics 15 (1): 1–13. https://doi.org/10.1186/1471-2105-15-162.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2022. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.
Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez, Tiffany Chan, et al. 2017. “Software for the Integration of Multiomics Experiments in Bioconductor.” Cancer Research 77 (21): e39–42. https://doi.org/10.1158/0008-5472.CAN-17-0344.
Reinhold, William C., Margot Sunshine, Hongfang Liu, Sudhir Varma, Kurt W. Kohn, Joel Morris, James Doroshow, and Yves Pommier. 2012. CellMiner: A Web-Based Suite of Genomic and Pharmacologic Tools to Explore Transcript and Drug Patterns in the NCI-60 Cell Line Set.” Cancer Research 72 (14): 3499–3511. https://doi.org/10.1158/0008-5472.CAN-12-1370.
Rinaudo, Philippe, Samia Boudah, Christophe Junot, and Etienne A Thévenot. 2016. “Biosigner: A New Method for the Discovery of Significant Molecular Signatures from Omics Data.” Frontiers in Molecular Biosciences 3: –. https://doi.org/10.3389/fmolb.2016.00026.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47. https://doi.org/10.1093/nar/gkv007.
Roger, Pierrick. 2022. Biodb: Biodb, a Library and a Development Framework for Connecting to Chemical and Biological Databases. https://github.com/pkrog/biodb.
Tenenhaus, M. 1999. “L’approche PLS.” Revue de Statistiques Appliquees 47: 5–40.
Thévenot, Etienne A., Aurélie Roux, Ying Xu, Eric Ezan, and Christophe Junot. 2015. “Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses.” Journal of Proteome Research 14 (8): 3322–35. https://doi.org/10.1021/acs.jproteome.5b00354.
Wold, S., M. Sjostrom, and L. Eriksson. 2001. PLS-Regression: A Basic Tool of Chemometrics.” Chemometrics and Intelligent Laboratory Systems 58: 109–30. http://dx.doi.org/10.1016/S0169-7439(01)00155-1.