Analysis of a molecular phenotyping (e.g. metabolomics) data sets (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at mining the data (e.g. detecting trends and outliers) and selecting features of predictive value (biomarker discovery). It comprises multiple steps including:
Post-processing (quality control, normalization and/or transformation of intensities)
Statistical analysis (univariate hypothesis testing, multivariate modeling, feature selection)
Annotation (physico-chemical properties, biological pathways)
The phenomis
package focuses on the two
first steps, and can be combined with other packages for multivariate
modeling, feature selection and annotation, such as the ropls
and biosigner
and biodb
Bioconductor packages. As an example, the tutorial below reproduces the
whole workflow described in the sacurine
study (Thévenot et al. 2015). The
phenomis
package has also been used to post-processed the
preclinical, proteomics and metabolomics data from the ProMetIS study
(Imbert et al. 2021). Proteomics and
metabolomics data from this study are provided as a second dataset,
named prometis
. Examples of statistical analyses of this
dataset are provided in the “Working with multi-omics datasets”
section.
SummarizedExperiment
and MultiAssayExperiment
formatsThe standard SummarizedExperiment
and
MultiAssayExperiment
Bioconductor formats for single and
multi-omics datasets are used by the phenomis
methods. The
methods return updated SummarizedExperiment
and
MultiAssayExperiment
objects with either modified assay
matrices (e.g. when using the transforming
method) or
enriched rowData
and colData
with new columns
(e.g. fold changes and p-values when using the
hypotesting
method).
Note that the phenomis
package also supports the
ExpressionSet
(respectively, MultiDataSet
)
formats, which are previous (respectively, alternative) formats for the
management of single (respectively, multiple) datasets.
Input (i.e. preprocessed) data consists of a ‘variable times sample’ matrix of intensities (dataMatrix numeric matrix), in addition to sample and variable metadata (sampleMetadata and variableMetadata data frames). Importantly, the row names from the dataMatrix must be identical to the row names from the vaiableMetadata (feature names), and the column names from the dataMatrix must be identical to the row names from sampleMetadata (sample names).
Note that 3 sample metadata columns should be specified when working
with the correcting
method, namely:
sampleType (character): the following types are handled by the algorithms:
sample: biological sample of interest
blank: e.g. solvent only in Liquid Chromatography coupled to Mass Spectrometry
pool: quality control sample generated by pooling an equal volume of each of the sample of interest from the whole study (i.e. from all batches)
poolN: (where N is an integer; e.g. pool2, pool4, …): diluted quality control sample: N indicates the dilution factor
injectionOrder (integer): order of the sample injection (e.g. in the LC-MS instrument)
batch (character): name of each batch
Text and graphics can be handled with the phenomis methods by setting the two arguments:
report.c
: if set to ‘interactive’
[default], messages are displayed interactively; if set to a character
corresponding the a filename (with the ‘.txt’ extension), messages are
diverted to this file by using the sink function internally
(the same file can be appended by successive methods); if set to ‘none’,
messages are suppressed
figure.c
: if set to ‘interactive’
[default], graphics are displayed interactively; if set to a character
corresponding the a filename (with the ‘.pdf’ extension), a pdf file
with the plot is generated instead; if set to ‘none’, graphics are
suppressed
As an example, we will use the phenomis
package to study the sacurine human cohort. The study
is aimed at characterizing the physiological variations of the human
urine metabolome with age, body mass index (BMI), and gender (Thévenot et al. 2015). Urine samples from 184
volunteers were analyzed by reversed-phase (C18) ultrahigh performance
liquid chromatography (UPLC) coupled to high-resolution mass
spectrometry (LTQ-Orbitrap). Raw data are publicly available on the
MetaboLights repository (MTBLS404).
This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):
correcting
: Correction of the
signal drift by local regression (loess) modeling of the intensity trend
in pool samples (Dunn et al. 2011);
Adjustment of offset differences between the two analytical batches by
using the average of the pool intensities in each batch (Kloet et al. 2009)
Variable quality control by discarding features with a coefficient of variation above 30% in pool samples
Normalization by the sample osmolality
transforming
: log10
transformation
inspecting
: Computing metrics to
filter out outlier samples according to the Weighted Hotellings’T2
distance (Tenenhaus 1999), the Z-score of
one of the intensity distribution deciles (Alonso
et al. 2011), and the Z-score of the number of missing values
(Alonso et al. 2011). A 0.001 threshold
for all p-values results in the HU_096 sample being discarded
hypotesting
: Univariate hypothesis
testing of significant variations with age, BMI, or between genders
(Student’s T test with Benjamini Hochberg correction)
PCA exploration of the data set; ropls
Bioconductor package (Thévenot et al.
2015)
clustering
: Hierarchical
clustering
OPLS(-DA) modeling of age, BMI and gender; ropls
Bioconductor package (Thévenot et al.
2015)
Selection of the features which significantly contributes to the
discrimination between gender with PLS-DA, Random Forest, or Support
Vector Machines classifiers; biosigner
Bioconductor package (Rinaudo et al.
2016)
A Galaxy version of this analysis is available W4M00001 ‘Sacurine-statistics’ on the workflow4metabolomics.usegalaxy.fr online infrastructure (Guitton et al. 2017).
reading
: reading the dataThe reading
function reads the data
sets and builds the SummarizedExperiment
object. Additional
information on how to build and handle SummarizedExperiment
objects (as well as MultiAssayExperiment
,
ExpressionSet
, and MultiDataSet
) are provided
in the Appendix.
## class: SummarizedExperiment
## dim: 113 210
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(113): (2-methoxyethoxy)propanoic acid isomer
## (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(10): database_identifier chemical_formula ...
## retention_time reliability
## colnames(210): QC1_001 HU_neg_017 ... HU_neg_146_b2 QC1_012_b2
## colData names(10): subset full ... bmi gender
inspecting
: looking at the dataThe inspecting
method provides a numerical and graphical
overview of the data. Furthermore, it computes quality metrics which may
subsequently be used to filter out some samples or variables.
Graphical overview. The data matrix is visualized with a color
gradient (top right) and the score plot of the Principal Component
Analysis is shown for the two first components (bottom right). The black
ellipse corresponds to the area of 95% of the samples, as computed with
the Hotelling test. For each sample the mean of variable intensities is
shown as a function of the injection order to detect any signal drift
and/or batch correction (bottom left). Note that for this coloring to be
displayed, the sampleType,
injectionOrder and batch columns
should be provided in the colData
of (each of) the
dataset(s). Finally, some metrics are computed regarding the proportion
of NAs, 0 values, intensity quantiles, and proportion of features with a
coefficient of variation in pool intensities < 30%.
Quality metrics (as additional column in the metadata):
samples (added in the colData
):
hotel_pval: p-value from the Hotelling’s T2 test in the first plane of the PC components
miss_pval: p-value associated to the z-score of the proportion of missing values (Alonso et al. 2011)
deci_pval: p-value associated to the z-score of intensity deciles (Alonso et al. 2011)
For each test, low p-values highlight samples with extreme behavior.
features (included in the
rowData
)
colData
from the individual datasets), variable metrics are
computed: sample, pool, and blank mean, sd and coefficient of variation
(if the corresponding types are present in the ‘sampleType’ column), as
well as ‘blank’ mean / ‘sample’ mean, and ‘pool’ CV / ‘sample’ CV
ratioThe filtering
method may be used to discard samples
and/or features with a high proportion of missing values and/or a
variance of 0. When filtering the features, sample class may be
specified (as the name a column from the sample metadata): in this case,
features will be kept when:
the proportion of missing values is below the threshold in at least one class
the variance is above 0 in both classes
correcting
: Correcting signal drift
and batch effectFor each feature, the signal drift and the batch effect are corrected
by using a normalization strategy which relies on the measurements of a
pooled (or QC) sample injected periodically: for each variable, a
regression model is fitted to the values of the pool and subsequently
used to adjust the intensities of the samples of interest (Kloet et al. 2009; Dunn et al. 2011). The
sample metadata of each datasets (e.g. colData
Data Frames)
must contain 3 columns:
sampleType (character): either ‘sample’, ‘blank’, or ‘pool’
injectionOrder (integer): order of injection in the instrument
batch (character): batch name
## Correction method: loess
## Reference observations: pool
## Processing batch:
## ne1
## ne2
inspecting
to compute the
‘pool_CV’ metric## class: SummarizedExperiment
## dim: 110 184
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(110): (2-methoxyethoxy)propanoic acid isomer
## (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(14): database_identifier chemical_formula ... poolDil_cor
## poolDil_pval
## colnames(184): HU_neg_017 HU_neg_018 ... HU_neg_106_b2 HU_neg_146_b2
## colData names(15): subset full ... miss_pval deci_pval
The normalization of each sample by its ‘osmolality’ value does not
require any extra method and relies on the classical methods from the
SummarizedExperiment
package.
Note that a normalizing
method is available in the
phenomis
package to perform Probabilistic Quotient
Normalization (Dieterle et al. 2006).
transforming
: transforming the data
intensities## 'log10' transformation
sacurine.se <- inspecting(sacurine.se, figure.c = "none", report.c = "none")
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "hotel_pval"] >= 0.001 &
colData(sacurine.se)[, "miss_pval"] >= 0.001 &
colData(sacurine.se)[, "deci_pval"] >= 0.001]
Final visual check of the data before performing the statistics:
hypotesting
: univariate hypothesis
testingThe hypotesting
method is a wrapper of the main R
functions for hypothesis testing and corrections for multiple testing.
The list of available tests include two sample tests (t-test and
Wilcoxon rank test, but also the limma test), analysis of variance (for
one and two factors) and Kruskal-Wallis rank test, and correlation tests
(by using either the pearson or the spearman correlation).
sacurine.se <- hypotesting(sacurine.se,
test.c = "ttest",
factor_names.vc = "gender",
adjust.c = "BH",
adjust_thresh.n = 0.05,
report.c = "none")
The phenomis
package can be conveniently complemented
with other packages for comprehensive statistical analysis and
annotation. In the rest of this tutorial, we illustrate how the ropls
,
biosigner
and biodb
Bioconductor packages can be included to perform multivariate modeling
(Thévenot et al. 2015), feature selection
(Rinaudo et al. 2016), and feature
annotation (Roger 2022). All these
packages support the SummarizedExperiment
and
MultiAssayExperiment
(but also ExpressionSet
and MultiDataSet
), to provide interoperability.
We first focus on unsupervised methods for multivariate analysis, by performing Principal Component Analysis and hierarchical clustering.
PCA can be performed by using the opls
function from the
ropls
package. It returns an updated object with scores and
loadings included as new columns in the metadata. The PCA model is
stored in the metadata from the SummarizedExperiment
and
can be accessed with the getOpls
method:
We now address the multivariate modeling of the response of interest (either age, bmi, or gender), by using Partial Least Squares modeling and feature selection.
Partial Least-Squares (PLS), which is a latent variable regression method based on covariance between the predictors and the response, has been shown to efficiently handle datasets with multi-collinear predictors, as in the case of spectrometry measurements (Wold, Sjostrom, and Eriksson 2001).
The opls
function from the ropls
package
can be used to perform Partial Least Squares modeling, by specifying the
response to be modeled as the second argument:
## PLS-DA
## 183 samples x 110 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.273 0.73 0.57 0.261 3 0 0.05 0.05
The (O)PLS(-DA) can be obtained from the metadata, e.g. to perform specific plots:
The biosigner
approach is a wrapper method to select the
features that significantly contribute to the performance of a binary
classifier, namely PLS-DA, random forest, or Support Vector Machine
(Rinaudo et al. 2016). We refer the
interested reader to the vignette from the biosigner
package.
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## Significant features from 'S' groups:
## plsda randomforest svm
## p-Anisic acid "S" "S" "S"
## Testosterone glucuronide "S" "S" "S"
## Oxoglutaric acid "A" "S" "S"
## Pantothenic acid "A" "E" "S"
## Acetylphenylalanine "B" "E" "S"
## Malic acid "S" "E" "C"
## N2-Acetylaminoadipic acid "C" "E" "S"
## 2-Isopropylmalic acid "E" "E" "S"
## Methoxysalicylic acid isomer "E" "E" "S"
## N4-Acetylcytidine "E" "E" "S"
## Pyrroledicarboxylic acid "E" "E" "S"
## Taurine "E" "E" "S"
## Accuracy:
## plsda randomforest svm
## Full 0.867 0.826 0.888
## AS 0.898 0.848 0.910
## S 0.885 0.848 0.941
Please note that the bootI
parameter is set to 5 to
speed up the vignette building, but that we advice to keep the default
50 number of bootstraps to obtain more stable signatures.
annotating
: MS annotationThis method is based on the biodb package (and its companion biodbChebi), which provides access to standard remote chemical and biological databases (ChEBI, KEGG, HMDB, …), as well as to in-house local database files (CSV, SQLite), with easy retrieval of entries, access to web services, search of compounds by mass and/or name (Roger 2022).
Viewing the parameters from the
annotating
method and their default
values:
## Annotating parameters for the query of chebi:
## mode available default
## query.type character c('mz', 'chebi.id', 'kegg.id') mz
## query.col character mz
## ms.mode character c('pos', 'neg') pos
## mz.tol numeric 10
## mz.tol.unit character c('plain', 'ppm') ppm
## fields character see below see below
## fieldsLimit numeric 1
## max.results numeric 3
## prefix character chebi.
## sep character |
## 'fields' possible values:
## 'accession', 'charge', 'chebi.id', 'formula', 'inchi', 'inchikey', 'kegg.compound.id', 'molecular.mass', 'monoisotopic.mass', 'name', 'smiles'
## 'fields' default values:
## 'chebi.id', 'name', 'formula', 'monoisotopic.mass', 'kegg.compound.id'
Querying chemical annotation from the Chemical Entities of Biological Interest ChEBI database:
sacurine.se <- annotating(sacurine.se,
database.c = "chebi",
param.ls = list(query.type = "mz",
query.col = "mass_to_charge",
ms.mode = "neg",
mz.tol = 10,
mz.tol.unit = "ppm",
max.results = 3,
prefix = "chebiMZ."),
report.c = "none")
knitr::kable(head(rowData(sacurine.se)[, grep("chebiMZ",
colnames(rowData(sacurine.se)))]))
sacurine.se <- annotating(sacurine.se, database.c = "chebi",
param.ls = list(query.type = "chebi.id",
query.col = "database_identifier",
prefix = "chebiID."))
knitr::kable(head(rowData(sacurine.se)[, grep("chebiID",
colnames(rowData(sacurine.se)))]))
Adding chemical information from a local database:
msdbDF <- read.table(system.file("extdata/local_ms_db.tsv", package = "phenomis"),
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
sacurine.se <- annotating(sacurine.se,
database.c = "local.ms",
param.ls = list(query.type = "mz",
query.col = "mass_to_charge",
ms.mode = "neg",
mz.tol = 5,
mz.tol.unit = "ppm",
local.ms.db = msdbDF,
prefix = "localMS."),
report.c = "none")
## INFO [04:01:09.693] Loading definitions from package biodb version 1.15.0.
## INFO [04:01:09.862] Loading definitions from package biodbChebi version 1.13.0.
## INFO [04:01:10.797] Closing BiodbMain instance...
## INFO [04:01:10.798] Connector "mass.csv.file" deleted.
knitr::kable(rowData(sacurine.se)[!is.na(rowData(sacurine.se)[, "localMS.accession"]), grep("localMS.", colnames(rowData(sacurine.se)), fixed = TRUE)])
localMS.accession | localMS.formula | localMS.mass.csv.file.id | localMS.molecular.mass | localMS.ms.mode | localMS.name | localMS.peak.mz | localMS.peak.mztheo | |
---|---|---|---|---|---|---|---|---|
Citric acid | KNA00550 | C6H8O7 | KNA00550 | 192.027 | neg | Citric acid | 191.019252 | 191.019252 |
Malic acid | KNA00527 | C4H6O5 | KNA00527 | 134.02152 | neg | L-Malic acid | 133.014168 | 133.014168 |
Taurine | KNA00479 | C2H7NO3S | KNA00479 | 125.01466 | neg | Taurine | 124.00727 | 124.00727 |
Tryptophan | KNA00507 | C11H12N2O2 | KNA00507 | 204.08988 | neg | L-Tryptophan | 203.082131 | 203.082131 |
writing
: Exporting the resultsFinally, the writing
method can be used to export the
SummarizedExperiment
object in the 3 tabular file format.
In case of a MultiAssayExperiment
, one subfolder containing
the 3 files will be created for each dataset.
The main features from the phenomis
package can also be
accessed via a graphical user interface in the Quality
Metrics, Batch Correction,
Univariate, and Heatmap modules from
the workflow4metabolomics.usegalaxy.fr
online resource for computational metabolomics, based on the Galaxy
environment.
The MultiAssayExperiment
format is useful to handle
multi-omics datasets (Ramos et
al. 2017). The phenomis
package (as well as the
ropls
and biosigner
packages) can be used to
process the data sets in parallel, such as all methods described before
can be directly applied to the MultiAssayExperiment
object.
In several methods, specific parameters for each dataset can be
passed as a vector: for instance, with two datasets, if one wishes to
log2 transform the intensities of the first dataset but not the second
one, the method.vc = c("log2", "none")
parameter can be
used with the transforming
method (in contrast, if only one
value is provided, it will be used for all datasets).
Here we apply phenomis
statistical methods to the
ProMetIS study , which addresses the molecular phenotyping of knock-out
mouse models by combined proteomics and metabolomics analysis (Imbert et al. 2021). The phenomis
dataset has been selected from the original data (publicly available on
GitHub: IFB-ElixirFr/ProMetIS)
as follows:
liver data
WT and KO mice for the LAT gene (Linker for Activation of T cells) only
the proteomics dataset and the metabolomics dataset obtained with the ZIC-pHILIC chromatographic column
post-processed data only
100 features only
The multi-omics dataset is saved as text files and can be loaded with
the reading
function as a MultiAssayExperiment
(or as a MultiDataSet
with the
output.c = "set"
argument)
## Reading the 'metabo' dataset...
## Reading the 'proteo' dataset...
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] metabo: SummarizedExperiment with 100 rows and 28 columns
## [2] proteo: SummarizedExperiment with 100 rows and 28 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Let us have a look at the (common) sample metadata:
## DataFrame with 6 rows and 3 columns
## gene mouse_id sex
## <character> <integer> <character>
## L818f LAT 818 F
## L819f LAT 819 F
## L824m LAT 824 M
## L825m LAT 825 M
## L826m LAT 826 M
## L827m LAT 827 M
We first explore the datasets:
We then perform univariate hypothesis testing with the
limma
approach (Ritchie et al.
2015):
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
As with single omics studies, the ropls
and
biosigner
can be directly applied to
MultiAssayExperiment
for multivariate analysis:
##
##
## Building the model for the 'metabo' dataset:
## PLS-DA
## 28 samples x 100 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.331 0.941 0.859 0.128 2 0 0.05 0.05
##
##
## Building the model for the 'proteo' dataset:
## PLS-DA
## 28 samples x 100 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.137 0.677 0.458 0.294 1 0 0.05 0.05
## Warning: Single component model: only 'overview' and 'permutation' (in case of
## single response (O)PLS(-DA)) plots available
and feature selection:
##
##
## Selecting the features for the 'metabo' dataset:
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## Significant features from 'S' groups:
## plsda randomforest svm
## ADP "B" "S" "S"
## GDP-L-fucose "S" "E" "S"
## 9H-xanthine "B" "S" "E"
## deoxycholic acid "E" "E" "S"
## taurocholic acid "E" "E" "S"
## 3-phenylpropionic acid "E" "E" "S"
## 4-oxo-4-(pyridin-3-yl)butanoic acid "E" "E" "S"
## ferulic acid "E" "E" "S"
## glycocholate "E" "E" "S"
## UDP-alpha-D-galactose "E" "E" "S"
## Accuracy:
## plsda randomforest svm
## Full 0.980 0.949 0.963
## AS 0.912 0.918 0.952
## S 0.830 0.933 0.980
##
##
## Selecting the features for the 'proteo' dataset:
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## No significant variable found for the selected classifier(s): 'plsda', 'randomforest', 'svm'
The final MultiAssayExperiment
can be exported as text
files with:
The CLL
package contains the chronic lymphocytic
leukemia (CLL) gene expression data from 24 samples, that were either
classified as progressive or stable in regards to disease
progression.
We first load the data (which are stored as an
ExpressionSet
object), and restrict to the first 1,000
features (to speed up computations):
We explore the data:
Investigate whether clusters can be observed (we first add the letter ‘p’ or ‘s’ in the sample names to indicate the phenotype: p = progressive, s = stable):
Biobase::sampleNames(cll.eset) <- paste0(Biobase::sampleNames(cll.eset),
"_",
substr(Biobase::pData(cll.eset)[, "Disease"], 1, 1))
cll.eset <- clustering(cll.eset)
and perform hypothesis testing with the limma test (the dots must be removed from the ‘progress.’ phenotype):
Biobase::pData(cll.eset)[, "Disease"] <- gsub(".", "",
Biobase::pData(cll.eset)[, "Disease"],
fixed = TRUE)
cll.eset <- hypotesting(cll.eset, "limma", "Disease")
All the subsequent multivariate analysis and feature selection steps
described with the sacurine
data set can be performed with
the cll.eset
accordingly.
We provide an example based on the NCI60_4arrays
cancer
data set from the omicade4
package (Meng et al. 2014), which has been made
available in the MultiAssayExperiment
format in the
ropls
package. The data consist of gene transcript
expression analysis from NCI-60 cell lines by using 4 microarray
platforms (Reinhold et al. 2012).
Getting the NCI60
data set as a
MultiAssayExperiment
:
Let’s focus on the LEukemia and MElanoma cancer types, and the agilent and hgu95 arrays:
##
## BR CN CO LC LE ME OV PR RE
## 5 6 7 9 6 10 7 2 8
We can now e.g. perform clustering on each data set independently:
and/or perform hypothesis testing on each data set:
as well as all the analyses described above for the
prometis
data set.
SummarizedExperiment
The SummarizedExperiment
format has been designed by the
Bioconductor team to handle (single) omics datasets (Morgan et al. 2022).
An example of SummarizedExperiment
creation and a
summary of useful methods are provided below.
Please refer to package
vignettes for a full description of
SummarizedExperiment
objects .
# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata
# Creating the SummarizedExperiment (package SummarizedExperiment)
library(SummarizedExperiment)
sac.se <- SummarizedExperiment(assays = list(sacurine = t(data.mn)),
colData = samp.df,
rowData = feat.df)
# note that colData and rowData main format is DataFrame, but data frames are accepted when building the object
stopifnot(validObject(sac.se))
# Viewing the SummarizedExperiment
# ropls::view(sac.se)
Methods | Description | Returned class |
---|---|---|
Constructors | ||
SummarizedExperiment |
Create a SummarizedExperiment object | SummarizedExperiment |
makeSummarizedExperimentFromExpressionSet |
SummarizedExperiment |
|
Accessors | ||
assayNames |
Get or set the names of assay() elements | character |
assay |
Get or set the ith (default first) assay element | matrix |
assays |
Get the list of experimental data numeric matrices | SimpleList |
rowData |
Get or set the row data (features) | DataFrame |
colData |
Get or set the column data (samples) | DataFrame |
metadata |
Get or set the experiment data | list |
dim |
Get the dimensions (features of interest x samples) | integer |
dimnames |
Get or set the dimension names | list of length 2 character/NULL |
rownames |
Get the feature names | character |
colnames |
Get the sample names | character |
Conversion | ||
as(, "SummarizedExperiment") |
Convert an ExpressionSet to a SummarizedExperiment | SummarizedExperiment |
MultiAssayExperiment
The MultiAssayExperiment
format has been designed by the
Bioconductor team to handle multiomics datasets (Ramos et al. 2017).
An example of MultiAssayExperiment
creation and a
summary of useful methods are provided below. Please refer to package
vignettes or to the original publication for a full description of
MultiAssayExperiment
objects (Ramos
et al. 2017).
Loading the NCI60_4arrays
from the omicade4
package:
Creating the MultiAssayExperiment
:
library(MultiAssayExperiment)
# Building the individual SummarizedExperiment instances
experiment.ls <- list()
sampleMap.ls <- list()
for (set.c in names(NCI60_4arrays)) {
# Getting the data and metadata
assay.mn <- as.matrix(NCI60_4arrays[[set.c]])
coldata.df <- data.frame(row.names = colnames(assay.mn),
.id = colnames(assay.mn),
stringsAsFactors = FALSE) # the 'cancer' information will be stored in the main colData of the mae, not the individual SummarizedExperiments
rowdata.df <- data.frame(row.names = rownames(assay.mn),
name = rownames(assay.mn),
stringsAsFactors = FALSE)
# Building the SummarizedExperiment
assay.ls <- list(se = assay.mn)
names(assay.ls) <- set.c
se <- SummarizedExperiment(assays = assay.ls,
colData = coldata.df,
rowData = rowdata.df)
experiment.ls[[set.c]] <- se
sampleMap.ls[[set.c]] <- data.frame(primary = colnames(se),
colname = colnames(se)) # both datasets use identical sample names
}
sampleMap <- listToMap(sampleMap.ls)
# The sample metadata are stored in the colData data frame (both datasets have the same samples)
stopifnot(identical(colnames(NCI60_4arrays[[1]]),
colnames(NCI60_4arrays[[2]])))
sample_names.vc <- colnames(NCI60_4arrays[[1]])
colData.df <- data.frame(row.names = sample_names.vc,
cancer = substr(sample_names.vc, 1, 2))
nci.mae <- MultiAssayExperiment(experiments = experiment.ls,
colData = colData.df,
sampleMap = sampleMap)
stopifnot(validObject(nci.mae))
Methods | Description | Returned class |
---|---|---|
Constructors | ||
MultiAssayExperiment |
Create a MultiAssayExperiment object | MultiAssayExperiment |
ExperimentList |
Create an ExperimentList from a List or list | ExperimentList |
Accessors | ||
colData |
Get or set data that describe the patients/biological units | DataFrame |
experiments |
Get or set the list of experimental data objects as original classes | experimentList |
assays |
Get the list of experimental data numeric matrices | SimpleList |
assay |
Get the first experimental data numeric matrix | matrix , matrix-like |
sampleMap |
Get or set the map relating observations to subjects | DataFrame |
metadata |
Get or set additional data descriptions | list |
rownames |
Get row names for all experiments | CharacterList |
colnames |
Get column names for all experiments | CharacterList |
Subsetting | ||
mae[i, j, k] |
Get rows, columns, and/or experiments | MultiAssayExperiment |
mae[i, ,] |
i: GRanges, character, integer, logical, List, list | MultiAssayExperiment |
mae[,j,] |
j: character, integer, logical, List, list | MultiAssayExperiment |
mae[,,k] |
k: character, integer, logical | MultiAssayExperiment |
mae[[i]] |
Get or set object of arbitrary class from experiments | (Varies) |
mae[[i]] |
Character, integer, logical | |
mae$column |
Get or set colData column | vector (varies) |
Management | ||
complete.cases |
Identify subjects with complete data in all experiments | vector (logical) |
duplicated |
Identify subjects with replicate observations per experiment | list of LogicalLists |
mergeReplicates |
Merge replicate observations within each experiment | MultiAssayExperiment |
intersectRows |
Return features that are present for all experiments | MultiAssayExperiment |
intersectColumns |
Return subjects with data available for all experiments | MultiAssayExperiment |
prepMultiAssay |
Troubleshoot common problems when constructing main class | list |
Reshaping | ||
longFormat |
Return a long and tidy DataFrame with optional colData columns | DataFrame |
wideFormat |
Create a wide DataFrame, one row per subject | DataFrame |
Combining | ||
c |
Concatenate an experiment | MultiAssayExperiment |
Viewing | ||
upsetSamples |
Generalized Venn Diagram analog for sample membership | upset |
Exporting | ||
exportClass |
Exporting to flat files | csv or tsv files |
ExpressionSet
The ExpressionSet
format (Biobase
package)
was designed by the Bioconductor team as the first format to handle
(single) omics data. It has now been supplanted by the
SummarizedExperiment
format.
An example of ExpressionSet
creation and a summary of
useful methods are provided below. Please refer to ‘An
introduction to Biobase and ExpressionSets’ from the documentation
from the Biobase
package for a full description of ExpressionSet
objects.
# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata
# Creating the ExpressionSet (package Biobase)
sac.set <- Biobase::ExpressionSet(assayData = t(data.mn))
Biobase::pData(sac.set) <- samp.df
Biobase::fData(sac.set) <- feat.df
stopifnot(validObject(sac.set))
# Viewing the ExpressionSet
# ropls::view(sac.set)
Methods | Description | Returned class |
---|---|---|
exprs |
‘variable times samples’ numeric matrix | matrix |
pData |
sample metadata | data.frame |
fData |
variable metadata | data.frame |
sampleNames |
sample names | character |
featureNames |
variable names | character |
dims |
2x1 matrix of ‘Features’ and ‘Samples’ dimensions | matrix |
varLabels |
Column names of the sample metadata data frame | character |
fvarLabels |
Column names of the variable metadata data frame | character |
MultiDataSet
Loading the NCI60_4arrays
from the omicade4
package:
Creating the MultiDataSet
:
library(MultiDataSet)
# Creating the MultiDataSet instance
nci.mds <- MultiDataSet::createMultiDataSet()
# Adding the two datasets as ExpressionSet instances
for (set.c in names(NCI60_4arrays)) {
# Getting the data
expr.mn <- as.matrix(NCI60_4arrays[[set.c]])
pdata.df <- data.frame(row.names = colnames(expr.mn),
cancer = substr(colnames(expr.mn), 1, 2),
stringsAsFactors = FALSE)
fdata.df <- data.frame(row.names = rownames(expr.mn),
name = rownames(expr.mn),
stringsAsFactors = FALSE)
# Building the ExpressionSet
eset <- Biobase::ExpressionSet(assayData = expr.mn,
phenoData = new("AnnotatedDataFrame",
data = pdata.df),
featureData = new("AnnotatedDataFrame",
data = fdata.df),
experimentData = new("MIAME",
title = set.c))
# Adding to the MultiDataSet
nci.mds <- MultiDataSet::add_eset(nci.mds, eset, dataset.type = set.c,
GRanges = NA, warnings = FALSE)
}
stopifnot(validObject(nci.mds))
Methods | Description | Returned class |
---|---|---|
Constructors | ||
createMultiDataSet |
Create a MultiDataSet object | MultiDataSet |
add_eset |
Create a MultiAssayExperiment object | MultiDataSet |
Subsetting | ||
mset[i, ] |
i: character,logical (samples to select) | MultiDataSet |
mset[, k] |
k: character (names of datasets to select) | MultiDataSet |
mset[[k]] |
k: character (name of the datast to select) | ExpressionSet |
Accessors | ||
as.list |
Get the list of data matrices | list |
pData |
Get the list of sample metadata | list |
fData |
Get the list of variable metadata | list |
sampleNames |
Get the list of sample names | list |
Management | ||
commonSamples |
Select samples that are present in all datasets | MultiDataSet |
Conversion | ||
mds2mae |
Convert a MultiDataSet to a MultiAssayExperiment | MultiAssayExperiment |
Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MultiDataSet_1.35.0 MultiAssayExperiment_1.33.0
## [3] biodb_1.15.0 phenomis_1.9.0
## [5] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [7] GenomicRanges_1.59.1 GenomeInfoDb_1.43.1
## [9] IRanges_2.41.1 S4Vectors_0.45.2
## [11] BiocGenerics_0.53.3 generics_0.1.3
## [13] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 formatR_1.14 rlang_1.1.4
## [4] magrittr_2.0.3 PMCMRplus_1.9.12 e1071_1.7-16
## [7] compiler_4.4.2 RSQLite_2.3.8 BWStest_0.2.3
## [10] vctrs_0.6.5 kSamples_1.2-10 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [16] dbplyr_2.5.0 XVector_0.47.0 labeling_0.4.3
## [19] utf8_1.2.4 ropls_1.39.0 rmarkdown_2.29
## [22] UCSC.utils_1.3.0 purrr_1.0.2 bit_4.5.0
## [25] xfun_0.49 randomForest_4.7-1.2 zlibbioc_1.52.0
## [28] cachem_1.1.0 Rmpfr_1.0-0 jsonlite_1.8.9
## [31] progress_1.2.3 SuppDists_1.1-9.8 blob_1.2.4
## [34] gmp_0.7-5 DelayedArray_0.33.2 chk_0.9.2
## [37] prettyunits_1.2.0 R6_2.5.1 bslib_0.8.0
## [40] stringi_1.8.4 RColorBrewer_1.1-3 ranger_0.17.0
## [43] limma_3.63.2 jquerylib_0.1.4 Rcpp_1.0.13-1
## [46] knitr_1.49 BiocBaseUtils_1.9.0 VennDiagram_1.7.3
## [49] igraph_2.1.1 Matrix_1.7-1 tidyselect_1.2.1
## [52] abind_1.4-8 yaml_2.3.10 curl_6.0.1
## [55] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [58] withr_3.0.2 askpass_1.2.1 evaluate_1.0.1
## [61] lambda.r_1.2.4 proxy_0.4-27 futile.logger_1.4.3
## [64] BiocFileCache_2.15.0 pillar_1.9.0 biodbChebi_1.13.0
## [67] BiocManager_1.30.25 filelock_1.0.3 plotly_4.10.4
## [70] hms_1.1.3 ggplot2_3.5.1 munsell_0.5.1
## [73] scales_1.3.0 calibrate_1.7.7 class_7.3-22
## [76] glue_1.8.0 lazyeval_0.2.2 maketools_1.3.1
## [79] tools_4.4.2 data.table_1.16.2 sys_3.4.3
## [82] buildtools_1.0.0 mvtnorm_1.3-2 XML_3.99-0.17
## [85] grid_4.4.2 tidyr_1.3.1 colorspace_2.1-1
## [88] GenomeInfoDbData_1.2.13 cli_3.6.3 rappdirs_0.3.3
## [91] futile.options_1.0.1 fansi_1.0.6 viridisLite_0.4.2
## [94] S4Arrays_1.7.1 dplyr_1.1.4 gtable_0.3.6
## [97] sass_0.4.9 digest_0.6.37 ggrepel_0.9.6
## [100] SparseArray_1.7.2 farver_2.1.2 htmlwidgets_1.6.4
## [103] lgr_0.4.4 memoise_2.0.1 htmltools_0.5.8.1
## [106] lifecycle_1.0.4 biosigner_1.35.0 httr_1.4.7
## [109] multcompView_0.1-10 statmod_1.5.0 openssl_2.2.2
## [112] qqman_0.1.9 bit64_4.5.2 MASS_7.3-61
reading
:
reading the datainspecting
:
looking at the datahypotesting
:
univariate hypothesis testingannotating
:
MS annotationwriting
:
Exporting the results