The qmtools
package provides basic tools for imputation,
normalization, and dimension-reduction of metabolomics data with the
standard SummarizedExperiment
class. It also offers several
helper functions to assist visualization of data. This vignette gives
brief descriptions of these tools with toy examples.
The package can be installed using BiocManager.
In R session, please type
BiocManager::install("qmtools")
.
To demonstrate the use of the qmtools
functions, we will
use the FAAH
knockout LC/MS data, containing quantified LC/MS peaks from the
spinal cords of 6 wild-type and 6 FAAH (fatty acid amide hydrolase)
knockout mice.
library(qmtools)
library(SummarizedExperiment)
library(vsn)
library(pls)
library(ggplot2)
library(patchwork)
set.seed(1e8)
data(faahko_se)
## Only keep the first assay for the vignette
assays(faahko_se)[2:4] <- NULL
faahko_se
#> class: SummarizedExperiment
#> dim: 206 12
#> metadata(0):
#> assays(1): raw
#> rownames(206): FT001 FT002 ... FT205 FT206
#> rowData names(10): mzmed mzmin ... WT peakidx
#> colnames(12): ko15.CDF ko16.CDF ... wt21.CDF wt22.CDF
#> colData names(2): sample_name sample_group
Metabolomics data often contains a large number of uninformative
features that can hinder downstream analysis. The
removeFeatures
function attempts to identify such features
and remove them from the data based on missing values, quality control
(QC) replicates, and blank samples with the following methods:
Proportions of missing values: retain features if there is at least one group with a proportion of non-missing values above a cut-off.
Relative standard deviation: remove features if QC replicates show low reproducibility.
Intraclass correlation coefficient (ICC): retain features if a feature has relatively high variability across biological samples compared to QC replicates.
QC/blank ratio: remove features with low abundance that may have non-biological origin.
The FAAH knockout data does not include QC and blank samples. Here, we just illustrate missing value-based filtering.
dim(faahko_se) # 206 features
#> [1] 206 12
table(colData(faahko_se)$sample_group)
#>
#> KO WT
#> 6 6
## Missing value filter based on 2 groups.
dim(removeFeatures(faahko_se, i = "raw",
group = colData(faahko_se)$sample_group,
cut = 0.80)) # nothing removed
#> [1] 206 12
dim(removeFeatures(faahko_se, i = "raw",
group = colData(faahko_se)$sample_group,
cut = 0.85)) # removed 65 features
#> [1] 141 12
## based on "WT" only
dim(removeFeatures(faahko_se, i = "raw",
group = colData(faahko_se)$sample_group,
levels = "WT", cut = 0.85))
#> [1] 104 12
In this vignette, we kept all features based on the cut-off: at least one group contains >= 80% of non-missing values.
Missing values are common in metabolomics data. For example, ions may
have a low abundance that does not reach the limit of detection of the
instrument. Unexpected stochastic fluctuations and technical error may
also cause missing values even though ions present at detectable levels.
We could use the plotMiss
function to explore the
mechanisms generating the missing values. The bar plot on the left panel
shows the amount of missing values in each samples and the right panel
helps to identify the structure of missing values with a
hierarchically-clustered heatmap.
## Sample group information
g <- factor(colData(faahko_se)$sample_group, levels = c("WT", "KO"))
## Visualization of missing values
plotMiss(faahko_se, i = "raw", group = g)
Overall, the knockout mice have a higher percentage of missing values. The features on top of the heatmap in general only present at the knockout mice, suggesting that some of missing values are at least not random (perhaps due to altered metabolisms by the experimental condition). In almost all cases, visualization and inspection of missing values are a time-intensive step, but greatly improve the ability to uncover the nature of missing values in data and help to choose an appropriate imputation method.
The imputation of missing values can be done with the
imputeIntensity
function. Several imputation methods are
available such as k-Nearest Neighbor (kNN), Random Forest (RF), Bayesian
PCA, and other methods available in MsCoreUtils.
By default, the kNN is used to impute missing values using the Gower
distance. The kNN is a distance-based algorithm that typically requires
to scale the data to avoid variance-based weighing. Since the Gower
distance used, the imputation can be performed with the original scales,
which may be helpful to non-technical users.
se <- imputeIntensity(faahko_se, i = "raw", name = "knn", method = "knn")
se # The result was stored in assays slot: "knn"
#> class: SummarizedExperiment
#> dim: 206 12
#> metadata(0):
#> assays(2): raw knn
#> rownames(206): FT001 FT002 ... FT205 FT206
#> rowData names(10): mzmed mzmin ... WT peakidx
#> colnames(12): ko15.CDF ko16.CDF ... wt21.CDF wt22.CDF
#> colData names(2): sample_name sample_group
## Standardization of input does not influence the result
m <- assay(faahko_se, "raw")
knn_scaled <- as.data.frame(
imputeIntensity(scale(m), method = "knn") # Can accept matrix as an input
)
knn_unscaled <- as.data.frame(assay(se, "knn"))
idx <- which(is.na(m[, 1]) | is.na(m[, 2])) # indices for missing values
p1 <- ggplot(knn_unscaled[idx, ], aes(x = ko15.CDF, y = ko16.CDF)) +
geom_point() + theme_bw()
p2 <- ggplot(knn_scaled[idx, ], aes(x = ko15.CDF, y = ko16.CDF)) +
geom_point() + theme_bw()
p1 + p2 + plot_annotation(title = "Imputed values: unscaled vs scaled")
In metabolomics, normalization is an important part of data
processing to reduce unwanted non-biological variations (e.g., variation
due to sample preparation and handling). The
normalizeIntensity
function provides several data-driven
normalization methods such as Probabilistic Quotient Normalization
(PQN), Variance-Stabilizing Normalization (VSN), Cyclic LOESS
normalization, and other methods available in MsCoreUtils.
Here, we will apply the VSN to the imputed intensities. Note that the
VSN produces glog-transformed (generalized log transform) feature
intensities. The consequence of normalization can be visualized with the
plotBox
function.
se <- normalizeIntensity(se, i = "knn", name = "knn_vsn", method = "vsn")
se # The result was stored in assays slot: "knn_vsn"
#> class: SummarizedExperiment
#> dim: 206 12
#> metadata(0):
#> assays(3): raw knn knn_vsn
#> rownames(206): FT001 FT002 ... FT205 FT206
#> rowData names(10): mzmed mzmin ... WT peakidx
#> colnames(12): ko15.CDF ko16.CDF ... wt21.CDF wt22.CDF
#> colData names(2): sample_name sample_group
p1 <- plotBox(se, i = "knn", group = g, log2 = TRUE) # before normalization
p2 <- plotBox(se, i = "knn_vsn", group = g) # after normalization
p1 + p2 + plot_annotation(title = "Before vs After normalization")
The metabolomics data generally consist of a large number of
features, and dimension-reduction techniques are often used for modeling
and visualization to uncover latent structure underlying many features.
The reduceFeatures
can be used to perform
dimension-reduction of the data. Currently, Principal Component Analysis
(PCA), Partial Least Square-Discriminant Analysis (PLS-DA) and
t-distributed stochastic neighbor (t-SNE) are supported. The function
returns a matrix containing dimension-reduced data with several
attributes that can be summarized with the summary
function.
## PCA
m_pca <- reduceFeatures(se, i = "knn_vsn", method = "pca", ncomp = 2)
summary(m_pca)
#> Reduction method: PCA (SVD)
#> Input centered before PCA: TRUE
#> Input scaled before PCA: FALSE
#> Number of PCs calculated: 2
#> Importance of PC(s):
#> PC1 PC2
#> Proportion of Variance 0.2265 0.1599
#> Cumulative Proportion 0.2265 0.3865
## PLS-DA (requires information about each sample's group)
m_plsda <- reduceFeatures(se, i = "knn_vsn", method = "plsda", ncomp = 2, y = g)
summary(m_plsda)
#> Reduction method: PLS-DA (kernelpls)
#> Y responses: WT KO
#> Input centered before PLS-DA: TRUE
#> Input scaled before PLS-DA: FALSE
#> Number of components considered: 2
#> Amount of X variance explained by each component:
#> Comp 1 Comp 2
#> Explained % 21.64 14.02
#> Cumulative % 21.64 35.65
The dimension-reduction results can be plotted with the
plotReduced
function. Each point (label) represents a
sample. Data ellipses can be visualized.
For soft ionization methods such as LC/ESI-MS, a bulk of ions can be
generated from an individual compound upon ionization. Because we
typically interested in compounds rather than different ion species,
identifying features from the same compound is necessary. The
clusterFeatures
function attempts to cluster metabolic
features with the following steps:
Clusters features according to their retention times
Based on the initial grouping, clusters features according to the intensity correlations
After the clustering procedures, the function adds the
rtime_group
and feature_group
columns to the
rowData of SummarizedExperiment
input.
se <- clusterFeatures(se, i = "knn_vsn", rtime_var = "rtmed",
rt_cut = 10, cor_cut = 0.7)
rowData(se)[, c("rtmed", "rtime_group", "feature_group")]
#> DataFrame with 206 rows and 3 columns
#> rtmed rtime_group feature_group
#> <numeric> <factor> <character>
#> FT001 2789.04 FG.01 FG.01.01
#> FT002 2788.31 FG.01 FG.01.02
#> FT003 2718.79 FG.02 FG.02.01
#> FT004 3024.33 FG.03 FG.03.01
#> FT005 3684.80 FG.04 FG.04.01
#> ... ... ... ...
#> FT202 3001.79 FG.14 FG.14.03
#> FT203 3714.93 FG.24 FG.24.07
#> FT204 3403.03 FG.50 FG.50.04
#> FT205 3614.09 FG.49 FG.49.05
#> FT206 3817.51 FG.41 FG.41.02
By default, the retention time-based grouping is performed with a hierarchical clustering based on the Manhattan distance (i.e., differences in retention times). The equivalent steps are
rts <- rowData(se)$rtmed
rt_cut <- 10
fit <- hclust(dist(rts, "manhattan"))
plot(as.dendrogram(fit), leaflab = "none")
rect.hclust(fit, h = rt_cut)
The retention-time based grouping can also be conducted with the
algorithms (groupClosest
and groupConsecutive
)
available in the MsFeatures
package.
Upon the initial grouping, each retention-based time group is further
clustered according to the intensity correlations since features may be
originated from different co-eluting compounds, not from a single
entity. By default, the function creates a graph where correlations
serve as edge weights while low correlations defined by a user-specified
cut-off ignored. cor_grouping = "connected"
simply assigns
connected features into the same feature group whereas
cor_grouping = louvain
further applies the Louvain
algorithm to the graph to identify densely connected features. The
groupSimiarityMatrix
approach from the MsFeatures
package is also supported.
The feature clustering results can be visualized with the
plotRTgroup
function. A group of features in the same
feature group will be displayed with the same color. Each vertex
represents a feature and each weight represent a correlation between
features.
se_connected <- clusterFeatures(se, i = "knn_vsn", rtime_var = "rtmed",
rt_cut = 10, cor_cut = 0.7,
cor_grouping = "connected")
plotRTgroup(se_connected, i = "knn_vsn", group = "FG.22")
se_louvain <- clusterFeatures(se, i = "knn_vsn", rtime_var = "rtmed",
rt_cut = 10, cor_cut = 0.7,
cor_grouping = "louvain")
plotRTgroup(se_louvain, i = "knn_vsn", group = "FG.22")
More details could be plotted by specifying
type = "pairs"
.
The clustering results can be used to deal with the redundancy of the data with other packages such as QFeatures (aggregation of intensities) and InterpretMSSpectrum (adduct annotation).
To test which metabolic features are different between two sets of
samples, the compareSamples
function provides a convenient
way to compute empirical Bayes statistics using the limma
package interface. Note that this function expects log-transformed
feature intensities.
## Compute statisticis for the contrast: KO - WT
fit <- compareSamples(se, i = "knn_vsn", group = "sample_group",
class1 = "WT", class2 = "KO")
## List top 5 features
head(fit, 5)
#> logFC CI.L CI.R AveExpr t P.Value adj.P.Val
#> FT042 3.153118 2.770529 3.535707 20.43317 17.814931 1.771892e-10 3.650097e-08
#> FT039 3.980802 3.403927 4.557677 19.87720 14.916453 1.592198e-09 1.639964e-07
#> FT063 2.217181 1.736336 2.698026 19.20748 9.967198 1.962654e-07 1.347689e-05
#> FT044 1.945845 1.269865 2.621825 18.89815 6.222304 3.185040e-05 1.640296e-03
#> FT098 2.575125 1.559026 3.591224 18.83681 5.478216 1.081528e-04 4.455894e-03
#> B
#> FT042 14.468908
#> FT039 12.375890
#> FT063 7.579018
#> FT044 2.355507
#> FT098 1.096718
Multiple covariates can be included to incorporate important sample and experiment information.
## Include covariates
colData(se)$covar <- c(rep(c("A", "B"), 6))
compareSamples(se, i = "knn_vsn", group = "sample_group",
covariates = "covar", class1 = "WT", class2 = "KO",
number = 5)
#> logFC CI.L CI.R AveExpr t P.Value adj.P.Val
#> FT042 3.153118 2.757256 3.548980 20.43317 17.376617 8.402759e-10 1.730968e-07
#> FT039 3.980802 3.396157 4.565447 19.87720 14.854108 5.006787e-09 5.156991e-07
#> FT063 2.217181 1.749823 2.684540 19.20748 10.349506 2.725672e-07 1.871628e-05
#> FT044 1.945845 1.236408 2.655281 18.89815 5.983604 6.682044e-05 3.441253e-03
#> FT098 2.575125 1.576222 3.574028 18.83681 5.623974 1.166301e-04 4.805159e-03
#> B
#> FT042 13.015891
#> FT039 11.311369
#> FT063 7.328577
#> FT044 1.659229
#> FT098 1.082477
For more flexible model specifications (e.g., interaction model, multi-level model), please use a standard workflow outlined in the limma package user’s guide.
Colin A. Smith (2021). faahKO: Saghatelian et al. (2004) FAAH knockout LC/MS data. http://dx.doi.org/10.1021/bi0480335
Laurent Gatto, Johannes Rainer and Sebastian Gibb (2021). MsCoreUtils: Core Utils for Mass Spectrometry Data. https://github.com/RforMassSpectrometry/MsCoreUtils
Johannes Rainer (2022). MsFeatures: Functionality for Mass Spectrometry Features. https://github.com/RforMassSpectrometry/MsFeatures
Laurent Gatto and Christophe Vanderaa (2021). QFeatures: Quantitative features for mass spectrometry data. https://github.com/RforMassSpectrometry/QFeatures
Jan Lisec (2018). InterpretMSSpectrum: Interpreting High Resolution Mass Spectra. https://CRAN.R-project.org/package=InterpretMSSpectrum
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47. https://bioconductor.org/packages/limma
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] patchwork_1.3.0 ggplot2_3.5.1
#> [3] pls_2.8-5 vsn_3.75.0
#> [5] qmtools_1.11.0 SummarizedExperiment_1.37.0
#> [7] Biobase_2.67.0 GenomicRanges_1.59.1
#> [9] GenomeInfoDb_1.43.2 IRanges_2.41.2
#> [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [13] generics_0.1.3 MatrixGenerics_1.19.0
#> [15] matrixStats_1.4.1 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3
#> [4] clue_0.3-66 e1071_1.7-16 compiler_4.4.2
#> [7] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1
#> [10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
#> [13] XVector_0.47.0 labeling_0.4.3 ca_0.71.1
#> [16] rmarkdown_2.29 UCSC.utils_1.3.0 preprocessCore_1.69.0
#> [19] purrr_1.0.2 xfun_0.49 zlibbioc_1.52.0
#> [22] cachem_1.1.0 jsonlite_1.8.9 DelayedArray_0.33.3
#> [25] cluster_2.1.8 R6_2.5.1 vcd_1.4-13
#> [28] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
#> [31] ranger_0.17.0 limma_3.63.2 boot_1.3-31
#> [34] car_3.1-3 lmtest_0.9-40 jquerylib_0.1.4
#> [37] Rcpp_1.0.13-1 assertthat_0.2.1 iterators_1.0.14
#> [40] knitr_1.49 zoo_1.8-12 igraph_2.1.2
#> [43] Matrix_1.7-1 nnet_7.3-19 tidyselect_1.2.1
#> [46] abind_1.4-8 yaml_2.3.10 viridis_0.6.5
#> [49] TSP_1.2-4 codetools_0.2-20 affy_1.85.0
#> [52] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
#> [55] withr_3.0.2 evaluate_1.0.1 proxy_0.4-27
#> [58] heatmaply_1.5.0 pillar_1.10.0 affyio_1.77.1
#> [61] BiocManager_1.30.25 carData_3.0-5 VIM_6.2.2
#> [64] foreach_1.5.2 plotly_4.10.4 sp_2.1-4
#> [67] munsell_0.5.1 scales_1.3.0 laeken_0.5.3
#> [70] class_7.3-22 glue_1.8.0 lazyeval_0.2.2
#> [73] maketools_1.3.1 tools_4.4.2 dendextend_1.19.0
#> [76] robustbase_0.99-4-1 sys_3.4.3 data.table_1.16.4
#> [79] webshot_0.5.5 registry_0.5-1 buildtools_1.0.0
#> [82] grid_4.4.2 tidyr_1.3.1 seriation_1.5.7
#> [85] MsCoreUtils_1.19.0 colorspace_2.1-1 GenomeInfoDbData_1.2.13
#> [88] Formula_1.2-5 cli_3.6.3 S4Arrays_1.7.1
#> [91] viridisLite_0.4.2 dplyr_1.1.4 DEoptimR_1.1-3-1
#> [94] gtable_0.3.6 sass_0.4.9 digest_0.6.37
#> [97] SparseArray_1.7.2 htmlwidgets_1.6.4 farver_2.1.2
#> [100] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
#> [103] statmod_1.5.0 MASS_7.3-61