| Title: | A High-Performance R Package for Metabolomics Modeling and Analysis |
|---|---|
| Description: | Tools for 1D NMR metabolomics workflows, including import and preprocessing of Bruker experiments, multivariate modeling (PCA, PLS, OPLS) and model analytics and validation (y-permutations, cv-anova). Performance-critical routines are implemented in C++ and use the Armadillo and Eigen linear algebra libraries to improve runtime. |
| Authors: | Torben Kimhofer [aut, cre] (ORCID: <https://orcid.org/0000-0001-7158-9930>) |
| Maintainer: | Torben Kimhofer <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.7 |
| Built: | 2026-05-13 05:04:44 UTC |
| Source: | https://github.com/bioc/metabom8 |
Takes the out_df you showed (perm rows + one "non-permuted" row) and returns
observed values, permutation distributions, and permutation p-values.
.perm_test_from_table( out_df, observed_label = "non-permuted", alternative = c("greater", "less"), add_one = TRUE, na_rm = TRUE ).perm_test_from_table( out_df, observed_label = "non-permuted", alternative = c("greater", "less"), add_one = TRUE, na_rm = TRUE )
out_df |
data.frame with columns like q2_comp, r2_comp, aucs_te, aucs_tr, model (with one row "non-permuted"), and optionally r/r_abs. |
observed_label |
character. Label used in |
alternative |
character. "greater" (default) tests obs > perm; "less" tests obs < perm. |
add_one |
logical. If TRUE, uses (1 + count)/(B + 1) p-value correction. |
na_rm |
logical. Drop NA values before computing p-values. |
A list with:
observed: named numeric vector of observed metrics
perm: list of numeric vectors for each metric
p_value: named numeric vector of permutation p-values
B: number of permutations used
"m8_prep" attribute.
The step title is formatted as "note {username}".
The timestamp is stored in params, and the user message
is stored in notes.Add user note to metabom8 provenance
Appends a user annotation to the "m8_prep" attribute.
The step title is formatted as "note {username}".
The timestamp is stored in params, and the user message
is stored in notes.
add_note(x, note, params = NULL)add_note(x, note, params = NULL)
x |
A metabom8 object or matrix with |
note |
Character string describing the annotation. |
params |
Named list providing paramter key-value pairs. |
The input object with updated provenance metadata.
Other provenance:
get_provenance(),
print_provenance()
params <- list( runtime = "docker", image = Sys.getenv("IMAGE", "docker-image-dummy"), workflow = Sys.getenv("M8_WORKFLOW", "std_prof-urine"), agent = paste0("snakemake/", Sys.getenv("SNAKEMAKE_VERSION", "v?")), run_id = Sys.getenv("M8_RUN_ID", "m8-2605-001") ) data(hiit_raw) print_provenance(hiit_raw) hiit_proc <- hiit_raw |> calibrate(type = "tsp") |> excise() |> add_note('dilution-adaptive acquisition mode -> verify snr after normalisation', params) print_provenance(hiit_proc)params <- list( runtime = "docker", image = Sys.getenv("IMAGE", "docker-image-dummy"), workflow = Sys.getenv("M8_WORKFLOW", "std_prof-urine"), agent = paste0("snakemake/", Sys.getenv("SNAKEMAKE_VERSION", "v?")), run_id = Sys.getenv("M8_RUN_ID", "m8-2605-001") ) data(hiit_raw) print_provenance(hiit_raw) hiit_proc <- hiit_raw |> calibrate(type = "tsp") |> excise() |> add_note('dilution-adaptive acquisition mode -> verify snr after normalisation', params) print_provenance(hiit_proc)
Aligns spectra within a specified ppm window using cross-correlation. Only the selected region is aligned; the remainder of the spectra is unchanged.
align_segment( dat, shift, idx_ref = 1, med = TRUE, clim = 0.7, norm = FALSE, lag.max = 20 )align_segment( dat, shift, idx_ref = 1, med = TRUE, clim = 0.7, norm = FALSE, lag.max = 20 )
dat |
Named list with elements:
|
shift |
Numeric vector of length 2 specifying ppm region to align. |
idx_ref |
Integer. Row index to use as reference spectrum. |
med |
Logical. Use row-wise median spectrum as reference? |
clim |
Numeric. Minimum correlation threshold. |
norm |
Logical. Z-scale before alignment. |
lag.max |
Integer. Maximum lag allowed. |
Updated dat list with aligned spectra in selected region.
Other preprocessing:
align_spectra(),
binning(),
calibrate(),
correct_baseline(),
correct_lw(),
pqn(),
print_preprocessing()
data(hiit_raw) plot_spec(hiit_raw, shift=c(1.3,1.4)) hiit_aligned <- align_segment(hiit_raw, c(1.3, 1.35)) plot_spec(hiit_aligned, shift=c(1.3,1.4)) # aligned segment plot_spec(hiit_aligned, shift=c(3, 3.1)) # segment not aligneddata(hiit_raw) plot_spec(hiit_raw, shift=c(1.3,1.4)) hiit_aligned <- align_segment(hiit_raw, c(1.3, 1.35)) plot_spec(hiit_aligned, shift=c(1.3,1.4)) # aligned segment plot_spec(hiit_aligned, shift=c(3, 3.1)) # segment not aligned
Aligns 1D NMR spectra using signal-adaptive ppm intervals derived from the
cohort median spectrum. Intervals are constructed around median peak systems
and aligned locally via cross-correlation.
The function accepts either a metabom8-style dat list or plain matrix/vector with ppm.
align_spectra(x, ppm = NULL, half_win_ppm = 0.007, lag.max = 200)align_spectra(x, ppm = NULL, half_win_ppm = 0.007, lag.max = 200)
x |
A numeric matrix/vector of spectra or a named list with elements
|
ppm |
Numeric vector of chemical shift values (ppm). Ignored if |
half_win_ppm |
Numeric scalar controlling alignment interval width. |
lag.max |
Integer. Maximum allowed lag for cross-correlation alignment. |
Alignment intervals are derived from peaks in the cohort median spectrum.
The half_win_ppm parameter controls interval granularity:
Smaller values (e.g. 0.005–0.007 ppm) generate more, narrower intervals (RSPA-like behaviour).
Larger values (e.g. 0.008–0.015 ppm) merge nearby multiplets into broader regions (icoshift-like behaviour).
Typical 600–800 MHz H NMR data:
0.006–0.008 ppm: stable multiplet-level alignment
<0.005 ppm: may split J-coupled systems
>0.015 ppm: may merge unrelated resonances
Alignment parameters and interval definitions are recorded in
attr(X, "m8_prep").
If input is dat, returns updated dat. Otherwise returns aligned matrix.
Other preprocessing:
align_segment(),
binning(),
calibrate(),
correct_baseline(),
correct_lw(),
pqn(),
print_preprocessing()
data(hiit_raw) plot_spec(hiit_raw, shift=c(1.3,1.4)) hiit_aligns <- align_spectra(hiit_raw) plot_spec(hiit_aligns, shift=c(1.3,1.4)) # aligned segment plot_spec(hiit_aligns, shift=c(3, 3.1)) # aligned segmentdata(hiit_raw) plot_spec(hiit_raw, shift=c(1.3,1.4)) hiit_aligns <- align_spectra(hiit_raw) plot_spec(hiit_aligns, shift=c(1.3,1.4)) # aligned segment plot_spec(hiit_aligns, shift=c(3, 3.1)) # aligned segment
Balanced bootstrap resampling strategy
balanced_boot(k, split, type = c("DA", "R"), probs = NULL)balanced_boot(k, split, type = c("DA", "R"), probs = NULL)
k |
Integer. Number of bootstrap resamples. |
split |
Numeric. Fraction of samples drawn for the training set
(e.g. |
type |
Character. Either |
probs |
Numeric vector of quantile probabilities used to stratify continuous
|
Generates k bootstrap samples (training sets sampled with replacement).
The remaining samples (the out-of-bag set) can be used as a test set.
Balancing ensures equal representation of strata in the training data:
type = "DA"Class labels define the strata, and sampling is balanced across classes.
type = "R"The response is discretised into bins using quantiles defined by
probs, and each bin contributes equally to the training set.
A named list with elements:
List of integer vectors containing training set indices for each resampling iteration.
Character string indicating the resampling strategy.
Integer. Number of samples in the dataset.
Integer. Random seed used to generate the resampling splits, ensuring reproducibility.
Other resampling strategies:
balanced_mc(),
kfold(),
mc(),
stratified_kfold()
n <- 100 # bivariate outcome thr <- 1.5 Y <- c(rnorm(80, thr-3, 0.3), rnorm(20, thr+3, 0.3)) # unbalanced low/high outcome mean(Y>thr) cv_k <- kfold(k = 10) cv_boot <- balanced_boot(k = 10, split = 2/3, type = "R", probs = c(0, 0.8, 1)) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) b_inst <- metabom8:::.arg_check_cv(cv_pars=cv_boot, model_type='R', n=n, Y_prepped=cbind(Y)) # balanced splits: proportion above global median stays ~0.5 q80 <- quantile(Y, 0.8) round(sapply(k_inst$train, function(i) mean(Y[i] > q80)), 2) # resembles original Y distr. round(sapply(b_inst$train, function(i) mean(Y[i] > q80)), 2) # balanced strata (low/high)n <- 100 # bivariate outcome thr <- 1.5 Y <- c(rnorm(80, thr-3, 0.3), rnorm(20, thr+3, 0.3)) # unbalanced low/high outcome mean(Y>thr) cv_k <- kfold(k = 10) cv_boot <- balanced_boot(k = 10, split = 2/3, type = "R", probs = c(0, 0.8, 1)) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) b_inst <- metabom8:::.arg_check_cv(cv_pars=cv_boot, model_type='R', n=n, Y_prepped=cbind(Y)) # balanced splits: proportion above global median stays ~0.5 q80 <- quantile(Y, 0.8) round(sapply(k_inst$train, function(i) mean(Y[i] > q80)), 2) # resembles original Y distr. round(sapply(b_inst$train, function(i) mean(Y[i] > q80)), 2) # balanced strata (low/high)
Balanced Monte-Carlo resampling strategy
balanced_mc(k, split, type = c("DA", "R"), probs = NULL)balanced_mc(k, split, type = c("DA", "R"), probs = NULL)
k |
Integer. Number of repeated random splits. |
split |
Numeric. Fraction of samples assigned to the training set
(e.g. |
type |
Character. Either |
probs |
Numeric vector of quantile probabilities used to stratify
continuous |
Generates k Monte-Carlo resampling splits by randomly partitioning
the data into training and test sets without replacement.
Balancing ensures equal representation of strata in the training data:
type = "DA"Class labels define the strata, and sampling is balanced across classes.
type = "R"The response is discretised into bins using quantiles defined by
probs, and each bin contributes equally to the training set.
This strategy can improve robustness of model evaluation in settings with limited samples size and imbalanced or unevenly distributed outcome variables.
A named list with elements:
List of integer vectors containing training set indices for each resampling iteration.
Character string indicating the resampling strategy.
Integer. Number of samples in the dataset.
Integer. Random seed used to generate the resampling splits, ensuring reproducibility.
Other resampling strategies:
balanced_boot(),
kfold(),
mc(),
stratified_kfold()
n <- 100 # bivariate outcome thr <- 1.5 Y <- c(rnorm(80, thr-3, 0.3), rnorm(20, thr+3, 0.3)) # unbalanced low/high outcome mean(Y>thr) cv_k <- kfold(k = 10) cv_mc <- balanced_mc(k = 10, split = 2/3, type = "R", probs = c(0, 0.8, 1)) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) mc_inst <- metabom8:::.arg_check_cv(cv_pars=cv_mc, model_type='R', n=n, Y_prepped=cbind(Y)) # balanced splits: proportion above global median stays ~0.5 q80 <- quantile(Y, 0.8) round(sapply(k_inst$train, function(i) mean(Y[i] > q80)), 2) # resembles original Y distr. round(sapply(mc_inst$train, function(i) mean(Y[i] > q80)), 2) # balanced strata (low/high)n <- 100 # bivariate outcome thr <- 1.5 Y <- c(rnorm(80, thr-3, 0.3), rnorm(20, thr+3, 0.3)) # unbalanced low/high outcome mean(Y>thr) cv_k <- kfold(k = 10) cv_mc <- balanced_mc(k = 10, split = 2/3, type = "R", probs = c(0, 0.8, 1)) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) mc_inst <- metabom8:::.arg_check_cv(cv_pars=cv_mc, model_type='R', n=n, Y_prepped=cbind(Y)) # balanced splits: proportion above global median stays ~0.5 q80 <- quantile(Y, 0.8) round(sapply(k_inst$train, function(i) mean(Y[i] > q80)), 2) # resembles original Y distr. round(sapply(mc_inst$train, function(i) mean(Y[i] > q80)), 2) # balanced strata (low/high)
Equidistant binning of spectra by summarising intensities within ppm bins.
binning(X, ppm = NULL, width = NULL, npoints = NULL, fun = sum)binning(X, ppm = NULL, width = NULL, npoints = NULL, fun = sum)
X |
Numeric matrix or data frame with spectra in rows, or a named list as
returned by |
ppm |
Numeric vector of chemical shift positions (length must match
|
width |
Numeric. Bin size in ppm, or |
npoints |
Integer. Desired number of bins per spectrum, or |
fun |
Function. Summary function applied to intensities within each bin.
Must return a single numeric value (e.g. |
If present, preprocessing provenance is appended to attr(X, "m8_prep") using
.m8_stamp(). The ppm axis is updated in attr(X, "m8_axis")$ppm and
column names are set to the bin centres.
When width is specified, spectra are interpolated onto a regular ppm grid and
then aggregated within bins (interp = TRUE in provenance). When npoints
is specified, aggregation is performed by index bins on the original grid
(interp = FALSE in provenance).
Numeric matrix with spectra in rows and binned ppm variables in columns.
Other preprocessing:
align_segment(),
align_spectra(),
calibrate(),
correct_baseline(),
correct_lw(),
pqn(),
print_preprocessing()
set.seed(1) X <- matrix(rnorm(2 * 100), nrow = 2) ppm <- round(seq(10, 0.5, length.out = 100), 3) colnames(X) <- ppm Xb <- binning(X, ppm, width = 0.5) Xb_mean <- binning(X, ppm, width = 0.5, fun = mean) dim(Xb)set.seed(1) X <- matrix(rnorm(2 * 100), nrow = 2) ppm <- round(seq(10, 0.5, length.out = 100), 3) colnames(X) <- ppm Xb <- binning(X, ppm, width = 0.5) Xb_mean <- binning(X, ppm, width = 0.5, fun = mean) dim(Xb)
Aligns 1D H NMR spectra to a reference signal on the existing ppm grid.
Supports singlet (e.g. TSP or custom range) and predefined doublet references
(glucose, alanine).
calibrate(X, ppm = NULL, type = "tsp")calibrate(X, ppm = NULL, type = "tsp")
X |
Numeric matrix/vector of spectra or a metabom8 data list. |
ppm |
Numeric chemical shift vector. If |
type |
Character ( |
In addition to predefined references, custom calibration targets can be supplied as a list with elements:
mode: "singlet" or "doublet"
window: numeric vector of length 2 defining the ppm search region
centre: target ppm position (optional; defaults to mean(window))
j: numeric vector of length 2 specifying expected J-coupling
(required for doublet calibration)
Custom doublet calibration requires the expected J-coupling range (j)
in ppm to distinguish the two peaks of the multiplet.
For example:
calibrate(
X, ppm,
list(
mode = "doublet",
window = c(1.2, 1.35),
j = c(0.007, 0.009)
)
)
Calibrated spectra in the same structure as input.
Other preprocessing:
align_segment(),
align_spectra(),
binning(),
correct_baseline(),
correct_lw(),
pqn(),
print_preprocessing()
data("covid_raw") X=covid_raw$X ppm=covid_raw$ppm X_tsp <- calibrate(X, ppm, type = "tsp") X_glu <- calibrate(X, ppm, type = "glucose") X_custom <- calibrate(X, ppm, type = c(1.9, 2.1))data("covid_raw") X=covid_raw$X ppm=covid_raw$ppm X_tsp <- calibrate(X, ppm, type = "tsp") X_glu <- calibrate(X, ppm, type = "glucose") X_custom <- calibrate(X, ppm, type = c(1.9, 2.1))
Calculates Cliff's delta (Cd) as a non-parametric effect size for
comparing two numeric vectors. Cd quantifies the directional difference
between a reference (ref) and comparison (comp) distribution
based on pairwise comparisons of their values.
Cd ranges from -1 to 1, where:
-1: values in comp are consistently larger than those in ref
0: both distributions are similar, with no systematic difference
1: values in ref are consistently larger than those in comp
cliffs_d(ref, comp) es_cdelta(ref, comp)cliffs_d(ref, comp) es_cdelta(ref, comp)
ref |
Numeric vector representing the reference group. |
comp |
Numeric vector representing the comparator group. |
The effect size is calculated as the scaled difference in dominance between groups. Missing or infinite values are removed with a message. Synonyms are
A single numeric value: Cliff's delta effect size.
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. doi:10.1037/0033-2909.114.3.494
ref <- rnorm(100, mean = 0) comp <- rnorm(100, mean = 1) hist(ref, col=rgb(0,0,1,0.3), breaks=50, xlim=c(-4,4)) hist(comp, col=rgb(1,0,0,0.3), breaks=50, add=TRUE) legend("topright", legend=c("ref","comp"), fill=c("blue","red")) cliffs_d(ref, comp) cliffs_d(comp, ref) comp1 <- rnorm(100, mean = 10) cliffs_d(ref, comp1) cliffs_d(ref, ref)ref <- rnorm(100, mean = 0) comp <- rnorm(100, mean = 1) hist(ref, col=rgb(0,0,1,0.3), breaks=50, xlim=c(-4,4)) hist(comp, col=rgb(1,0,0,0.3), breaks=50, add=TRUE) legend("topright", legend=c("ref","comp"), fill=c("blue","red")) cliffs_d(ref, comp) cliffs_d(comp, ref) comp1 <- rnorm(100, mean = 10) cliffs_d(ref, comp1) cliffs_d(ref, ref)
Applies baseline correction to each spectrum (row) of a spectral matrix.
Multiple correction algorithms are available and selected via the
method argument.
correct_baseline(X, method = c("asls", "linear"), ...) bline(X, ...)correct_baseline(X, method = c("asls", "linear"), ...) bline(X, ...)
X |
Numeric matrix or metabom8 |
method |
Character specifying the baseline correction algorithm. One of:
|
... |
Additional parameters passed to the selected method. Arguments for
Arguments for
|
Baseline correction is performed independently for each spectrum.
The "asls" method estimates a smooth baseline using asymmetric least
squares smoothing, which is well suited for spectra with positive peaks
such as NMR metabolomics data.
The "linear" method estimates a straight baseline from the edges of
each spectrum and subtracts the fitted trend.
The returned object includes a .stamp attribute recording the
baseline correction method and parameters used.
Numeric matrix containing baseline-corrected spectra with the
same dimensions as X. If X is a metabom8 dat object, the corrected
matrix replaces the X component while preserving associated metadata.
Eilers PHC, Boelens HFM (2005). Baseline correction with asymmetric least squares smoothing.
Other preprocessing:
align_segment(),
align_spectra(),
binning(),
calibrate(),
correct_lw(),
pqn(),
print_preprocessing()
data(hiit_raw) plot_spec(hiit_raw$X[1,], hiit_raw$ppm, shift=c(3.1,4), backend='base') hiit_proc <- hiit_raw |> excise() |> correct_baseline() plot_spec(hiit_proc$X[1,], hiit_proc$ppm, shift=c(3.1,4), backend='base', add=TRUE, col='red')data(hiit_raw) plot_spec(hiit_raw$X[1,], hiit_raw$ppm, shift=c(3.1,4), backend='base') hiit_proc <- hiit_raw |> excise() |> correct_baseline() plot_spec(hiit_proc$X[1,], hiit_proc$ppm, shift=c(3.1,4), backend='base', add=TRUE, col='red')
Applies a multiplicative correction to each spectrum to compensate for linewidth-induced peak height variation, using a reference peak region (e.g. TSP). The correction assumes an empirical power-law relationship between peak height and linewidth:
correct_lw( x, lw_ref = 0.8, beta = 0.6, shift = c(-0.1, 0.1), ppm = NULL, sf = NULL, estimate_beta = TRUE, beta_min_n = 6, only_if_improves = TRUE, notes = NULL )correct_lw( x, lw_ref = 0.8, beta = 0.6, shift = c(-0.1, 0.1), ppm = NULL, sf = NULL, estimate_beta = TRUE, beta_min_n = 6, only_if_improves = TRUE, notes = NULL )
x |
A numeric matrix (samples x variables) or a named list containing
|
lw_ref |
Numeric reference linewidth (FWHM) to which spectra are
normalized. If |
beta |
Numeric exponent governing the linewidth-to-height relationship.
Ignored if |
shift |
Numeric vector of length 2 specifying the ppm region used for
linewidth estimation and peak height measurement (e.g. |
ppm |
Optional numeric ppm axis. If |
sf |
Optional spectrometer frequency (MHz). If |
estimate_beta |
Logical; if |
beta_min_n |
Minimum number of valid samples required to estimate
|
only_if_improves |
Logical; if |
notes |
Optional character string appended to the preprocessing log. |
Spectra are scaled such that all samples are adjusted to a common reference
linewidth lw_ref:
where FWHM is estimated per spectrum within the specified shift
region (typically containing an internal reference signal).
If estimate_beta = TRUE, the exponent beta is estimated from a
log-log regression between peak height (max within shift) and FWHM
across samples. Otherwise, the user-supplied beta is used.
The correction is only applied if it reduces the coefficient of variation (CV)
of the reference peak height across samples when only_if_improves = TRUE.
Input may be either:
a numeric matrix X (samples x variables), with ppm axis supplied
via attr(X, "m8_axis")$ppm or numeric colnames(X), or
a named list with elements X, ppm, and optionally meta.
Attributes "m8_prep", "m8_axis", and "m8_meta" are
preserved and the applied correction is appended to the preprocessing log
via .m8_stamp().
This correction removes systematic peak height variation due to spectral
broadening (e.g. shimming differences) under the assumption that the internal
reference signal has constant concentration across samples. The exponent
beta reflects the empirical sensitivity of peak height to linewidth
in the current acquisition and processing regime.
This adjustment improves comparability of signal intensities across spectra by reducing linewidth-induced amplitude bias, thereby enhancing the detectability of small effects in downstream multivariate analyses (e.g. PLS-type models).
An object of the same type as input:
If input is a matrix, returns a corrected matrix with attributes preserved and updated.
If input is a list, returns the same list with corrected $X.
Other preprocessing:
align_segment(),
align_spectra(),
binning(),
calibrate(),
correct_baseline(),
pqn(),
print_preprocessing()
data(covid_raw) # matrix input with ppm in m8_axis Xcor <- correct_lw(covid_raw, shift = c(-0.1, 0.1))data(covid_raw) # matrix input with ppm in m8_axis Xcor <- correct_lw(covid_raw, shift = c(-0.1, 0.1))
1D proton NMR spectra from SARS-CoV-2–positive patients (n = 10) and healthy controls (n = 13), collected in Perth, Western Australia. Spectra were pre-processed (residual water and signal-free regions excised, baseline corrected, and normalized to account for line-width differences).
A numeric matrix/data frame with 23 rows (samples) and 27,819 columns (chemical shift variables in parts per million, ppm).
FIDs were acquired with a standard 90 RF pulse sequence on a 600 MHz Bruker Avance II
spectrometer using IVDr methods for blood plasma (300 K, 32 scans). The spectrometer was equipped
with a double-resonance broadband (BBI) probe and a refrigerated autosampler.
Australian National Phenome Centre (ANPC), Murdoch University.
Kimhofer et al. (2020) doi:10.1021/acs.jproteome.0c00519
data(covid)data(covid)
1D proton NMR spectra from SARS-CoV-2–positive patients (n = 10) and healthy controls (n = 13),
collected in a research study in Perth, Western Australia. Spectra are raw and require processing
before statistical analysis (see ?covid for processed spectra).
A numeric matrix/data frame with 23 rows and 29,782 columns:
Spectra (samples)
Chemical shift variables in parts per million (ppm)
FIDs were acquired using a Carr–Purcell–Meiboom–Gill (CPMG) pulse sequence on a 600 MHz Bruker
Avance II spectrometer using IVDr methods for blood plasma (300 K, 32 scans). The spectrometer was
equipped with a double-resonance broadband (BBI) probe and a refrigerated autosampler at
4C.
Australian National Phenome Centre (ANPC), Murdoch University.
Kimhofer et al. (2020) doi:10.1021/acs.jproteome.0c00519
data(covid_raw)data(covid_raw)
Performs a cross-validated ANOVA (CV-ANOVA) test for OPLS models. The function compares residuals from a null model and a model using cross-validated predictive scores to assess the significance of the OPLS model.
cv_anova(smod)cv_anova(smod)
smod |
An object of class |
Interpretation of the CV-ANOVA table
The CV-ANOVA compares two nested linear models:
Null model:
Full model:
where represents the cross-validated predictive component
score(s) obtained from the OPLS model.
The ANOVA table contains:
SS – Sum of Squares
DF – Degrees of Freedom
MS – Mean Squares (SS / DF)
F_value – F statistic comparing model vs. null
p_value – P-value from the F-test
The Regression row quantifies the reduction in residual variance achieved by including the cross-validated predictive score(s). The Residual row represents the unexplained variance of the full model.
The F-statistic is computed as:
where and are the residual sums of squares of the
null and full models, respectively.
Predictive interpretation
A small p-value (typically < 0.05) indicates that the cross-validated
predictive score(s) significantly reduce residual variance compared to an
intercept-only model. This suggests that the OPLS model captures
statistically meaningful predictive structure in .
A large p-value indicates that the predictive component does not explain
significantly better than chance, implying weak or unstable
predictive performance.
Importantly, CV-ANOVA evaluates the linear explanatory power of the
cross-validated predictive scores, not the descriptive separation of the
latent space. A model may show visual class separation or moderate R
yet fail CV-ANOVA if predictive performance is weak.
Sample size strongly affects statistical power. With small , even
models with moderate predictive strength may not reach statistical
significance.
CV-ANOVA is intended for continuous response (regression) models.
A data.frame containing:
SS - Sum of Squares
DF - Degrees of Freedom
MS - Mean Squares
F_value - F statistic
p_value - P-value from the F-test
Eriksson, L., et al. (2008). CV-ANOVA for significance testing of PLS and OPLS models. Journal of Chemometrics, 22(11-12), 594–600.
Other model_validation:
dmodx(),
opls_perm()
data("covid") X <- covid$X Y <- as.numeric(factor(covid$an$type)) - 1 scaling <- uv_scaling(center=TRUE) cv <- balanced_mc(10, split=2/3, type='R', probs = c(0, 0.5, 1)) mod <- opls(X, Y, scaling, cv) cv_anova(mod)data("covid") X <- covid$X Y <- as.numeric(factor(covid$an$type)) - 1 scaling <- uv_scaling(center=TRUE) cv <- balanced_mc(10, split=2/3, type='R', probs = c(0, 0.5, 1)) mod <- opls(X, Y, scaling, cv) cv_anova(mod)
Calculates the orthogonal distance of each observation to an OPLS model in X-space. DModX can be used for identifying outliers.
dmodx(mod, plot = TRUE)dmodx(mod, plot = TRUE)
mod |
An OPLS model object of class |
plot |
Logical. If TRUE, a plot of DModX values with an approximate cutoff is shown. |
DModX is computed from the X-residual matrix as a scaled RMSE of residuals. The empirical cutoff (dashed line in plot) uses a t-based confidence interval and assumes approximate normality of DModX values. This assumption may not be satisfied in all datasets, so the resulting threshold should be regarded as a pragmatic heuristic for outlier detection.
A data frame with columns ID, DmodX, and passed.
Other model_validation:
cv_anova(),
opls_perm()
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) dX <- dmodx(model) print(dX[[1]]) df <- dX[[2]]; head(df)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) dX <- dmodx(model) print(dX[[1]]) df <- dX[[2]]; head(df)
Generates coordinates of a two-dimensional ellipse corresponding to a Hotelling T^2 region projected onto selected dimensions.
ellipse2d(obj, dims = c(1, 2), npoints = 100)ellipse2d(obj, dims = c(1, 2), npoints = 100)
obj |
A |
dims |
Integer vector of length 2 specifying which dimensions to project
onto. Default is |
npoints |
Integer. Number of points used to approximate the ellipse. Default is 100. |
The ellipse is obtained by projecting the multivariate T^2 ellipsoid onto the specified dimensions. The scaling is derived from the Hotelling T^2 statistic and accounts for sample size and dimensionality.
A data.frame with columns x and y containing
ellipse coordinates.
set.seed(1) X <- cbind(rnorm(100), rnorm(100) + 0.5) t2 <- hotellingsT2(X) ell <- ellipse2d(t2) plot(X[,1], X[,2], asp = 1, pch = 16, xlab = "Score 1", ylab = "Score 2") lines(ell$x, ell$y, col = "red", lwd = 2)set.seed(1) X <- cbind(rnorm(100), rnorm(100) + 0.5) t2 <- hotellingsT2(X) ell <- ellipse2d(t2) plot(X[,1], X[,2], asp = 1, pch = 16, xlab = "Score 1", ylab = "Score 2") lines(ell$x, ell$y, col = "red", lwd = 2)
Removes specified chemical shift regions from 1D H NMR spectra.
By default, commonly excluded metabolomics regions are removed
(upfield/downfield noise, water, urea).
excise(x, ppm = NULL, regions = NULL)excise(x, ppm = NULL, regions = NULL)
x |
Numeric matrix or vector. Spectra in rows and variables in columns. |
ppm |
Numeric vector of chemical shift positions (ppm).
If omitted, |
regions |
Named list of numeric vectors (length 2), specifying ppm regions to remove. Each element must define lower and upper bounds. |
Default regions removed (ppm):
Upfield noise: [min(ppm), 0.25]
Residual water: [4.5, 5.2]
Urea region: [5.5, 6.0]
Downfield noise: [9.7, max(ppm)]
Removed regions are recorded in attr(X, "m8_prep") if present.
Updated ppm values are stored in column names and in
attr(X, "m8_axis")$ppm.
Numeric matrix with specified chemical shift regions removed.
set.seed(1) ppm <- seq(0, 10, length.out = 1000) X <- matrix(rnorm(100 * length(ppm)), nrow = 100) Xe <- excise(X, ppm) dim(Xe) names(attributes(Xe))set.seed(1) ppm <- seq(0, 10, length.out = 1000) X <- matrix(rnorm(100 * length(ppm)), nrow = 100) Xe <- excise(X, ppm) dim(Xe) names(attributes(Xe))
Extract fitted Y values
fitted(object, ...) ## S4 method for signature 'm8_model' fitted(object)fitted(object, ...) ## S4 method for signature 'm8_model' fitted(object)
object |
An object of class |
... |
Additional arguments (currently ignored). |
Numeric vector or matrix containing the fitted response values.
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) Y_hat_dummy <- fitted(model)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) Y_hat_dummy <- fitted(model)
Returns the indices of the chemical shift vector (ppm) that fall
within the specified range.
get_idx(range = c(1, 5), ppm) get.idx(range = c(1, 5), ppm)get_idx(range = c(1, 5), ppm) get.idx(range = c(1, 5), ppm)
range |
Numeric vector of length 2 specifying lower and upper bounds (order does not matter). |
ppm |
Numeric vector. The full chemical shift axis (in ppm). |
Integer vector of indices corresponding to ppm values within the given range.
Other NMR:
.dynamicIntervalsMedian()
data(covid_raw) X <- covid_raw$X ppm <- covid_raw$ppm idx_tsp <- get_idx(c(-0.1, 0.1), ppm) ppm[range(idx_tsp)] plot(ppm[idx_tsp], X[1, idx_tsp], type = 'l')data(covid_raw) X <- covid_raw$X ppm <- covid_raw$ppm idx_tsp <- get_idx(c(-0.1, 0.1), ppm) ppm[range(idx_tsp)] plot(ppm[idx_tsp], X[1, idx_tsp], type = 'l')
Extracts preprocessing provenance stored in the "m8_prep" attribute.
Allows access to the full processing log, a specific step, or a specific
parameter within a step.
get_provenance(x, step = NULL, param = NULL)get_provenance(x, step = NULL, param = NULL)
x |
A metabom8 object (named list with element |
step |
Optional. Either:
If |
param |
Optional character string specifying a parameter name within the selected step. If provided, only this parameter value is returned. |
Provenance metadata are recorded automatically by metabom8 preprocessing functions and stored as structured attributes on the spectral matrix. This function provides programmatic access to these records.
Depending on the arguments:
Full provenance list (if step = NULL)
A single preprocessing step (if step specified)
A single parameter value (if param specified)
Other provenance:
add_note(),
print_provenance()
data(hiit_raw) hiit_proc <- hiit_raw |> calibrate(type = "tsp") |> excise() # Retrieve full log log <- get_provenance(hiit_proc) # Retrieve specific step get_provenance(hiit_proc, step = 2) # Retrieve parameter from a named step get_provenance(hiit_proc, step = "calibrate", param = "target")data(hiit_raw) hiit_proc <- hiit_raw |> calibrate(type = "tsp") |> excise() # Retrieve full log log <- get_provenance(hiit_proc) # Retrieve specific step get_provenance(hiit_proc, step = 2) # Retrieve parameter from a named step get_provenance(hiit_proc, step = "calibrate", param = "target")
Urine samples collected from a single individual performing a -type
exercise protocol over a time period of 3h.
A list with four elements:
Numeric matrix of spectral intensities (samples x variables).
Numeric vector of chemical shift values corresponding to columns of X.
Data frame containing acquisition metadata.
Data frame containing sample annotations.
The spectra were acquired on a 600 MHz Bruker NMR spectrometer. The dataset is included for demonstration of preprocessing and modelling workflows in metabom8.
Example dataset bundled with the package.
data(hiit_raw)data(hiit_raw)
Computes the Hotelling T^2 statistic defining a multivariate confidence region for a set of observations. The region corresponds to an ellipsoid in p-dimensional space and is commonly visualised as an ellipse in two-dimensional (OPLS) score plots.
hotellingsT2(X, alpha = 0.95)hotellingsT2(X, alpha = 0.95)
X |
Numeric matrix with observations in rows and variables (dimensions) in columns. |
alpha |
Numeric scalar. Confidence level for the T^2 region. Default is 0.95. |
The Hotelling T^2 region is defined as
where .
A named list with elements:
Numeric vector of column means.
Sample covariance matrix.
Squared radius of the Hotelling T^2 region.
set.seed(1) X <- matrix(rnorm(200), ncol = 2) t2 <- hotellingsT2(X) ell <- ellipse2d(t2) plot(X, asp = 1) lines(ell$x, ell$y, col = "red", lwd = 2)set.seed(1) X <- matrix(rnorm(200), ncol = 2) t2 <- hotellingsT2(X) ell <- ellipse2d(t2) plot(X, asp = 1) lines(ell$x, ell$y, col = "red", lwd = 2)
K-fold cross-validation strategy
kfold(k)kfold(k)
k |
Integer number of folds. |
Partitions the data into k folds. Each fold is used once as a test set,
with the remaining folds used for training.
No stratification is applied; folds are created by random partitioning.
A named list with elements:
List of integer vectors containing training set indices for each resampling iteration.
Character string indicating the resampling strategy.
Integer. Number of samples in the dataset.
Integer. Random seed used to generate the resampling splits, ensuring reproducibility.
Other resampling strategies:
balanced_boot(),
balanced_mc(),
mc(),
stratified_kfold()
n <- 100 thr <- 1.5 Y <- c(rnorm(80, thr - 3, 0.3), rnorm(20, thr + 3, 0.3)) # unbalanced outcome mean(Y > thr) cv_k <- kfold(k = 10) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) sapply(k_inst$train, function(i) length(i))n <- 100 thr <- 1.5 Y <- c(rnorm(80, thr - 3, 0.3), rnorm(20, thr + 3, 0.3)) # unbalanced outcome mean(Y > thr) cv_k <- kfold(k = 10) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) sapply(k_inst$train, function(i) length(i))
Returns a named character vector describing the preprocessing operations implemented in metabom8. For more information on individual functionalities please refer to the function help pages.
list_preprocessing()list_preprocessing()
A named character vector with preprocessing function identifiers as names and short descriptions as values.
list_preprocessing()list_preprocessing()
Model loadings
## S4 method for signature 'm8_model' loadings(x, orth = FALSE, ...)## S4 method for signature 'm8_model' loadings(x, orth = FALSE, ...)
x |
An object of class |
orth |
Logical indicating whether orthogonal scores should be returned (only applicable for OPLS models). |
... |
Additional arguments (currently ignored). |
Numeric vector or matrix containing loadings.
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) P <- loadings(model) Po <- loadings(model, orth = TRUE) dim(P) dim(Po) == dim(P)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) P <- loadings(model) Po <- loadings(model, orth = TRUE) dim(P) dim(Po) == dim(P)
Estimates the full width at half maximum (FWHM; line width) of a singlet-like peak within a specified chemical-shift range for each spectrum.
lw(X, ppm = NULL, shift = c(-0.1, 0.1), sf)lw(X, ppm = NULL, shift = c(-0.1, 0.1), sf)
X |
Numeric matrix (spectra in rows) or a named list as returned by
|
ppm |
Numeric vector of chemical shift values (ppm) corresponding to columns of
|
shift |
Numeric vector of length 2. Chemical shift range containing the peak
(e.g., |
sf |
Spectrometer frequency in MHz. Either a single numeric (recycled across spectra)
or a numeric vector of length |
For each spectrum, the function:
extracts the region defined by shift,
finds the peak apex within that region,
computes the half-height level relative to the local baseline (minimum in the window),
estimates the left and right half-height crossing points by linear interpolation,
converts the width from ppm to Hz using sf (MHz), i.e. Hz = ppm * sf.
If no valid half-height crossings can be found (e.g., very low SNR or truncated peak),
NA is returned for that spectrum.
Numeric vector of FWHM values in Hz (length nrow(X)).
The ppm axis may be increasing or decreasing; FWHM is computed as an absolute width and is therefore independent of axis direction.
# Simulated NMR peaks with different linewidths ppm <- seq(-0.2, 0.2, length.out = 1000) # generate peaks with increasing width sds <- seq(0.01, 0.03, length.out = 10) X <- t(sapply(sds, function(s) dnorm(ppm, mean = 0, sd = s) )) sf <- 600 # spectrometer frequency in MHz fwhm_vals <- lw(X, ppm = ppm, shift = c(-0.1, 0.1), sf = sf) plot(sds, fwhm_vals, xlab = "Gaussian sd", ylab = "Estimated FWHM (Hz)", pch = 16)# Simulated NMR peaks with different linewidths ppm <- seq(-0.2, 0.2, length.out = 1000) # generate peaks with increasing width sds <- seq(0.01, 0.03, length.out = 10) X <- t(sapply(sds, function(s) dnorm(ppm, mean = 0, sd = s) )) sf <- 600 # spectrometer frequency in MHz fwhm_vals <- lw(X, ppm = ppm, shift = c(-0.1, 0.1), sf = sf) plot(sds, fwhm_vals, xlab = "Gaussian sd", ylab = "Estimated FWHM (Hz)", pch = 16)
pca(), pls(), and opls().m8_model class
Model object returned by pca(), pls(), and opls().
## S4 method for signature 'm8_model' summary(object) ## S4 method for signature 'm8_model' show(object)## S4 method for signature 'm8_model' summary(object) ## S4 method for signature 'm8_model' show(object)
object |
An object of class |
An object of class m8_model.
summary(m8_model): Summarise model performance and component selection.
show(m8_model): Show a compact model header.
engineCharacter. Model engine ("pca", "pls", "opls").
ctrlList. Engine-specific control and performance information.
fitList. Fitted data (engine specific).
cvResampling instance (may be NULL if not used).
prepScaling and centering information
provenancePreprocessing attributes of spectral matrix X
sessionR-session information
callFunction call
dimsList with n and p.
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) class(model) show(model)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) class(model) show(model)
Monte-Carlo cross-validation strategy
mc(k, split)mc(k, split)
k |
Integer. Number of repeated random splits. |
split |
Numeric. Fraction of samples assigned to the
training set (e.g. |
Monte-Carlo cross-validation generates k random train/test splits
without replacement. No stratification is applied; samples are drawn
uniformly at random.
A named list with elements:
List of integer vectors containing training set indices for each resampling iteration.
Character string indicating the resampling strategy.
Integer. Number of samples in the dataset.
Integer. Random seed used to generate the resampling splits, ensuring reproducibility.
Other resampling strategies:
balanced_boot(),
balanced_mc(),
kfold(),
stratified_kfold()
n <- 100 # bivariate outcome thr <- 1.5 Y <- c(rnorm(80, thr-3, 0.3), rnorm(20, thr+3, 0.3)) # unbalanced low/high outcome mean(Y>thr) cv_mc <- mc(k = 10, split = 2/3) mc_inst <- metabom8:::.arg_check_cv(cv_pars=cv_mc, model_type='R', n=n, Y_prepped=cbind(Y)) sapply(mc_inst$train, function(i) length(i))n <- 100 # bivariate outcome thr <- 1.5 Y <- c(rnorm(80, thr-3, 0.3), rnorm(20, thr+3, 0.3)) # unbalanced low/high outcome mean(Y>thr) cv_mc <- mc(k = 10, split = 2/3) mc_inst <- metabom8:::.arg_check_cv(cv_pars=cv_mc, model_type='R', n=n, Y_prepped=cbind(Y)) sapply(mc_inst$train, function(i) length(i))
metabom8 (pronounced metabo-mate) provides pipelines for 1D NMR data import, preprocessing, multivariate modeling (PCA, OPLS), metabolite identification, and visualisation. Core functions are accelerated in C++ via Rcpp, RcppArmadillo, and RcppEigen for improved computational performance.
Import, preprocessing, and analysis of 1D NMR spectra.
Principal Components Analysis (PCA) with back-scaled loadings (projected back to the spectral domain and visualized as spectra).
Orthogonal Partial Least Squares (OPLS) fitted iteratively via the NIPALS algorithm, with an automatic stopping criterion to determine the optimal number of components.
Automatic model selection using cross-validated performance (, ,
and cross-validated AUC for classification), with safeguards against overfitting.
Robust statistical validation for small to large sample sizes: stratified Monte Carlo cross-validation and k-fold CV.
Model diagnostics and validation (DModX, permutation testing).
Metabolite identification via STOCSY and STORM.
Native C++ acceleration via RcppArmadillo and RcppEigen.
Getting Started: vignette("Getting Started")
Maintainer: Torben Kimhofer [email protected] (ORCID)
Useful links:
Scales a numeric vector to the range using min-max normalization.
This is a special case of scRange.
minmax(x, na.rm = FALSE)minmax(x, na.rm = FALSE)
x |
Numeric vector. Input values to be scaled. |
na.rm |
Logical; if |
The scaled values are computed as:
Equivalent to scRange(x, ra = c(0, 1)).
A numeric vector of the same length as x, scaled to the range .
scRange() for flexible output ranges.
x <- rnorm(20) plot(x, type = 'l'); abline(h = range(x), lty = 2) points(minmax(x), type = 'l', col = 'blue') abline(h = c(0, 1), col = 'blue', lty = 2)x <- rnorm(20) plot(x, type = 'l'); abline(h = range(x), lty = 2) points(minmax(x), type = 'l', col = 'blue') abline(h = c(0, 1), col = 'blue', lty = 2)
Estimates the noise standard deviation () for each spectrum from a signal-free ppm region.
The default estimator is robust (MAD-based) and suitable for signal-to-noise calculations.
noise_sd( X, ppm = NULL, where = c(14.6, 14.7), method = c("mad", "sd", "p95"), baseline_correct = FALSE, lambda = 10000, min_points = 50L, ns = NULL, normalise_scans = FALSE )noise_sd( X, ppm = NULL, where = c(14.6, 14.7), method = c("mad", "sd", "p95"), baseline_correct = FALSE, lambda = 10000, min_points = 50L, ns = NULL, normalise_scans = FALSE )
X |
Numeric matrix (spectra in rows), numeric vector (single spectrum), or a named list
as returned by |
ppm |
Numeric vector of chemical shift values (ppm) corresponding to columns of
|
where |
Numeric vector of length 2. ppm range used for noise estimation.
Should be free of metabolite signals (default |
method |
Character. Noise estimator: |
baseline_correct |
Logical. If |
lambda |
Numeric. Smoothing parameter for |
min_points |
Integer. Minimum number of points required in the noise window.
May be a single value (recycled across spectra) or a numeric vector of
length |
ns |
Number of scans / transients (required if normalise_scans=TRUE) |
normalise_scans |
Logical. If |
In NMR spectroscopy, noise scales predictably with the number of scans ().
For otherwise identical acquisition settings:
Signal increases approximately proportional to .
Noise increases approximately proportional to .
Consequently, signal-to-noise ratio (SNR) increases proportional to .
Equivalently, the noise standard deviation scales as:
assuming a fixed underlying signal scale and comparable acquisition conditions.
To compare noise levels across datasets acquired with different numbers of scans, a scan-normalised noise estimate may be used:
Under stable receiver gain and processing conditions, this normalised noise should be approximately constant across runs.
Numeric vector of noise estimates (length nrow(X)).
data(hiit_raw) X <- hiit_raw$X ppm <- hiit_raw$ppm sigma <- noise_sd(X, ppm, where = c(10,11)) plot(hiit_raw$meta$a_NS, sigma, xlab = "Number of scans (NS)", ylab = expression(sigma~"(noise estimate)"), pch = 16) lines(lowess(hiit_raw$meta$a_NS, sigma), col = "red", lwd = 2)data(hiit_raw) X <- hiit_raw$X ppm <- hiit_raw$ppm sigma <- noise_sd(X, ppm, where = c(10,11)) plot(hiit_raw$meta$a_NS, sigma, xlab = "Number of scans (NS)", ylab = expression(sigma~"(noise estimate)"), pch = 16) lines(lowess(hiit_raw$meta$a_NS, sigma), col = "red", lwd = 2)
Normalises 1D NMR spectra using an ERETIC reference signal. By default
the ERETIC position is discovered in a search window and the spectra are
normalised by the integral in a narrow window around the detected position.
Power users can supply pos to enforce a fixed ERETIC position.
norm_eretic( X, integr = FALSE, ppm = NULL, pos = NULL, search = c(10, 16), width = 0.2, warn_absent = TRUE, noise_win = c(9.8, 10.2), snr_factor = 10 )norm_eretic( X, integr = FALSE, ppm = NULL, pos = NULL, search = c(10, 16), width = 0.2, warn_absent = TRUE, noise_win = c(9.8, 10.2), snr_factor = 10 )
X |
Numeric matrix or vector. NMR spectra with spectra in rows.
If |
integr |
Logical. If |
ppm |
Numeric vector of chemical shift positions. If |
pos |
Numeric scalar or |
search |
Numeric vector of length 2. Ppm window used to discover ERETIC when |
width |
Numeric scalar. Full integration window width in ppm (default 0.2 gives |
warn_absent |
Logical. If |
noise_win |
Numeric vector of length 2. Ppm window for estimating noise intensity. |
snr_factor |
Numeric scaler. Minimum Signal-to-Noise Ratio required for detecting the ERETIC peak. |
Binning/normalisation history is recorded in attr(X, "m8_prep") if present.
Ppm axis values are stored in attr(X, "m8_axis")$ppm.
If integr = TRUE, numeric vector of ERETIC integrals.
If integr = FALSE, numeric matrix of normalised spectra.
set.seed(123) n <- 1000 ppm <- seq(0, 14, length.out = n) gauss <- function(x, c, h, w = 0.05) { h * exp(-((x - c)^2) / (2 * w^2)) } heights <- c(100, 80, 60, 40, 20) spectra <- sapply(heights, function(h) rnorm(n, 0, 0.01) + gauss(ppm, 12, h) ) spectra <- t(spectra) # 5 spectra × 1000 variables plot_spec(spectra, ppm, shift = c(11, 13)) X_norm <- norm_eretic(spectra, ppm=ppm) plot_spec(X_norm, ppm, shift=c(11, 13))set.seed(123) n <- 1000 ppm <- seq(0, 14, length.out = n) gauss <- function(x, c, h, w = 0.05) { h * exp(-((x - c)^2) / (2 * w^2)) } heights <- c(100, 80, 60, 40, 20) spectra <- sapply(heights, function(h) rnorm(n, 0, 0.01) + gauss(ppm, 12, h) ) spectra <- t(spectra) # 5 spectra × 1000 variables plot_spec(spectra, ppm, shift = c(11, 13)) X_norm <- norm_eretic(spectra, ppm=ppm) plot_spec(X_norm, ppm, shift=c(11, 13))
Fits a supervised Orthogonal Partial Least Squares (O-PLS) model using a NIPALS-based algorithm with optional cross-validation and automatic component selection.
opls(X, Y, scaling, validation_strategy)opls(X, Y, scaling, validation_strategy)
X |
Numeric matrix of predictors (rows = samples, columns = variables). |
Y |
Numeric matrix or factor vector of responses. |
scaling |
A scaling strategy object (e.g., |
validation_strategy |
A cross-validation strategy object defining how resampling is performed (e.g., k-fold, Monte Carlo). |
O-PLS decomposes the predictor matrix into:
One predictive component capturing variation correlated with Y
Orthogonal components capturing structured variation in X
unrelated to Y
Predictive and orthogonal components are estimated sequentially.
Cross-validated performance metrics (e.g., Q², R², classification AUC)
are computed for each model configuration according to the supplied
validation_strategy.
The model extracts a single predictive component and iteratively
adds orthogonal components until the stopRule indicates
overfitting or maxPCo is reached.
Scaling specified via scaling is applied internally during model
fitting and does not alter the input matrix X. Spectral preprocessing
(e.g., alignment or baseline correction) should be performed prior to
model fitting.
The returned model object stores:
Predictive and orthogonal component models
Cross-validation results
Performance metrics (R², Q², AUC)
Model control parameters
Input data provenance metadata
Session information for reproducibility
An object of class m8_model containing the fitted O-PLS model,
cross-validation results, and performance statistics.
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) summary(model) # scores Tp <- scores(model) To <- scores(model, orth=TRUE) t2 <- hotellingsT2(cbind(Tp, To)) ell <-ellipse2d(t2) plot(Tp, To, asp = 1, col = as.factor(covid$an$type), xlim = range(c(Tp, ell$x)), ylim = range(c(To, ell$y)) ) lines(ell$x, ell$y, col = "grey", lty=2) # loadings & vip's Pp <- loadings(model) Po <- loadings(model, orth=TRUE) vips <- vip(model) x=covid$ppm y = Pp * apply(covid$X, 2, sd) palette <- colorRampPalette(c("blue", "cyan", "yellow", "red"))(100) idx <- cut(vips, breaks = 100, labels = FALSE) plot(x, y, type = "n", xlim = rev(range(x)), xlab='ppm', ylab='t_pred_sc') for (i in seq_len(length(x) - 1)) { segments(x[i], y[i], x[i+1], y[i+1], col = palette[idx[i]], lwd = 2) }data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) summary(model) # scores Tp <- scores(model) To <- scores(model, orth=TRUE) t2 <- hotellingsT2(cbind(Tp, To)) ell <-ellipse2d(t2) plot(Tp, To, asp = 1, col = as.factor(covid$an$type), xlim = range(c(Tp, ell$x)), ylim = range(c(To, ell$y)) ) lines(ell$x, ell$y, col = "grey", lty=2) # loadings & vip's Pp <- loadings(model) Po <- loadings(model, orth=TRUE) vips <- vip(model) x=covid$ppm y = Pp * apply(covid$X, 2, sd) palette <- colorRampPalette(c("blue", "cyan", "yellow", "red"))(100) idx <- cut(vips, breaks = 100, labels = FALSE) plot(x, y, type = "n", xlim = rev(range(x)), xlab='ppm', ylab='t_pred_sc') for (i in seq_len(length(x) - 1)) { segments(x[i], y[i], x[i+1], y[i+1], col = palette[idx[i]], lwd = 2) }
Performs Y-permutation tests to assess the robustness of OPLS models by comparing model performance statistics on real vs. permuted response labels.
opls_perm(smod, n = 10, plot = TRUE, mc = FALSE)opls_perm(smod, n = 10, plot = TRUE, mc = FALSE)
smod |
An OPLS model object of class |
n |
Integer. Number of permutations to perform. |
plot |
Logical. If |
mc |
Logical. If |
Each permutation shuffles the response labels and fits a new OPLS model. The function captures model statistics (R2, Q2, AUC) to compare against the non-permuted model. This helps determine whether the original model performance is better than expected by chance.
A data.frame with model metrics (e.g., R2, Q2, AUROC) from both permuted and original models.
Wiklund, S. et al. (2008). A Tool for Improved Validation of OPLS Models. Journal of Chemometrics, 22(11–12), 594–600.
Other model_validation:
cv_anova(),
dmodx()
Pareto Scaling Leaves variables unscaled. Optional centering.
pareto_scaling(center = FALSE)pareto_scaling(center = FALSE)
center |
Logical. If TRUE, variables are mean-centered before scaling. |
Scales variables by the square root of their standard deviation.
A list with elements:
Numeric matrix containing the scaled data.
List describing the preprocessing, including centering and scaling
parameters (center, scale, X_mean, X_sd).
Other scaling_strategies:
unscaled(),
uv_scaling()
paritalUV <- pareto_scaling(center=TRUE) X <- matrix(c(10,10, 0,0, 0, 10, 0, 1000), ncol=4) X_scaled <- prep_X(paritalUV, X) str(X_scaled) X_scaled$XparitalUV <- pareto_scaling(center=TRUE) X <- matrix(c(10,10, 0,0, 0, 10, 0, 1000), ncol=4) X_scaled <- prep_X(paritalUV, X) str(X_scaled) X_scaled$X
Fits an unsupervised Principal Component Analysis (PCA) model
to a spectral data matrix. Model-internal centering and scaling
are controlled via a ScalingStrategy object and do not
modify the input matrix.
The default backend uses an internal NIPALS implementation.
Alternatively, PCA algorithms from the pcaMethods package
(e.g. "ppca", "svd", "robustPca") can be used.
pca(X, scaling, ncomp, method = "nipals")pca(X, scaling, ncomp, method = "nipals")
X |
Numeric matrix (rows = samples, columns = variables). |
scaling |
A |
ncomp |
Integer. Number of principal components to compute. |
method |
Character. PCA backend. Use |
PCA decomposes into orthogonal score and loading matrices:
where:
contains the principal component scores
contains the loadings
The number of components is fixed by ncomp.
Unlike supervised models, PCA does not use cross-validation
or stopping rules.
Scaling and centering are applied internally during model fitting. The original input matrix is not modified.
An object of class m8_model with engine = "pca".
The object contains:
fit$t: Score matrix (samples × components)
fit$p: Loading matrix (components × variables)
ctrl: Model control information (variance explained, scaling settings)
provenance: Attributes inherited from the input matrix
Other modelling:
opls(),
pls()
data(covid) uv <- uv_scaling(center=TRUE) model <- pca(X=covid$X, scaling=uv, ncomp=2) show(model) summary(model) Tx <- scores(model) Px <- loadings(model) t2 <- hotellingsT2(Tx) ell <-ellipse2d(t2) # scores plot plot(Tx, asp = 1, col = as.factor(covid$an$type), xlim = range(c(Tx[1,], ell$x)), ylim = range(c(Tx[2,], ell$y)) ) lines(ell$x, ell$y, col = "grey", lty=2)data(covid) uv <- uv_scaling(center=TRUE) model <- pca(X=covid$X, scaling=uv, ncomp=2) show(model) summary(model) Tx <- scores(model) Px <- loadings(model) t2 <- hotellingsT2(Tx) ell <-ellipse2d(t2) # scores plot plot(Tx, asp = 1, col = as.factor(covid$an$type), xlim = range(c(Tx[1,], ell$x)), ylim = range(c(Tx[2,], ell$y)) ) lines(ell$x, ell$y, col = "grey", lty=2)
Plot one or multiple 1D ¹H NMR spectra using different rendering backends.
plot_spec( x, ppm = NULL, shift = c(0, 10), backend = c("plotly", "base", "ggplot2"), add = FALSE, ... ) spec(...) matspec(...)plot_spec( x, ppm = NULL, shift = c(0, 10), backend = c("plotly", "base", "ggplot2"), add = FALSE, ... ) spec(...) matspec(...)
x |
Numeric vector (single spectrum) or numeric matrix (spectra in rows). |
ppm |
Numeric vector of chemical shift values. Must match |
shift |
Numeric vector of length 2. Chemical shift window to display
(e.g., |
backend |
Character. Rendering backend. One of |
add |
Logical. If |
... |
Additional arguments passed to the selected backend. |
The function accepts both single spectra and matrices of spectra. Input is internally normalized to matrix form, subset to the selected ppm region, and then rendered using the chosen backend.
For large NMR datasets (e.g., >500 spectra × >10k ppm), the "base" and
"plotly" backends are substantially more memory-efficient than "ggplot",
which requires reshaping to long format.
Chemical shift axes are automatically displayed in decreasing order (NMR convention).
"plotly": a plotly object.
"ggplot": a ggplot2 object.
"base": NULL (invisibly).
data(hiit_raw) plot_spec(hiit_raw) plot_spec(hiit_raw, shift=c(-0.05,0.05), backend='base') plot_spec(hiit_raw, shift=c(-0.05,0.05), backend='ggplot2')data(hiit_raw) plot_spec(hiit_raw) plot_spec(hiit_raw, shift=c(-0.05,0.05), backend='base') plot_spec(hiit_raw, shift=c(-0.05,0.05), backend='ggplot2')
Generates a STOCSY plot (covariance trace coloured by absolute correlation).
plotStocsy(stoc_mod, shift = c(0, 10), title = NULL)plotStocsy(stoc_mod, shift = c(0, 10), title = NULL)
stoc_mod |
An object of class |
shift |
Numeric vector of length 2 specifying the chemical shift range (ppm). |
title |
Optional character plot title. |
A ggplot2 object.
# st <- stocsy(X, ppm, driver = 5.233, plotting = FALSE) # plotStocsy(st, shift = c(5.15, 5.30), title = "Glucose") data(covid) cs = 5.233 # anomeric H of gluc s1 <- stocsy(covid$X, driver=cs, plotting = FALSE) plotStocsy(s1)# st <- stocsy(X, ppm, driver = 5.233, plotting = FALSE) # plotStocsy(st, shift = c(5.15, 5.30), title = "Glucose") data(covid) cs = 5.233 # anomeric H of gluc s1 <- stocsy(covid$X, driver=cs, plotting = FALSE) plotStocsy(s1)
Fits a supervised Partial Least Squares (PLS) model using a NIPALS-based algorithm with optional cross-validation and automatic component selection.
pls(X, Y, scaling, validation_strategy, maxPCo = 5)pls(X, Y, scaling, validation_strategy, maxPCo = 5)
X |
Numeric matrix of predictors (rows = samples, columns = variables). |
Y |
Numeric matrix or factor vector of responses. |
scaling |
A scaling strategy object (e.g., |
validation_strategy |
A cross-validation strategy object defining how resampling is performed (e.g., k-fold, Monte Carlo). |
maxPCo |
Integer. Maximum number of predictive components to evaluate. |
Model components are estimated sequentially using a NIPALS-based algorithm.
For each component, cross-validated performance metrics (e.g., Q², R²,
classification AUC) are computed according to the supplied
validation_strategy. Component extraction stops when the
stopRule indicates overfitting or when maxPCo is reached.
Scaling specified via scaling is applied internally during model
fitting and does not alter the input matrix X. Spectral preprocessing
steps (e.g., alignment, baseline correction) should be performed prior to
model fitting.
The returned model object stores:
Fitted component models
Cross-validation results
Performance metrics (R², Q², AUC)
Model control parameters
Input data provenance metadata
Session information for reproducibility
An object of class m8_model containing the fitted PLS model,
cross-validation results, and performance statistics.
Other modelling:
opls(),
pca()
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-pls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) summary(model) Tp <- scores(model) Pp <- loadings(model)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-pls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) summary(model) Tp <- scores(model) Pp <- loadings(model)
Identifies local maxima, minima, or both from smoothed NMR spectra using Savitzky–Golay filtering.
ppick(X, ppm, fil_p = 3, fil_n = 5, type = "max")ppick(X, ppm, fil_p = 3, fil_n = 5, type = "max")
X |
Numeric matrix. NMR data with spectra in rows and chemical shifts in columns. |
ppm |
Numeric vector. Chemical shift values corresponding to columns in |
fil_p |
Integer. Polynomial order of the Savitzky–Golay filter. |
fil_n |
Integer. Filter length (must be odd) of the Savitzky–Golay filter. |
type |
Character. Type of extrema to return: |
The spectra are smoothed using a Savitzky–Golay filter to reduce noise. Extrema are then detected by identifying sign changes in the first derivative of the smoothed signal.
A list of data frames, one per spectrum. Each data frame contains:
idc: Index of the detected peak.
ppm: Chemical shift at the peak.
Int: Intensity at the peak.
Etype: Extrema type: 1 for minima, -1 for maxima.
data(covid) X <- covid$X ppm <- covid$ppm peaklist <- ppick(X, ppm) plot_spec(X[1, ], ppm, shift = c(1.2, 1.4), backend='base') points(peaklist[[1]]$ppm, peaklist[[1]]$Int, col = 'cyan')data(covid) X <- covid$X ppm <- covid$ppm peaklist <- ppick(X, ppm) plot_spec(X[1, ], ppm, shift = c(1.2, 1.4), backend='base') points(peaklist[[1]]$ppm, peaklist[[1]]$Int, col = 'cyan')
Finds local extrema in 1D spectra using Savitzky–Golay first and second derivatives. Candidate peaks are identified at zero-crossings of the first derivative and classified by the sign of the second derivative. Optional filters control peak height, prominence, SNR, curvature and minimum separation.
ppick2( X, ppm = NULL, type = c("max", "min", "both"), fil_p = 3L, fil_n = 11L, noise_win = NULL, min_snr = 10, min_height = NULL, min_prominence = NULL, prom_half_window_ppm = 0.02, min_distance_ppm = 0.005, min_curvature = NULL, keep_cols = c("height", "snr", "curvature", "prominence") )ppick2( X, ppm = NULL, type = c("max", "min", "both"), fil_p = 3L, fil_n = 11L, noise_win = NULL, min_snr = 10, min_height = NULL, min_prominence = NULL, prom_half_window_ppm = 0.02, min_distance_ppm = 0.005, min_curvature = NULL, keep_cols = c("height", "snr", "curvature", "prominence") )
X |
Numeric matrix or vector. Spectra in rows. |
ppm |
Numeric vector or NULL. If NULL, inferred from colnames(X). |
type |
Character. "max", "min", or "both". |
fil_p |
Integer. Polynomial order for SG filter. |
fil_n |
Integer. Window length for SG filter (odd). |
noise_win |
Numeric length-2 vector or NULL. Ppm window used to estimate noise per spectrum. If NULL, a robust noise estimate is computed from the full spectrum (MAD of first differences). |
min_snr |
Numeric. Minimum SNR (peak height / noise). Set NULL to disable. |
min_height |
Numeric. Minimum absolute peak height (in original intensity units). NULL disables. |
min_prominence |
Numeric. Minimum local prominence. NULL disables. |
prom_half_window_ppm |
Numeric. Half-window (ppm) for prominence estimation around each peak. |
min_distance_ppm |
Numeric. Minimum separation between peaks (ppm). NULL disables. |
min_curvature |
Numeric. Minimum absolute curvature at peak (|d2|). NULL disables. |
keep_cols |
Character. Extra columns to keep (default keeps all computed). |
List of data.frames (one per spectrum). Each contains: idc, ppm, Int, Etype (+1 max, -1 min), height, snr, curvature, prominence.
data(covid) X <- covid$X ppm <- covid$ppm peaklist <- ppick2(X[1,], ppm, min_snr=50) plot_spec(X[1, ], ppm, shift = c(3, 4.5), backend='base') points(peaklist[[1]]$ppm, peaklist[[1]]$Int, col = "cyan") head(peaklist[[1]])data(covid) X <- covid$X ppm <- covid$ppm peaklist <- ppick2(X[1,], ppm, min_snr=50) plot_spec(X[1, ], ppm, shift = c(3, 4.5), backend='base') points(peaklist[[1]]$ppm, peaklist[[1]]$Int, col = "cyan") head(peaklist[[1]])
Applies probabilistic quotient normalisation (PQN) to spectra. PQN estimates a sample-specific dilution factor from the median of quotients relative to a reference spectrum and scales each spectrum accordingly.
pqn( X, ref_index = NULL, total_area = FALSE, bin = NULL, iref = NULL, TArea = NULL )pqn( X, ref_index = NULL, total_area = FALSE, bin = NULL, iref = NULL, TArea = NULL )
X |
Numeric matrix or data.frame. Each row is a sample spectrum and each column is a variable (e.g. chemical shift point or bin). |
ref_index |
Integer vector of row indices used to compute the reference
spectrum. If |
total_area |
Logical. If |
bin |
Optional named list controlling binning for reference estimation,
e.g. |
iref |
Deprecated. Use |
TArea |
Deprecated. Use |
Mechanics. Let be spectrum and a reference spectrum.
PQN computes quotients and defines the dilution factor as
. The PQN-normalised spectrum is
.
The reference spectrum is typically the median spectrum across all samples
or across QC samples (ref_index).
If bin is provided, and dilution factors are computed on binned spectra,
but applied to the original spectra.
Dilution factors are stored in attr(X, "m8_pqn")$dilution_factor.
Numeric matrix of PQN-normalised spectra.
Total area normalisation prior to PQN is usually not recommended. Total area scaling removes global intensity differences by enforcing equal total signal per sample. PQN is itself a global scaling method intended to estimate dilution. Applying both can substantially change results because PQN no longer estimates dilution alone, but also compensates compositional distortions introduced by total area scaling.
Situations where total_area = TRUE can be defensible include:
when spectra have large, non-dilution-related amplitude differences caused by acquisition artefacts (receiver gain / baseline offset) and you explicitly want to stabilise the reference estimation step;
when the measured total signal is expected to be constant by design (e.g. strictly controlled sample mass/volume and stable overall metabolite pool), and the main goal is to reduce technical scaling variation before PQN.
In most metabolomics settings, prefer PQN without total area scaling.
#' @section On spectral alignment and binning:
PQN assumes that corresponding variables represent the same chemical signal
across spectra. If spectra are not well aligned, small peak shifts can inflate
the variability of pointwise quotients , leading to unstable
dilution factor estimates.
In such cases, slight binning (e.g. narrow fixed-width bins) prior to reference estimation is recommended. Binning reduces sensitivity to minor misalignments by aggregating neighbouring variables. However, excessive binning may obscure narrow signals and should be avoided.
Alternatively, prior spectral alignment is preferable when available.
Dieterle F, Ross A, Schlotterbeck G, Senn H (2006). Probabilistic Quotient Normalization as Robust Method to Account for Dilution of Complex Biological Mixtures. Analytical Chemistry, 78(13), 4281–4290.
Other preprocessing:
align_segment(),
align_spectra(),
binning(),
calibrate(),
correct_baseline(),
correct_lw(),
print_preprocessing()
set.seed(1) ppm <- seq(0, 10, length.out = 1000) ref <- dnorm(ppm, 3, 0.15) + dnorm(ppm, 6, 0.20) + dnorm(ppm, 7.5, 0.18) dil <- c(1, 0.8, 0.6, 0.4, 0.2) # true dilution factors X <- t(sapply(dil, function(d) d * ref + rnorm(length(ref), 0, 0.005))) plot_spec(X, ppm) Xn <- pqn(X, ref_index=1) dil_est <- attr(Xn, "m8_pqn")$dilution_factor cbind(true = dil, estimated = dil_est)set.seed(1) ppm <- seq(0, 10, length.out = 1000) ref <- dnorm(ppm, 3, 0.15) + dnorm(ppm, 6, 0.20) + dnorm(ppm, 7.5, 0.18) dil <- c(1, 0.8, 0.6, 0.4, 0.2) # true dilution factors X <- t(sapply(dil, function(d) d * ref + rnorm(length(ref), 0, 0.005))) plot_spec(X, ppm) Xn <- pqn(X, ref_index=1) dil_est <- attr(Xn, "m8_pqn")$dilution_factor cbind(true = dil, estimated = dil_est)
Applies a preprocessing strategy to a numeric matrix.
prep_X(preproc_strategy, X)prep_X(preproc_strategy, X)
preproc_strategy |
list with elements 'center' (logical) and 'scale' (character). |
X |
numeric matrix - processing is done column-wise. |
A list containing the processed matrix and parameters.
X <- matrix(c(10,10, 0,0, 0, 10), ncol=3) autoscale <- uv_scaling(center=TRUE) X_scaled <- prep_X(autoscale, X) str(X_scaled)X <- matrix(c(10,10, 0,0, 0, 10), ncol=3) autoscale <- uv_scaling(center=TRUE) X_scaled <- prep_X(autoscale, X) str(X_scaled)
List available preprocessing functions Returns the preprocessing utilities provided by metabom8.
print_preprocessing()print_preprocessing()
A named character vector describing preprocessing functions.
Other preprocessing:
align_segment(),
align_spectra(),
binning(),
calibrate(),
correct_baseline(),
correct_lw(),
pqn()
list_preprocessing()list_preprocessing()
Displays the recorded preprocessing history attached to a metabom8
spectral matrix. The function reads the "m8_prep" attribute
and prints each processing step in chronological order, including
parameters and notes.
print_provenance(x, detail = FALSE, max_items = 8)print_provenance(x, detail = FALSE, max_items = 8)
x |
A metabom8 object or spectral matrix with attached
|
detail |
Prints full information trail, incl. timestamp / versioning |
max_items |
Limits individual list entries (e.g., parameter) to specified number |
The input can be either:
A metabom8-style list containing $X, or
A matrix with metabom8 provenance attributes attached.
If no preprocessing metadata is found, a message is printed.
metabom8 records preprocessing steps as an ordered list of
transformation descriptors stored in the "m8_prep" attribute.
Each step typically contains:
step: Name of the preprocessing operation
params: Parameters used
notes: Optional description
time: Timestamp
pkg: Package name and version
This function provides a compact audit trail of the processing workflow, facilitating reproducibility and provenance inspection.
Invisibly returns NULL. This function is called for
its side effect of printing pipeline information.
Other provenance:
add_note(),
get_provenance()
params <- list( runtime = "docker", image = Sys.getenv("IMAGE", "docker-image-dummy"), workflow = Sys.getenv("M8_WORKFLOW", "std_prof-urine"), agent = paste0("snakemake/", Sys.getenv("SNAKEMAKE_VERSION", "v?")), run_id = Sys.getenv("M8_RUN_ID", "m8-2605-001") ) data(hiit_raw) print_provenance(hiit_raw) hiit_proc <- hiit_raw |> calibrate(type = "tsp") |> excise() |> add_note('dilution-adaptive acquisition mode -> verify snr after normalisation', params) print_provenance(hiit_proc)params <- list( runtime = "docker", image = Sys.getenv("IMAGE", "docker-image-dummy"), workflow = Sys.getenv("M8_WORKFLOW", "std_prof-urine"), agent = paste0("snakemake/", Sys.getenv("SNAKEMAKE_VERSION", "v?")), run_id = Sys.getenv("M8_RUN_ID", "m8-2605-001") ) data(hiit_raw) print_provenance(hiit_raw) hiit_proc <- hiit_raw |> calibrate(type = "tsp") |> excise() |> add_note('dilution-adaptive acquisition mode -> verify snr after normalisation', params) print_provenance(hiit_proc)
Imports TopSpin-processed 1D NMR spectra together with spectrometer acquisition
and TopSpin processing parameters (acqus and procs, respectively).
read1d( path, exp_type = list(pulprog = "noesygppr1d"), n_max = 1000, filter = TRUE, recursive = TRUE, verbose = 1, to_global = FALSE ) read1d_proc( path, exp_type = list(pulprog = "noesygppr1d"), n_max = 1000, filter = TRUE, recursive = TRUE, verbose = 1, to_global = FALSE )read1d( path, exp_type = list(pulprog = "noesygppr1d"), n_max = 1000, filter = TRUE, recursive = TRUE, verbose = 1, to_global = FALSE ) read1d_proc( path, exp_type = list(pulprog = "noesygppr1d"), n_max = 1000, filter = TRUE, recursive = TRUE, verbose = 1, to_global = FALSE )
path |
Character. Directory path containing Bruker NMR experiments. |
exp_type |
Named list. Optional filtering specification based on
acquisition or processing metadata. Each list element must correspond to a
metadata field (e.g.
Multiple fields are combined using logical AND. |
n_max |
Integer. Maximum number of spectra to import. Default: 1000. |
filter |
Logical. Filter out experiments with incomplete file systems. |
recursive |
Logical. Search |
verbose |
Logical or numeric. Verbosity level. |
to_global |
Logical. If |
A named list with three elements:
A numeric matrix of spectra (rows = samples, columns = ppm values).
A numeric vector of chemical shift values (ppm).
A data frame of acquisition and processing metadata,
row-aligned with X.
If to_global = TRUE, objects with the same names in the global
environment will be overwritten.
path <- system.file("extdata", package = "metabom8") read1d_proc(path, exp_type = list(pulprog = "noesygppr1d"), n_max = 2)path <- system.file("extdata", package = "metabom8") read1d_proc(path, exp_type = list(pulprog = "noesygppr1d"), n_max = 2)
Reads Bruker 1D NMR FIDs, corrects the digital filter (group delay), applies apodisation (windowing), optional zero-filling, FFT, phasing and ppm calibration.
Returns either absorption-, dispersion-, or magnitude-mode spectra.
If to_global = TRUE, objects are assigned to the global environment.
read1d_raw( path, exp_type = list(exp = "PROF_PLASMA_CPMG128_3mm", pulprog = "noesygppr1d"), apodisation = list(fun = "exponential", lb = 0.2), zerofill = 1L, mode = c("absorption", "dispersion", "magnitude"), verbose = 1, recursive = TRUE, n_max = 1000, filter = TRUE, to_global = FALSE )read1d_raw( path, exp_type = list(exp = "PROF_PLASMA_CPMG128_3mm", pulprog = "noesygppr1d"), apodisation = list(fun = "exponential", lb = 0.2), zerofill = 1L, mode = c("absorption", "dispersion", "magnitude"), verbose = 1, recursive = TRUE, n_max = 1000, filter = TRUE, to_global = FALSE )
path |
Character. Path to the root directory containing Bruker experiment folders. |
exp_type |
Named list. Acquisition-parameter filter to select experiments
(e.g., |
apodisation |
Named list. Apodisation function and parameters. |
zerofill |
Integer. Zero-filling exponent ( |
mode |
Character. Spectrum type to return: |
verbose |
Integer. Verbosity level: |
recursive |
Logical. Recursively search subdirectories for FIDs. |
n_max |
Integer. Maximum number of experiments to process. |
filter |
Logical. Remove experiments with incomplete file structures. |
to_global |
Logical. If TRUE, objects are also assigned to the global environment; otherwise, only an invisible list is returned. |
FIDs are read from Bruker acquisition folders and processed by the following pipeline:
Digital-filter (group-delay) correction: the initial n complex points
are invalid due to the causal DSP decimation filter and are discarded; n
equals GRPDLY when present, or is looked up from Bruker tables indexed by
DECIM and DSPFVS on older systems.
Apodisation (windowing).
Zero-filling (optional).
FFT to the frequency domain.
Phase correction.
PPM calibration (e.g., to TSP).
A common ppm scale is then interpolated across spectra.
Digital filter note: On newer systems GRPDLY is written in acqus/acqu2s
and should be used directly. For older data sets (GRPDLY < 0 or missing), the group delay
is derived from DECIM and DSPFVS via an internal look-up table.
Apodisation functions:
"uniform"
"cosine",
"sine",
"exponential" (parameter: lb)
"sem" (sine * exponential, parameter: lb)
"gauss" (parameter: lb, 'gb', and 'para')
"expGaus_resyG" (parameter: lb, 'gb', and 'aq_t')
A named list with three elements:
A numeric matrix of spectra (rows = samples, columns = ppm values).
A numeric vector of chemical-shift axis (ppm).
A data frame of acquisition metadata, row-aligned with X.
If to_global = TRUE, these objects are also assigned to the global environment.
In that case, any existing objects with the same names will be overwritten.
read1d_proc for importing TopSpin-processed spectra
path <- system.file("extdata", package = "metabom8") read1d_raw( path, exp_type = list(PULPROG = "noesygppr1d"), apodisation = list(fun = "exponential", lb = 0.2), zerofill = 1, n_max = 3 )path <- system.file("extdata", package = "metabom8") read1d_raw( path, exp_type = list(PULPROG = "noesygppr1d"), apodisation = list(fun = "exponential", lb = 0.2), zerofill = 1, n_max = 3 )
PLS/OPLS model scores
scores(object, ...) ## S4 method for signature 'm8_model' scores(object, orth = FALSE, cv = FALSE)scores(object, ...) ## S4 method for signature 'm8_model' scores(object, orth = FALSE, cv = FALSE)
object |
An object of class |
... |
Additional arguments (currently ignored). |
orth |
Logical indicating whether orthogonal scores should be returned (only applicable for OPLS models). |
cv |
Logical indicating whether cross-validated scores should be returned |
Numeric vector or matrix containing scores.
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) scores(model, orth=FALSE) scores(model, orth=TRUE) scores(model, cv=TRUE)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) scores(model, orth=FALSE) scores(model, orth=TRUE) scores(model, cv=TRUE)
Rescales a numeric vector to a specified range using min-max scaling. This is a generalized form of min-max normalization allowing any output range.
scRange(x, ra)scRange(x, ra)
x |
Numeric vector. Input values to be scaled. |
ra |
Numeric vector of length 2. Desired output range (e.g., |
The scaled values are computed as:
A numeric vector of the same length as x, scaled to the range ra.
x <- rnorm(20) plot(x, type = 'l'); abline(h = range(x), lty = 2) points(scRange(x, ra = c(5, 10)), type = 'l', col = 'red'); abline(h = c(5, 10), col = 'red', lty = 2)x <- rnorm(20) plot(x, type = 'l'); abline(h = range(x), lty = 2) points(scRange(x, ra = c(5, 10)), type = 'l', col = 'red'); abline(h = c(5, 10), col = 'red', lty = 2)
Performs STOCSY analysis on 1D NMR spectra. Computes correlation and covariance
of all variables in X against a driver signal (internal ppm position or
an external numeric vector).
stocsy(X, ppm = NULL, driver, plotting = TRUE, title = NULL)stocsy(X, ppm = NULL, driver, plotting = TRUE, title = NULL)
X |
Numeric matrix or data frame. Rows are spectra, columns are variables. |
ppm |
Optional numeric vector of chemical shift values (length = ncol(X)).
If missing or NULL, will try to read from |
driver |
Numeric scalar (ppm value, internal driver) or numeric vector (external driver, length = nrow(X)). |
plotting |
Logical. If TRUE, plot the STOCSY spectrum. |
title |
Optional character plot title. |
An object of class m8_stocsy1d (S3), a named list with entries:
version, X, ppm, driver, r, cov, extD.
Other structural_annotation:
storm()
# st <- stocsy(X, ppm, driver = 5.233, plotting = FALSE) # plotStocsy(st, shift = c(5.15, 5.30)) data(covid) cs = 5.233 stocsy(covid$X, driver=cs, plotting = TRUE)# st <- stocsy(X, ppm, driver = 5.233, plotting = FALSE) # plotStocsy(st, shift = c(5.15, 5.30)) data(covid) cs = 5.233 stocsy(covid$X, driver=cs, plotting = TRUE)
Selects an optimal subset of spectra that best match a specified target signal region, improving downstream correlation-based structural analysis such as STOCSY.
storm(X, ppm, b = 30, q = 0.05, idx.refSpec, shift)storm(X, ppm, b = 30, q = 0.05, idx.refSpec, shift)
X |
Numeric matrix (or data.frame) of NMR spectra with samples in rows and spectral variables in columns. |
ppm |
Numeric vector of chemical shift values corresponding to the columns of |
b |
Integer. Half-window size expressed as number of spectral variables (data points). The effective window width therefore depends on the ppm spacing. |
q |
Numeric. P-value threshold for including spectral variables when updating the reference region. |
idx.refSpec |
Integer. Row index of |
shift |
Numeric vector of length 2 giving the ppm range (minimum, maximum) defining the initial target signal region. |
STORM iteratively refines a subset of spectra exhibiting consistent signal position and multiplicity within a specified ppm region.
Starting from an initial reference spectrum and ppm window:
Spectra positively correlated with the current reference signal are retained.
A driver peak (maximum intensity within the reference window) is identified.
Correlation and covariance are evaluated within a local window of size b
around the driver peak.
The reference region is updated using variables that satisfy the p-value
threshold (q) and show positive correlation.
The procedure continues until the selected subset stabilises. The resulting row indices define spectra that most consistently represent the structural pattern of the target signal.
STORM does not perform metabolite identification directly. Instead, it refines the dataset to enhance structural coherence prior to correlation-based interpretation methods such as STOCSY.
Integer vector of row indices of X defining the selected spectral subset.
Posma, J. M., et al. (2012). Subset optimisation by reference matching (STORM):
an optimised statistical approach for recovery of metabolic biomarker structural
information from NMR spectra of biofluids.
Analytical Chemistry, 84(24), 10694–10701.
Other structural_annotation:
stocsy()
## Simulated example with three Gaussian signals set.seed(123) n <- 100 # number of spectra S <- 1000 # number of spectral variables ppm <- seq(10, 0, length.out = S) gauss <- function(x, centre, width, height) { height * exp(-((x - centre)^2) / (2 * width^2)) } ## Generate base signals sig1 <- gauss(ppm, 7, 0.05, 10) sig2 <- gauss(ppm, 3.5, 0.10, 8) sig3 <- gauss(ppm, 1, 0.07, 5) spectra <- matrix(0, n, S) for (i in seq_len(n)) { spectra[i, ] <- sig1 + sig2 + rnorm(S, 0, 0.1) if (i <= 25) { spectra[i, ] <- spectra[i, ] + sig3 } } ## Apply STORM to refine spectra containing the 1 ppm signal idx <- storm( X = spectra, ppm = ppm, b = 30, q = 0.05, idx.refSpec = 1, shift = c(0.75, 1.25) ) length(idx) # number of spectra retained## Simulated example with three Gaussian signals set.seed(123) n <- 100 # number of spectra S <- 1000 # number of spectral variables ppm <- seq(10, 0, length.out = S) gauss <- function(x, centre, width, height) { height * exp(-((x - centre)^2) / (2 * width^2)) } ## Generate base signals sig1 <- gauss(ppm, 7, 0.05, 10) sig2 <- gauss(ppm, 3.5, 0.10, 8) sig3 <- gauss(ppm, 1, 0.07, 5) spectra <- matrix(0, n, S) for (i in seq_len(n)) { spectra[i, ] <- sig1 + sig2 + rnorm(S, 0, 0.1) if (i <= 25) { spectra[i, ] <- spectra[i, ] + sig3 } } ## Apply STORM to refine spectra containing the 1 ppm signal idx <- storm( X = spectra, ppm = ppm, b = 30, q = 0.05, idx.refSpec = 1, shift = c(0.75, 1.25) ) length(idx) # number of spectra retained
Y-stratified k-fold cross-validation strategy
stratified_kfold(k, type = c("DA", "R"), probs = NULL)stratified_kfold(k, type = c("DA", "R"), probs = NULL)
k |
Integer. Number of folds. |
type |
Character. Either |
probs |
Numeric vector of quantile probabilities used
to stratify continuous |
For classification (type = "DA"), folds are generated such that
class proportions are approximately preserved in each fold.
For regression (type = "R"), Y is discretised into bins
defined by probs, and folds are generated to approximately
preserve the bin distribution.
A named list with elements:
List of integer vectors containing training set indices for each resampling iteration.
Character string indicating the resampling strategy.
Integer. Number of samples in the dataset.
Integer. Random seed used to generate the resampling splits, ensuring reproducibility.
Other resampling strategies:
balanced_boot(),
balanced_mc(),
kfold(),
mc()
set.seed(1) n <- 100 thr <- 1.5 Y <- c(rnorm(80, thr - 3, 0.3), rnorm(20, thr + 3, 0.3)) # unbalanced outcome mean(Y > thr) q80 <- quantile(Y, 0.8) # defines the rare "high" stratum (top 20%) cv_k <- kfold(k = 10) cv_sk <- stratified_kfold(k = 10, type = "R", probs = c(0, 0.8, 1)) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) sk_inst <- metabom8:::.arg_check_cv(cv_pars=cv_sk, model_type='R', n=n, Y_prepped=cbind(Y)) round(sapply(k_inst$train, function(i) mean(Y[i] > q80)), 2) # reflects imbalance round(sapply(sk_inst$train, function(i) mean(Y[i] > q80)), 2) # balanced across strataset.seed(1) n <- 100 thr <- 1.5 Y <- c(rnorm(80, thr - 3, 0.3), rnorm(20, thr + 3, 0.3)) # unbalanced outcome mean(Y > thr) q80 <- quantile(Y, 0.8) # defines the rare "high" stratum (top 20%) cv_k <- kfold(k = 10) cv_sk <- stratified_kfold(k = 10, type = "R", probs = c(0, 0.8, 1)) k_inst <- metabom8:::.arg_check_cv(cv_pars=cv_k, model_type='R', n=n, Y_prepped=cbind(Y)) sk_inst <- metabom8:::.arg_check_cv(cv_pars=cv_sk, model_type='R', n=n, Y_prepped=cbind(Y)) round(sapply(k_inst$train, function(i) mean(Y[i] > q80)), 2) # reflects imbalance round(sapply(sk_inst$train, function(i) mean(Y[i] > q80)), 2) # balanced across strata
prep_X.No Scaling
This function defines a preprocessing strategy that is applied via
prep_X.
unscaled(center = FALSE)unscaled(center = FALSE)
center |
Logical. If TRUE, variables are mean-centered before scaling. |
A list with elements:
Numeric matrix containing the scaled data.
List describing the preprocessing, including centering and scaling
parameters (center, scale, X_mean, X_sd).
#' @details Leaves variables unscaled. Optional centering.
Other scaling_strategies:
pareto_scaling(),
uv_scaling()
just_centering <- unscaled(center=TRUE) X <- matrix(c(10,10, 0,0, 0, 10), ncol=3) X_centered <- prep_X(just_centering, X) str(X_centered) X_centered$Xjust_centering <- unscaled(center=TRUE) X <- matrix(c(10,10, 0,0, 0, 10), ncol=3) X_centered <- prep_X(just_centering, X) str(X_centered) X_centered$X
prep_X.Unit Variance Scaling
This function defines a preprocessing strategy that is applied via
prep_X.
uv_scaling(center = TRUE)uv_scaling(center = TRUE)
center |
Logical. If TRUE, variables are mean-centered before scaling. |
Centers variables and scales each feature to unit variance. UV scaling divides each variable by its standard deviation. This is the default scaling in many multivariate methods such as PCA and PLS.
A list with elements:
Numeric matrix containing the scaled data.
List describing the preprocessing, including centering and scaling
parameters (center, scale, X_mean, X_sd).
Other scaling_strategies:
pareto_scaling(),
unscaled()
autoscale <- uv_scaling(center=TRUE) X <- matrix(c(10,10, 0,0, 0, 10), ncol=3) X_scaled <- prep_X(autoscale, X) str(X_scaled) X_scaled$Xautoscale <- uv_scaling(center=TRUE) X <- matrix(c(10,10, 0,0, 0, 10), ncol=3) X_scaled <- prep_X(autoscale, X) str(X_scaled) X_scaled$X
Variable Importance in Projection (VIP)
vip(object) ## S4 method for signature 'm8_model' vip(object)vip(object) ## S4 method for signature 'm8_model' vip(object)
object |
An object of class |
Numeric vector or matrix containing the variable importance in projection (VIP) scores.
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) vip_scores <- vip(model) dim(vip_scores)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) vip_scores <- vip(model) dim(vip_scores)
Extract model weights
weights(object, ...) ## S4 method for signature 'm8_model' weights(object, orth = FALSE)weights(object, ...) ## S4 method for signature 'm8_model' weights(object, orth = FALSE)
object |
An object of class |
... |
Additional arguments (currently ignored). |
orth |
Logical indicating whether orthogonal scores should be returned (only applicable for OPLS models). |
Numeric vector or matrix containing model weights.
data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) W <- weights(model) Wo <- weights(model, orth = TRUE) dim(W) dim(Wo) == dim(W)data(covid) cv <- balanced_mc(k=5, split=2/3) scaling <- uv_scaling(center=TRUE) model <-opls(X=covid$X, Y=covid$an$type, scaling, cv) show(model) W <- weights(model) Wo <- weights(model, orth = TRUE) dim(W) dim(Wo) == dim(W)
Compute X residual matrix Returns the residual matrix (E) of an OPLS model.
xres(object) ## S4 method for signature 'm8_model' xres(object)xres(object) ## S4 method for signature 'm8_model' xres(object)
object |
An object. |
Numeric matrix containing the X residuals.
data(covid) cv <- balanced_mc(k = 5, split = 2/3) scaling <- uv_scaling(center = TRUE) model <- opls(X = covid$X, Y = covid$an$type, scaling, cv) X_res <- xres(model) dim(X_res) == dim(covid$X)data(covid) cv <- balanced_mc(k = 5, split = 2/3) scaling <- uv_scaling(center = TRUE) model <- opls(X = covid$X, Y = covid$an$type, scaling, cv) X_res <- xres(model) dim(X_res) == dim(covid$X)