Introduction to squallms

squallms is a Bioconductor package designed to easily label and remove low-quality chromatographic features from LC/GC-MS analysis. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for grouping similar peaks together and labeling them, and editing XCMS objects so only the high-quality peaks remain.

Setup

Installation

squallms can be installed from Bioconductor (from release 3.19 onwards):

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("squallms")

squallms makes extensive use of the tidyverse for data shaping and organization as well as the RaMS package for MS data extraction. It also uses the shiny and plotly packages for interactive data visualization.

library(squallms)

Introduction

Mass-spectrometry coupled to chromatographic separation is a powerful technique for the discovery and quantification of molecules. Recent developments in automated peakpicking software have made it possible to query an entire dataset for molecular features that appear in the data as “peaks”, but these algorithms continue to have unacceptably high false positive rates that require manual review of their performance. Thus, a method for streamlining peak quality annotation in a robust and efficient way is very much in demand. This software package attempts to bridge this gap with a set of tools for the calculation of peak quality metrics and the rapid and interactive labeling of peak lists.

This package leans heavily on existing resources. Input files are required to be mzML/mzXML, mandating the use of Proteowizard’s msconvert or other software to convert proprietary formats into these open-source versions. This package also performs no peakpicking, retention time correction, or correspondence of its own, requiring the use of XCMS, MzMine, MSDIAL, or other software for the extraction of a preliminary peak list. However, the package has been designed to augment these existing resources (in particular XCMS) by accepting multiple input formats for peak lists that are shared across various systems.

At its core, this package takes in a peak list (a data frame containing columns for sample id and feature id) and returns one with only high-quality peaks. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for rapid feature viewing and labeling, and constructing a logistic model to label any remaining features. These steps are reviewed in detail with associated code entries in the rest of this vignette below.

Peakpicking with XCMS

This demo uses data peakpicked with XCMS but it’s worth noting that any peakpicking software could be used to produce a peak list as long as it can be turned into the proper input format (see “Calculating metrics of peak quality” below). The data files used here for demonstrationg purposes are those provided with the RaMS package.

suppressPackageStartupMessages({
    library(dplyr)
    library(tidyr)
    library(tibble)
    library(ggplot2)
    library(xcms)
    library(RaMS)
})
mzML_files <- system.file("extdata", package = "RaMS") %>%
    list.files(full.names = TRUE, pattern = "[A-F].mzML")

register(BPPARAM = SerialParam())
cwp <- CentWaveParam(snthresh = 0, extendLengthMSW = TRUE, integrate = 2)
obp <- ObiwarpParam(binSize = 0.1, response = 1, distFun = "cor_opt")
pdp <- PeakDensityParam(
    sampleGroups = 1:3, bw = 12, minFraction = 0,
    binSize = 0.001, minSamples = 0
)


xcms_major_version <- as.numeric(substr(as.character(packageVersion("xcms")), 1, 1))
if (xcms_major_version >= 4) {
    library(MSnbase)
    msexp <- readMSData(mzML_files, msLevel. = 1, mode = "onDisk")
    fpp <- FillChromPeaksParam(ppm = 5)
} else {
    library(MsExperiment)
    msexp <- readMsExperiment(mzML_files, msLevel. = 1, mode = "onDisk")
    fpp <- ChromPeakAreaParam()
}

xcms_filled <- msexp %>%
    findChromPeaks(cwp) %>%
    adjustRtime(obp) %>%
    groupChromPeaks(pdp) %>%
    fillChromPeaks(fpp)

This XCMS object is useful for documenting processing steps and interfacing with R-based pipelines but can be challenging to inspect and interact with. squallms provides a function that turns this S4 object into a flat file containing the feature and peak information.

feat_peak_info <- makeXcmsObjFlat(xcms_filled) %>%
    select(feature, starts_with("mz"), starts_with("rt"), filename, filepath)
feat_peak_info %>%
    head() %>%
    mutate(filepath = paste0(substr(filepath, 1, 13), "~")) %>%
    knitr::kable(caption = "Output from makeXcmsObjFlat.")
Output from makeXcmsObjFlat.
feature mz mzmin mzmax rt rtmin rtmax filename filepath
FT001 90.05549 90.05541 90.05556 11.11593 10.91385 11.41188 LB12HL_AB.mzML.gz /github/works~
FT001 90.05549 90.05544 90.05552 11.08192 10.87952 11.41167 LB12HL_CD.mzML.gz /github/works~
FT001 90.05549 90.05540 90.05553 11.05225 10.80270 11.39430 LB12HL_EF.mzML.gz /github/works~
FT002 90.05543 90.05523 90.05553 11.81833 11.53837 11.98883 LB12HL_CD.mzML.gz /github/works~
FT002 90.05546 90.05536 90.05551 11.78373 11.51833 11.98535 LB12HL_EF.mzML.gz /github/works~
FT002 90.05547 90.05518 90.05563 11.44277 11.36553 11.70645 LB12HL_AB.mzML.gz /github/works~

Although comprehensive, the only information needed is actually the grouping info, the peak bounding boxes, and the filepath (feature, mz*, rt*, and filepath) because we’ll recalculate the necessary metrics from the raw peak data pulled in from the mz(X)ML files. Similarly, other peakpicking algorithms tend to produce data in a similar format and can interface with squallms here.

We also can operate over only a subset of the data, which is quite useful for very large datasets. Quality control files or pooled samples with a similar expected quality can be used in lieu of the entire dataset which may not fit into computer memory. The chunk of code below shows a possible way to do this using the dplyr package’s filter command to keep only two of the files, dropping a third from the quality calculations.

feat_peak_info_subset <- feat_peak_info %>%
    filter(grepl("AB|CD", filename))

Note, however, that the retention time bounds must be uncorrected. makeXcmsObjFlat does this automatically for XCMS objects, but if corrected retention times are provided then the data extracted from the mz(X)ML files will be incorrect.

Calculating metrics of peak quality (extractChromMetrics)

The first step is to obtain some kind of information about the chromatographic features that will allow us to distinguish between good and bad. The most intuitive metrics (in my opinion) are similarity to a bell curve and the signal-to-noise ratio. Although many implementations of these exist, the ones I’ve found most useful are described in Kumler et al. (2023). The function for this in squallms is extractChromMetrics, as shown below. I read in the raw MS data with RaMS first because this saves a step and can be reused for quality checks.

msdata <- grabMSdata(unique(feat_peak_info$filepath), verbosity = 0)
shape_metrics <- extractChromMetrics(feat_peak_info, ms1_data = msdata$MS1)
knitr::kable(head(shape_metrics), caption = "Format of the output from extractChromMetrics")
Format of the output from extractChromMetrics
feature med_mz med_rt med_cor med_snr
FT001 90.05549 11.105383 0.9187332 14.425791
FT002 90.05546 11.756300 0.8074746 3.870094
FT003 93.07429 11.144350 0.9552928 17.061047
FT004 104.07101 10.645483 0.8738360 20.505134
FT005 104.07100 8.301808 0.9474947 18.936980
FT006 104.07098 7.373142 0.7835736 5.283807
shape_metrics %>%
    arrange(desc(med_snr)) %>%
    ggplot() +
    geom_point(aes(x = med_rt, y = med_mz, color = med_cor, size = med_snr), alpha = 0.5) +
    theme_bw()
Mass by time diagram of all features returned by XCMS. Points have been colored by their calculated peak shape metric and are sized by their calculated signal-to-noise ratio. Large, pale points tend to be higher quality than small dark ones.

Mass by time diagram of all features returned by XCMS. Points have been colored by their calculated peak shape metric and are sized by their calculated signal-to-noise ratio. Large, pale points tend to be higher quality than small dark ones.

The output from extractChromMetrics is a data frame with peak shape (med_cor) and signal-to-noise (med_snr) calculations as estimated for each feature in the dataset. Unfortunately, many of these features are very low-quality, often a result of noise rather than biological signal, and should not be included in the downstream analysis. We view a random subset of the features that XCMS produced in the below code chunk:

# Set seed for reproducibility, ensuring that slice_sample always returns the
# same features for discussion below.
set.seed(123)

some_random_feats <- shape_metrics %>%
    slice_sample(n = 8)
some_random_feats %>%
    mutate(mzmin = med_mz - med_mz * 5 / 1e6) %>%
    mutate(mzmax = med_mz + med_mz * 5 / 1e6) %>%
    mutate(rtmin = med_rt - 1) %>%
    mutate(rtmax = med_rt + 1) %>%
    left_join(msdata$MS1, join_by(
        between(y$rt, x$rtmin, x$rtmax),
        between(y$mz, x$mzmin, x$mzmax)
    )) %>%
    qplotMS1data(color_col = "med_cor") +
    geom_vline(aes(xintercept = med_rt), color = "red") +
    geom_text(aes(x = Inf, y = Inf, label = paste0("SNR: ", round(med_snr))),
        data = some_random_feats, hjust = 1, vjust = 1, color = "black"
    ) +
    facet_wrap(~feature, scales = "free", nrow = 3) +
    scale_color_continuous(limits = c(0, 1), name = "Peak shape metric") +
    scale_y_continuous(expand = expansion(c(0, 0.25))) +
    theme(legend.position.inside = c(0.82, 0.12), legend.position = "inside") +
    guides(color = guide_colorbar(direction = "horizontal", title.position = "top"))
Eight randomly selected features from the XCMS object shown as extracted ion chromatograms. Colors correspond to the med_cor metric and the med_snr value is shown as a label in the upper right. As shown above, high-quality peaks tend to be pale and have large SNR values but several features show very poor quality.

Eight randomly selected features from the XCMS object shown as extracted ion chromatograms. Colors correspond to the med_cor metric and the med_snr value is shown as a label in the upper right. As shown above, high-quality peaks tend to be pale and have large SNR values but several features show very poor quality.

As expected, many “noise” signals were picked up, even in this precleaned demo dataset. Fortunately, it also looks like our peak quality metrics are doing a reasonable job of distinguishing features that look ok (e.g. FT014, FT179) from those of poor quality (e.g. FT118, FT050) with lower values for med_cor and med_snr in the noisy features.

However, each MS setup, dataset, and analyst will have their own tolerance for peak quality, making it difficult to provide universal recommendations for thresholds at which med_cor and med_snr should be used to filter out poor-quality features. This means that some degree of manual review and labeling is likely always going to be necessary for a high-quality dataset.

Labeling

Labeling chromatographic features is usually a slow and painful step. XCMS’s methods for extracting chromatograms are sluggish and difficult to customize, while data entry usually involves having an Excel workbook or some similar data entry object open to record the values. squallms provides two ways of labeling features a little more easily.

Manual labeling with labelFeatsManual

labelFeatsManual accepts the data frame of peak information constructed above and launches a Shiny application to show one feature at a time. This application detects when arrow keys have been pressed and uses them as a data entry method for peak classification. Assignments can be made sequentially or randomly and specific features can be jumped to for rapid inspection. While the default settings use the arrow keys to label, these keys can be rebound using the “Reassign” button to suit the user’s convenience.

# Chunk not run during vignette build because interactive input required
# Annotations may differ from the saved output loaded below based on user preference
manual_classes <- labelFeatsManual(feat_peak_info, ms1_data = msdata$MS1)
Screenshot of lasso labeling dashboard showing the first feature in the dataset as seen via labelFeatsManual.
Screenshot of lasso labeling dashboard showing the first feature in the dataset as seen via labelFeatsManual.

This first peak looks reasonably good, so I would assign this a “Good” classification with a right-arrow key press which would then trigger the rendering of the next feature.

The labelFeatsManual function returns a vector containing the assigned quality assessments, with NAs for each feature that wasn’t labeled. An example of this output can be found included in the package for demonstration purposes and is loaded here for use in the rest of the vignette. Your annotations may differ from those that I established during development due to differences in quality tolerance, further highlighting the importance of an individually-driven annotation cycle.

manual_classes <- readRDS(system.file("extdata", "intro_manual_labels.rds", package = "squallms"))
table(manual_classes, useNA = "ifany")
#> manual_classes
#>  Bad Good <NA> 
#>   25   11  159

Lasso labeling with labelFeatsLasso

However, one of the big strengths of squallms is its implementation of a strategy for classifying many peaks simultaneously. To do this, the raw MS data for a multi-file feature is coerced to a set of shared retention times. This RT x file peak matrix can then be treated as individual “pixels” in a PCA that extracts the major patterns. For chromatographic peak data, the dominant signal is often noise vs real peak. This operation is performed using the pickyPCA function of the package, though for non-demonstration purposes it’s usually not necessary to call it directly as it’s performed within the labelFeatsLasso function. Here, we call pickyPCA with a small subset of 20 features to visually demonstrate the logic.

pcaoutput <- feat_peak_info %>%
    filter(feature %in% sprintf("FT%03d", 30:50)) %>%
    pickyPCA(
        ms1_data = msdata$MS1, rt_window_width = 1, ppm_window_width = 5,
        verbosity = 0
    )

The similarity between high-quality peaks can be seen visually in the figure below. Good peaks have a distinctly lighter center and darker sides corresponding to high values at the center of the peak and low values at the edges of the window. The pattern’s visible across multiple files as a ridge running vertically down each feature’s small multiple plot. In this case, features FT030, FT040, FT041, and FT044 look like high-quality features. Noise features have no obvious pattern, so the PCA won’t be able to extract signal from them as a dominant component. The other pattern are the features (FT031, FT042, FT048) that are intense at the beginning or at the end, corresponding to the noise peaks that CentWave loves picking out so much.

pcaoutput$interp_df %>%
    ggplot() +
    geom_tile(aes(x = approx_rt, y = filename, fill = approx_int)) +
    facet_wrap(~feature, nrow = 4) +
    scale_x_continuous(
        breaks = c(1, 25, 50), labels = c("0", "0.5", "1"),
        expand = expansion()
    ) +
    scale_y_discrete(expand = expansion()) +
    theme(
        axis.text = element_blank(), axis.ticks = element_blank(),
        legend.position = "none"
    ) +
    labs(x = "Scaled retention time", y = "Individual files")
File by time views of extracted ion chromatograms for a subset of features shown as a pixel matrix. Colors correspond to the relative intensity of the chromatogram across the gridded retention time axis, with good features consistently showing up as a bright streak down the center of the plot, while noise features look like static. XCMS feature IDs are shown in the facet labels.

File by time views of extracted ion chromatograms for a subset of features shown as a pixel matrix. Colors correspond to the relative intensity of the chromatogram across the gridded retention time axis, with good features consistently showing up as a bright streak down the center of the plot, while noise features look like static. XCMS feature IDs are shown in the facet labels.

When the PCA is performed, we then see these features separated from the rest along the first principal component, with features FT030, FT040, FT041, and FT044 all stacking almost on top of each other at the far right side of the principal component plot.

pcaoutput$pcamat %>%
    prcomp() %>%
    .$rotation %>%
    .[, 1:2] %>%
    as.data.frame() %>%
    rownames_to_column("feature") %>%
    ggplot(aes(x = PC1, y = PC2, label = feature, key = feature)) +
    geom_vline(xintercept = 0) +
    geom_hline(yintercept = 0) +
    geom_text() +
    coord_fixed() +
    scale_x_continuous(expand = expansion(0.2)) +
    labs(x = "Principal component 1", y = "Principal component 2") +
    theme_bw()
Principal component analysis of the pixel matrix features. Each 'pixel' in the previous plot was a dimension along which the PCA detected redundancy, allowing for a majority of the variance to be explained by the first axis in which good features separate from bad ones.

Principal component analysis of the pixel matrix features. Each ‘pixel’ in the previous plot was a dimension along which the PCA detected redundancy, allowing for a majority of the variance to be explained by the first axis in which good features separate from bad ones.

Of course, such separation isn’t guaranteed to be consistently to the left or right and isn’t always neatly along one axis. While a line between good and bad could be drawn with complicated math expressions, a much easier way is to simply engage an interactive visualization allowing for arbitrary selection. R’s shiny and plotly packages are perfect for this method, so squallms has a built-in Shiny application that launches in the browser and can be used to select such peaks using the “lasso” tool - thus, the “lasso labeling” part of the package name.

# Chunk not run during vignette build because interactive input required
# Annotations may differ from the saved output loaded below based on user preference
lasso_classes <- labelFeatsLasso(feat_peak_info, ms1_data = msdata$MS1)

In the screenshot below, I’ve moused over the feature IDs in the central plot as they render one at a time in the bottom left. Once I find a cluster where many high-quality features are found, I clicked and dragged to select a region of good features, which rendered the “aggregate” feature in the bottom right. Shortly after, I clicked the “Flag selection as Good” at the bottom of the left sidebar.

Screenshot of lasso labeling dashboard when ‘Good’ features are being lassoed in the principal components graph. Central plot shows the interactive PCA with feature IDs set to react to both mouse hover and lasso selection. Hovering over a given feature shows its extracted ion chromatogram in the lower left plot. Clicking and dragging to select multiple features with the lasso option produces the lower right figure which represents the aggregate chromatogram after normalization. Plot in the upper right shows the aggregate chromatogram for each cluster of the central PCA.
Screenshot of lasso labeling dashboard when ‘Good’ features are being lassoed in the principal components graph. Central plot shows the interactive PCA with feature IDs set to react to both mouse hover and lasso selection. Hovering over a given feature shows its extracted ion chromatogram in the lower left plot. Clicking and dragging to select multiple features with the lasso option produces the lower right figure which represents the aggregate chromatogram after normalization. Plot in the upper right shows the aggregate chromatogram for each cluster of the central PCA.

Of course, a training set needs examples of low-quality peaks as well. I then browsed other features in the central plot until finding a region of low-quality features that I could similarly select using the lasso tool and flag as Bad. Multiple regions can be selected and combined with additional lassos, and of course the rectangle selection tool can be used instead of the freehand.

Screenshot of lasso labeling dashboard when ‘Bad’ features are being lassoed in the principal components graph.
Screenshot of lasso labeling dashboard when ‘Bad’ features are being lassoed in the principal components graph.

In this case, the good/bad signal was so strongly dominant that I had no need for additional principal components and really only needed the first one to capture my internal sense of peak quality. However, other datasets may not identify the quality signal as the first or even second PC, especially if there’s a large biological signal present. To handle this case, I added a k-means clustering algorithm that operates on all PCs and renders their aggregate shape in the upper right. Combined with plotly’s ability to show or hide individual clusters by double-clicking the plot legend, this allows the user to isolate regions in multidimensional space while still only showing the first two PCs in the plot. Both k and the number of PCs used for the k-means clustering can be set in the left sidebar, while a simple barplot provides information about the number of PCs that capture 20, 50, and 80% of the variation in the dataset.

While both the number of clusters and the number of PCs used are arbitrary, the idea is that these provide a way to capture variation beyond the first two principal componenents if the initial distribution is unhelpful. I would recommend increasing the number of PCs used to capture 50+% of the variation, while the number of clusters can be toggled until the groups in the aggregate plots in the upper right look good and an entire cluster can be grabbed at one time.

After both good and bad features have been flagged, clicking the “Return to R” button at the bottom of the sidebar or closing the window will return a character vector consisting of “Good”, “Bad”, and NAs named by the feature with which those labels were associated. As before, I saved the output from a previous session on the demo data using the lassos shown in the screenshots above and have provided it alongside for demonstration purposes. Again, your exact values may differ from those provided here due to variable tolerances for feature quality.

lasso_classes <- readRDS(system.file("extdata", "intro_lasso_labels.rds", package = "squallms"))
table(lasso_classes, useNA = "ifany")
#> lasso_classes
#>  Bad Good <NA> 
#>   49   35  111

By comparing the lasso labels to the manual labels we can estimate the robustness of our labeling using typical metrics such as false positive/negative rates as well as accuracy and F1 scores, all provided concisely by the caret package.

suppressPackageStartupMessages(library(caret))
data.frame(manual = manual_classes, lasso = lasso_classes) %>%
    filter(!is.na(manual) & !is.na(lasso))
#>       manual lasso
#> FT007    Bad   Bad
#> FT008    Bad   Bad
#> FT026    Bad   Bad
#> FT039    Bad   Bad
#> FT072    Bad   Bad
#> FT085    Bad   Bad
#> FT087   Good  Good
#> FT092   Good  Good
#> FT111   Good  Good
#> FT120    Bad   Bad
#> FT153   Good  Good
#> FT171    Bad   Bad
#> FT181   Good  Good

data.frame(manual = factor(manual_classes), lasso = factor(lasso_classes)) %>%
    filter(!is.na(manual)) %>%
    with(confusionMatrix(lasso, manual, positive = "Good"))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Bad Good
#>       Bad    8    0
#>       Good   0    5
#>                                      
#>                Accuracy : 1          
#>                  95% CI : (0.7529, 1)
#>     No Information Rate : 0.6154     
#>     P-Value [Acc > NIR] : 0.001815   
#>                                      
#>                   Kappa : 1          
#>                                      
#>  Mcnemar's Test P-Value : NA         
#>                                      
#>             Sensitivity : 1.0000     
#>             Specificity : 1.0000     
#>          Pos Pred Value : 1.0000     
#>          Neg Pred Value : 1.0000     
#>              Prevalence : 0.3846     
#>          Detection Rate : 0.3846     
#>    Detection Prevalence : 0.3846     
#>       Balanced Accuracy : 1.0000     
#>                                      
#>        'Positive' Class : Good       
#> 

If there are differences between the two (the demo data happens to align perfectly), we can replace the lasso labels with the (generally) more reliable manual labels.

class_labels <- lasso_classes
class_labels[!is.na(manual_classes)] <- manual_classes[!is.na(manual_classes)]

Building a quality model and removing low-quality peaks

Of course, both labeling methods are designed to return partially-labeled datasets. To get quality estimates for the remaining features, we can use a logistic model trained on a few of our metrics to estimate the quality of each feature. The basic idea here is to use the med_cor and med_snr information extracted with extractChromMetrics as the predictive variables and the labels produced by labelFeatsManual or labelFeatsLasso as the predicted variable in a logistic regression. The default formula uses glm(formula=feat_class~med_cor+med_snr, family = binomial) to fit the values but additional metrics that exist in the shape_metrics data frame can be used as predictive variables in the model as well.

logModelfeatProb aligns the feature metrics with the labels and performs the logistic regression, returning a vector of likelihoods for every feature in the dataset, not just those that were previously labeled. These values range from 0 to 1 and estimate the probability that a given feature would be assigned a “Good” classification if manually reviewed, with values close to 1 being very likely to be “Good” and values close to 0 more likely to be low-quality.

pred_probs <- logModelFeatProb(shape_metrics, class_labels)
#> Logistic model regression coefficients:
#> (Intercept)     med_cor     med_snr 
#>  -7.0349466   4.8379451   0.6066241
Scatterplot of features showing the relationship between med_cor and med_snr, with colors indicating the model likelihood estimation for known good features (filled circles), known bad features (X characters), and unclassified feature (open circles). The logistic fit can be seen as diagonal stripes of a given color descending from the upper left to the bottom right.

Scatterplot of features showing the relationship between med_cor and med_snr, with colors indicating the model likelihood estimation for known good features (filled circles), known bad features (X characters), and unclassified feature (open circles). The logistic fit can be seen as diagonal stripes of a given color descending from the upper left to the bottom right.

The function also returns the diagram seen above when the verbosity argument is set to 2. This plot shows the med_cor and med_snr values calculated for each feature colored by their likelihood and shaped by their assigned class. Hopefully, a majority of the filled circles are green and the X shapes are red, with a clear diagonal through the center where features are of middling quality and there’s a mix of good and bad features. This plot can be helpful in determining the probabilistic quality threshold above which features align with the user’s intuition about acceptable peak quality.

The values returned by logModelFeatProb can then be handled manually, or you can use the wrapper function logModelFeatQuality with an associated likelihood_threshold to return the binary good/bad classification. This also then makes it possible to assess the performance of the model on the training data itself, so the caret package is again used here to provide accuracy metrics and a confusion matrix.

pred_classes <- logModelFeatQuality(shape_metrics, class_labels, verbosity = 1)
#> Logistic model regression coefficients:
#> (Intercept)     med_cor     med_snr 
#>  -7.0349466   4.8379451   0.6066241 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Bad Good
#>       Bad   61    6
#>       Good   4   35
#>                                           
#>                Accuracy : 0.9057          
#>                  95% CI : (0.8333, 0.9538)
#>     No Information Rate : 0.6132          
#>     P-Value [Acc > NIR] : 1.159e-11       
#>                                           
#>                   Kappa : 0.7993          
#>                                           
#>  Mcnemar's Test P-Value : 0.7518          
#>                                           
#>             Sensitivity : 0.8537          
#>             Specificity : 0.9385          
#>          Pos Pred Value : 0.8974          
#>          Neg Pred Value : 0.9104          
#>              Prevalence : 0.3868          
#>          Detection Rate : 0.3302          
#>    Detection Prevalence : 0.3679          
#>       Balanced Accuracy : 0.8961          
#>                                           
#>        'Positive' Class : Good            
#> 

Finally, once an appropriate threshold has been set and the model’s performance is acceptable the XCMS object itself can be edited to remove the poor-quality peaks.

final_xcms <- updateXcmsObjFeats(xcms_filled, shape_metrics, class_labels,
    verbosity = 0, likelihood_threshold = 0.5
)

This final XCMS object can then be passed to downstream software and pipelines with confidence!

Vignette diagnostics

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] caret_6.0-94        lattice_0.22-6      MSnbase_2.33.0     
#>  [4] ProtGenerics_1.39.0 S4Vectors_0.44.0    mzR_2.41.0         
#>  [7] Rcpp_1.0.13         Biobase_2.67.0      BiocGenerics_0.53.1
#> [10] generics_0.1.3      RaMS_1.4.3          xcms_4.4.0         
#> [13] BiocParallel_1.41.0 ggplot2_3.5.1       tibble_3.2.1       
#> [16] tidyr_1.3.1         dplyr_1.1.4         squallms_1.1.0     
#> [19] BiocStyle_2.35.0   
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          sys_3.4.3                  
#>   [3] jsonlite_1.8.9              MultiAssayExperiment_1.33.0
#>   [5] magrittr_2.0.3              farver_2.1.2               
#>   [7] MALDIquant_1.22.3           rmarkdown_2.28             
#>   [9] fs_1.6.5                    zlibbioc_1.52.0            
#>  [11] vctrs_0.6.5                 base64enc_0.1-3            
#>  [13] htmltools_0.5.8.1           S4Arrays_1.6.0             
#>  [15] progress_1.2.3              pROC_1.18.5                
#>  [17] SparseArray_1.6.0           mzID_1.45.0                
#>  [19] parallelly_1.38.0           sass_0.4.9                 
#>  [21] bslib_0.8.0                 htmlwidgets_1.6.4          
#>  [23] plyr_1.8.9                  lubridate_1.9.3            
#>  [25] impute_1.81.0               plotly_4.10.4              
#>  [27] cachem_1.1.0                buildtools_1.0.0           
#>  [29] igraph_2.1.1                mime_0.12                  
#>  [31] lifecycle_1.0.4             iterators_1.0.14           
#>  [33] pkgconfig_2.0.3             Matrix_1.7-1               
#>  [35] R6_2.5.1                    fastmap_1.2.0              
#>  [37] future_1.34.0               GenomeInfoDbData_1.2.13    
#>  [39] MatrixGenerics_1.19.0       shiny_1.9.1                
#>  [41] clue_0.3-65                 digest_0.6.37              
#>  [43] pcaMethods_1.99.0           colorspace_2.1-1           
#>  [45] GenomicRanges_1.59.0        labeling_0.4.3             
#>  [47] Spectra_1.16.0              timechange_0.3.0           
#>  [49] fansi_1.0.6                 httr_1.4.7                 
#>  [51] abind_1.4-8                 compiler_4.4.1             
#>  [53] proxy_0.4-27                withr_3.0.2                
#>  [55] doParallel_1.0.17           DBI_1.2.3                  
#>  [57] highr_0.11                  lava_1.8.0                 
#>  [59] MASS_7.3-61                 MsExperiment_1.9.0         
#>  [61] DelayedArray_0.33.1         ModelMetrics_1.2.2.2       
#>  [63] tools_4.4.1                 PSMatch_1.10.0             
#>  [65] httpuv_1.6.15               future.apply_1.11.3        
#>  [67] nnet_7.3-19                 glue_1.8.0                 
#>  [69] nlme_3.1-166                QFeatures_1.16.0           
#>  [71] promises_1.3.0              grid_4.4.1                 
#>  [73] cluster_2.1.6               reshape2_1.4.4             
#>  [75] recipes_1.1.0               gtable_0.3.6               
#>  [77] class_7.3-22                preprocessCore_1.69.0      
#>  [79] data.table_1.16.2           hms_1.1.3                  
#>  [81] MetaboCoreUtils_1.15.0      xml2_1.3.6                 
#>  [83] utf8_1.2.4                  XVector_0.46.0             
#>  [85] foreach_1.5.2               pillar_1.9.0               
#>  [87] stringr_1.5.1               limma_3.63.0               
#>  [89] later_1.3.2                 splines_4.4.1              
#>  [91] survival_3.7-0              tidyselect_1.2.1           
#>  [93] maketools_1.3.1             knitr_1.48                 
#>  [95] IRanges_2.41.0              SummarizedExperiment_1.36.0
#>  [97] xfun_0.48                   hardhat_1.4.0              
#>  [99] statmod_1.5.0               timeDate_4041.110          
#> [101] matrixStats_1.4.1           stringi_1.8.4              
#> [103] UCSC.utils_1.2.0            lazyeval_0.2.2             
#> [105] yaml_2.3.10                 evaluate_1.0.1             
#> [107] codetools_0.2-20            MsCoreUtils_1.19.0         
#> [109] BiocManager_1.30.25         cli_3.6.3                  
#> [111] affyio_1.77.0               rpart_4.1.23               
#> [113] xtable_1.8-4                munsell_0.5.1              
#> [115] jquerylib_0.1.4             GenomeInfoDb_1.43.0        
#> [117] MassSpecWavelet_1.73.0      globals_0.16.3             
#> [119] XML_3.99-0.17               parallel_4.4.1             
#> [121] gower_1.0.1                 prettyunits_1.2.0          
#> [123] AnnotationFilter_1.31.0     listenv_0.9.1              
#> [125] viridisLite_0.4.2           keys_0.1.1                 
#> [127] ipred_0.9-15                e1071_1.7-16               
#> [129] prodlim_2024.06.25          MsFeatures_1.15.0          
#> [131] scales_1.3.0                affy_1.85.0                
#> [133] ncdf4_1.23                  purrr_1.0.2                
#> [135] crayon_1.5.3                rlang_1.1.4                
#> [137] vsn_3.74.0

Vignette last built on 2024-10-31