The yamss (yet another mass spectrometry software) package provides tools to preprocess raw mass spectral data files arising from metabolomics experiments with the primary goal of providing high-quality differential analysis. Currently, yamss implements a preprocessing method “bakedpi”, which stands for bivariate approximate kernel density estimation for peak identification.
Alternatives to this package include xcms (Smith et al. 2006) (available on Bioconductor) and MZMine 2 (Pluskal et al. 2010). These packages also provide preprocessing functionality but focus more on feature detection and alignment in individual samples. The input data to yamss are standard metabolomics mass spectral files which can be read by mzR.
The “bakedpi” algorithm is a new preprocessing algorithm for untargeted metabolomics data (Myint et al. 2017). The output of “bakedpi” is essentially a table with dimension peaks (adducts) by samples, containing quantified intensity measurements representing the abundance of metabolites. This table, which is very similar to output in gene expression analysis, can directly be used in standard statistical packages, such as limma, to identify differentially abundant metabolites between conditions. It is critical that all samples which are to be analyzed together are processed together through bakedpi.
Please cite our paper when using yamss (Myint et al. 2017).
We will be looking at data provided in the mtbls2
data
package. In this experiment, investigators exposed wild-type and mutant
Arabidopsis thaliana leaves to silver nitrate and performed
global LC/MS profiling. The experiment was repeated twice, but we will
only be looking at the first replicate. There are 4 wild-type and 4
mutant plants in this experiment.
filepath <- file.path(find.package("mtbls2"), "mzML")
files <- list.files(filepath, pattern = "MSpos-Ex1", recursive = TRUE, full.names = TRUE)
classes <- rep(c("wild-type", "mutant"), each = 4)
The first step is to read the raw mass spec data into an R
representation using readMSdata()
:
colData <- DataFrame(sampClasses = classes, ionmode = "pos")
cmsRaw <- readMSdata(files = files, colData = colData, mzsubset = c(500,520), verbose = TRUE)
## [readRaw]: Reading 8 files
## class: CMSraw
## Representing 8 data files
## Number of scans: 3389
## M/Z: 500.000120 - 519.998100
The output of readMSdata()
is an object of class
CMSraw
representating raw (unprocessed) data. We use the
colData
argument to store phenotype information about the
different samples.
The next step is to use bakedpi()
to preprocess these
samples. This function takes a while to run, so we only run it on a
small slice of M/Z values, using the mzsubset
argument.
This is only done for speed.
cmsProc <- bakedpi(cmsRaw, dbandwidth = c(0.01, 10), dgridstep = c(0.01, 1),
outfileDens = NULL, dortalign = TRUE, mzsubset = c(500, 520), verbose = TRUE)
## [bakedpi] Background correction
## [bakedpi] Retention time alignment
## [bakedpi] Density estimation
## class: CMSproc
## Representing 8 data files
## Number of scans: 3389
## M/Z: 500.000120 - 519.998100
The dbandwidth
and dgridstep
arguments
influence the bivariate kernel density estimation which forms the core
of bakedpi. dgridstep
is a vector of length 2 that
specifies the spacing of the grid upon which the density is estimated.
The first component specifies the spacing in the M/Z direction, and the
second component specifies the spacing in the scan (retention time)
direction. To showcase a fast example, we have specified the M/Z and
scan spacings to be 0.01
and 1
respectively,
but we recommend keeping the defaults of 0.005
and
1
because this will more accurately define the M/Z and scan
bounds of the detected peaks. dbandwidth
is a vector of
length 2 that specifies the kernel density bandwidth in the M/Z and scan
directions in the first and second components respectively. Note that
dbandwidth[1]
should be greater than or equal to
dgridstep[1]
and dbandwidth[2]
should be
greater than or equal to dgridstep[2]
. Because a binning
strategy is used to speed up computation of the density estimate, this
is to ensure that data points falling into the same grid location all
have the same distances associated with them.
For experiments with a wide range of M/Z values or several thousands
of scans, the computation of the density estimate can be time-intensive.
For this reason, it can be useful to save the density estimate in an
external file specified by the outfileDens
argument. If
outfileDens
is set to NULL
, then the density
estimate is not saved and must be recomputed if we wish to process the
data again. Specifying the filename of the saved density estimate in
outfileDens
when rerunning bakedpi()
skips the
density estimation phase which can save a lot of time.
The resulting object is an instance of class CMSproc
which contains the bivariate kernel density estimate as well as some
useful preprocessing metadata. In order to obtain peak bounds and
quantifications, the last step is to run slicepi()
, which
computes a global threshold for the density, uses this threshold to call
peak bounds, and quantifies the peaks. If the cutoff
argument is supplied as a number, then slicepi()
will use
this as the density threshold. Otherwise if cutoff
is left
as the default NULL
, a data-driven threshold will be
identified.
## [slicepi] Computing cutoff
## [slicepi] Computing peak bounds
## [slicepi] Quantifying peaks
## An object of class 'CMSslice'
## Representing 8 data files
## Number of scans: 3389
## M/Z: 500.000120 - 519.998100
## Number of peaks: 72
The output of slicepi()
is an instance of class
CMSslice
and contains the peak bounds and quantifications
as well as sample and preprocessing metadata.
We can access the differential analysis report with
diffrep()
. This is a convenience function, similar to
diffreport()
from the xcms
package. In our case it uses limma to do
differential analysis; the output of diffrep()
is basically
topTable()
from limma.
## logFC AveExpr t P.Value adj.P.Val B
## 25 3.474403 15.32845 70.51488 9.110811e-12 3.650248e-10 17.95487
## 54 -2.156827 17.52813 -69.50470 1.013958e-11 3.650248e-10 17.84146
## 21 1.209102 17.87641 44.72779 2.655826e-10 4.527970e-09 14.29968
## 55 -2.004350 15.66532 -43.91037 3.044188e-10 4.527970e-09 14.14912
## 4 -1.137775 16.23114 -43.71853 3.144424e-10 4.527970e-09 14.11336
## 47 1.252761 17.88478 35.34145 1.515174e-09 1.818209e-08 12.36901
Let’s look at the peak bound information for the peaks with the strongest differential signal. We can store the IDs for the top 10 peaks with
## [1] 25 54 21 55 4 47 46 16 59 17
We can access the peak bound information with
peakBounds()
and select the rows corresponding to
topPeaks
.
## DataFrame with 10 rows and 5 columns
## mzmin mzmax scanmin scanmax peaknum
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 501.04 501.07 795 822 25
## 2 501.25 501.29 2095 2138 54
## 3 516.08 516.12 703 748 21
## 4 502.26 502.29 2104 2126 55
## 5 508.19 508.21 207 233 4
## 6 516.05 516.09 1586 1636 47
## 7 500.09 500.12 1583 1628 46
## 8 501.60 501.63 666 710 16
## 9 513.30 513.31 2813 2831 59
## 10 515.61 515.63 668 686 17
CMSproc
objects contain information useful in exploring
your data.
The bivariate kernel density estimate matrix can be accessed with
densityEstimate()
.
We can make a plot of the density estimate in a particular M/Z and
scan region with plotDensityRegion()
.
mzrange <- c(bounds[idx[1], "mzmin"], bounds[idx[1], "mzmax"])
scanrange <- c(bounds[idx[1], "scanmin"], bounds[idx[1], "scanmax"])
plotDensityRegion(cmsProc, mzrange = mzrange + c(-0.5,0.5), scanrange = scanrange + c(-30,30))
Peaks are called by thresholding the density estimate. If we wish to
investigate the impact of varying this cutoff, we can use
densityCutoff
to obtain the original cutoff and
updatePeaks()
to re-call peaks and quantify them. Quantiles
of the non-zero density values are also available via
densityQuantiles()
.
## [slicepi] Computing peak bounds
## [slicepi] Quantifying peaks
## 0.0% 0.1% 0.2% 0.3% 0.4% 0.5%
## 2.269571e-22 8.181801e-20 1.631771e-19 2.494108e-19 3.400638e-19 4.422775e-19
## [slicepi] Computing peak bounds
## [slicepi] Quantifying peaks
## [1] 72
## [1] 72
## [1] 206
R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC tzcode source: system (glibc)
attached base packages: [1] stats4 stats graphics grDevices utils
datasets methods
[8] base
other attached packages: [1] mtbls2_1.36.0 yamss_1.33.0
[3] SummarizedExperiment_1.37.0 Biobase_2.67.0
[5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
[7] IRanges_2.41.2 S4Vectors_0.45.2
[9] MatrixGenerics_1.19.0 matrixStats_1.4.1
[11] BiocGenerics_0.53.3 generics_0.1.3
[13] BiocStyle_2.35.0
loaded via a namespace (and not attached): [1] sass_0.4.9 tiff_0.1-12
bitops_1.0-9
[4] SparseArray_1.7.2 jpeg_0.1-10 lattice_0.22-6
[7] digest_0.6.37 evaluate_1.0.1 grid_4.4.2
[10] fftwtools_0.9-11 fastmap_1.2.0 jsonlite_1.8.9
[13] Matrix_1.7-1 ProtGenerics_1.39.1 mzR_2.41.1
[16] limma_3.63.2 BiocManager_1.30.25 httr_1.4.7
[19] UCSC.utils_1.3.0 codetools_0.2-20 jquerylib_0.1.4
[22] abind_1.4-8 cli_3.6.3 rlang_1.1.4
[25] crayon_1.5.3 XVector_0.47.1 EBImage_4.49.0
[28] cachem_1.1.0 DelayedArray_0.33.3 yaml_2.3.10
[31] S4Arrays_1.7.1 tools_4.4.2 ncdf4_1.23
[34] locfit_1.5-9.10 GenomeInfoDbData_1.2.13 png_0.1-8
[37] buildtools_1.0.0 R6_2.5.1 lifecycle_1.0.4
[40] htmlwidgets_1.6.4 bslib_0.8.0 data.table_1.16.4
[43] Rcpp_1.0.13-1 statmod_1.5.0 xfun_0.49
[46] sys_3.4.3 knitr_1.49 htmltools_0.5.8.1
[49] rmarkdown_2.29 maketools_1.3.1 compiler_4.4.2
[52] RCurl_1.98-1.16