---
title: "Mass Spec Query Language Support to the Spectra Package"
package: SpectraQL
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Mass Spec Query Language Support to the Spectra Package}
%\VignetteEngine{knitr::rmarkdown}
%%\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{Spectra,SpectraQL,msdata,BiocStyle}
---
```{r style, echo = FALSE, results = 'asis'}
library(BiocStyle)
BiocStyle::markdown()
```
**Package**: `r BiocStyle::Biocpkg("SpectraQL")`
**Authors**: `r packageDescription("SpectraQL")[["Author"]] `
**Compiled**: `r date()`
# Introduction
The Mass Spec Query Language
([MassQL](https://mwang87.github.io/MassQueryLanguage_Documentation/)) is a
domain specific language meant to be a succinct way to express a query in a mass
spectrometry (MS) centric fashion. It is inspired by SQL, but it attempts to
bake in assumptions of MS to make querying much more natural for MS users.
The *SpectraQL* package provides support for the MassQL language in R,
for MS data represented by `Spectra` objects defined in Bioconductor's
[Spectra](https://bioconductor.org/packages/Spectra) package.
# Installation
The package can be installed with the `BiocManager` package using:
```{r, eval = FALSE}
## Install BiocManager, if not already installed
install.packages("BiocManager")
## Install the package
BiocManager::install("SpectraQL")
```
# Extracting data from `Spectra` objects with MassQL
The *SpectraQL* package adds support for most of the Mass Spec Query Language
functionality to `Spectra` objects hence allowing to filter and extract data
using platform independent and data storage agnostic queries. At present,
*SpectraQL* supports most, but not all of the conditions and expressions of the
MassQL definition. In particular, numeric operations in query fields such as
`MS2PROD=100+14` are not yet supported.
Below we load MS data from an example mzML file. Note also that `Spectra`
objects can represent MS data from a very large variety of sources including but
not limited to mzML or mzXML files, databases
(e.g. [MsBackendMassbank](https://rformassspectrometry.github.io/MsBackendMassbank/))
or MGF files
([MsBackendMgf](https://rformassspectrometry.github.io/MsBackendMgf/)). The use
of the `r Biocpkg("Spectra")` an hence *SpectraQL* package is thus not
restricted to mzML-backed MS data.
```{r, message = FALSE}
library(Spectra)
library(SpectraQL)
library(msdata)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
dda <- Spectra(fl)
```
## MassQL definition
A MassQL query consists of several parts: `"QUERY WHERE
AND FILTER AND "` (see its
[definition](https://mwang87.github.io/MassQueryLanguage_Documentation/#definition-of-a-query)
for more details). `""` defines which data should be extracted
from the selected data set, `""` defines which spectra should be
selected and `""` allows to subset each individual spectrum (i.e. to
which mass peaks each spectrum should be subsetted). See the following sections
for more information on each part of such an MassQL query. *SpectraQL* supports
the MassQL query schema is however case insensitive (i.e. both `"query"` and
`"QUERY"` are accepted) and is insensitive to white spaces (i.e. both
`"RTMIN=10"` and `"RTMIN = 10"` are supported). Also, it should be noted that
not all keywords, filters and operations defined by MassQL are yet supported by
*SpectraQL*. The supported types of data, conditions and filters are listed in
the sections below.
### Type of data
The `""` defines *what* should be returned by the `query`
function. In addition, functions can be specified and applied to `""` to summarize the results. The types of data that are currently supported
by *SpectraQL* are:
- `"*"`: returns the full data, i.e. returns a `Spectra` object.
- `"MS1DATA"` (case insensitive): returns (if no function is defined; see
further below) the peaks data for selected MS1 scans (i.e. a `list` of
matrices with m/z and intensity values).
- `"MS2DATA"` (case insensitive): returns (if no function is defined) the peaks
data for selected MS2 scans (i.e. a `list` of matrices with m/z and intensity
values).
In addition, functions can be used to extract specific information from the
selected spectra. *SpectraQL* supports at present the following functions
defined by MassQL:
- `"scaninfo"`: e.g. `"scaninfo(MS1DATA)"` or `"scaninfo(MS2DATA)"` to extract
spectra information for MS1 or MS2 scans, respectively. This returns the
`spectraData` for the sub-setted `Spectra` object.
- `"scansum"`: e.g. `"scansum(MS1DATA)"` or `"scansum(MS2DATA)"` to extract the
TIC (sum of peak intensities) of the selected spectra.
### Condition
The `""` allows to subset a `Spectra` object. Several conditions can
be combined with `"and"`. The syntax for a condition is `" =
"`, e.g. `"MS2PROD = 144.1"`. Such conditions can be further refined by
additional expressions that allow for example to define acceptable tolerances
for m/z differences (see further below for details). `SpectraQL` supports the
following conditions:
- `"RTMIN"`: minimum retention time (in **seconds**).
- `"RTMAX"`: maximum retention time (in **seconds**).
- `"SCANMIN"`: the minimum scan number (acquisition number).
- `"SCANMAX"`: the maximum scan number (acquisition number).
- `"CHARGE"`: the charge for MS2 spectra.
- `"POLARITY"`: the polarity of the spectra (can be `"positive"`, `"negative"`,
`"pos"` or `"neg"`, case insensitive).
- `"MS2PROD"` or `"MS2MZ"`: allows to select MS2 spectra that contain a peak
with a particular m/z.
- `"MS2PREC"`: allows to select MS2 spectra with the defined precursor m/z.
- `"MS1MZ"`: allows to select MS1 spectra containing peaks with the defined m/z.
- `"MS2NL"`: allows to look for a neutral loss from precursor in MS2 spectra.
All conditions involving m/z values allow to specify a mass accuracy using the
optional fields `"TOLERANCEMZ"` and `"TOLERANCEPPM"` that define the absolute
and m/z-relative acceptable difference in m/z values. One or both fields can be
attached to a *condition* such as
`"MS2PREC=100:TOLERANCEMZ=0.1:TOLERANCEPPM=20"` to select for example all MS2
spectra with a precursor m/z equal to 100 accepting a difference of 0.1 and 20
ppm. Note that in contrast to MassQL, the default tolarance and ppm is 0 for all
calls.
### Filter
Filters allow to subset individual spectra keeping e.g. only peaks that match a
user-defined m/z value. *SpectraQL* supports the following filters:
- `"MS1MZ"`: filters MS1 spectra keeping only peaks with matching m/z values
(tolerance can be specified as above with `"TOLERANCEMZ"` and
`"TOLERANCEPPM"`.
- `"MS2MZ"`: filters MS2 spectra keeping only peaks with matching m/z values
(tolerance can be specified as above with `"TOLERANCEMZ"` and
`"TOLERANCEPPM"`.
### Differences of the `SpectraQL` implementation to the `MassQL` definition
- Retention times (for `"RTMIN"`, `"RTMAX"`) are expressed in seconds, not
minutes.
- Default for `"TOLERANCEMZ"` is `0` instead of `0.1`.
## Examples
In this section we use MassQL queries to subset and extract data from our
`Spectra` object `dda`. This object contains MS1 and MS2 spectra from a data
dependent acquisition. Some general information is provided below.
```{r}
#' Number of MS1 and MS2 spectra
table(msLevel(dda))
#' retention time range
range(rtime(dda))
```
### Filtering and subsetting
To restrict the data to spectra measured between 200 and 300 seconds of the
measurement run we can use a MassQL query with the `"rtmin"` and `"rtmax"`
conditions.
```{r}
dda_rt <- query(dda, "QUERY * WHERE RTMIN = 200 AND RTMAX = 300")
dda_rt
```
Internally, *SpectraQL* translates the MassQL query into filter functions that
are applied to the `Spectra` object. The returned result is now a `Spectra`
object with all spectra measured between 200 and 300 seconds.
```{r}
range(rtime(dda_rt))
```
To restrict to MS2 spectra with their precursor m/z matching a certain value, we
can use the `"MS2PREC"` condition. Conditions on m/z values support also
accuracy specifications. Using `"TOLERANCEMZ"` and `"TOLERANCEPPM"` it is thus
possible to define an absolute and m/z relative acceptable difference. Below we
use such a condition to select MS2 spectra with a precursor m/z matching 278.093
with a tolerance of 0 and ppm of 20. Since *SpectraQL* is case insensitive, we
write the full MassQL query in lower case below.
```{r}
dda_pmz <- query(
dda,
"query * where ms2prec = 278.093:toleancemz=0:toleranceppm=30")
length(dda_pmz)
dda_pmz
```
We thus selected 9 spectra from the data set with a precursor m/z matching our
filter criteria.
```{r}
precursorMz(dda_pmz)
```
We can obviously also combine the two queries into a single MassQL query. This
time we also omit the `"tolerancemz"` field as we set that anyway to a value of
0.
```{r}
dda_rt <- query(dda,
paste0("query * where rtmin = 200 and rtmax = 300 and ",
"ms2prec = 278.093:toleranceppm = 30"))
dda_rt
```
For conditions `"MS2PROD"`, `"MS2PREC"` and `"MS1MZ"` it is also possible to
provide multiple values to select spectra that match any of the provided m/z
values. These values can be separated by `"OR"` (or `"or"`), `"TOLERANCEPPM"`
and `"TOLERANCEMZ"` will be applied to each of these. Below we select all
spectra that contain an (MS1) peak with an m/z matching any of the provided
values.
```{r}
query(dda, "QUERY * WHERE MS1MZ = (123 OR 234.1 OR 300):TOLERANCEMZ=0.05")
```
Note that using `"MS1MZ"` will return only MS1 spectra and `"MS2PROD"`,
`"MS2MZ"` and `"MS2PREC"` only MS2 spectra.
### Choosing which data to return
Thus far we uses `"*"` in all queries, which returns the result as a `Spectra`
object. Alternatively, we can use `"MS1DATA"` or `"MS2DATA"` which will return
the peak data for MS1 or MS2 spectra, respectively. The result is thus not a
`Spectra` object, but a `list` of two-column matrices with the m/z and intensity
values of all selected spectra. Below we use this to extract the peaks data for
all MS2 spectra measured between 200 and 300 seconds.
```{r}
pks <- query(dda, "query ms2data where rtmin = 200 and rtmax = 300")
length(pks)
head(pks[[1L]])
```
We can in addition also use functions to extract only specific data from the
result. `"scansum"` would for example return the sum of intensities per spectra
and would thus allow to extract a total ion chromatogram:
```{r}
tic <- query(dda, "query scansum(ms1data)")
```
```{r}
plot(tic, type = "l", xlab = "scan index")
```
Please note that the x-axis does **not** represent the retention times but the
scan indices.
The function `"scaninfo"` extracts all information from the selected
spectra. For *SpectraQL* this means that the result from `spectraData()` are
returned:
```{r}
si <- query(dda, "query scaninfo(*) where rtmin = 300 and rtmax = 500")
si
```
### Filtering peaks within spectra
In contrast to the *condition* statements (`"WHERE"`), `"FILTER"` allows to
filter the peak data within spectra. We can for example filter all spectra
keeping only peaks with a certain m/z. Below we filter the MS1 data keeping only
peaks with an m/z value matching 219.095 with a tolerance of 0.01 and extract
the ion count for that.
```{r}
ic <- query(
dda, "query scansum(ms1data) filter ms1mz = 219.095:tolerancemz=0.1")
plot(ic, type = "l", xlab = "scan index")
```
We can also combine that with `"QUERY"` to restrict to a certain retention time
range to generate an extracted ion chromatogram (XIC).
```{r}
ic <- query(
dda,
paste0("query scansum(ms1data) where rtmin = 235 and rtmax = 250",
" filter ms1mz = 219.095:tolerancemz=0.1"))
plot(ic, type = "l", xlab = "scan index")
```
Internally, *SpectraQL*'s function *translates* the MassQL query into the
following combination of function calls:
```{r}
res <- dda |>
filterMsLevel(1L) |>
filterRt(c(235, 250)) |>
filterMzValues(219.095, tolerance = 0.1) |>
ionCount()
plot(res, type = "l", xlab = "scan index")
```
# Session information {-}
```{r sessioninfo, echo=FALSE}
sessionInfo()
```