| Title: | Utilities for Annotation of Metabolomics Data |
|---|---|
| Description: | High level functions to assist in annotation of (metabolomics) data sets. These include functions to perform simple tentative annotations based on mass matching but also functions to consider m/z and retention times for annotation of LC-MS features given that respective reference values are available. In addition, the function provides high-level functions to simplify matching of LC-MS/MS spectra against spectral libraries and objects and functionality to represent and manage such matched data. |
| Authors: | Michael Witting [aut] (ORCID: <https://orcid.org/0000-0002-1462-4426>), Johannes Rainer [aut, cre] (ORCID: <https://orcid.org/0000-0002-6977-7147>), Andrea Vicini [aut] (ORCID: <https://orcid.org/0000-0001-9438-6909>), Carolin Huber [aut] (ORCID: <https://orcid.org/0000-0002-9355-8948>), Philippine Louail [aut] (ORCID: <https://orcid.org/0009-0007-5429-6846>), Nir Shachaf [ctb] |
| Maintainer: | Johannes Rainer <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 1.17.0 |
| Built: | 2026-05-30 07:33:30 UTC |
| Source: | https://github.com/bioc/MetaboAnnotation |
Matches between query and target generic objects can be represented by
the Matched object. By default, all data accessors work as
left joins between the query and the target object, i.e. values are
returned for each query object with eventual duplicated entries (values)
if the query object matches more than one target object. See also
Creation and subsetting as well as Extracting data sections below for
details and more information.
The Matched object allows to represent matches between one-dimensional
query and target objects (being e.g. numeric or list),
two-dimensional objects (data.frame or matrix) or more complex
structures such as SummarizedExperiments or QFeatures. Combinations of
all these different data types are also supported. Matches are represented
between elements of one-dimensional objects, or rows for two-dimensional
objects (including SummarizedExperiment or QFeatures). For
QFeatures::QFeatures() objects matches to only one of the assays
within the object is supported.
addMatches(object, ...) endoapply(X, FUN, ...) filterMatches(object, param, ...) matchedData(object, ...) queryVariables(object, ...) targetVariables(object, ...) Matched( query = list(), target = list(), matches = data.frame(query_idx = integer(), target_idx = integer(), score = numeric()), queryAssay = character(), targetAssay = character(), metadata = list() ) ## S4 method for signature 'Matched' length(x) ## S4 method for signature 'Matched' show(object) ## S4 method for signature 'Matched,ANY,ANY,ANY' x[i, j, ..., drop = FALSE] matches(object) target(object) ## S4 method for signature 'Matched' query(x, pattern, ...) targetIndex(object) queryIndex(object) whichTarget(object) whichQuery(object) ## S4 method for signature 'Matched' x$name ## S4 method for signature 'Matched' colnames(x) scoreVariables(object) ## S4 method for signature 'Matched' queryVariables(object) ## S4 method for signature 'Matched' targetVariables(object) ## S4 method for signature 'Matched' matchedData(object, columns = colnames(object), ...) pruneTarget(object) ## S4 method for signature 'Matched,missing' filterMatches( object, queryValue = integer(), targetValue = integer(), queryColname = character(), targetColname = character(), index = integer(), keep = TRUE, ... ) SelectMatchesParam( queryValue = numeric(), targetValue = numeric(), queryColname = character(), targetColname = character(), index = integer(), keep = TRUE ) TopRankedMatchesParam(n = 1L, decreasing = FALSE) ScoreThresholdParam(threshold = 0, above = FALSE, column = "score") ## S4 method for signature 'Matched,SelectMatchesParam' filterMatches(object, param, ...) ## S4 method for signature 'Matched,TopRankedMatchesParam' filterMatches(object, param, ...) ## S4 method for signature 'Matched,ScoreThresholdParam' filterMatches(object, param, ...) SingleMatchParam( duplicates = c("remove", "closest", "top_ranked"), column = "score", decreasing = TRUE ) ## S4 method for signature 'Matched,SingleMatchParam' filterMatches(object, param, ...) ## S4 method for signature 'Matched' addMatches( object, queryValue = integer(), targetValue = integer(), queryColname = character(), targetColname = character(), score = rep(NA_real_, length(queryValue)), isIndex = FALSE ) ## S4 method for signature 'ANY' endoapply(X, FUN, ...) ## S4 method for signature 'Matched' endoapply(X, FUN, ...) ## S4 method for signature 'Matched' lapply(X, FUN, ...)addMatches(object, ...) endoapply(X, FUN, ...) filterMatches(object, param, ...) matchedData(object, ...) queryVariables(object, ...) targetVariables(object, ...) Matched( query = list(), target = list(), matches = data.frame(query_idx = integer(), target_idx = integer(), score = numeric()), queryAssay = character(), targetAssay = character(), metadata = list() ) ## S4 method for signature 'Matched' length(x) ## S4 method for signature 'Matched' show(object) ## S4 method for signature 'Matched,ANY,ANY,ANY' x[i, j, ..., drop = FALSE] matches(object) target(object) ## S4 method for signature 'Matched' query(x, pattern, ...) targetIndex(object) queryIndex(object) whichTarget(object) whichQuery(object) ## S4 method for signature 'Matched' x$name ## S4 method for signature 'Matched' colnames(x) scoreVariables(object) ## S4 method for signature 'Matched' queryVariables(object) ## S4 method for signature 'Matched' targetVariables(object) ## S4 method for signature 'Matched' matchedData(object, columns = colnames(object), ...) pruneTarget(object) ## S4 method for signature 'Matched,missing' filterMatches( object, queryValue = integer(), targetValue = integer(), queryColname = character(), targetColname = character(), index = integer(), keep = TRUE, ... ) SelectMatchesParam( queryValue = numeric(), targetValue = numeric(), queryColname = character(), targetColname = character(), index = integer(), keep = TRUE ) TopRankedMatchesParam(n = 1L, decreasing = FALSE) ScoreThresholdParam(threshold = 0, above = FALSE, column = "score") ## S4 method for signature 'Matched,SelectMatchesParam' filterMatches(object, param, ...) ## S4 method for signature 'Matched,TopRankedMatchesParam' filterMatches(object, param, ...) ## S4 method for signature 'Matched,ScoreThresholdParam' filterMatches(object, param, ...) SingleMatchParam( duplicates = c("remove", "closest", "top_ranked"), column = "score", decreasing = TRUE ) ## S4 method for signature 'Matched,SingleMatchParam' filterMatches(object, param, ...) ## S4 method for signature 'Matched' addMatches( object, queryValue = integer(), targetValue = integer(), queryColname = character(), targetColname = character(), score = rep(NA_real_, length(queryValue)), isIndex = FALSE ) ## S4 method for signature 'ANY' endoapply(X, FUN, ...) ## S4 method for signature 'Matched' endoapply(X, FUN, ...) ## S4 method for signature 'Matched' lapply(X, FUN, ...)
object |
a |
... |
additional parameters. |
X |
|
FUN |
for |
param |
for |
query |
object with the query elements. |
target |
object with the elements against which |
matches |
|
queryAssay |
|
targetAssay |
|
metadata |
|
x |
|
i |
|
j |
for |
drop |
for |
pattern |
for |
name |
for |
columns |
for |
queryValue |
for |
targetValue |
for |
queryColname |
for |
targetColname |
for |
index |
for |
keep |
for |
n |
for |
decreasing |
for |
threshold |
for |
above |
for |
column |
for |
duplicates |
for |
score |
for |
isIndex |
for |
See individual method description above for details.
Matched object is returned as result from the matchValues() function.
Alternatively, Matched objects can also be created with the Matched
function providing the query and target objects as well as the matches
data.frame with two columns of integer indices defining which elements
from query match which element from target.
addMatches: add new matches to an existing object. Parameters
queryValue and targetValue allow to define which element(s) in
query and target should be considered matching. If isIndex = TRUE,
both queryValue and targetValue are considered to be integer indices
identifying the matching elements in query and target, respectively.
Alternatively (with isIndex = FALSE) queryValue and targetValue can
be elements in columns queryColname or targetColname which can be used
to identify the matching elements. Note that in this case
only the first matching pair is added. Parameter score allows to
provide the score for the match. It can be a numeric with the score or a
data.frame with additional information on the manually added matches. In
both cases its length (or number of rows) has to match the length of
queryValue. See examples below for more information.
endoapply: applies a user defined function FUN to each subset of
matches in a Matched object corresponding to a query element (i.e. for
each x[i] with i being 1 to length(x)). The results are then combined
in a single Matched object representing updated matches. Note that FUN
has to return a Matched object.
lapply: applies a user defined function FUN to each subset of
matches in a Matched object for each query element (i.e. to each x[i]
with i from 1 to length(x)). It returns a list of length(object)
elements where each element is the output of FUN applied to each subset
of matches.
[: subset the object selecting query object elements to keep with
parameter i. The resulting object will contain all the matches
for the selected query elements. The target object will by default be
returned as-is.
filterMatches: filter matches in a Matched object using different
approaches depending on the class of param:
ScoreThresholdParam: keeps only the matches whose score is strictly
above or strictly below a certain threshold (respectively when parameter
above = TRUE and above = FALSE). The name of the column containing
the scores to be used for the filtering can be specified with parameter
column. The default for column is "score". Such variable is present
in each Matched object. The name of other score variables (if present)
can be provided (the names of all score variables can be obtained with
scoreVariables() function). For example column = "score_rt" can be
used to filter matches based on retention time scores for Matched
objects returned by matchValues() when param objects involving a
retention time comparison are used.
SelectMatchesParam: keeps or removes (respectively when parameter
keep = TRUE and keep = FALSE) matches corresponding to certain
indices or values of query and target. If queryValue and
targetValue are provided, matches for these value pairs are kept or
removed. Parameter indexallows to filter matches providing their index in the [matches()] matrix. Note thatfilterMatchesremoves only matches from the [matches()] matrix from theMatchedobject but thus not alter thequeryortarget' in the object. See examples below for more
information.
SingleMatchParam: reduces matches to keep only (at most) a
single match per query. The deduplication strategy can be defined with
parameter duplicates:
duplicates = "remove": all matches for query elements matching more
than one target element will be removed.
duplicates = "closest": keep only the closest match for each
query element. The closest match is defined by the value(s) of
score (and eventually score_rt, if present). The one match with
the smallest value for this (these) column(s) is retained. This is
equivalent to TopRankedMatchesParam(n = 1L, decreasing = FALSE).
duplicates = "top_ranked": select the best ranking match for each
query element. Parameter column allows to specify the column by
which matches are ranked (use targetVariables(object) or
scoreVariables(object) to list possible columns). Parameter
decreasing allows to define whether the match with the highest
(decreasing = TRUE) or lowest (decreasing = FALSE) value in
column for each query will be selected.
TopRankedMatchesParam: for each query element the matches are ranked
according to their score and only the n best of them are kept (if n
is larger than the number of matches for a given query element all the
matches are returned). For the ranking (ordering) R's rank function is
used on the absolute values of the scores (variable "score"), thus,
smaller score values (representing e.g. smaller differences between
expected and observed m/z values) are considered better. By
setting parameter decreasing = TRUE matches can be ranked in decreasing
order (i.e. higher scores are ranked higher and are thus selected).
If besides variable "score" also variable "score_rt" is available in
the Matched object (which is the case for the Matched object
returned by matchValues() for param objects involving a retention
time comparison), the ordering of the matches is based on the product of
the ranks of the two variables (ranking of retention time differences
is performed on the absolute value of "score_rt"). Thus, matches with
small (or, depending on parameter decreasing, large) values for
"score" and "score_rt" are returned.
pruneTarget: cleans the object by removing non-matched
target elements.
$ extracts a single variable from the Matched x. The variables that
can be extracted can be listed using colnames(x). These variables can
belong to query, target or be related to the matches (e.g. the
score of each match). If the query (target) object is two dimensional,
its columns can be extracted (prefix "target_" is used for columns in the
target object) otherwise if query (target) has only a single
dimension (e.g. is a list or a character) the whole object can be
extracted with x$query (x$target). More precisely, when
query (target) is a SummarizedExperiment the columns from
rowData(query) (rowData(target)) are extracted; when query (target)
is a QFeatures::QFeatures() the columns from rowData of the assay
specified in the queryAssay (targetAssay) slot are extracted.
The matching scores
are available as variable "score". Similar to a left join between the
query and target elements, this function returns a value for each query
element, with eventual duplicated values for query elements matching more
than one target element. If variables from the target data.frame are
extracted, an NA is reported for the entries corresponding to query
elements that don't match any target element. See examples below for
more details.
length returns the number of query elements.
matchedData allows to extract multiple variables contained in the
Matched object as a DataFrame. Parameter columns allows to
define which columns (or variables) should be returned (defaults to
columns = colnames(object)). Each single column in the returned
DataFrame is constructed in the same way as in $. That is, like $,
this function performs a left join of variables from the query and
target objects returning all values for all query elements
(eventually returning duplicated elements for query elements matching
multiple target elements) and the values for the target elements matched
to the respective query elements (or NA if the target element is not
matched to any query element).
matches returns a data.frame with the actual matching information with
columns "query_idx" (index of the element in query), "target_idx"
(index of the element in target) "score" (the score of the match) and
eventual additional columns.
target returns the target object.
targetIndex returns the indices of the matched targets in the order they
are assigned to the query elements. The length of the returned integer
vector is equal to the total number of matches in the object. targetIndex
and queryIndex are aligned, i.e. each element in them represent a matched
query-target pair.
query returns the query object.
queryIndex returns the indices of the query elements with matches to
target elements. The length of the returned integer vector is equal to
the total number of matches in the object. targetIndex and queryIndex
are aligned, i.e. each element in them represent a matched query-target
pair.
queryVariables returns the names of the variables (columns) in query.
scoreVariables returns the names of the score variables stored in the
Matched object (precisely the names of the variables in matches(object)
containing the string "score" in their name ignoring the case).
targetVariables returns the names of the variables (columns) in target
(prefixed with "target_").
whichTarget returns an integer with the indices of the elements in
target that match at least one element in query.
whichQuery returns an integer with the indices of the elements in
query that match at least one element in target.
Andrea Vicini, Johannes Rainer
MatchedSpectra() for matched Spectra::Spectra() objects.
## Creating a `Matched` object. q1 <- data.frame(col1 = 1:5, col2 = 6:10) t1 <- data.frame(col1 = 11:16, col2 = 17:22) ## Define matches between query row 1 with target row 2 and, query row 2 ## with target rows 2,3,4 and query row 5 with target row 5. mo <- Matched( q1, t1, matches = data.frame(query_idx = c(1L, 2L, 2L, 2L, 5L), target_idx = c(2L, 2L, 3L, 4L, 5L), score = seq(0.5, 0.9, by = 0.1))) mo ## Which of the query elements (rows) match at least one target ## element (row)? whichQuery(mo) ## Which target elements (rows) match at least one query element (row)? whichTarget(mo) ## Extracting variable "col1" from query object . mo$col1 ## We have duplicated values for the entries of `col1` related to query ## elements (rows) matched to multiple rows of the target object). The ## value of `col1` is returned for each element (row) in the query. ## Extracting variable "col1" from target object. To access columns from ## target we have to prefix the name of the column by `"target_"`. ## Note that only values of `col1` for rows matching at least one query ## row are returned and an NA is reported for query rows without matching ## target rows. mo$target_col1 ## The 3rd and 4th query rows do not match any target row, thus `NA` is ## returned. ## `matchedData` can be used to extract all (or selected) columns ## from the object. Same as with `$`, a left join between the columns ## from the query and the target is performed. Below we extract selected ## columns from the object as a DataFrame. res <- matchedData(mo, columns = c("col1", "col2", "target_col1", "target_col2")) res res$col1 res$target_col1 ## With the `queryIndex` and `targetIndex` it is possible to extract the ## indices of the matched query-target pairs: queryIndex(mo) targetIndex(mo) ## Hence, the first match is between the query with index 1 to the target ## with index 2, then, query with index 2 is matched to target with index 2 ## and so on. ## The example matched object contains all query and all target ## elements (rows). Below we subset the object keeping only query rows that ## are matched to at least one target row. mo_sub <- mo[whichQuery(mo)] ## mo_sub contains now only 3 query rows: nrow(query(mo_sub)) ## while the original object contains all 5 query rows: nrow(query(mo)) ## Both objects contain however still the full target object: nrow(target(mo)) nrow(target(mo_sub)) ## With the `pruneTarget` we can however reduce also the target rows to ## only those that match at least one query row mo_sub <- pruneTarget(mo_sub) nrow(target(mo_sub)) ######## ## Creating a `Matched` object with a `data.frame` for `query` and a `vector` ## for `target`. The matches are specified in the same way as the example ## before. q1 <- data.frame(col1 = 1:5, col2 = 6:10) t2 <- 11:16 mo <- Matched(q1, t2, matches = data.frame(query_idx = c(1L, 2L, 2L, 2L, 5L), target_idx = c(2L, 2L, 3L, 4L, 5L), score = seq(0.5, 0.9, by = 0.1))) ## *target* is a simple vector and has thus no columns. The matched values ## from target, if it does not have dimensions and hence column names, can ## be retrieved with `$target` mo$target ## Note that in this case "target" is returned by the function `colnames` colnames(mo) ## As before, we can extract all data as a `DataFrame` res <- matchedData(mo) res ## Note that the columns of the obtained `DataFrame` are the same as the ## corresponding vectors obtained with `$` res$col1 res$target ## Also subsetting and pruning works in the same way as the example above. mo_sub <- mo[whichQuery(mo)] ## mo_sub contains now only 3 query rows: nrow(query(mo_sub)) ## while the original object contains all 5 query rows: nrow(query(mo)) ## Both object contain however still the full target object: length(target(mo)) length(target(mo_sub)) ## Reducing the target elements to only those that match at least one query ## row mo_sub <- pruneTarget(mo_sub) length(target(mo_sub)) ######## ## Filtering `Matched` with `filterMatches` ## Inspecting the matches in `mo`: mo$col1 mo$target ## We have thus target *12* matched to both query elements with values 1 and ## 2, and query element 2 is matching 3 target elements. Let's assume we want ## to resolve this multiple mappings to keep from them only the match between ## query 1 (column `"col1"` containing value `1`) with target 1 (value `12`) ## and query 2 (column `"col1"` containing value `2`) with target 2 (value ## `13`). In addition we also want to keep query element 5 (value `5` in ## column `"col1"`) with the target with value `15`: mo_sub <- filterMatches(mo, SelectMatchesParam(queryValue = c(1, 2, 5), queryColname = "col1", targetValue = c(12, 13, 15))) matchedData(mo_sub) ## Alternatively to specifying the matches to filter with `queryValue` and ## `targetValue` it is also possible to specify directly the index of the ## match(es) in the `matches` `data.frame`: matches(mo) ## To keep only matches like in the example above we could use: mo_sub <- filterMatches(mo, SelectMatchesParam(index = c(1, 3, 5))) matchedData(mo_sub) ## Note also that, instead of keeping the specified matches, it would be ## possible to remove them by setting `keep = FALSE`. Below we remove ## selected matches from the object: mo_sub <- filterMatches(mo, SelectMatchesParam(queryValue = c(2, 2), queryColname = "col1", targetValue = c(12, 14), keep = FALSE)) mo_sub$col1 mo_sub$target ## As alternative to *manually* selecting matches it is also possible to ## filter matches keeping only the *best matches* using the ## `TopRankedMatchesParam`. This will rank matches for each query based on ## their *score* value and select the best *n* matches with lowest score ## values (i.e. smallest difference in m/z values). mo_sub <- filterMatches(mo, TopRankedMatchesParam(n = 1L)) matchedData(mo_sub) ## Additionally it is possible to select matches based on a threshold ## for their *score*. Below we keep matches with score below 0.75 (one ## could select matches with *score* greater than the threshold by setting ## `ScoreThresholdParam` parameter `above = TRUE`. mo_sub <- filterMatches(mo, ScoreThresholdParam(threshold = 0.75)) matchedData(mo_sub) ######## ## Selecting the best match for each `query` element with `endoapply` ## It is also possible to select for each `query` element the match with the ## lowest score using `endoapply`. We manually define a function to select ## the best match for each query and give it as input to `endoapply` ## together with the `Matched` object itself. We obtain the same results as ## in the `filterMatches` example above. FUN <- function(x) { if(nrow(x@matches) > 1) x@matches <- x@matches[order(x@matches$score)[1], , drop = FALSE] x } mo_sub <- endoapply(mo, FUN) matchedData(mo_sub) ######## ## Adding matches using `addMatches` ## `addMatches` allows to manually add matches. Below we add a new match ## between the `query` element with a value of `1` in column `"col1"` and ## the target element with a value of `15`. Parameter `score` allows to ## assign a score value to the match. mo_add <- addMatches(mo, queryValue = 1, queryColname = "col1", targetValue = 15, score = 1.40) matchedData(mo_add) ## Matches are always sorted by `query`, thus, the new match is listed as ## second match. ## Alternatively, we can also provide a `data.frame` with parameter `score` ## which enables us to add additional information to the added match. Below ## we define the score and an additional column specifying that this match ## was added manually. This information will then also be available in the ## `matchedData`. mo_add <- addMatches(mo, queryValue = 1, queryColname = "col1", targetValue = 15, score = data.frame(score = 5, manual = TRUE)) matchedData(mo_add) ## The match will get a score of NA if we're not providing any score. mo_add <- addMatches(mo, queryValue = 1, queryColname = "col1", targetValue = 15) matchedData(mo_add) ## Creating a `Matched` object with a `SummarizedExperiment` for `query` and ## a `vector` for `target`. The matches are specified in the same way as ## the example before. library(SummarizedExperiment) q1 <- SummarizedExperiment( assays = data.frame(matrix(NA, 5, 2)), rowData = data.frame(col1 = 1:5, col2 = 6:10), colData = data.frame(cD1 = c(NA, NA), cD2 = c(NA, NA))) t1 <- data.frame(col1 = 11:16, col2 = 17:22) ## Define matches between row 1 in rowData(q1) with target row 2 and, ## rowData(q1) row 2 with target rows 2,3,4 and rowData(q1) row 5 with target ## row 5. mo <- Matched( q1, t1, matches = data.frame(query_idx = c(1L, 2L, 2L, 2L, 5L), target_idx = c(2L, 2L, 3L, 4L, 5L), score = seq(0.5, 0.9, by = 0.1))) mo ## Which of the query elements (rows) match at least one target ## element (row)? whichQuery(mo) ## Which target elements (rows) match at least one query element (row)? whichTarget(mo) ## Extracting variable "col1" from rowData(q1). mo$col1 ## We have duplicated values for the entries of `col1` related to rows of ## rowData(q1) matched to multiple rows of the target data.frame t1. The ## value of `col1` is returned for each row in the rowData of query. ## Extracting variable "col1" from target object. To access columns from ## target we have to prefix the name of the column by `"target_"`. ## Note that only values of `col1` for rows matching at least one row in ## rowData of query are returned and an NA is reported for those without ## matching target rows. mo$target_col1 ## The 3rd and 4th query rows do not match any target row, thus `NA` is ## returned. ## `matchedData` can be used to extract all (or selected) columns ## from the object. Same as with `$`, a left join between the columns ## from the query and the target is performed. Below we extract selected ## columns from the object as a DataFrame. res <- matchedData(mo, columns = c("col1", "col2", "target_col1", "target_col2")) res res$col1 res$target_col1 ## The example `Matched` object contains all rows in the ## `rowData` of the `SummarizedExperiment` and all target rows. Below we ## subset the object keeping only rows that are matched to at least one ## target row. mo_sub <- mo[whichQuery(mo)] ## mo_sub contains now a `SummarizedExperiment` with only 3 rows: nrow(query(mo_sub)) ## while the original object contains a `SummarizedExperiment` with all 5 ## rows: nrow(query(mo)) ## Both objects contain however still the full target object: nrow(target(mo)) nrow(target(mo_sub)) ## With the `pruneTarget` we can however reduce also the target rows to ## only those that match at least one in the `rowData` of query mo_sub <- pruneTarget(mo_sub) nrow(target(mo_sub))## Creating a `Matched` object. q1 <- data.frame(col1 = 1:5, col2 = 6:10) t1 <- data.frame(col1 = 11:16, col2 = 17:22) ## Define matches between query row 1 with target row 2 and, query row 2 ## with target rows 2,3,4 and query row 5 with target row 5. mo <- Matched( q1, t1, matches = data.frame(query_idx = c(1L, 2L, 2L, 2L, 5L), target_idx = c(2L, 2L, 3L, 4L, 5L), score = seq(0.5, 0.9, by = 0.1))) mo ## Which of the query elements (rows) match at least one target ## element (row)? whichQuery(mo) ## Which target elements (rows) match at least one query element (row)? whichTarget(mo) ## Extracting variable "col1" from query object . mo$col1 ## We have duplicated values for the entries of `col1` related to query ## elements (rows) matched to multiple rows of the target object). The ## value of `col1` is returned for each element (row) in the query. ## Extracting variable "col1" from target object. To access columns from ## target we have to prefix the name of the column by `"target_"`. ## Note that only values of `col1` for rows matching at least one query ## row are returned and an NA is reported for query rows without matching ## target rows. mo$target_col1 ## The 3rd and 4th query rows do not match any target row, thus `NA` is ## returned. ## `matchedData` can be used to extract all (or selected) columns ## from the object. Same as with `$`, a left join between the columns ## from the query and the target is performed. Below we extract selected ## columns from the object as a DataFrame. res <- matchedData(mo, columns = c("col1", "col2", "target_col1", "target_col2")) res res$col1 res$target_col1 ## With the `queryIndex` and `targetIndex` it is possible to extract the ## indices of the matched query-target pairs: queryIndex(mo) targetIndex(mo) ## Hence, the first match is between the query with index 1 to the target ## with index 2, then, query with index 2 is matched to target with index 2 ## and so on. ## The example matched object contains all query and all target ## elements (rows). Below we subset the object keeping only query rows that ## are matched to at least one target row. mo_sub <- mo[whichQuery(mo)] ## mo_sub contains now only 3 query rows: nrow(query(mo_sub)) ## while the original object contains all 5 query rows: nrow(query(mo)) ## Both objects contain however still the full target object: nrow(target(mo)) nrow(target(mo_sub)) ## With the `pruneTarget` we can however reduce also the target rows to ## only those that match at least one query row mo_sub <- pruneTarget(mo_sub) nrow(target(mo_sub)) ######## ## Creating a `Matched` object with a `data.frame` for `query` and a `vector` ## for `target`. The matches are specified in the same way as the example ## before. q1 <- data.frame(col1 = 1:5, col2 = 6:10) t2 <- 11:16 mo <- Matched(q1, t2, matches = data.frame(query_idx = c(1L, 2L, 2L, 2L, 5L), target_idx = c(2L, 2L, 3L, 4L, 5L), score = seq(0.5, 0.9, by = 0.1))) ## *target* is a simple vector and has thus no columns. The matched values ## from target, if it does not have dimensions and hence column names, can ## be retrieved with `$target` mo$target ## Note that in this case "target" is returned by the function `colnames` colnames(mo) ## As before, we can extract all data as a `DataFrame` res <- matchedData(mo) res ## Note that the columns of the obtained `DataFrame` are the same as the ## corresponding vectors obtained with `$` res$col1 res$target ## Also subsetting and pruning works in the same way as the example above. mo_sub <- mo[whichQuery(mo)] ## mo_sub contains now only 3 query rows: nrow(query(mo_sub)) ## while the original object contains all 5 query rows: nrow(query(mo)) ## Both object contain however still the full target object: length(target(mo)) length(target(mo_sub)) ## Reducing the target elements to only those that match at least one query ## row mo_sub <- pruneTarget(mo_sub) length(target(mo_sub)) ######## ## Filtering `Matched` with `filterMatches` ## Inspecting the matches in `mo`: mo$col1 mo$target ## We have thus target *12* matched to both query elements with values 1 and ## 2, and query element 2 is matching 3 target elements. Let's assume we want ## to resolve this multiple mappings to keep from them only the match between ## query 1 (column `"col1"` containing value `1`) with target 1 (value `12`) ## and query 2 (column `"col1"` containing value `2`) with target 2 (value ## `13`). In addition we also want to keep query element 5 (value `5` in ## column `"col1"`) with the target with value `15`: mo_sub <- filterMatches(mo, SelectMatchesParam(queryValue = c(1, 2, 5), queryColname = "col1", targetValue = c(12, 13, 15))) matchedData(mo_sub) ## Alternatively to specifying the matches to filter with `queryValue` and ## `targetValue` it is also possible to specify directly the index of the ## match(es) in the `matches` `data.frame`: matches(mo) ## To keep only matches like in the example above we could use: mo_sub <- filterMatches(mo, SelectMatchesParam(index = c(1, 3, 5))) matchedData(mo_sub) ## Note also that, instead of keeping the specified matches, it would be ## possible to remove them by setting `keep = FALSE`. Below we remove ## selected matches from the object: mo_sub <- filterMatches(mo, SelectMatchesParam(queryValue = c(2, 2), queryColname = "col1", targetValue = c(12, 14), keep = FALSE)) mo_sub$col1 mo_sub$target ## As alternative to *manually* selecting matches it is also possible to ## filter matches keeping only the *best matches* using the ## `TopRankedMatchesParam`. This will rank matches for each query based on ## their *score* value and select the best *n* matches with lowest score ## values (i.e. smallest difference in m/z values). mo_sub <- filterMatches(mo, TopRankedMatchesParam(n = 1L)) matchedData(mo_sub) ## Additionally it is possible to select matches based on a threshold ## for their *score*. Below we keep matches with score below 0.75 (one ## could select matches with *score* greater than the threshold by setting ## `ScoreThresholdParam` parameter `above = TRUE`. mo_sub <- filterMatches(mo, ScoreThresholdParam(threshold = 0.75)) matchedData(mo_sub) ######## ## Selecting the best match for each `query` element with `endoapply` ## It is also possible to select for each `query` element the match with the ## lowest score using `endoapply`. We manually define a function to select ## the best match for each query and give it as input to `endoapply` ## together with the `Matched` object itself. We obtain the same results as ## in the `filterMatches` example above. FUN <- function(x) { if(nrow(x@matches) > 1) x@matches <- x@matches[order(x@matches$score)[1], , drop = FALSE] x } mo_sub <- endoapply(mo, FUN) matchedData(mo_sub) ######## ## Adding matches using `addMatches` ## `addMatches` allows to manually add matches. Below we add a new match ## between the `query` element with a value of `1` in column `"col1"` and ## the target element with a value of `15`. Parameter `score` allows to ## assign a score value to the match. mo_add <- addMatches(mo, queryValue = 1, queryColname = "col1", targetValue = 15, score = 1.40) matchedData(mo_add) ## Matches are always sorted by `query`, thus, the new match is listed as ## second match. ## Alternatively, we can also provide a `data.frame` with parameter `score` ## which enables us to add additional information to the added match. Below ## we define the score and an additional column specifying that this match ## was added manually. This information will then also be available in the ## `matchedData`. mo_add <- addMatches(mo, queryValue = 1, queryColname = "col1", targetValue = 15, score = data.frame(score = 5, manual = TRUE)) matchedData(mo_add) ## The match will get a score of NA if we're not providing any score. mo_add <- addMatches(mo, queryValue = 1, queryColname = "col1", targetValue = 15) matchedData(mo_add) ## Creating a `Matched` object with a `SummarizedExperiment` for `query` and ## a `vector` for `target`. The matches are specified in the same way as ## the example before. library(SummarizedExperiment) q1 <- SummarizedExperiment( assays = data.frame(matrix(NA, 5, 2)), rowData = data.frame(col1 = 1:5, col2 = 6:10), colData = data.frame(cD1 = c(NA, NA), cD2 = c(NA, NA))) t1 <- data.frame(col1 = 11:16, col2 = 17:22) ## Define matches between row 1 in rowData(q1) with target row 2 and, ## rowData(q1) row 2 with target rows 2,3,4 and rowData(q1) row 5 with target ## row 5. mo <- Matched( q1, t1, matches = data.frame(query_idx = c(1L, 2L, 2L, 2L, 5L), target_idx = c(2L, 2L, 3L, 4L, 5L), score = seq(0.5, 0.9, by = 0.1))) mo ## Which of the query elements (rows) match at least one target ## element (row)? whichQuery(mo) ## Which target elements (rows) match at least one query element (row)? whichTarget(mo) ## Extracting variable "col1" from rowData(q1). mo$col1 ## We have duplicated values for the entries of `col1` related to rows of ## rowData(q1) matched to multiple rows of the target data.frame t1. The ## value of `col1` is returned for each row in the rowData of query. ## Extracting variable "col1" from target object. To access columns from ## target we have to prefix the name of the column by `"target_"`. ## Note that only values of `col1` for rows matching at least one row in ## rowData of query are returned and an NA is reported for those without ## matching target rows. mo$target_col1 ## The 3rd and 4th query rows do not match any target row, thus `NA` is ## returned. ## `matchedData` can be used to extract all (or selected) columns ## from the object. Same as with `$`, a left join between the columns ## from the query and the target is performed. Below we extract selected ## columns from the object as a DataFrame. res <- matchedData(mo, columns = c("col1", "col2", "target_col1", "target_col2")) res res$col1 res$target_col1 ## The example `Matched` object contains all rows in the ## `rowData` of the `SummarizedExperiment` and all target rows. Below we ## subset the object keeping only rows that are matched to at least one ## target row. mo_sub <- mo[whichQuery(mo)] ## mo_sub contains now a `SummarizedExperiment` with only 3 rows: nrow(query(mo_sub)) ## while the original object contains a `SummarizedExperiment` with all 5 ## rows: nrow(query(mo)) ## Both objects contain however still the full target object: nrow(target(mo)) nrow(target(mo_sub)) ## With the `pruneTarget` we can however reduce also the target rows to ## only those that match at least one in the `rowData` of query mo_sub <- pruneTarget(mo_sub) nrow(target(mo_sub))
CompAnnotationSources (i.e. classes extending the base virtual
CompAnnotationSource class) define and provide access to a (potentially
remote) compound annotation resource. This aims to simplify the integration
of external annotation resources by automating the actual connection
(or data resource download) process from the user. In addition, since the
reference resource is not directly exposed to the user it allows integration
of annotation resources that do not allow access to the full data.
Objects extending CompAnnotationSource available in this package are:
CompDbSource(): annotation source referencing an annotation source in the
[CompoundDb::CompDb()] format ( from the CompoundDb Bioconductor
package).
Classes extending CompAnnotationSource need to implement the matchSpectra
method with parameters query, target and param where query is
the Spectra object with the (experimental) query spectra, target the
object extending the CompAnnotationSource and param the parameter object
defining the similarity calculation (e.g. CompareSpectraParam(). The method
is expected to return a MatchedSpectra object.
CompAnnotationSource objects are not expected to contain any annotation
data. Access to the annotation data (in form of a Spectra object) is
suggested to be only established within the object's matchSpectra method.
This would also enable parallel processing of annotations as no e.g. database
connection would have to be shared across processes.
## S4 method for signature 'Spectra,CompAnnotationSource,Param' matchSpectra(query, target, param, ...) ## S4 method for signature 'CompAnnotationSource' show(object) ## S4 method for signature 'CompAnnotationSource' metadata(x, ...)## S4 method for signature 'Spectra,CompAnnotationSource,Param' matchSpectra(query, target, param, ...) ## S4 method for signature 'CompAnnotationSource' show(object) ## S4 method for signature 'CompAnnotationSource' metadata(x, ...)
query |
for |
target |
for |
param |
for |
... |
additional parameters passed to |
object |
A |
x |
A |
For an example implementation see CompDbSource().
matchSpectra: function to match experimental MS2 spectra against the
annotation source. See matchSpectra() for parameters.
metadata: function to provide metadata on the annotation resource (host,
source, version etc).
show (optional): method to provide general information on the data
source.
Johannes Rainer, Nir Shachaf
CompDb databasesCompDbSource objects represent references to CompoundDb::CompDb
database-backed annotation resources. Instances are expected to be
created with the dedicated construction functions such as
MassBankSource or the generic CompDbSource. The annotation data is
not stored within the object but will be accessed/loaded within the
object's matchSpectra method.
New CompDbSource objects can be created using the functions:
CompDbSource: create a new CompDbSource object from an existing
CompDb database. The (SQLite) database file (including the full path)
needs to be provided with parameter dbfile.
MassBankSource: retrieves a CompDb database for the specified MassBank
release from Bioconductor's online AnnotationHub (if it exists) and
uses that. Note that AnnotationHub resources are cached locally and thus
only downloaded the first time.
The function has parameters release which allows to define the desired
MassBank release (e.g. release = "2021.03" or release = "2022.06")
and ... which allows to pass optional parameters to the AnnotationHub
constructor function, such as localHub = TRUE to use only the cached
data and avoid updating/retrieving updates from the internet.
Other functions:
metadata: get metadata (information) on the annotation resource.
CompDbSource(dbfile = character()) ## S4 method for signature 'CompDbSource' metadata(x, ...) ## S4 method for signature 'CompDbSource' show(object) MassBankSource(release = "2021.03", ...)CompDbSource(dbfile = character()) ## S4 method for signature 'CompDbSource' metadata(x, ...) ## S4 method for signature 'CompDbSource' show(object) MassBankSource(release = "2021.03", ...)
dbfile |
|
x |
A |
... |
For |
object |
A |
release |
A |
Johannes Rainer
## Locate a CompDb SQLite database file. For this example we use the test ## database from the `CompoundDb` package. fl <- system.file("sql", "CompDb.MassBank.sql", package = "CompoundDb") ann_src <- CompDbSource(fl) ## The object contains only the reference/link to the annotation resource. ann_src ## Retrieve a CompDb with MassBank data for a certain MassBank release mb_src <- MassBankSource("2021.03") mb_src## Locate a CompDb SQLite database file. For this example we use the test ## database from the `CompoundDb` package. fl <- system.file("sql", "CompDb.MassBank.sql", package = "CompoundDb") ann_src <- CompDbSource(fl) ## The object contains only the reference/link to the annotation resource. ann_src ## Retrieve a CompDb with MassBank data for a certain MassBank release mb_src <- MassBankSource("2021.03") mb_src
The createStandardMixes function defines groups (mixes) of compounds
(standards) with dissimilar m/z values. The expected size of the groups can
be defined with parameters max_nstd and min_nstd and the minimum required
difference between m/z values within each group with parameter min_diff.
The group assignment will be reported in an additional column in the result
data frame.
createStandardMixes( x, max_nstd = 10, min_nstd = 5, min_diff = 2, iterativeRandomization = FALSE )createStandardMixes( x, max_nstd = 10, min_nstd = 5, min_diff = 2, iterativeRandomization = FALSE )
x |
|
max_nstd |
|
min_nstd |
|
min_diff |
|
iterativeRandomization |
|
Users should be aware that because the function iterates through x, the
compounds at the bottom of the matrix are more complicated to group, and
there is a possibility that some compounds will not be grouped with others.
We advise specifyiong iterativeRandomization = TRUE even if it takes more
time.
data.frame created by adding a column group to the input x
matrix, comprising the group number for each compound.
Philippine Louail
## Iterative grouping only x <- matrix(c(135.0288, 157.0107, 184.0604, 206.0424, 265.1118, 287.0937, 169.0356, 191.0176, 468.9809, 490.9628, 178.0532, 200.0352), ncol = 2, byrow = TRUE, dimnames = list(c("Malic Acid", "Pyridoxic Acid", "Thiamine", "Uric acid", "dUTP", "N-Formyl-L-methionine"), c("adduct_1", "adduct_2"))) result <- createStandardMixes(x, max_nstd = 3, min_diff = 2) ## Randomize grouping set.seed(123) x <- matrix(c(349.0544, 371.0363, 325.0431, 347.0251, 581.0416, 603.0235, 167.0564, 189.0383, 150.0583, 172.0403, 171.0053, 192.9872, 130.0863, 152.0682, 768.1225, 790.1044), ncol = 2, byrow = TRUE, dimnames = list(c("IMP", "UMP", "UDP-glucuronate", "1-Methylxanthine", "Methionine", "Dihydroxyacetone phosphate", "Pipecolic acid", "CoA"), c("[M+H]+", "[M+Na]+"))) result <- createStandardMixes(x, max_nstd = 4, min_nstd = 3, min_diff = 2, iterativeRandomization = TRUE)## Iterative grouping only x <- matrix(c(135.0288, 157.0107, 184.0604, 206.0424, 265.1118, 287.0937, 169.0356, 191.0176, 468.9809, 490.9628, 178.0532, 200.0352), ncol = 2, byrow = TRUE, dimnames = list(c("Malic Acid", "Pyridoxic Acid", "Thiamine", "Uric acid", "dUTP", "N-Formyl-L-methionine"), c("adduct_1", "adduct_2"))) result <- createStandardMixes(x, max_nstd = 3, min_diff = 2) ## Randomize grouping set.seed(123) x <- matrix(c(349.0544, 371.0363, 325.0431, 347.0251, 581.0416, 603.0235, 167.0564, 189.0383, 150.0583, 172.0403, 171.0053, 192.9872, 130.0863, 152.0682, 768.1225, 790.1044), ncol = 2, byrow = TRUE, dimnames = list(c("IMP", "UMP", "UDP-glucuronate", "1-Methylxanthine", "Methionine", "Dihydroxyacetone phosphate", "Pipecolic acid", "CoA"), c("[M+H]+", "[M+Na]+"))) result <- createStandardMixes(x, max_nstd = 4, min_nstd = 3, min_diff = 2, iterativeRandomization = TRUE)
Matches between query and target spectra can be represented by the
MatchedSpectra object. Functions like the matchSpectra() function will
return this type of object. By default, all data accessors work as
left joins between the query and the target spectra, i.e. values are
returned for each query spectrum with eventual duplicated entries (values)
if the query spectrum matches more than one target spectrum.
MatchedSpectra( query = Spectra(), target = Spectra(), matches = data.frame(query_idx = integer(), target_idx = integer(), score = numeric()) ) ## S4 method for signature 'MatchedSpectra' spectraVariables(object) ## S4 method for signature 'MatchedSpectra' queryVariables(object) ## S4 method for signature 'MatchedSpectra' targetVariables(object) ## S4 method for signature 'MatchedSpectra' colnames(x) ## S4 method for signature 'MatchedSpectra' x$name ## S4 method for signature 'MatchedSpectra' spectraData(object, columns = spectraVariables(object)) ## S4 method for signature 'MatchedSpectra' matchedData(object, columns = spectraVariables(object), ...) ## S4 method for signature 'MatchedSpectra' addProcessing(object, FUN, ..., spectraVariables = character()) ## S4 method for signature 'MatchedSpectra' plotSpectraMirror( x, xlab = "m/z", ylab = "intensity", main = "", scalePeaks = FALSE, ... ) ## S4 method for signature 'MatchedSpectra,MsBackend' setBackend(object, backend, ...)MatchedSpectra( query = Spectra(), target = Spectra(), matches = data.frame(query_idx = integer(), target_idx = integer(), score = numeric()) ) ## S4 method for signature 'MatchedSpectra' spectraVariables(object) ## S4 method for signature 'MatchedSpectra' queryVariables(object) ## S4 method for signature 'MatchedSpectra' targetVariables(object) ## S4 method for signature 'MatchedSpectra' colnames(x) ## S4 method for signature 'MatchedSpectra' x$name ## S4 method for signature 'MatchedSpectra' spectraData(object, columns = spectraVariables(object)) ## S4 method for signature 'MatchedSpectra' matchedData(object, columns = spectraVariables(object), ...) ## S4 method for signature 'MatchedSpectra' addProcessing(object, FUN, ..., spectraVariables = character()) ## S4 method for signature 'MatchedSpectra' plotSpectraMirror( x, xlab = "m/z", ylab = "intensity", main = "", scalePeaks = FALSE, ... ) ## S4 method for signature 'MatchedSpectra,MsBackend' setBackend(object, backend, ...)
query |
|
target |
|
matches |
|
object |
|
x |
|
name |
for |
columns |
for |
... |
for |
FUN |
for |
spectraVariables |
for |
xlab |
for |
ylab |
for |
main |
for |
scalePeaks |
for |
backend |
for |
See individual method desciption above for details.
MatchedSpectra objects are the result object from the matchSpectra().
While generally not needed, MatchedSpectra objects can also be created
with the MatchedSpectra function providing the query and target
Spectra objects as well as a data.frame with the matches between
query and target elements. This data frame is expected to have columns
"query_idx", "target_idx" with the integer indices of query and
target objects that are matched and a column "score" with a numeric
score for the match.
MatchedSpectra objects can be subset using:
[ subset the MatchedSpectra selecting query spectra to keep with
parameter i. The target spectra will by default be returned as-is.
pruneTarget cleans the MatchedSpectra object by removing non-matched
target spectra.
In addition, MatchedSpectra can be filtered with any of the filtering
approaches defined for Matched() objects: SelectMatchesParam(),
TopRankedMatchesParam() or ScoreThresholdParam().
$ extracts a single spectra variable from the MatchedSpectra x. Use
spectraVariables to get all available spectra variables. Prefix
"target_" is used for spectra variables from the target Spectra. The
matching scores are available as spectra variable "score".
Similar to a left join between the query and target spectra, this function
returns a value for each query spectrum with eventual duplicated values for
query spectra matching more than one target spectrum. If spectra variables
from the target spectra are extracted, an NA is reported for query
spectra that don't match any target spectra. See examples below for more
details.
length returns the number of query spectra.
matchedData same as spectraData below.
query returns the query Spectra.
queryVariables returns the spectraVariables of query.
spectraData returns spectra variables from the query and/or target
Spectra as a DataFrame. Parameter columns allows to define which
variables should be returned (defaults to
columns = spectraVariables(object)), spectra variable names of the target
spectra need to be prefixed with target_ (e.g. target_msLevel to get
the MS level from target spectra). The score from the matching function is
returned as spectra variable "score". Similar to $, this function
performs a left join of spectra variables from the query and target
spectra returning all values for all query spectra (eventually returning
duplicated elements for query spectra matching multiple target spectra)
and the values for the target spectra matched to the respective query
spectra. See help on $ above or examples below for details.
spectraVariables returns all available spectra variables in the query
and target spectra. The prefix "target_" is used to label spectra
variables of target spectra (e.g. the name of the spectra variable for the
MS level of target spectra is called "target_msLevel").
target returns the target Spectra.
targetVariables returns the spectraVariables of target (prefixed
with "target_").
whichTarget returns an integer with the indices of the spectra in
target that match at least on spectrum in query.
whichQuery returns an integer with the indices of the spectra in
query that match at least on spectrum in target.
addProcessing: add a processing step to both the query and target
Spectra in object. Additional parameters for FUN can be passed via
.... See addProcessing documentation in Spectra::Spectra() for more
information.
plotSpectraMirror: creates a mirror plot between the query and each
matching target spectrum. Can only be applied to a MatchedSpectra with a
single query spectrum. Setting parameter scalePeaks = TRUE will scale
the peak intensities per spectrum to a total sum of one for a better
graphical visualization. Additional plotting parameters can be passed
through .... The parameters ppm and tolerance can be used to
define the m/z tolerance for matching peaks between the query and target
spectra. If not provided by the user, the values from the param
object used to create the MatchedSpectra object are used; if these are
missing, the default values (ppm =20 and tolerance = 0) are used.
setBackend: allows to change the backend of both the query and target
Spectra::Spectra() object. The function will return a MatchedSpectra
object with the query and target Spectra changed to the specified
backend, which can be any backend extending Spectra::MsBackend.
Johannes Rainer
Matched() for additional functions available for MatchedSpectra.
## Creating a dummy MatchedSpectra object. library(Spectra) df1 <- DataFrame( msLevel = 2L, rtime = 1:10, spectrum_id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")) df2 <- DataFrame( msLevel = 2L, rtime = rep(1:10, 20), spectrum_id = rep(c("A", "B", "C", "D", "E"), 20)) sp1 <- Spectra(df1) sp2 <- Spectra(df2) ## Define matches between query spectrum 1 with target spectra 2 and 5, ## query spectrum 2 with target spectrum 2 and query spectrum 4 with target ## spectra 8, 12 and 15. ms <- MatchedSpectra( sp1, sp2, matches = data.frame(query_idx = c(1L, 1L, 2L, 4L, 4L, 4L), target_idx = c(2L, 5L, 2L, 8L, 12L, 15L), score = 1:6)) ## Which of the query spectra match at least one target spectrum? whichQuery(ms) ## Extracting spectra variables: accessor methods for spectra variables act ## as "left joins", i.e. they return a value for each query spectrum, with ## eventually duplicated elements if one query spectrum matches more than ## one target spectrum. ## Which target spectrum matches at least one query spectrum? whichTarget(ms) ## Extracting the retention times of the query spectra. ms$rtime ## We have duplicated retention times for query spectrum 1 (matches 2 target ## spectra) and 4 (matches 3 target spectra). The retention time is returned ## for each query spectrum. ## Extracting retention times of the target spectra. Note that only retention ## times for target spectra matching at least one query spectrum are returned ## and an NA is reported for query spectra without matching target spectrum. ms$target_rtime ## The first query spectrum matches target spectra 2 and 5, thus their ## retention times are returned as well as the retention time of the second ## target spectrum that matches also query spectrum 2. The 3rd query spectrum ## does match any target spectrum, thus `NA` is returned. Query spectrum 4 ## matches target spectra 8, 12, and 15, thus the next reported retention ## times are those from these 3 target spectra. None of the remaining 6 query ## spectra matches any target spectra and thus `NA` is reported for each of ## them. ## With `queryIndex` and `targetIndex` it is possible to extract the indices ## of the matched query-index pairs queryIndex(ms) targetIndex(ms) ## The first match is between query index 1 and target index 2, the second ## match between query index 1 and target index 5 and so on. ## We could use these indices to extract a `Spectra` object containing only ## matched target spectra and assign a spectra variable with the indices of ## the query spectra matched_target <- target(ms)[targetIndex(ms)] matched_target$query_index <- queryIndex(ms) ## This `Spectra` object thus contains information from the matching, but ## is a *conventional* `Spectra` object that could be used for further ## analyses. ## `spectraData` can be used to extract all (or selected) spectra variables ## from the object. Same as with `$`, a left join between the specta ## variables from the query spectra and the target spectra is performed. The ## prefix `"target_"` is used to label the spectra variables from the target ## spectra. Below we extract selected spectra variables from the object. res <- spectraData(ms, columns = c("rtime", "spectrum_id", "target_rtime", "target_spectrum_id")) res res$spectrum_id res$target_spectrum_id ## Again, all values for query spectra are returned and for query spectra not ## matching any target spectrum NA is reported as value for the respecive ## variable. ## The example matched spectra object contains all query and all target ## spectra. Below we subset the object keeping only query spectra that are ## matched to at least one target spectrum. ms_sub <- ms[whichQuery(ms)] ## ms_sub contains now only 3 query spectra: length(query(ms_sub)) ## while the original object contains all 10 query spectra: length(query(ms)) ## Both object contain however still the full target `Spectra`: length(target(ms)) length(target(ms_sub)) ## With the `pruneTarget` we can however reduce also the target spectra to ## only those that match at least one query spectrum ms_sub <- pruneTarget(ms_sub) length(target(ms_sub))## Creating a dummy MatchedSpectra object. library(Spectra) df1 <- DataFrame( msLevel = 2L, rtime = 1:10, spectrum_id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")) df2 <- DataFrame( msLevel = 2L, rtime = rep(1:10, 20), spectrum_id = rep(c("A", "B", "C", "D", "E"), 20)) sp1 <- Spectra(df1) sp2 <- Spectra(df2) ## Define matches between query spectrum 1 with target spectra 2 and 5, ## query spectrum 2 with target spectrum 2 and query spectrum 4 with target ## spectra 8, 12 and 15. ms <- MatchedSpectra( sp1, sp2, matches = data.frame(query_idx = c(1L, 1L, 2L, 4L, 4L, 4L), target_idx = c(2L, 5L, 2L, 8L, 12L, 15L), score = 1:6)) ## Which of the query spectra match at least one target spectrum? whichQuery(ms) ## Extracting spectra variables: accessor methods for spectra variables act ## as "left joins", i.e. they return a value for each query spectrum, with ## eventually duplicated elements if one query spectrum matches more than ## one target spectrum. ## Which target spectrum matches at least one query spectrum? whichTarget(ms) ## Extracting the retention times of the query spectra. ms$rtime ## We have duplicated retention times for query spectrum 1 (matches 2 target ## spectra) and 4 (matches 3 target spectra). The retention time is returned ## for each query spectrum. ## Extracting retention times of the target spectra. Note that only retention ## times for target spectra matching at least one query spectrum are returned ## and an NA is reported for query spectra without matching target spectrum. ms$target_rtime ## The first query spectrum matches target spectra 2 and 5, thus their ## retention times are returned as well as the retention time of the second ## target spectrum that matches also query spectrum 2. The 3rd query spectrum ## does match any target spectrum, thus `NA` is returned. Query spectrum 4 ## matches target spectra 8, 12, and 15, thus the next reported retention ## times are those from these 3 target spectra. None of the remaining 6 query ## spectra matches any target spectra and thus `NA` is reported for each of ## them. ## With `queryIndex` and `targetIndex` it is possible to extract the indices ## of the matched query-index pairs queryIndex(ms) targetIndex(ms) ## The first match is between query index 1 and target index 2, the second ## match between query index 1 and target index 5 and so on. ## We could use these indices to extract a `Spectra` object containing only ## matched target spectra and assign a spectra variable with the indices of ## the query spectra matched_target <- target(ms)[targetIndex(ms)] matched_target$query_index <- queryIndex(ms) ## This `Spectra` object thus contains information from the matching, but ## is a *conventional* `Spectra` object that could be used for further ## analyses. ## `spectraData` can be used to extract all (or selected) spectra variables ## from the object. Same as with `$`, a left join between the specta ## variables from the query spectra and the target spectra is performed. The ## prefix `"target_"` is used to label the spectra variables from the target ## spectra. Below we extract selected spectra variables from the object. res <- spectraData(ms, columns = c("rtime", "spectrum_id", "target_rtime", "target_spectrum_id")) res res$spectrum_id res$target_spectrum_id ## Again, all values for query spectra are returned and for query spectra not ## matching any target spectrum NA is reported as value for the respecive ## variable. ## The example matched spectra object contains all query and all target ## spectra. Below we subset the object keeping only query spectra that are ## matched to at least one target spectrum. ms_sub <- ms[whichQuery(ms)] ## ms_sub contains now only 3 query spectra: length(query(ms_sub)) ## while the original object contains all 10 query spectra: length(query(ms)) ## Both object contain however still the full target `Spectra`: length(target(ms)) length(target(ms_sub)) ## With the `pruneTarget` we can however reduce also the target spectra to ## only those that match at least one query spectrum ms_sub <- pruneTarget(ms_sub) length(target(ms_sub))
The matchFormula method matches chemical formulas from different inputs
(parameter query and target). Before comparison all formulas are
normalized using MetaboCoreUtils::standardizeFormula(). Inputs can be
either a character or data.frame containing a column with formulas.
In case of data.frames parameter formulaColname needs to be used to
specify the name of the column containing the chemical formulas.
matchFormula(query, target, ...) ## S4 method for signature 'character,character' matchFormula(query, target, BPPARAM = SerialParam()) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar' matchFormula( query, target, formulaColname = c("formula", "formula"), BPPARAM = SerialParam() ) ## S4 method for signature 'character,data.frameOrSimilar' matchFormula( query, target, formulaColname = "formula", BPPARAM = SerialParam() ) ## S4 method for signature 'data.frameOrSimilar,character' matchFormula( query, target, formulaColname = "formula", BPPARAM = SerialParam() )matchFormula(query, target, ...) ## S4 method for signature 'character,character' matchFormula(query, target, BPPARAM = SerialParam()) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar' matchFormula( query, target, formulaColname = c("formula", "formula"), BPPARAM = SerialParam() ) ## S4 method for signature 'character,data.frameOrSimilar' matchFormula( query, target, formulaColname = "formula", BPPARAM = SerialParam() ) ## S4 method for signature 'data.frameOrSimilar,character' matchFormula( query, target, formulaColname = "formula", BPPARAM = SerialParam() )
query |
|
target |
|
... |
currently ignored |
BPPARAM |
parallel processing setup. See |
formulaColname |
|
Matched object representing the result.
Michael Witting
## input formula query <- c("H12C6O6", "C11H12O2", "HN3") target <- c("HCl", "C2H4O", "C6H12O6") query_df <- data.frame( formula = c("H12C6O6", "C11H12O2", "HN3"), name = c("A", "B", "C") ) target_df <- data.frame( formula = c("HCl", "C2H4O", "C6H12O6"), name = c("D", "E", "F") ) ## character vs character matches <- matchFormula(query, target) matchedData(matches) ## data.frame vs data.frame matches <- matchFormula(query_df, target_df) matchedData(matches) ## data.frame vs character matches <- matchFormula(query_df, target) matchedData(matches) ## character vs data.frame matches <- matchFormula(query, target_df) matchedData(matches)## input formula query <- c("H12C6O6", "C11H12O2", "HN3") target <- c("HCl", "C2H4O", "C6H12O6") query_df <- data.frame( formula = c("H12C6O6", "C11H12O2", "HN3"), name = c("A", "B", "C") ) target_df <- data.frame( formula = c("HCl", "C2H4O", "C6H12O6"), name = c("D", "E", "F") ) ## character vs character matches <- matchFormula(query, target) matchedData(matches) ## data.frame vs data.frame matches <- matchFormula(query_df, target_df) matchedData(matches) ## data.frame vs character matches <- matchFormula(query_df, target) matchedData(matches) ## character vs data.frame matches <- matchFormula(query, target_df) matchedData(matches)
The matchSpectra method matches (compares) spectra from query with those
from target based on settings specified with param and returns the result
from this as a MatchedSpectra object.
matchSpectra(query, target, param, ...)matchSpectra(query, target, param, ...)
query |
Spectra::Spectra object with the (experimental) spectra. |
target |
MS data to compare against. Can be another Spectra::Spectra. |
param |
parameter object containing the settings for the matching (e.g. eventual prefiltering settings, cut-off value for similarity above which spectra are considered matching etc). |
... |
optional parameters. |
a MatchedSpectra object with the spectra matching results.
Johannes Rainer
CompareSpectraParam() for the comparison between
Spectra::Spectra objects.
matchSpectra compares experimental (query) MS2 spectra against
reference (target) MS2 spectra and reports matches with a similarity that
passing a specified threshold. The function performs the similarity
calculation between each query spectrum against each target spectrum.
Parameters query and target can be used to define the query and target
spectra, respectively, while parameter param allows to define and configure
the similarity calculation and matching condition. Parameter query takes
a Spectra::Spectra object while target can be either a
Spectra::Spectra object, a CompoundDb::CompDb (reference library)
object defined in the CompoundDb package or
a CompAnnotationSource (e.g. a CompDbSource())
with the reference or connection information to a supported annotation
resource).
Some notes on performance and information on parallel processing are provided in the vignette.
Currently supported parameter objects defining the matching are:
CompareSpectraParam: the generic parameter object allowing to set all
settings for the Spectra::compareSpectra() call that is used to
perform the similarity calculation.
This includes MAPFUN and FUN defining the peak-mapping and similarity
calculation functions and ppm and tolerance to define an acceptable
difference between m/z values of the compared peaks. Parameter
matchedPeaksCount is also passed to compareSpectra() and, if set to
TRUE (default is FALSE) will report the number of peaks defined to be
matching by the MAPFUN.
Additional parameters to the compareSpectra call can be passed along
with .... See the help of Spectra::Spectra() for more information on
these parameters. Importantly, if msentropy or a GNPS-like similarity
calculation is used, MAPFUN should be selected accordingly (see section
Using alternative spectra similarity functions in the package vignette
for more information).
By default, parameters ppm and tolerance are passed to the similarity
calculation function, but if this function uses different parameters
(e.g., msentropy_similarity() uses ms2_tolerance_in_ppm instead of
ppm), these should be submitted to the CompareSpectraParam() function
throught the ... parameter.
Parameters requirePrecursor (default TRUE) and requirePrecursorPeak
(default FALSE) allow to pre-filter the target spectra prior to the
actual similarity calculation for each individual query spectrum.
Parameters ppm and tolerance are also used to define the maximal
acceptable difference in precursor m/z if requirePrecursor or
requirePrecursorPeak are set to TRUE.
Target spectra can also be pre-filtered based on
retention time if parameter toleranceRt is set to a value different than
the default toleranceRt = Inf. Only target spectra with a retention time
within the query's retention time +/- (toleranceRt + percentRt% of the
query's retention time) are considered. Note that while for ppm and
tolerance only a single value is accepted, toleranceRt and percentRt
can be also of length equal to the number of query spectra hence allowing
to define different rt boundaries for each query spectrum.
While these pre-filters can considerably improve performance, it should be
noted that no matches will be found between query and target spectra with
missing values in the considered variable (precursor m/z or retention
time). For target spectra without retention times (such as for Spectra
from a public reference database such as MassBank) the default
toleranceRt = Inf should thus be used.
Finally, parameter THRESHFUN allows to define a function to be applied to
the similarity scores to define which matches to report. See below for more
details.
MatchForwardReverseParam: performs spectra matching as with
CompareSpectraParam but reports, similar to MS-DIAL, also the reverse
similarity score and the presence ratio. Please refer to the
documentation of CompareSpectraParam for explanation of the parameters.
With MatchForwardReverseParam, the matching of query
spectra to target spectra is performed by considering all peaks from the
query and all peaks from the target (reference) spectrum (i.e. forward
matching using an outer join-based peak matching strategy). For matching
spectra also the reverse similarity is calculated considering only peaks
present in the target (reference) spectrum (i.e. using a right join-based
peak matching). This is reported as spectra variable "reverse_score".
In addition, the ratio between the number of matched peaks and the total
number of peaks in the target (reference) spectra is reported as the
presence ratio (spectra variable "presence_ratio") and the total
number of matched peaks as "matched_peaks_count". See examples below
for details. Parameter THRESHFUN_REVERSE allows to define an additional
threshold function to filter matches. If THRESHFUN_REVERSE is defined
only matches with a spectra similarity fulfilling both THRESHFUN and
THRESHFUN_REVERSE are returned. With the default
THRESHFUN_REVERSE = NULL all matches passing THRESHFUN are reported.
## S4 method for signature 'Spectra,CompDbSource,Param' matchSpectra( query, target, param, BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE ) CompareSpectraParam( MAPFUN = joinPeaks, tolerance = 0, ppm = 5, FUN = MsCoreUtils::ndotproduct, requirePrecursor = TRUE, requirePrecursorPeak = FALSE, THRESHFUN = function(x) which(x >= 0.7), toleranceRt = Inf, percentRt = 0, matchedPeaksCount = FALSE, ... ) MatchForwardReverseParam( MAPFUN = joinPeaks, tolerance = 0, ppm = 5, FUN = MsCoreUtils::ndotproduct, requirePrecursor = TRUE, requirePrecursorPeak = FALSE, THRESHFUN = function(x) which(x >= 0.7), THRESHFUN_REVERSE = NULL, toleranceRt = Inf, percentRt = 0, ... ) ## S4 method for signature 'Spectra,Spectra,CompareSpectraParam' matchSpectra( query, target, param, rtColname = c("rtime", "rtime"), BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE ) ## S4 method for signature 'Spectra,CompDb,Param' matchSpectra( query, target, param, rtColname = c("rtime", "rtime"), BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE ) ## S4 method for signature 'Spectra,Spectra,MatchForwardReverseParam' matchSpectra( query, target, param, rtColname = c("rtime", "rtime"), BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE )## S4 method for signature 'Spectra,CompDbSource,Param' matchSpectra( query, target, param, BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE ) CompareSpectraParam( MAPFUN = joinPeaks, tolerance = 0, ppm = 5, FUN = MsCoreUtils::ndotproduct, requirePrecursor = TRUE, requirePrecursorPeak = FALSE, THRESHFUN = function(x) which(x >= 0.7), toleranceRt = Inf, percentRt = 0, matchedPeaksCount = FALSE, ... ) MatchForwardReverseParam( MAPFUN = joinPeaks, tolerance = 0, ppm = 5, FUN = MsCoreUtils::ndotproduct, requirePrecursor = TRUE, requirePrecursorPeak = FALSE, THRESHFUN = function(x) which(x >= 0.7), THRESHFUN_REVERSE = NULL, toleranceRt = Inf, percentRt = 0, ... ) ## S4 method for signature 'Spectra,Spectra,CompareSpectraParam' matchSpectra( query, target, param, rtColname = c("rtime", "rtime"), BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE ) ## S4 method for signature 'Spectra,CompDb,Param' matchSpectra( query, target, param, rtColname = c("rtime", "rtime"), BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE ) ## S4 method for signature 'Spectra,Spectra,MatchForwardReverseParam' matchSpectra( query, target, param, rtColname = c("rtime", "rtime"), BPPARAM = BiocParallel::SerialParam(), addOriginalQueryIndex = TRUE )
query |
for |
target |
for |
param |
for |
BPPARAM |
for |
addOriginalQueryIndex |
for |
MAPFUN |
|
tolerance |
|
ppm |
|
FUN |
|
requirePrecursor |
|
requirePrecursorPeak |
|
THRESHFUN |
|
toleranceRt |
|
percentRt |
|
matchedPeaksCount |
|
... |
for |
THRESHFUN_REVERSE |
for |
rtColname |
|
matchSpectra returns a MatchedSpectra() object with the matching
results. If target is a CompAnnotationSource only matching target
spectra will be reported.
Constructor functions return an instance of the class.
Johannes Rainer, Michael Witting
library(Spectra) library(MsDataHub) ## Load a test file from *MsDataHub* fl <- MsDataHub::PestMix1_DDA.mzML() pest_ms2 <- filterMsLevel(Spectra(fl), 2L) ## subset to selected spectra. pest_ms2 <- pest_ms2[c(808, 809, 945:955)] ## Load a small example MassBank data set load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation")) ## Match spectra with the default similarity score (normalized dot product) csp <- CompareSpectraParam(requirePrecursor = TRUE, ppm = 10) mtches <- matchSpectra(pest_ms2, minimb, csp) mtches ## Are there any matching spectra for the first query spectrum? mtches[1] ## No ## And for the second query spectrum? mtches[2] ## The second query spectrum matches 4 target spectra. The scores for these ## matches are: mtches[2]$score ## To access the score for the full data set mtches$score ## Below we use a THRESHFUN that returns for each query spectrum the (first) ## best matching target spectrum. csp <- CompareSpectraParam(requirePrecursor = FALSE, ppm = 10, THRESHFUN = function(x) which.max(x)) mtches <- matchSpectra(pest_ms2, minimb, csp) mtches ## Each of the query spectra is matched to one target spectrum length(mtches) matches(mtches) ## Match spectra considering also measured retention times. This requires ## that both query and target spectra have non-missing retention times. rtime(pest_ms2) rtime(minimb) ## Target spectra don't have retention times. Below we artificially set ## retention times to show how an additional retention time filter would ## work. rtime(minimb) <- rep(361, length(minimb)) ## Matching spectra requiring a matching precursor m/z and the difference ## of retention times between query and target spectra to be <= 2 seconds. csp <- CompareSpectraParam(requirePrecursor = TRUE, ppm = 10, toleranceRt = 2) mtches <- matchSpectra(pest_ms2, minimb, csp) mtches matches(mtches) ## Note that parameter `rtColname` can be used to define different spectra ## variables with retention time information (such as retention indices etc). ## A `CompDb` compound annotation database could also be used with ## parameter `target`. Below we load the test `CompDb` database from the ## `CompoundDb` Bioconductor package. library(CompoundDb) fl <- system.file("sql", "CompDb.MassBank.sql", package = "CompoundDb") cdb <- CompDb(fl) res <- matchSpectra(pest_ms2, cdb, CompareSpectraParam()) ## We do however not find any matches since the used compound annotation ## database contains only a very small subset of the MassBank. res ## As `target` we have now however the MS2 spectra data from the compound ## annotation database target(res) ## See the package vignette for details, descriptions and more examples, ## also on how to retrieve e.g. MassBank reference databases from ## Bioconductor's AnnotationHub.library(Spectra) library(MsDataHub) ## Load a test file from *MsDataHub* fl <- MsDataHub::PestMix1_DDA.mzML() pest_ms2 <- filterMsLevel(Spectra(fl), 2L) ## subset to selected spectra. pest_ms2 <- pest_ms2[c(808, 809, 945:955)] ## Load a small example MassBank data set load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation")) ## Match spectra with the default similarity score (normalized dot product) csp <- CompareSpectraParam(requirePrecursor = TRUE, ppm = 10) mtches <- matchSpectra(pest_ms2, minimb, csp) mtches ## Are there any matching spectra for the first query spectrum? mtches[1] ## No ## And for the second query spectrum? mtches[2] ## The second query spectrum matches 4 target spectra. The scores for these ## matches are: mtches[2]$score ## To access the score for the full data set mtches$score ## Below we use a THRESHFUN that returns for each query spectrum the (first) ## best matching target spectrum. csp <- CompareSpectraParam(requirePrecursor = FALSE, ppm = 10, THRESHFUN = function(x) which.max(x)) mtches <- matchSpectra(pest_ms2, minimb, csp) mtches ## Each of the query spectra is matched to one target spectrum length(mtches) matches(mtches) ## Match spectra considering also measured retention times. This requires ## that both query and target spectra have non-missing retention times. rtime(pest_ms2) rtime(minimb) ## Target spectra don't have retention times. Below we artificially set ## retention times to show how an additional retention time filter would ## work. rtime(minimb) <- rep(361, length(minimb)) ## Matching spectra requiring a matching precursor m/z and the difference ## of retention times between query and target spectra to be <= 2 seconds. csp <- CompareSpectraParam(requirePrecursor = TRUE, ppm = 10, toleranceRt = 2) mtches <- matchSpectra(pest_ms2, minimb, csp) mtches matches(mtches) ## Note that parameter `rtColname` can be used to define different spectra ## variables with retention time information (such as retention indices etc). ## A `CompDb` compound annotation database could also be used with ## parameter `target`. Below we load the test `CompDb` database from the ## `CompoundDb` Bioconductor package. library(CompoundDb) fl <- system.file("sql", "CompDb.MassBank.sql", package = "CompoundDb") cdb <- CompDb(fl) res <- matchSpectra(pest_ms2, cdb, CompareSpectraParam()) ## We do however not find any matches since the used compound annotation ## database contains only a very small subset of the MassBank. res ## As `target` we have now however the MS2 spectra data from the compound ## annotation database target(res) ## See the package vignette for details, descriptions and more examples, ## also on how to retrieve e.g. MassBank reference databases from ## Bioconductor's AnnotationHub.
The validateMatchedSpectra() function opens a simple shiny application
that allows to browse results stored in a MatchedSpectra object and to
validate the presented matches. For each query spectrum a table with
matched target spectra are shown (if available) and an interactive mirror
plot is generated. Valid matches can be selected using a check box which is
displayed below the mirror plot. Upon pushing the "Save & Close"
button the app is closed and a filtered MatchedSpectra is returned,
containing only validated matches.
Note that column "query_index_" and "target_index_" are temporarily
added to the query and target Spectra object to display them in the
interactive graphics for easier identification of the compared spectra.
validateMatchedSpectra(object)validateMatchedSpectra(object)
object |
A non-empty instance of class |
A MatchedSpectra with validated results.
Carolin Huber, Michael Witting, Johannes Rainer
library(Spectra) ## Load test data from *MsDataHub* fl <- MsDataHub::PestMix1_DDA.mzML() pest_ms2 <- filterMsLevel(Spectra(fl), 2L) pest_ms2 <- pest_ms2[c(808, 809, 945:955)] load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation")) ## Normalize intensities and match spectra csp <- CompareSpectraParam(requirePrecursor = TRUE, THRESHFUN = function(x) x >= 0.7) norm_int <- function(x) { x[, "intensity"] <- x[, "intensity"] / max(x[, "intensity"]) * 100 x } ms <- matchSpectra(addProcessing(pest_ms2, norm_int), addProcessing(minimb, norm_int), csp) ## validate matches using the shiny app. Note: the call is only executed ## in interactive mode. if (interactive()) { res <- validateMatchedSpectra(ms) }library(Spectra) ## Load test data from *MsDataHub* fl <- MsDataHub::PestMix1_DDA.mzML() pest_ms2 <- filterMsLevel(Spectra(fl), 2L) pest_ms2 <- pest_ms2[c(808, 809, 945:955)] load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation")) ## Normalize intensities and match spectra csp <- CompareSpectraParam(requirePrecursor = TRUE, THRESHFUN = function(x) x >= 0.7) norm_int <- function(x) { x[, "intensity"] <- x[, "intensity"] / max(x[, "intensity"]) * 100 x } ms <- matchSpectra(addProcessing(pest_ms2, norm_int), addProcessing(minimb, norm_int), csp) ## validate matches using the shiny app. Note: the call is only executed ## in interactive mode. if (interactive()) { res <- validateMatchedSpectra(ms) }
The matchValues method matches elements from query with those in target
using different matching approaches depending on parameter param.
Generally, query is expected to contain MS experimental values
(m/z and possibly retention time) while target reference values. query
and target can be numeric, a two dimensional array (such as a
data.frame, matrix or DataFrame), a SummarizedExperiment
or a QFeatures, target can in addition be a Spectra::Spectra() object.
For SummarizedExperiment, the information for the matching is expected
to be in the object's rowData. For QFeatures matching is performed
for values present in the rowData of one of the object's assays (which
needs to be specified with the assayQuery parameter - if a QFeatures
is used as target the name of the assay needs to be specified with
parameter assayTarget). If target is a Spectra matching is performed
against spectra variables of this object and the respective variable names
need to be specified e.g. with mzColname and/or rtColname.
matchMz is an alias for matchValues to allow backward compatibility.
Available param objects and corresponding matching approaches are:
ValueParam: generic matching between values in query and target given
acceptable differences expressed in ppm and tolerance. If query or
target are not numeric, parameter valueColname has to be used to
specify the name of the column that contains the values to be matched.
The function returns a Matched() object.
MzParam: match query m/z values against reference compounds for which
also m/z are known. Matching is performed similarly to the ValueParam
above. If query or target are not numeric, the column name containing
the values to be compared must be defined with matchValues' parameter
mzColname, which defaults to "mz". MzParam parameters tolerance
and ppm allow to define the maximal acceptable (constant or m/z relative)
difference between query and target m/z values.
MzRtParam: match m/z and retention time values between query and
target. Parameters mzColname and rtColname of the matchValues
function allow to define the columns in query and target containing
these values (defaulting to c("mz", "mz") and c("rt", "rt"),
respectively). MzRtParam parameters tolerance and
ppm have the same meaning as in MzParam; MzRtParam parameter
toleranceRt allows to specify the maximal acceptable difference between
query and target retention time values.
Mass2MzParam: match m/z values against reference compounds for
which only the (exact) mass is known. Before matching, m/z values are
calculated from the compounds masses in the target table using the
adducts specified via Mass2MzParam adducts parameter (defaults to
adducts = "[M+H]+"). After conversion of adduct masses to m/z values,
matching is performed similarly to MzParam (i.e. the same parameters
ppm and tolerance can be used). If query is not numeric,
parameter mzColname of matchValues can be used to specify the column
containing the query's m/z values (defaults to "mz"). If target is a
is not numeric, parameter massColname can be used to define the
column containing the reference compound's masses (defaults to
"exactmass").
Mass2MzRtParam: match m/z and retention time values against
reference compounds for which the (exact) mass and retention time are
known. Before matching, exact masses in target are converted to m/z
values as for Mass2MzParam. Matching is then performed similarly to
MzRtParam, i.e. m/z and retention times of entities are compared. With
matchValues' parameters mzColname, rtColname and massColname the
columns containing m/z values (in query), retention time values (in
query and target) and exact masses (in target) can be specified.
Mz2MassParam: input values for query and target are expected to be
m/z values but matching is performed on exact masses calculated from these
(based on the provided adduct definitions). In detail, m/z values in
query are first converted to masses with the MetaboCoreUtils::mz2mass()
function based on the adducts defined with queryAdducts (defaults to
"[M+H]+"). The same is done for m/z values in target (adducts can
be defined with targetAdducts which defaults to "[M-H-]"). Matching is then performed on these converted values similarly to ValueParam. If queryortargetare not numeric, the column containing the m/z values can be specified withmatchValues' parameter mzColname(defaults to"mz"').
Mz2MassRtParam: same as Mz2MassParam but with additional comparison of
retention times between query and target. Parameters rtColname and
mzColname of matchValues allow to specify which columns contain the
retention times and m/z values, respectively.
ValueParam(tolerance = 0, ppm = 5) MzParam(tolerance = 0, ppm = 5) Mass2MzParam(adducts = c("[M+H]+"), tolerance = 0, ppm = 5) Mass2MzRtParam(adducts = c("[M+H]+"), tolerance = 0, ppm = 5, toleranceRt = 0) MzRtParam(tolerance = 0, ppm = 0, toleranceRt = 0) Mz2MassParam( queryAdducts = c("[M+H]+"), targetAdducts = c("[M-H]-"), tolerance = 0, ppm = 5 ) Mz2MassRtParam( queryAdducts = c("[M+H]+"), targetAdducts = c("[M+H]+"), tolerance = 0, ppm = 5, toleranceRt = 0 ) matchValues(query, target, param, ...) ## S4 method for signature 'numeric,numeric,ValueParam' matchValues(query, target, param) ## S4 method for signature 'numeric,data.frameOrSimilar,ValueParam' matchValues( query, target, param, valueColname = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,numeric,ValueParam' matchValues( query, target, param, valueColname = character(), queryAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar,ValueParam' matchValues( query, target, param, valueColname = character(), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'numeric,numeric,Mass2MzParam' matchValues(query, target, param) ## S4 method for signature 'numeric,data.frameOrSimilar,Mass2MzParam' matchValues( query, target, param, massColname = "exactmass", targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,numeric,Mass2MzParam' matchValues(query, target, param, mzColname = "mz", queryAssay = character()) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mass2MzParam' matchValues( query, target, param, mzColname = "mz", massColname = "exactmass", queryAssay = character(0), targetAssay = character(0) ) ## S4 method for signature 'numeric,data.frameOrSimilar,MzParam' matchValues(query, target, param, mzColname = "mz", targetAssay = character()) ## S4 method for signature 'numeric,Spectra,MzParam' matchValues(query, target, param, mzColname = "mz", targetAssay = character()) ## S4 method for signature 'data.frameOrSimilar,numeric,MzParam' matchValues(query, target, param, mzColname = "mz", queryAssay = character()) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar,MzParam' matchValues( query, target, param, mzColname = c("mz", "mz"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,Spectra,MzParam' matchValues( query, target, param, mzColname = c("mz", "mz"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mass2MzRtParam' matchValues( query, target, param, massColname = "exactmass", mzColname = "mz", rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar,MzRtParam' matchValues( query, target, param, mzColname = c("mz", "mz"), rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,Spectra,MzRtParam' matchValues( query, target, param, mzColname = c("mz", "mz"), rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'numeric,numeric,Mz2MassParam' matchValues(query, target, param) ## S4 method for signature 'numeric,data.frameOrSimilar,Mz2MassParam' matchValues(query, target, param, mzColname = "mz", targetAssay = character()) ## S4 method for signature 'data.frameOrSimilar,numeric,Mz2MassParam' matchValues(query, target, param, mzColname = "mz", queryAssay = character()) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mz2MassParam' matchValues( query, target, param, mzColname = c("mz", "mz"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mz2MassRtParam' matchValues( query, target, param, mzColname = c("mz", "mz"), rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() )ValueParam(tolerance = 0, ppm = 5) MzParam(tolerance = 0, ppm = 5) Mass2MzParam(adducts = c("[M+H]+"), tolerance = 0, ppm = 5) Mass2MzRtParam(adducts = c("[M+H]+"), tolerance = 0, ppm = 5, toleranceRt = 0) MzRtParam(tolerance = 0, ppm = 0, toleranceRt = 0) Mz2MassParam( queryAdducts = c("[M+H]+"), targetAdducts = c("[M-H]-"), tolerance = 0, ppm = 5 ) Mz2MassRtParam( queryAdducts = c("[M+H]+"), targetAdducts = c("[M+H]+"), tolerance = 0, ppm = 5, toleranceRt = 0 ) matchValues(query, target, param, ...) ## S4 method for signature 'numeric,numeric,ValueParam' matchValues(query, target, param) ## S4 method for signature 'numeric,data.frameOrSimilar,ValueParam' matchValues( query, target, param, valueColname = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,numeric,ValueParam' matchValues( query, target, param, valueColname = character(), queryAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar,ValueParam' matchValues( query, target, param, valueColname = character(), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'numeric,numeric,Mass2MzParam' matchValues(query, target, param) ## S4 method for signature 'numeric,data.frameOrSimilar,Mass2MzParam' matchValues( query, target, param, massColname = "exactmass", targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,numeric,Mass2MzParam' matchValues(query, target, param, mzColname = "mz", queryAssay = character()) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mass2MzParam' matchValues( query, target, param, mzColname = "mz", massColname = "exactmass", queryAssay = character(0), targetAssay = character(0) ) ## S4 method for signature 'numeric,data.frameOrSimilar,MzParam' matchValues(query, target, param, mzColname = "mz", targetAssay = character()) ## S4 method for signature 'numeric,Spectra,MzParam' matchValues(query, target, param, mzColname = "mz", targetAssay = character()) ## S4 method for signature 'data.frameOrSimilar,numeric,MzParam' matchValues(query, target, param, mzColname = "mz", queryAssay = character()) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar,MzParam' matchValues( query, target, param, mzColname = c("mz", "mz"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,Spectra,MzParam' matchValues( query, target, param, mzColname = c("mz", "mz"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mass2MzRtParam' matchValues( query, target, param, massColname = "exactmass", mzColname = "mz", rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,data.frameOrSimilar,MzRtParam' matchValues( query, target, param, mzColname = c("mz", "mz"), rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'data.frameOrSimilar,Spectra,MzRtParam' matchValues( query, target, param, mzColname = c("mz", "mz"), rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature 'numeric,numeric,Mz2MassParam' matchValues(query, target, param) ## S4 method for signature 'numeric,data.frameOrSimilar,Mz2MassParam' matchValues(query, target, param, mzColname = "mz", targetAssay = character()) ## S4 method for signature 'data.frameOrSimilar,numeric,Mz2MassParam' matchValues(query, target, param, mzColname = "mz", queryAssay = character()) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mz2MassParam' matchValues( query, target, param, mzColname = c("mz", "mz"), queryAssay = character(), targetAssay = character() ) ## S4 method for signature ## 'data.frameOrSimilar,data.frameOrSimilar,Mz2MassRtParam' matchValues( query, target, param, mzColname = c("mz", "mz"), rtColname = c("rt", "rt"), queryAssay = character(), targetAssay = character() )
tolerance |
for any |
ppm |
for any |
adducts |
for |
toleranceRt |
for |
queryAdducts |
for |
targetAdducts |
for |
query |
feature table containing information on MS1 features. Can be
a |
target |
compound table with metabolites to compare against. The
expected types are the same as those for |
param |
parameter object defining the matching approach and containing the settings for that approach. See description above for details. |
... |
currently ignored. |
valueColname |
|
targetAssay |
|
queryAssay |
|
massColname |
|
mzColname |
|
rtColname |
|
Matched object representing the result.
Depending on the param object different scores representing the quality
of the match are provided. This comprises absolute as well as relative
differences (column/variables "score" and "ppm_error" respectively).
If param is a Mz2MassParam, "score" and "ppm_error" represent
differences of the compared masses (calculated from the provided m/z values).
If param an MzParam, MzRtParam, Mass2MzParam or Mass2MzRtParam,
"score" and "ppm_error" represent absolute and relative differences of
m/z values.
Additionally, if param is either an MzRtParam or Mass2MzRtParam
differences between query and target retention times for each matched
element is available in the column/variable "score_rt" in the returned
Matched object.
Negative values of "score" (or "score_rt") indicate that the m/z or mass
(or retention time) of the query element is smaller than that of the target
element.
Andrea Vicini, Michael Witting
matchSpectra or CompareSpectraParam() for spectra data matching
library(MetaboCoreUtils) ## Create a simple "target/reference" compound table target_df <- data.frame( name = c("Tryptophan", "Leucine", "Isoleucine"), formula = c("C11H12N2O2", "C6H13NO2", "C6H13NO2"), exactmass = c(204.089878, 131.094629, 131.094629) ) ## Create a "feature" table with m/z of features. We calculate m/z for ## certain adducts of some of the compounds in the reference table. fts <- data.frame( feature_id = c("FT001", "FT002", "FT003"), mz = c(mass2mz(204.089878, "[M+H]+"), mass2mz(131.094629, "[M+H]+"), mass2mz(204.089878, "[M+Na]+") + 1e-6)) ## Define the parameters for the matching parm <- Mass2MzParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20) res <- matchValues(fts, target_df, parm) res ## List the available variables/columns colnames(res) ## feature_id and mz are from the query data frame, while target_name, ## target_formula and target_exactmass are from the query object (columns ## from the target object have a prefix *target_* added to the original ## column name. Columns adduct, score and ppm_error represent the results ## of the matching: adduct the adduct/ion of the original compound for which ## the m/z matches, score the absolute difference of the query and target ## m/z and ppm_error the relative difference in m/z values. ## Get the full matching result: matchedData(res) ## We have thus matches of FT002 to two different compounds (but with the ## same mass). ## Individual columns can also be accessed with the $ operator: res$feature_id res$target_name res$ppm_error ## We repeat the matching requiring an exact match parm <- Mass2MzParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 0) res <- matchValues(fts, target_df, parm) res matchedData(res) ## The last feature could thus not be matched to any compound. ## At last we use also different adduct definitions. parm <- Mass2MzParam( adducts = c("[M+K]+", "[M+Li]+"), tolerance = 0, ppm = 20) res <- matchValues(fts, target_df, parm) res matchedData(res) ## No matches were found. ## We can also match a "feature" table with a target data.frame taking into ## account both m/z and retention time values. target_df <- data.frame( name = c("Tryptophan", "Leucine", "Isoleucine"), formula = c("C11H12N2O2", "C6H13NO2", "C6H13NO2"), exactmass = c(204.089878, 131.094629, 131.094629), rt = c(150, 140, 140) ) fts <- data.frame( feature_id = c("FT001", "FT002", "FT003"), mz = c(mass2mz(204.089878, "[M+H]+"), mass2mz(131.094629, "[M+H]+"), mass2mz(204.089878, "[M+Na]+") + 1e-6), rt = c(150, 140, 150.1) ) ## Define the parameters for the matching parm <- Mass2MzRtParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20, toleranceRt = 0) res <- matchValues(fts, target_df, parm) res ## Get the full matching result: matchedData(res) ## FT003 could not be matched to any compound, FT002 was matched to two ## different compounds (but with the same mass). ## We repeat the matching allowing a positive tolerance for the matches ## between rt values ## Define the parameters for the matching parm <- Mass2MzRtParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20, toleranceRt = 0.1) res <- matchValues(fts, target_df, parm) res ## Get the full matching result: matchedData(res) ## Also FT003 was matched in this case ## It is also possible to match directly m/z values mz1 <- c(12, 343, 23, 231) mz2 <- mz1 + rnorm(4, sd = 0.001) res <- matchValues(mz1, mz2, MzParam(tolerance = 0.001)) matchedData(res) ## Matching with a SummarizedExperiment or a QFeatures work analogously, ## only that the matching is performed on the object's `rowData`. ## Below we create a simple SummarizedExperiment with some random assay data. ## Note that results from a data preprocessing with the `xcms` package could ## be extracted as a `SummarizedExperiment` with the `quantify` method from ## the `xcms` package. library(SummarizedExperiment) se <- SummarizedExperiment( assays = matrix(rnorm(12), nrow = 3, ncol = 4, dimnames = list(NULL, c("A", "B", "C", "D"))), rowData = fts) ## We can now perform the matching of this SummarizedExperiment against the ## target_df as before. res <- matchValues(se, target_df, param = Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20)) res ## Getting the available columns colnames(res) ## The query columns represent the columns of the object's `rowData` rowData(se) ## matchedData also returns the query object's rowData along with the ## matching entries in the target object. matchedData(res) ## While `query` will return the full SummarizedExperiment. query(res) ## To illustrate use with a QFeatures object we first create a simple ## QFeatures object with two assays, `"ions"` representing the full feature ## data.frame and `"compounds"` a subset of it. library(QFeatures) qf <- QFeatures(list(ions = se, compounds = se[2,])) ## We can perform the same matching as before, but need to specify which of ## the assays in the QFeatures should be used for the matching. Below we ## perform the matching using the "ions" assay. res <- matchValues(qf, target_df, queryAssay = "ions", param = Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20)) res ## colnames returns now the colnames of the `rowData` of the `"ions"` assay. colnames(res) matchedData(res)library(MetaboCoreUtils) ## Create a simple "target/reference" compound table target_df <- data.frame( name = c("Tryptophan", "Leucine", "Isoleucine"), formula = c("C11H12N2O2", "C6H13NO2", "C6H13NO2"), exactmass = c(204.089878, 131.094629, 131.094629) ) ## Create a "feature" table with m/z of features. We calculate m/z for ## certain adducts of some of the compounds in the reference table. fts <- data.frame( feature_id = c("FT001", "FT002", "FT003"), mz = c(mass2mz(204.089878, "[M+H]+"), mass2mz(131.094629, "[M+H]+"), mass2mz(204.089878, "[M+Na]+") + 1e-6)) ## Define the parameters for the matching parm <- Mass2MzParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20) res <- matchValues(fts, target_df, parm) res ## List the available variables/columns colnames(res) ## feature_id and mz are from the query data frame, while target_name, ## target_formula and target_exactmass are from the query object (columns ## from the target object have a prefix *target_* added to the original ## column name. Columns adduct, score and ppm_error represent the results ## of the matching: adduct the adduct/ion of the original compound for which ## the m/z matches, score the absolute difference of the query and target ## m/z and ppm_error the relative difference in m/z values. ## Get the full matching result: matchedData(res) ## We have thus matches of FT002 to two different compounds (but with the ## same mass). ## Individual columns can also be accessed with the $ operator: res$feature_id res$target_name res$ppm_error ## We repeat the matching requiring an exact match parm <- Mass2MzParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 0) res <- matchValues(fts, target_df, parm) res matchedData(res) ## The last feature could thus not be matched to any compound. ## At last we use also different adduct definitions. parm <- Mass2MzParam( adducts = c("[M+K]+", "[M+Li]+"), tolerance = 0, ppm = 20) res <- matchValues(fts, target_df, parm) res matchedData(res) ## No matches were found. ## We can also match a "feature" table with a target data.frame taking into ## account both m/z and retention time values. target_df <- data.frame( name = c("Tryptophan", "Leucine", "Isoleucine"), formula = c("C11H12N2O2", "C6H13NO2", "C6H13NO2"), exactmass = c(204.089878, 131.094629, 131.094629), rt = c(150, 140, 140) ) fts <- data.frame( feature_id = c("FT001", "FT002", "FT003"), mz = c(mass2mz(204.089878, "[M+H]+"), mass2mz(131.094629, "[M+H]+"), mass2mz(204.089878, "[M+Na]+") + 1e-6), rt = c(150, 140, 150.1) ) ## Define the parameters for the matching parm <- Mass2MzRtParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20, toleranceRt = 0) res <- matchValues(fts, target_df, parm) res ## Get the full matching result: matchedData(res) ## FT003 could not be matched to any compound, FT002 was matched to two ## different compounds (but with the same mass). ## We repeat the matching allowing a positive tolerance for the matches ## between rt values ## Define the parameters for the matching parm <- Mass2MzRtParam( adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20, toleranceRt = 0.1) res <- matchValues(fts, target_df, parm) res ## Get the full matching result: matchedData(res) ## Also FT003 was matched in this case ## It is also possible to match directly m/z values mz1 <- c(12, 343, 23, 231) mz2 <- mz1 + rnorm(4, sd = 0.001) res <- matchValues(mz1, mz2, MzParam(tolerance = 0.001)) matchedData(res) ## Matching with a SummarizedExperiment or a QFeatures work analogously, ## only that the matching is performed on the object's `rowData`. ## Below we create a simple SummarizedExperiment with some random assay data. ## Note that results from a data preprocessing with the `xcms` package could ## be extracted as a `SummarizedExperiment` with the `quantify` method from ## the `xcms` package. library(SummarizedExperiment) se <- SummarizedExperiment( assays = matrix(rnorm(12), nrow = 3, ncol = 4, dimnames = list(NULL, c("A", "B", "C", "D"))), rowData = fts) ## We can now perform the matching of this SummarizedExperiment against the ## target_df as before. res <- matchValues(se, target_df, param = Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20)) res ## Getting the available columns colnames(res) ## The query columns represent the columns of the object's `rowData` rowData(se) ## matchedData also returns the query object's rowData along with the ## matching entries in the target object. matchedData(res) ## While `query` will return the full SummarizedExperiment. query(res) ## To illustrate use with a QFeatures object we first create a simple ## QFeatures object with two assays, `"ions"` representing the full feature ## data.frame and `"compounds"` a subset of it. library(QFeatures) qf <- QFeatures(list(ions = se, compounds = se[2,])) ## We can perform the same matching as before, but need to specify which of ## the assays in the QFeatures should be used for the matching. Below we ## perform the matching using the "ions" assay. res <- matchValues(qf, target_df, queryAssay = "ions", param = Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"), tolerance = 0, ppm = 20)) res ## colnames returns now the colnames of the `rowData` of the `"ions"` assay. colnames(res) matchedData(res)