MsBackendRawFileReader
BackendThe figure below depicts the idea of the Spectra framework. For a detailed description, read (Rainer et al. 2022).
suppressMessages(
stopifnot(require(Spectra),
require(MsBackendRawFileReader),
require(tartare),
require(BiocParallel))
)
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'IRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'AnnotationHub'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'AnnotationDbi'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'Biostrings'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'XVector'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomeInfoDb'
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'ExperimentHub'
assemblies aka Common Intermediate Language bytecode The download and
install can be done on all platforms using the command:
rawrr::installRawFileReaderDLLs()
## Overwrite sourceUrl to https://fgcz-ms.uzh.ch/~cpanse/rawrr/dotnet//linux-x64/rawrr
## MD5 d842e4ba5fe901cc4dd62e439e7c6b9b /github/home/.cache/R/rawrr/rawrrassembly/linux-x64/rawrr
## [1] 0
## ExperimentHub with 5 records
## # snapshotDate(): 2024-10-24
## # $dataprovider: Functional Genomics Center Zurich (FGCZ)
## # $species: NA
## # $rdataclass: Spectra
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH3219"]]'
##
## title
## EH3219 | Q Exactive HF-X mzXML
## EH3220 | Q Exactive HF-X raw
## EH3221 | Fusion Lumos mzXML
## EH3222 | Fusion Lumos raw
## EH4547 | Q Exactive HF Orbitrap raw
The RawFileReader libraries require a file extension ending with
.raw
.
## [1] "/github/home/.cache/R/ExperimentHub/17602f5447ea_3236.raw"
## [1] TRUE
## [1] "/github/home/.cache/R/ExperimentHub/17606b95c019_3238.raw"
## [1] TRUE
## [1] "/github/home/.cache/R/ExperimentHub/176042880e66_4590.raw"
## [1] TRUE
Call the constructor using Spectra::backendInitialize
,
see also (Rainer et al. 2022).
beRaw <- Spectra::backendInitialize(
MsBackendRawFileReader::MsBackendRawFileReader(),
files = c(rawfileEH3220, rawfileEH3222, rawfileEH4547))
Call the print method
## MsBackendRawFileReader with 32500 spectra
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.00358906 1
## 2 1 0.01189523 2
## 3 1 0.01847028 3
## 4 1 0.02504740 4
## 5 1 0.03161818 5
## ... ... ... ...
## 32496 2 34.9964 21877
## 32497 2 34.9978 21878
## 32498 2 34.9992 21879
## 32499 2 35.0007 21880
## 32500 2 35.0021 21881
## ... 30 more variables/columns.
##
## file(s):
## 17602f5447ea_3236.raw
## 17606b95c019_3238.raw
## 176042880e66_4590.raw
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "mz" "intensity"
## [7] "dataStorage" "dataOrigin"
## [9] "centroided" "smoothed"
## [11] "polarity" "precScanNum"
## [13] "precursorMz" "precursorIntensity"
## [15] "precursorCharge" "collisionEnergy"
## [17] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [19] "isolationWindowUpperMz" "scanType"
## [21] "charge" "masterScan"
## [23] "dependencyType" "monoisotopicMz"
## [25] "AGC" "injectionTime"
## [27] "resolution" "isolationWidth"
## [29] "isolationOffset" "AGCTarget"
## [31] "collisionEnergyList" "AGCFill"
## [33] "isStepped"
Here we reproduce the Figure 2 of Kockmann and
Panse (2021) rawrr. The
MsBackendRawFileReader
ships with a filterScan
method using functionality provided
by the C# libraries by Thermo Fisher Scientific Shofstahl (2016).
(S <- (beRaw |>
filterScan("FTMS + c NSI Full ms2 [email protected] [100.0000-1015.0000]") )[437]) |>
plotSpectra()
# supposed to be scanIndex 9594
S
## MsBackendRawFileReader with 1 spectra
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 2 15.4204 9594
## ... 30 more variables/columns.
##
## file(s):
## 176042880e66_4590.raw
## [1] 175.1190 276.1666 375.2350 503.2936 632.3362 746.3791 803.4006 860.4221
names(yIonSeries) <- paste0("y", seq(1, length(yIonSeries)))
abline(v = yIonSeries, col='#DDDDDD88', lwd=5)
axis(3, yIonSeries, names(yIonSeries))
For demonstration reasons, we extent the MsBackend
class
by a filter method. The filterIons
function returns spectra
if and only if all fragment ions, given as argument, match. We use
protViz::findNN
binary search method for determining the nearest mZ peak for each ion.
If the mass error between an ion and an mz value is less than the given
mass tolerance, an ion is considered a hit.
## [1] "filterIons"
setMethod("filterIons", "MsBackend",
function(object, mZ=numeric(), tol=numeric(), ...) {
keep <- lapply(peaksData(object, BPPARAM = bpparam()),
FUN=function(x){
NN <- protViz::findNN(mZ, x[, 1])
hit <- (error <- mZ - x[NN, 1]) < tol & x[NN, 2] >= quantile(x[, 2], .9)
if (sum(hit) == length(mZ))
TRUE
else
FALSE
})
object[unlist(keep)]
})
The lines below implement a simple targeted peptide search engine.
The R code snippet takes as input a MsBackendRawFileReader
object containing 32500 spectra and y-fragment-ion mZ values determined
for LGGNEQVTR++
.
start_time <- Sys.time()
X <- beRaw |>
MsBackendRawFileReader::filterScan("FTMS + c NSI Full ms2 [email protected] [100.0000-1015.0000]") |>
filterIons(yIonSeries, tol = 0.005) |>
Spectra::Spectra() |>
Spectra::peaksData()
end_time <- Sys.time()
The defined filterIons
method runs on 995 input spectra
and returns 4 spectra.
The runtime is shown below.
## Time difference of 3.097611 secs
Next, we define and apply a method for graphing
LGGNEQVTR
peptide spectrum matches. Also, the function
returns some statistics of the match.
## A helper plot function to visualize a peptide spectrum match for
## the LGGNEQVTR peptide.
.plot.LGGNEQVTR <- function(x){
yIonSeries <- protViz::fragmentIon("LGGNEQVTR")[[1]]$y[1:8]
names(yIonSeries) <- paste0("y", seq(1, length(yIonSeries)))
plot(x, type = 'h', xlim = range(yIonSeries))
abline(v = yIonSeries, col = '#DDDDDD88', lwd=5)
axis(3, yIonSeries, names(yIonSeries))
# find nearest mZ value
idx <- protViz::findNN(yIonSeries, x[,1])
data.frame(
ion = names(yIonSeries),
mZ.yIon = yIonSeries,
mZ = x[idx, 1],
intensity = x[idx, 2]
)
}
## ion mZ
## 1 y1 175.1190
## 2 y2 276.1665
## 3 y3 375.2349
## 4 y4 503.2936
## 5 y5 632.3362
## 6 y6 746.3791
## 7 y7 803.4003
## 8 y8 860.4216
## ion intensity
## 1 y1 1505214
## 2 y2 2583122
## 3 y3 2364014
## 4 y4 3179124
## 5 y5 2286947
## 6 y6 1236341
## 7 y7 4586484
## 8 y8 12894520
We demonstrate the Spectra::combinePeaks
method and
aggregate the four spectra into a single peak matrix. The statistics
returned by .plot.LGGNEQVTR()
should be identical to the
aggregation code snippet output above.
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': unable to find an inherited method for function 'combinePeaks' for signature 'object = "SimpleList"'
Below we demonstrate the interaction with the MsBackendMgf package while composing a Mascot Generic Format mgf file which is compatible with conducting an MS/MS Ions Search using Mascot Server (>=2.7) Perkins et al. (1999).
if (require(MsBackendMgf)){
## Map Spectra variables to Mascot Server compatible vocabulary.
map <- c(custom = "TITLE",
msLevel = "CHARGE",
scanIndex = "SCANS",
precursorMz = "PEPMASS",
rtime = "RTINSECONDS")
## Compose custom TITLE
beRaw$custom <- paste0("File: ", beRaw$dataOrigin, " ; SpectrumID: ", S$scanIndex)
(mgf <- tempfile(fileext = '.mgf'))
(beRaw |>
filterScan("FTMS + c NSI Full ms2 [email protected] [100.0000-1015.0000]") )[437] |>
Spectra::Spectra() |>
Spectra::selectSpectraVariables(c("rtime", "precursorMz",
"precursorCharge", "msLevel", "scanIndex", "custom")) |>
MsBackendMgf::export(backend = MsBackendMgf::MsBackendMgf(),
file = mgf, map = map)
readLines(mgf) |> head(12)
readLines(mgf) |> tail()
}
## Loading required package: MsBackendMgf
## [1] "862.427612304688 154045.78125" "870.404846191406 159569.8125"
## [3] "871.395141601562 196302.6875" "880.671020507812 65916"
## [5] "END IONS" ""
To extract all tandem spectra, you can use the code snippets below
S <- Spectra::backendInitialize(
MsBackendRawFileReader::MsBackendRawFileReader(),
files = c(rawfileEH4547)) |>
Spectra()
S
## MSn data (Spectra) with 21881 spectra in a MsBackendRawFileReader backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.00258665 1
## 2 2 0.00686231 2
## 3 2 0.00828043 3
## 4 2 0.00971328 4
## 5 2 0.01112509 5
## ... ... ... ...
## 21877 2 34.9964 21877
## 21878 2 34.9978 21878
## 21879 2 34.9992 21879
## 21880 2 35.0007 21880
## 21881 2 35.0021 21881
## ... 30 more variables/columns.
##
## file(s):
## 176042880e66_4590.raw
Next, we generate an mgf file for each scan type. This is helpful, e.g., for optimizing search settings tandem mass spectrometry sequence database search tool as comet Eng, Jahan, and Hoopmann (2012) or mascot server Perkins et al. (1999).
## Define scanType patterns
scanTypePattern <- list(
EThcD.lowres = "ITMS.+sa Full ms2.+@etd.+@hcd.+",
ETciD.lowres = "ITMS.+sa Full ms2.+@etd.+@cid.+",
CID.lowres = "ITMS[^@]+@cid[^@]+$",
HCD.lowres = "ITMS[^@]+@hcd[^@]+$",
EThcD.highres = "FTMS.+sa Full ms2.+@etd.+@hcd.+",
HCD.highres = "FTMS[^@]+@hcd[^@]+$"
)
beRaw <- Spectra::backendInitialize(
MsBackendRawFileReader::MsBackendRawFileReader(),
files = c(rawrr::sampleFilePath()))
beRaw <- Spectra::backendInitialize(
MsBackendRawFileReader::MsBackendRawFileReader(),
files = rawrr::sampleFilePath())
beRaw$custom <- paste0("File: ", gsub("/srv/www/htdocs/Data2San/", "", beRaw$dataOrigin), " ; SpectrumID: ", beRaw$scanIndex)
.generate_mgf <- function(ext, pattern, dir=tempdir(), ...){
mgf <- file.path(dir, paste0(sub("\\.raw", "", unique(basename(beRaw$dataOrigin))),
".", ext, ".mgf"))
idx <- beRaw$scanType |> grepl(patter=pattern)
if (sum(idx) == 0) return (NULL)
message(paste0("Extracting ", sum(idx), " ",
pattern, " scans\n\t to file ", mgf, " ..."))
beRaw[which(idx)] |>
Spectra::Spectra() |>
Spectra::selectSpectraVariables(c("rtime", "precursorMz",
"precursorCharge", "msLevel", "scanIndex", "custom")) |>
MsBackendMgf::export(backend = MsBackendMgf::MsBackendMgf(),
file = mgf,
map = map)
mgf
}
#mapply(ext = names(scanTypePattern),
# scanTypePattern,
# FUN = .generate_mgf) |>
# lapply(FUN = function(f){if (file.exists(f)) {readLines(f) |> head()}})
Given the task, we want to filter an MS2 of peak list recorded on an Orbitrap device to be interested only in the top peak within 100 Da mass windows. The following code snippet will demonstrate a solution.
## Define a function that takes a matrix as input and derives
## the top n most intense peaks within a mass window.
## Of note, here, we require centroided data. (no profile mode!)
MsBackendRawFileReader:::.top_n
## function (x, n = 10, mass_window = 100, ...)
## {
## if (nrow(x) < n) {
## return(x)
## }
## idx <- unlist(lapply(seq(0, 2000, by = mass_window), function(mZ) {
## i <- which((mZ < x[, 1] & x[, 1] <= mZ + mass_window))
## r <- i[order(x[, 2][i], decreasing = TRUE)]
## if (length(x[, 2]) > length(i))
## return(r[1:n])
## return(r)
## }, ...))
## x[sort(idx[!is.na(idx)]), ]
## }
## <bytecode: 0x555a10a2d288>
## <environment: namespace:MsBackendRawFileReader>
We add our custom code to the processing queue of the Spectra object.
Of note, we use n = 1
in praxis n = 10
for a
100 Da mass window, which seems to be a practical choice.
The plot below displays a visual control of the custom filter
function top_n.
On the top is the original spectrum, and
the filtered one is on the bottom. A point indicates peaks that
match.
The following snippet prints the values of the filtered peaklist and the mZ values of the y-ions.
## [1] 171.1129 276.1667 375.2351 486.2656 503.2942 632.3369 746.3797 860.4223
## y1 y2 y3 y4 y5 y6 y7 y8
## 175.1190 276.1666 375.2350 503.2936 632.3362 746.3791 803.4006 860.4221
When reading spectra the
MsBackendRawFileReader:::.RawFileReader_read_peaks
method
is calling the rawrr::readSpectrum
method.
The figure below displays the time performance for reading a single spectrum in dependency from the chunk size (how many spectra are read in one function call) for reading different numbers of overall spectra.
ioBm <- file.path(system.file(package = 'MsBackendRawFileReader'),
'extdata', 'specs.csv') |>
read.csv2(header=TRUE)
# perform and include a local IO benchmark
ioBmLocal <- ioBenchmark(1000, c(32, 64, 128, 256), rawfile = rawfileEH4547)
lattice::xyplot((1 / as.numeric(time)) * workers ~ size | factor(n) ,
group = host,
data = rbind(ioBm, ioBmLocal),
horizontal = FALSE,
scales=list(y = list(log = 10)),
auto.key = TRUE,
layout = c(3, 1),
ylab = 'spectra read in one second',
xlab = 'number of spectra / file')
We compare the output of the Thermo Fischer Scientific raw files
versus their corresponding mzXML files using
Spectra::MsBackendMzR
relying on the mzR
package.
## see ?tartare and browseVignettes('tartare') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## see ?tartare and browseVignettes('tartare') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
if (require(mzR)){
beMzXML <- Spectra::backendInitialize(
Spectra::MsBackendMzR(),
files = c(mzXMLEH3219))
beRaw <- Spectra::backendInitialize(
MsBackendRawFileReader::MsBackendRawFileReader(),
files = c(rawfileEH3220))
intensity.xml <- sapply(intensity(beMzXML[1:100]), sum)
intensity.raw <- sapply(intensity(beRaw[1:100]), sum)
plot(intensity.xml ~ intensity.raw, log = 'xy', asp = 1,
pch = 16, col = rgb(0.5, 0.5, 0.5, alpha=0.5), cex=2)
abline(lm(intensity.xml ~ intensity.raw),
col='red')
}
Are all scans of the raw file in the mzXML file?
##
## FALSE TRUE
## 113 1764
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] mzR_2.41.0 Rcpp_1.0.13-1
## [3] MsBackendMgf_1.15.0 tartare_1.20.0
## [5] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [7] BiocFileCache_2.15.0 dbplyr_2.5.0
## [9] MsBackendRawFileReader_1.13.1 Spectra_1.17.0
## [11] BiocParallel_1.41.0 S4Vectors_0.45.0
## [13] BiocGenerics_0.53.1 generics_0.1.3
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
## [4] filelock_1.0.3 Biostrings_2.75.1 fastmap_1.2.0
## [7] digest_0.6.37 mime_0.12 lifecycle_1.0.4
## [10] cluster_2.1.6 ProtGenerics_1.39.0 KEGGREST_1.47.0
## [13] RSQLite_2.3.7 magrittr_2.0.3 compiler_4.4.2
## [16] rlang_1.1.4 sass_0.4.9 tools_4.4.2
## [19] utf8_1.2.4 yaml_2.3.10 knitr_1.48
## [22] bit_4.5.0 curl_6.0.0 withr_3.0.2
## [25] purrr_1.0.2 sys_3.4.3 grid_4.4.2
## [28] fansi_1.0.6 MASS_7.3-61 cli_3.6.3
## [31] rmarkdown_2.29 crayon_1.5.3 httr_1.4.7
## [34] rawrr_1.15.5 ncdf4_1.23 DBI_1.2.3
## [37] cachem_1.1.0 zlibbioc_1.52.0 parallel_4.4.2
## [40] AnnotationDbi_1.69.0 BiocManager_1.30.25 XVector_0.47.0
## [43] vctrs_0.6.5 jsonlite_1.8.9 IRanges_2.41.0
## [46] bit64_4.5.2 clue_0.3-65 maketools_1.3.1
## [49] protViz_0.7.9 jquerylib_0.1.4 glue_1.8.0
## [52] codetools_0.2-20 BiocVersion_3.21.1 GenomeInfoDb_1.43.0
## [55] UCSC.utils_1.3.0 tibble_3.2.1 pillar_1.9.0
## [58] rappdirs_0.3.3 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [61] R6_2.5.1 evaluate_1.0.1 Biobase_2.67.0
## [64] lattice_0.22-6 highr_0.11 png_0.1-8
## [67] memoise_2.0.1 bslib_0.8.0 MetaboCoreUtils_1.15.0
## [70] xfun_0.49 MsCoreUtils_1.19.0 fs_1.6.5
## [73] buildtools_1.0.0 pkgconfig_2.0.3