Fragment matching using synapter

Foreword

synapter is free and open-source software. If you use it, please support the project by citing it in publications:

Nicholas James Bond, Pavel Vyacheslavovich Shliaha, Kathryn S. Lilley, and Laurent Gatto. Improving qualitative and quantitative performance for MSE-based label free proteomics. J. Proteome Res., 2013, 12 (6), pp 2340–2353

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/lgatto/synapter/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

Introduction

This document assumes familiarity with standard synapter pipeline described in (Bond et al. 2013) and in the package synapter vignette, available online and with vignette("synapter", package = "synapter").

In this vignette we introduce a new fragment matching feature (see figures 2, 3 and 4) which improves the matching of identification and the quantitation features. After applying the usual synergise1 workflow (see ?synergise1 and ?Synapter for details) a number of multiple matches and possible false unique matches remain that can be deconvoluted by comparing common peaks in the identification fragment peaks and the quantitation spectra.

The example data synobj2 used throughout this document is available in the synapterdata package and can be directly load as follows:

library("synapterdata")
synobj2RData()

In the [next section](:synergise] we describe how synobj2 was generated. The following sections then describe the new fragment matching functionality.

Running synergise1

One has to run the synergise1 workflow before fragment matching can be applied. Please read the general synapter vignette for the general use of synergise1. The additional data needed for the fragment matching procedure are a final_fragment.csv file for the identification run and a Spectrum.xml file for the quantitation run.

## Please find the raw data at:
## http://proteome.sysbiol.cam.ac.uk/lgatto/synapter/data/
library("synapter")

inlist <- list(
  identpeptide = "fermentor_03_sample_01_HDMSE_01_IA_final_peptide.csv.gz",
  identfragments = "fermentor_03_sample_01_HDMSE_01_IA_final_fragment.csv.gz",
  quantpeptide = "fermentor_02_sample_01_HDMSE_01_IA_final_peptide.csv.gz",
  quantpep3d = "fermentor_02_sample_01_HDMSE_01_Pep3DAMRT.csv.gz",
  quantspectra = "fermentor_02_sample_01_HDMSE_01_Pep3D_Spectrum.xml.gz",
  fasta = "S.cerevisiae_Uniprot_reference_canonical_18_03_14.fasta")

synobj2 <- Synapter(inlist, master=FALSE)
synobj2 <- synergise1(object=synobj2,
                      outputdir=tempdir())

Filtering fragments

This step is optional and allows one to remove low abundance fragments in the spectra using filterFragments. Filtering fragments can remove noise in the spectra and reduce undesired fragment matches. Prior to filtering, the plotCumulativeNumberOfFragments function can be use to visualise the intensity of all fragments. Both functions have an argument what to decide what spectra/fragments to filter/plot. Choose fragments.ident for the identification fragments and spectra.quant for the quantitation fragments.

plotCumulativeNumberOfFragments(synobj2,
                                what = "fragments.ident")
Cumulative Number of Fragments

Cumulative Number of Fragments

plotCumulativeNumberOfFragments(synobj2,
                                what = "spectra.quant")
Cumulative Number of Fragments

Cumulative Number of Fragments

filterFragments(synobj2,
                what = "fragments.ident",
                minIntensity = 70)
filterFragments(synobj2,
                what = "spectra.quant",
                minIntensity = 70)

Fragment matching

This method, named fragmentMatching, performs the matching of the identification fragments vs the quantitation spectra and counts the number of identical peaks for each combination.

Because the peaks/fragments in the spectra of one run will never be numerically identical to these in another, a tolerance parameter has to be set using the setFragmentMatchingPpmTolerance function. Peaks/Fragments within this tolerance are treated as identical.

setFragmentMatchingPpmTolerance(synobj2, 25)
fragmentMatching(synobj2)

The plotFragmentMatching function illustrates the details of this fragment matching procedure. If it is called without any additional argument every matched pair (fragment vs spectrum) is plotted. One can use the key argument to select a special value in a column (defined by the column argument) of the MatchedEMRTs data.frame. E.g. if one wants to select the fragment matching results with a high number of common peaks, e.g. 28 common peaks:

plotFragmentMatching(synobj2,
                     key = 28,
                     column = "FragmentMatching")
Fragment matching for cases with 28 common fragments. The identification data are shown on the top (blue) and the quantitation data are on the bottom (red). Common peaks are displayed in darker colours and highlighted by full points.

Fragment matching for cases with 28 common fragments. The identification data are shown on the top (blue) and the quantitation data are on the bottom (red). Common peaks are displayed in darker colours and highlighted by full points.

Or, if one is interested in all results for the peptide with the sequence "TALIDGLAQR".

plotFragmentMatching(synobj2,
                     key = "TALIDGLAQR",
                     column = "peptide.seq")
Fragment matching for peptide *TALIDGLAQ*.

Fragment matching for peptide TALIDGLAQ.

Fragment matching for peptide *TALIDGLAQ*.

Fragment matching for peptide TALIDGLAQ.

Maybe the peptide with a special precursor ID looks interesting.

plotFragmentMatching(synobj2,
                     key = 12589,
                     column = "precursor.leID.ident")
Fragment matching precursor with *leID* identifier 12589.

Fragment matching precursor with leID identifier 12589.

Fragment matching precursor with *leID* identifier 12589.

Fragment matching precursor with leID identifier 12589.

Plot distribution of common peaks

The plotFragmentMatchingPerformance function can be used to assess the performance of the fragment matching and the result of the filtering procedure (see below) based on the number of common peaks. This function invisibly returns a list with matrices containing true positive, false positive, true negative and false negative matches for the unique and non unique matches EMRT matches, as illustrated in tables 1 and 2. Both tables could be also generated by fragmentMatchingPerformance.

m <- plotFragmentMatchingPerformance(synobj2)
Number of true/false match peptides for different peak matching thresholds and difference in number of peaks between the first and second (in terms of number of common peaks) possible matches. The former metric is used to filter out possible false positive unique matches while the second is used to filter multiple matches. Empty circles indicate zero peptides.

Number of true/false match peptides for different peak matching thresholds and difference in number of peaks between the first and second (in terms of number of common peaks) possible matches. The former metric is used to filter out possible false positive unique matches while the second is used to filter multiple matches. Empty circles indicate zero peptides.

Number of true positives, false negatives, false positives, false negatives and false discovery rate for a given number of common peaks.
ncommon tp fp tn fn all fdr
0 2176 26 0 0 4034 0.0118074
1 1940 5 21 236 2765 0.0025707
2 1426 4 22 750 1899 0.0027972
3 1039 2 24 1137 1370 0.0019212
4 759 2 24 1417 1016 0.0026281
5 569 0 26 1607 780 0.0000000
6 414 0 26 1762 600 0.0000000
7 312 0 26 1864 468 0.0000000
8 245 0 26 1931 379 0.0000000
9 197 0 26 1979 316 0.0000000
10 148 0 26 2028 252 0.0000000
11 116 0 26 2060 203 0.0000000
12 94 0 26 2082 171 0.0000000
13 66 0 26 2110 132 0.0000000
14 55 0 26 2121 110 0.0000000
Number of true positives, false negatives, false positives, false negatives and false discovery rate for a given difference in number of common peaks between the higest and second highest multiply matching EMRTs in terms of number of common peptides.
deltacommon tp fp tn fn all fdr
0 334 72 0 0 4034 0.1773399
1 334 8 64 0 889 0.0233918
2 271 4 68 63 698 0.0145455
3 226 2 70 108 566 0.0087719
4 198 1 71 136 501 0.0050251
5 176 1 71 158 453 0.0056497
6 153 1 71 181 397 0.0064935
7 129 1 71 205 337 0.0076923
8 110 1 71 224 294 0.0090090
9 96 1 71 238 261 0.0103093
10 84 1 71 250 232 0.0117647
11 70 1 71 264 201 0.0140845
12 61 1 71 273 180 0.0161290
13 50 1 71 284 150 0.0196078
14 42 1 71 292 130 0.0232558

Filtering unique matches

From the left panel on figure 7 and table 1 displaying counts for unique matches one can define filtering values for the unique (this section) and multiple matches (next section). In the case of uniquely matching EMRTs, one can easily reduce the number of false matches by requiring that true matches must have at least one peak/fragment in common. Clearly this will also remove some true matches. The question is whether you want to rely on matches that have no (or only a few) peaks/fragments in common?

performance(synobj2)
getEMRTtable(synobj2)
## 
##    0    1    2    3    4    5    6    7   12   14 
##  241 2865  400   64   21    6    5    1    1    1
filterUniqueMatches(synobj2, minNumber = 1)
performance(synobj2)
getEMRTtable(synobj2)
## 
##   -1    0    1    2    3    4    5    6    7   12   14 
##  625  241 2240  400   64   21    6    5    1    1    1

Filtering non-unique matches

The largest benefit of fragment matching is for non unique matches. If we assume that true match have a highest number of common peaks/fragments, we can distinguish correct matches among multiple possible matches that could not resolved before (c.f. section [#sec:plotperf]). To do so, we use the difference of common peaks from the highest to the second highest number in the match group. Assuming two cases with multiple matches. In the first case, we have two possible matches: a match with 7 and a match with 2 fragments in common. In the second ambiguous match, there are a matches with 2 and 1 fragments in common respectively. If we decide to accept a difference of at least 2, our first multiple match case be resolved into a unique match as the difference between the best and second matches is 5 and the best match with 7 common fragments will be upgraded to a unique match.

The right panel of figure 7 and table 2 can be used to choose a good threshold for the difference in number of common peaks.

performance(synobj2)
getEMRTtable(synobj2)
## 
##   -1    0    1    2    3    4    5    6    7   12   14 
##  625  241 2240  400   64   21    6    5    1    1    1
filterNonUniqueMatches(synobj2, minDelta = 2)
performance(synobj2)
getEMRTtable(synobj2)
## 
##   -2   -1    0    1 
##  210  625  241 2529

In this example we rescued 289 unique matches out of the non unique ones.

Exporting results

Like in the initial synapter workflow, it is possible to export the MatchedEMRT results using the writeMatchedEMRTs function. The table has some new columns that correspond to the fragment matching procedure, e.g. FragmentMatching, .

writeMatchedEMRTs(synobj2, file = "MatchedEMRTs.csv")

Session information

All software and respective versions used to produce this document are listed below.

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocStyle_2.35.0    synapterdata_1.44.0 synapter_2.31.0    
##  [4] MSnbase_2.33.2      ProtGenerics_1.39.1 S4Vectors_0.45.2   
##  [7] mzR_2.41.1          Rcpp_1.0.13-1       Biobase_2.67.0     
## [10] BiocGenerics_0.53.3 generics_0.1.3      rmarkdown_2.29     
## 
## loaded via a namespace (and not attached):
##  [1] rlang_1.1.4                 magrittr_2.0.3             
##  [3] clue_0.3-66                 cleaver_1.45.0             
##  [5] matrixStats_1.4.1           compiler_4.4.2             
##  [7] vctrs_0.6.5                 reshape2_1.4.4             
##  [9] stringr_1.5.1               pkgconfig_2.0.3            
## [11] crayon_1.5.3                fastmap_1.2.0              
## [13] XVector_0.47.0              tzdb_0.4.0                 
## [15] UCSC.utils_1.3.0            preprocessCore_1.69.0      
## [17] purrr_1.0.2                 xfun_0.49                  
## [19] MultiAssayExperiment_1.33.1 zlibbioc_1.52.0            
## [21] cachem_1.1.0                GenomeInfoDb_1.43.2        
## [23] jsonlite_1.8.9              DelayedArray_0.33.3        
## [25] BiocParallel_1.41.0         parallel_4.4.2             
## [27] cluster_2.1.8               R6_2.5.1                   
## [29] RColorBrewer_1.1-3          bslib_0.8.0                
## [31] stringi_1.8.4               limma_3.63.2               
## [33] GenomicRanges_1.59.1        jquerylib_0.1.4            
## [35] SummarizedExperiment_1.37.0 iterators_1.0.14           
## [37] knitr_1.49                  readr_2.1.5                
## [39] IRanges_2.41.2              splines_4.4.2              
## [41] Matrix_1.7-1                igraph_2.1.2               
## [43] tidyselect_1.2.1            qvalue_2.39.0              
## [45] abind_1.4-8                 yaml_2.3.10                
## [47] doParallel_1.0.17           codetools_0.2-20           
## [49] affy_1.85.0                 lattice_0.22-6             
## [51] tibble_3.2.1                plyr_1.8.9                 
## [53] evaluate_1.0.1              survival_3.8-3             
## [55] Biostrings_2.75.3           pillar_1.10.0              
## [57] affyio_1.77.1               BiocManager_1.30.25        
## [59] MatrixGenerics_1.19.0       foreach_1.5.2              
## [61] MALDIquant_1.22.3           ncdf4_1.23                 
## [63] hms_1.1.3                   ggplot2_3.5.1              
## [65] munsell_0.5.1               scales_1.3.0               
## [67] glue_1.8.0                  lazyeval_0.2.2             
## [69] maketools_1.3.1             tools_4.4.2                
## [71] mzID_1.45.0                 sys_3.4.3                  
## [73] QFeatures_1.17.0            vsn_3.75.0                 
## [75] buildtools_1.0.0            XML_3.99-0.17              
## [77] grid_4.4.2                  impute_1.81.0              
## [79] tidyr_1.3.1                 MsCoreUtils_1.19.0         
## [81] colorspace_2.1-1            GenomeInfoDbData_1.2.13    
## [83] PSMatch_1.11.0              cli_3.6.3                  
## [85] S4Arrays_1.7.1              dplyr_1.1.4                
## [87] AnnotationFilter_1.31.0     pcaMethods_1.99.0          
## [89] gtable_0.3.6                sass_0.4.9                 
## [91] digest_0.6.37               SparseArray_1.7.2          
## [93] multtest_2.63.0             htmltools_0.5.8.1          
## [95] lifecycle_1.0.4             httr_1.4.7                 
## [97] statmod_1.5.0               MASS_7.3-61

References

Bond, Nicholas James, Pavel Vyacheslavovich Shliaha, Kathryn S. Lilley, and Laurent Gatto. 2013. “Improving Qualitative and Quantitative Performance for MSE-Based Label Free Proteomics.” Journal of Proteome Research 12. https://doi.org/10.1021/pr300776t.