Title: | Utilities for Exploration and Assessment of Confidence of LC-MSn Proteomics Identifications |
---|---|
Description: | Extracts MS/MS ID data from mzIdentML (leveraging mzID package) or text files. After collating the search results from multiple datasets it assesses their identification quality and optimize filtering criteria to achieve the maximum number of identifications while not exceeding a specified false discovery rate. Also contains a number of utilities to explore the MS/MS results and assess missed and irregular enzymatic cleavages, mass measurement accuracy, etc. |
Authors: | Vlad Petyuk with contributions from Laurent Gatto |
Maintainer: | Vlad Petyuk <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.41.0 |
Built: | 2024-10-31 03:36:19 UTC |
Source: | https://github.com/bioc/MSnID |
Extracts MS/MS ID data from mzIdentML (leveraging mzID package) or text files. After collating the search results from multiple datasets it assesses their identification quality and optimize filtering criteria to achieve the maximal identifications at a user specified false discovery rate. Additional utilities include:
post-experimental recalibration of mass measurement accuracy
assessment of irregular and missed cleavages given the enzyme cleavage pattern
assessment of false discovery rates at peptide-to-spectrum match, unique peptide and protein levels
leverages brute-force and sophisticated optimization routines (Nelder-Mead and simulated annealing) for finding the filtering criteria that provide the maximum spectrum, peptide or protein identifications while not exceeding a corresponding preset threshold of false discovery rate
converts the results into MSnSet class object as spectral counting data
Package: | MSnID |
Type: | Package |
Version: | 0.1.0 |
Date: | 2014-04-02 |
License: | Artistic-2.0 |
Vladislav A. Petyuk ([email protected])
Returns the non-redundant list of accession (protein) identifiers from the
MSnID object. Most of the times accessions and proteins
have the same meaning. However, there are cases, for example use of 6-frame
stop-to-stop translation as FASTA file, where the entries are called with
general term accessions rather then proteins. Currently, accessions and
proteins have the same meaning in MSnID
.
accessions(object, ...) proteins(object, ...)
accessions(object, ...) proteins(object, ...)
object |
An instance of class "MSnID". |
... |
ignored parameters |
Non-redundant list of accession (protein) identifiers.
Vladislav A Petyuk [email protected]
data(c_elegans) head(accessions(msnidObj)) head(proteins(msnidObj))
data(c_elegans) head(accessions(msnidObj)) head(proteins(msnidObj))
Given the provided modification mass, annotates its position within the peptide sequence with provided symbol.
add_mod_symbol(object, mod_mass, symbol)
add_mod_symbol(object, mod_mass, symbol)
object |
An instance of class "MSnID". |
mod_mass |
(string) modification mass. Must match exactly one of the masses in
|
symbol |
(string) character that annotates the position of the modification |
Returns MSnID object with new or modified peptide_mod
column.
The column contains peptide sequence with amino acid
modifications annotated with symbols.
In current implementation the method can not distinguish modifications with the same mass, but different amino acid specificity as different modifications.
Vladislav A Petyuk [email protected]
m <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") m <- read_mzIDs(m, mzids) # to know the present mod masses report_mods(m) # TMT modification m <- add_mod_symbol(m, mod_mass="229.1629", symbol="#") # alkylation m <- add_mod_symbol(m, mod_mass="57.021463735", symbol="^") # phosphorylation m <- add_mod_symbol(m, mod_mass="79.966330925", symbol="*") # show the mapping head(unique(subset(psms(m), select=c("modification", "peptide_mod")))) # clean-up cache unlink(".Rcache", recursive=TRUE)
m <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") m <- read_mzIDs(m, mzids) # to know the present mod masses report_mods(m) # TMT modification m <- add_mod_symbol(m, mod_mass="229.1629", symbol="#") # alkylation m <- add_mod_symbol(m, mod_mass="57.021463735", symbol="^") # phosphorylation m <- add_mod_symbol(m, mod_mass="79.966330925", symbol="*") # show the mapping head(unique(subset(psms(m), select=c("modification", "peptide_mod")))) # clean-up cache unlink(".Rcache", recursive=TRUE)
Filter out peptide-to-spectrum MS/MS identifications.
apply_filter(msnidObj, filterObj)
apply_filter(msnidObj, filterObj)
msnidObj |
An instance of class "MSnID". |
filterObj |
Either an instance of MSnIDFilter
class or a |
filterObj
argument
evaluated to a "logical"
for each entry of the
MS/MS results table.
Returns an instance of "MSnID" class with with peptide-to-spectrum
matches that pass criteria defined in filterObj
argument.
Vladislav A Petyuk [email protected]
data(c_elegans) ## Filtering using string: msnidObj <- assess_termini(msnidObj, validCleavagePattern="[KR]\\.[^P]") table(msnidObj$numIrregCleavages) # getting rid of any other peptides except fully tryptic msnidObj <- apply_filter(msnidObj, "numIrregCleavages == 0") show(msnidObj) ## Filtering using filter object: # first adding columns that will be used as filters msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$mzError <- abs(msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) # setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$msmsScore <- list(comparison=">", threshold=10.0) filtObj$mzError <- list(comparison="<", threshold=0.1) # 0.1 Thomson show(filtObj) # applying filter and comparing MSnID object before and after show(msnidObj) msnidObj <- apply_filter(msnidObj, filtObj) show(msnidObj)
data(c_elegans) ## Filtering using string: msnidObj <- assess_termini(msnidObj, validCleavagePattern="[KR]\\.[^P]") table(msnidObj$numIrregCleavages) # getting rid of any other peptides except fully tryptic msnidObj <- apply_filter(msnidObj, "numIrregCleavages == 0") show(msnidObj) ## Filtering using filter object: # first adding columns that will be used as filters msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$mzError <- abs(msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) # setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$msmsScore <- list(comparison=">", threshold=10.0) filtObj$mzError <- list(comparison="<", threshold=0.1) # 0.1 Thomson show(filtObj) # applying filter and comparing MSnID object before and after show(msnidObj) msnidObj <- apply_filter(msnidObj, filtObj) show(msnidObj)
Bottom-up proteomics approaches utilize endoproteases or chemical agents
to digest proteins into smaller fragments called peptides.
The enzymes recognize short amino acid motifs
and cleave along the peptide bonds. Chemical agents such as
CNBr
also possess amino acid cleavage specificity. In real-world
not every cleavage site get cleaved during the sample processing.
Therefore settings of MS/MS search engines quite often explictly
allow up to a certain number missed clevage sites per peptide sequence.
This function counts the number of missed cleavages in peptide
sequence given the endoprotease cleavage motif in the form of
regular expression. The default value for missedCleavagePattern
is
[KR](?=[^P$])
, which corresponds to trypsin.
assess_missed_cleavages(object, missedCleavagePattern="[KR](?=[^P$])")
assess_missed_cleavages(object, missedCleavagePattern="[KR](?=[^P$])")
object |
An instance of class "MSnID". |
missedCleavagePattern |
Cleavage pattern in the form of regular expression. |
Returns an instance of "MSnID" class with additional column "numMissCleavages"
If the "MSnID" instance does not contain "peptide" column in MS/MS results
table then there will be an error.
E.g. you can check this by "peptide" %in% names(msnid)
where msnid
is your "MSnID" instance.
Vladislav A Petyuk [email protected]
data(c_elegans) # adding column numMissCleavages containing count of missed cleavages msnidObj <- assess_missed_cleavages(msnidObj, missedCleavagePattern="[KR](?=[^P$])") # check the distribution table(msnidObj$numMissCleavages)
data(c_elegans) # adding column numMissCleavages containing count of missed cleavages msnidObj <- assess_missed_cleavages(msnidObj, missedCleavagePattern="[KR](?=[^P$])") # check the distribution table(msnidObj$numMissCleavages)
Bottom-up proteomics approaches utilize endoproteases or chemical agents
to digest proteins into smaller fragments called peptides.
The enzymes recognize short amino acid motifs
and cleave along the peptide bonds. Chemical agents such as CNBr
also
possesses amino acid cleavage specificity.
This function checks if peptide termini are as expected given the
enzymatic/chemical cleavage specificity.
The default value for validCleavagePattern
is
[KR]\.[^P]
, which corresponds to trypsin.
assess_termini(object, validCleavagePattern="[KR]\\.[^P]")
assess_termini(object, validCleavagePattern="[KR]\\.[^P]")
object |
An instance of class "MSnID". |
validCleavagePattern |
Cleavage pattern in the form of regular expression. |
N- or C- protein termini are not considered as irregular clevages sites.
Returns an instance of "MSnID" class with additional
column "numIrregCleavages". If both termini conforms with
cleavage specificity, then value is 0
, if one or two
termini are irregular then the values are 1
and 2
,
correspondingly.
If the "MSnID" instance does not contain "peptide" column in MS/MS results
table then there will be an error.
E.g. you can check this by "peptide" %in% names(msnid)
where msnid
is your "MSnID" instance.
Vladislav A Petyuk [email protected]
data(c_elegans) # adding column numIrregCleavages # containing count of irregularly cleaved termini msnidObj <- assess_termini(msnidObj, validCleavagePattern="[KR]\\.[^P]") # check the distribution table(msnidObj$numIrregCleavages)
data(c_elegans) # adding column numIrregCleavages # containing count of irregularly cleaved termini msnidObj <- assess_termini(msnidObj, validCleavagePattern="[KR]\\.[^P]") # check the distribution table(msnidObj$numIrregCleavages)
In a typical setting instruments select ions for fragmentation primarily based on ion intensity. For low molecular weight peptides the most intense peak usually corresponds to monoisotopic peak (that is only C12 carbon isotopes). With increase of molecular weight, instensity of monoisotopic peak becomes smaller relatively to heavier peptide isotopes (that is containing one or a few C13 isotopes).
The function subtracts or adds the mass difference between C13 and C12 isotopes (1.0033548378 Da) if that reduces the mass error. Such a mass error arises from the fact that instrument may peak non-monoisotopic peak for fragmentation and thus report the mass that is different by ~ 1 Da.
correct_peak_selection(object)
correct_peak_selection(object)
object |
An instance of class "MSnID". |
Returns an instance of "MSnID" class with updated
experimentalMassToCharge
value.
Vladislav A Petyuk [email protected]
MSnID
recalibrate
mass_measurement_error
data(c_elegans) # plot original mass error massErr <- (msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) * msnidObj$chargeState hist(massErr,xlim=c(-1,+1), breaks=seq(-1.5,+1.5,0.01)) # fixing the problem of picking wrong monoisotopic peak msnidObj <- correct_peak_selection(msnidObj) # plot fixed mass error massErr <- (msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) * msnidObj$chargeState hist(massErr,xlim=c(-1,+1), breaks=seq(-1.5,+1.5,0.01))
data(c_elegans) # plot original mass error massErr <- (msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) * msnidObj$chargeState hist(massErr,xlim=c(-1,+1), breaks=seq(-1.5,+1.5,0.01)) # fixing the problem of picking wrong monoisotopic peak msnidObj <- correct_peak_selection(msnidObj) # plot fixed mass error massErr <- (msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) * msnidObj$chargeState hist(massErr,xlim=c(-1,+1), breaks=seq(-1.5,+1.5,0.01))
Filter out peptide-to-spectrum MS/MS identifications.
evaluate_filter(object, filter, level=c("PSM", "peptide", "accession"))
evaluate_filter(object, filter, level=c("PSM", "peptide", "accession"))
object |
An instance of class "MSnID". |
filter |
Either an instance of MSnIDFilter class or a |
level |
Level at which the filter will be evaluated. Possible values are "PSM", "peptide" and "accession". Multiple are OK. Default is all of them. |
Returns a matrix with with column names "fdr" and "n". Column "n" contains the number of features (spectra, peptides or proteins/accessions) passing the filter. Column "fdr" is the false discovery rate (i.e. identification confidence) for the corresponding features. Row names correspond to the provided levels.
Vladislav A Petyuk [email protected]
data(c_elegans) ## Filtering using string: msnidObj <- assess_termini(msnidObj, validCleavagePattern="[KR]\\.[^P]") table(msnidObj$numIrregCleavages) evaluate_filter(msnidObj, "numIrregCleavages == 0") ## Filtering using filter object: # first adding columns that will be used as filters msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$mzError <- abs(msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) # setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$msmsScore <- list(comparison=">", threshold=10.0) filtObj$mzError <- list(comparison="<", threshold=0.1) # 0.1 Thomson show(filtObj) evaluate_filter(msnidObj, filtObj)
data(c_elegans) ## Filtering using string: msnidObj <- assess_termini(msnidObj, validCleavagePattern="[KR]\\.[^P]") table(msnidObj$numIrregCleavages) evaluate_filter(msnidObj, "numIrregCleavages == 0") ## Filtering using filter object: # first adding columns that will be used as filters msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$mzError <- abs(msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) # setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$msmsScore <- list(comparison=">", threshold=10.0) filtObj$mzError <- list(comparison="<", threshold=0.1) # 0.1 Thomson show(filtObj) evaluate_filter(msnidObj, filtObj)
A wrapper function over AnnotationHub that helps to convert from one protein identifiers to another.
fetch_conversion_table(organism_name, from, to, backend="AnnotationHub", snapshot_date=NULL)
fetch_conversion_table(organism_name, from, to, backend="AnnotationHub", snapshot_date=NULL)
organism_name |
(string) official organism name. E.g. "Homo sapiens", "Mus musculus" or "Rattus norvegicus". |
from , to
|
(string) identifier names. Recommended names are "SYMBOL", "UNIPROT", "REFSEQ" and "ENSEMBLPROT". Other identifiers are possible, but use them at your own risk. |
backend |
(string) currently only AnnotationHub |
snapshot_date |
(string) snapshot date for AnnotationHub.
Default is |
data.frame with first column name of from identifier, the second name of to identifier
Vladislav A Petyuk [email protected]
conv_tbl <- fetch_conversion_table("Rattus norvegicus", "REFSEQ", "SYMBOL") head(conv_tbl) conv_tab <- fetch_conversion_table("Homo sapiens", "UNIPROT", "SYMBOL") head(conv_tbl)
conv_tbl <- fetch_conversion_table("Rattus norvegicus", "REFSEQ", "SYMBOL") head(conv_tbl) conv_tab <- fetch_conversion_table("Homo sapiens", "UNIPROT", "SYMBOL") head(conv_tbl)
Reports quality for a given level of identification (spectra, peptide or protein).
id_quality(object, filter=NULL, level=c("PSM", "peptide", "accession"))
id_quality(object, filter=NULL, level=c("PSM", "peptide", "accession"))
object |
An instance of class "MSnID". |
filter |
Optional argument.
Either an instance of MSnIDFilter class or a |
level |
Level at which the filter will be evaluated. Possible values are "PSM", "peptide" and "accession". Multiple are OK. Default is all of them. |
Returns a matrix with with column names "fdr" and "n". Column "n" contains the number of features (spectra, peptides or proteins/accessions) passing the filter. Column "fdr" is the false discovery rate (i.e. identification confidence) for the corresponding features. Row names correspond to the provided levels.
Vladislav A Petyuk [email protected]
data(c_elegans) id_quality(msnidObj, level="peptide") id_quality(msnidObj, filter="`MS-GF:PepQValue` < 0.01", level="peptide")
data(c_elegans) id_quality(msnidObj, level="peptide") id_quality(msnidObj, filter="`MS-GF:PepQValue` < 0.01", level="peptide")
Infer parsimonious set of accessions (e.g. proteins) that explains all the peptide sequences. The algorithm is a simple loop that looks for the accession explaining most peptides, records the peptide-to-accession mapping for this accession, removes those peptides, and then looks for next best accession. The loop continues until no peptides left. The method does not accept any arguments at this point (except the MSnID object itself).
infer_parsimonious_accessions(object, unique_only=FALSE, prior=character(0))
infer_parsimonious_accessions(object, unique_only=FALSE, prior=character(0))
object |
An instance of class "MSnID". |
unique_only |
If TRUE, peptides mapping to multiple accessions are dropped and only unique are retained. If FALSE, then shared peptides assigned according to Occam's razor rule. That is a shared peptide is assigned to a protein with larger number of unique peptides. If the number of unique peptides is the same, then to the first accession. Default is FALSE. |
prior |
(character)
character vector with prior justified proteins/accessions.
If |
Although the algorithm is rather simple it is THE algorithm used for inferring maximal matching in bipartate graphs and is used in the IDPicker software.
Returns an instance of "MSnID" with minimal set of proteins necessary to explain all the peptide sequences.
Vladislav A Petyuk [email protected]
data(c_elegans) # explicitely adding parameters that will be used for data filtering msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$absParentMassErrorPPM <- abs(mass_measurement_error(msnidObj)) # quick-and-dirty filter. The filter is too strong for the sake of saving time # at the minimal set of proteins inference step. msnidObj <- apply_filter(msnidObj, 'msmsScore > 12 & absParentMassErrorPPM < 2') show(msnidObj) msnidObj2 <- infer_parsimonious_accessions(msnidObj) show(msnidObj2)
data(c_elegans) # explicitely adding parameters that will be used for data filtering msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$absParentMassErrorPPM <- abs(mass_measurement_error(msnidObj)) # quick-and-dirty filter. The filter is too strong for the sake of saving time # at the minimal set of proteins inference step. msnidObj <- apply_filter(msnidObj, 'msmsScore > 12 & absParentMassErrorPPM < 2') show(msnidObj) msnidObj2 <- infer_parsimonious_accessions(msnidObj) show(msnidObj2)
Given the peptide sequence with modification X.XXXX*XXXX.X and provided protein sequence FASTA, the method maps the location of the modification resulting in {protein ID}-{aa}{aa position}".
map_mod_sites(object, fasta, accession_col = "accession", peptide_mod_col = "peptide_mod", mod_char = "*", site_delimiter = "lower")
map_mod_sites(object, fasta, accession_col = "accession", peptide_mod_col = "peptide_mod", mod_char = "*", site_delimiter = "lower")
object |
An instance of class MSnID. |
fasta |
(AAStringSet object) Protein sequences read from a FASTA file. Names must match protein/accesison IDs in the accesson column of the MSnID object. |
accession_col |
(string) Name of the column with accession/protein IDs in the MSnID object. Default is "accession". |
peptide_mod_col |
(string) Name of the column with modified peptide sequences in the MSnID object. Default is "peptide_mod". |
mod_char |
(string) character that annotates the position of the modification. Default is "*". |
site_delimiter |
(string) either a single character or "lower" (default) meaning it will be the same amino acid symbol, but in lower case |
MSnID object with extra columns regarting the modification mapping.
Most likely, what you need is SiteID
.
PepLoc |
(list of ints) position of the starting amino acid within protein sequence. It is a list, because there may be multiple occurences of the same sequence matching the peptide's sequence. |
PepLocFirst |
(int) position of the first occurence of the matching sequence |
ProtLength |
(int) protein length |
ModShift |
(vector of ints) positions of modified amino acids within peptide |
ModAAs |
(vector of characters) single-letter amino acid codes of the modified residues |
SiteLoc |
(list of vectors of ints) positions of the modified amino acids within protein for each occurence of the peptide |
Site |
(list of vectors of characters) modified sites encoded as amino acid symbol follwed by position for each occurence of the peptide |
SiteCollapsed |
(list of characters)
same as |
SiteCollapsedFirst |
(character)
first element of the |
SiteID |
(character)
accession ID concatenated with |
Vladislav A Petyuk [email protected]
m <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") m <- read_mzIDs(m, mzids) # to know the present mod masses report_mods(m) # TMT modification m <- add_mod_symbol(m, mod_mass="229.1629", symbol="#") # alkylation m <- add_mod_symbol(m, mod_mass="57.021463735", symbol="^") # phosphorylation m <- add_mod_symbol(m, mod_mass="79.966330925", symbol="*") # show the mapping head(unique(subset(psms(m), select=c("modification", "peptide_mod")))) # read fasta for mapping modifications fst_path <- system.file("extdata","for_phospho.fasta.gz",package="MSnID") library(Biostrings) fst <- readAAStringSet(fst_path) # to ensure names are the same as in accessions(m) names(fst) <- sub("(^[^ ]*) .*$", "\1", names(fst)) # # mapping phosphosites m <- map_mod_sites(m, fst, "accession", "peptide_mod", "*", "lower") head(unique(subset(psms(m), select=c("accession", "peptide_mod", "SiteID")))) # clean-up cache unlink(".Rcache", recursive=TRUE)
m <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") m <- read_mzIDs(m, mzids) # to know the present mod masses report_mods(m) # TMT modification m <- add_mod_symbol(m, mod_mass="229.1629", symbol="#") # alkylation m <- add_mod_symbol(m, mod_mass="57.021463735", symbol="^") # phosphorylation m <- add_mod_symbol(m, mod_mass="79.966330925", symbol="*") # show the mapping head(unique(subset(psms(m), select=c("modification", "peptide_mod")))) # read fasta for mapping modifications fst_path <- system.file("extdata","for_phospho.fasta.gz",package="MSnID") library(Biostrings) fst <- readAAStringSet(fst_path) # to ensure names are the same as in accessions(m) names(fst) <- sub("(^[^ ]*) .*$", "\1", names(fst)) # # mapping phosphosites m <- map_mod_sites(m, fst, "accession", "peptide_mod", "*", "lower") head(unique(subset(psms(m), select=c("accession", "peptide_mod", "SiteID")))) # clean-up cache unlink(".Rcache", recursive=TRUE)
Computes error of the parent ion mass to charge measurement from
experimentalMassToCharge
and calculatedMassToCharge
.
The returned value is in points per million (ppm).
mass_measurement_error(object)
mass_measurement_error(object)
object |
An instance of class "MSnID". |
It may be more common to compute "mass measurement error". However, the practical difference in "mass measurement error" and "mass to charge measurement error" is negligible. Moreover, the instruments measure mass/charge ratio, not mass per se.
Returns mass to charge measurement error in "ppm" units.
Vladislav A Petyuk [email protected]
MSnID
recalibrate
correct_peak_selection
data(c_elegans) ppm <- mass_measurement_error(msnidObj) hist(ppm, 100)
data(c_elegans) ppm <- mass_measurement_error(msnidObj) hist(ppm, 100)
The MSnID
is a convenience class for manipulating
the MS/MS search results.
The way to create objects is to call MSnID
constructor function that takes as an input the project working directory
workDir
and the second argument if the cache from previous analysis
should be cleaned cleanCache
.
workDir
:Object of class "character"
.
containing working directory for the project.
The .Rcache
subdirectory stores the cached
resuls form the previous analyses.
The mechanism of caching relies on R.cache package.
psms
:Object of class data.table
that contains all
the MS/MS identification results in the
form of peptide(or protein)-spectrum-matches.
signature(object, mzids)
:
Reads mzIdentML files into psms
data.table
slot
of object
MSnID instance. The functionality leverage
mzID package facility. Note, the calls are memoised
using R.cache facility. So if the call with the same
list of files issues again, the results will be read from cache
instead of re-parsing the mzIdentML files.
See read_mzIDs
psms(object)
, psms(object)<-value
:Gets and sets MS/MS search results as data.frame
.
See psms
signature(x = "MSnID")
:
Returns the dimensions of
the table with MS/MS identification data.
signature(object = "MSnID")
:
Returns unique peptide list. See peptides
signature(object = "MSnID")
:
Returns unique accessions (typically proteins) list.
See accessions
signature(object = "MSnID")
:
Returns unique proteins list.
See proteins
Checks the agreement of peptide termini
with enzymes cleavage specificity. The return value is
theMSnID
object with extra variable numIrregCleavages
.
See assess_termini
Checks if the peptide sequence contains the
sites that were not cleaved by the enzyme.
For details see assess_missed_cleavages
Returns parent ion mass measurement error in
parts per million (ppm) units. Note, it requires
experimentalMassToCharge
and calculatedMassToCharge
variables to be set. See mass_measurement_error
Recalibrates, that is removes systematic
error from experimentalMassToCharge
measurements.
See recalibrate
Subtracts or adds the mass difference between
C13 and C12 isotopes (1.0033548378 Da) if that reduces the mass error.
Such a mass error arises from the fact that instrument may peak
non-monoisotopic peak for fragmentation and thus report the mass
that is different by ~ 1 Da.
See correct_peak_selection
signature(msnidObj="MSnID", filterObj="character")
signature(msnidObj="MSnID", filterObj="MSnIDFilter")
The filterObj
argument is a "character"
or
converted to a "character"
text string that is
evaluated to a "logical"
for each entry of the
MS/MS results table. Return value is a filtered
MSnID
object with entries that pass the applied filter.
See apply_filter
evaluate_filter(object, filter,
level = c("PSM","peptide", "accession")
Returns a list with fdr
and n
elements.
Argument filter
is either "character"
or "MSnIDFilter"
object. Argument level
can
take one of the values c("PSM","peptide","accession")
and controls the level filter is evaluated.
See evaluate_filter
id_quality
signature(object="MSnID", ...)
Other optional ... arguments are filter
is an "MSnIDFilter"
instance and level
. The level
values are one of "PSM",
"peptide", "accession". The method returns FDR for given
level depending of type of identifications.
See id_quality
signature(x = "MSnID")
:
Coerce object from MSnID
to MSnSet
.
signature(x="MSnID")
Returns the column names in the MS/MS results table.
object$name
, object$name<-value
Access and set name
column in MS/MS search results table.
object[[i]]
, object[[i]]<-value
Access and set column i
(character or numeric index) in
MS/MS search results table.
signature(from = "MSnID")
:
Coerce object from MSnID
to MSnSet
.
signature(from = "MSnID")
:
Coerce object from MSnID
to data.table
.
Vladislav A Petyuk [email protected]
## Not run: msnidObj <- MSnID(".") mzids <- system.file("extdata","c_elegans.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) # clean up the cache directory unlink(".Rcache", recursive=TRUE) ## End(Not run)
## Not run: msnidObj <- MSnID(".") mzids <- system.file("extdata","c_elegans.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) # clean up the cache directory unlink(".Rcache", recursive=TRUE) ## End(Not run)
The MSnIDFilter
is a convenience class for manipulating the MS/MS
filter for MS/MS results.
The way to create objects is to call MSnIDFilter
constructor function that takes as input the MSnID
class instance and (optionally) filterList
.
MSnIDObj
:An instance of class "MSnID".
filterList
:An optional argument. A list with element names corresponding to
column names avalable in MSnID
instance. Each element
contains sub-elements "comparison" and "threshold". "Comparison" is
one of the relationship operators (e.g. ">")
see Comparison for details. "Threshold" is the corresponding
parameter value the identification has to be more or less (depending
on comparison) to pass the filter.
signature(object="MSnIDFilter")
:
Prints MSnIDFilter
object.
object$name
, object$name<-value
Access and set filterList elements.
signature(x="MSnIDFilter")
Returns the names of the criteria.
as.numeric
signature(x="MSnIDFilter")
Converts filterList
into "numeric" vector.
Vector names are the list element names.
Vector values are threshold values.
Comparison operators are lost.
length
signature(x="MSnIDFilter")
Returns the number of criteria set in the "MSnIDFilter" object.
update
signature(object="MSnIDFilter", ...)
The additional ... argument is numeric
vector of the same
length as the number of criteria in MSnIDFilter
object.
The method update the corresponding thresholds to new provided values.
Vladislav A Petyuk [email protected]
MSnSet
evaluate_filter
apply_filter
optimize_filter
data(c_elegans) ## Filtering using filter object: # first adding columns that will be used as filters msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$mzError <- abs(msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) # setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$msmsScore <- list(comparison=">", threshold=10.0) filtObj$mzError <- list(comparison="<", threshold=0.1) # 0.1 Thomson show(filtObj) # applying filter and comparing MSnID object before and after show(msnidObj) msnidObj <- apply_filter(msnidObj, filtObj) show(msnidObj)
data(c_elegans) ## Filtering using filter object: # first adding columns that will be used as filters msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$mzError <- abs(msnidObj$experimentalMassToCharge - msnidObj$calculatedMassToCharge) # setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$msmsScore <- list(comparison=">", threshold=10.0) filtObj$mzError <- list(comparison="<", threshold=0.1) # 0.1 Thomson show(filtObj) # applying filter and comparing MSnID object before and after show(msnidObj) msnidObj <- apply_filter(msnidObj, filtObj) show(msnidObj)
MSnID object from c_elegans_A_3_1_21Apr10_Draco_10-03-04_msgfplus.mzid.gz dataset from PeptideAltas repository id PASS00308.
data(c_elegans)
data(c_elegans)
data(c_elegans) msnidObj ## Not run: ## How to download the example mzID file from PeptideAltas: try(setInternet2(FALSE),silent=TRUE) ftp.loc <- "ftp://PASS00308:[email protected]/MSGFPlus_Results/MZID_Files/c_elegans_A_3_1_21Apr10_Draco_10-03-04_msgfplus.mzid.gz" download.file(ftp.loc, "c_elegans.mzid.gz") ## End(Not run) ## Not run: ## Script for generation of the C. elegans example data: msnidObj <- MSnID(".") mzids <- system.file("extdata","c_elegans.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) save(msnidObj, file='c_elegans.RData', compress='xz', compression_level=9) # MD5 sum for the file is: a7c511a6502a6419127f1e46db48ed92 digest::digest(msnidObj) # clean up the cache directory unlink(".Rcache", recursive=TRUE) ## End(Not run)
data(c_elegans) msnidObj ## Not run: ## How to download the example mzID file from PeptideAltas: try(setInternet2(FALSE),silent=TRUE) ftp.loc <- "ftp://PASS00308:[email protected]/MSGFPlus_Results/MZID_Files/c_elegans_A_3_1_21Apr10_Draco_10-03-04_msgfplus.mzid.gz" download.file(ftp.loc, "c_elegans.mzid.gz") ## End(Not run) ## Not run: ## Script for generation of the C. elegans example data: msnidObj <- MSnID(".") mzids <- system.file("extdata","c_elegans.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) save(msnidObj, file='c_elegans.RData', compress='xz', compression_level=9) # MD5 sum for the file is: a7c511a6502a6419127f1e46db48ed92 digest::digest(msnidObj) # clean up the cache directory unlink(".Rcache", recursive=TRUE) ## End(Not run)
Adjusts parameters in the "MSnIDFilter" instance to achieve the most number of spectra, peptides or proteins/accessions within pre-set FDR upper limit.
optimize_filter(filterObj, msnidObj, fdr.max, method, level, n.iter, mc.cores=NULL)
optimize_filter(filterObj, msnidObj, fdr.max, method, level, n.iter, mc.cores=NULL)
filterObj |
An instance of class "MSnIDFilter". |
msnidObj |
An instance of class "MSnID". |
fdr.max |
Upper limit on acceptable false discovery rate (FDR). |
method |
Optimization method. Possible values are "Grid" or
same values as for the |
level |
Determines at what level to perform optimization. Possible values are "PSM", "peptides" or "accession". |
n.iter |
For method "Grid" is approxomate number of evaluation point. For "Nelder-Mean" and "SANN" methods see optim. |
mc.cores |
Controls the number of enabled cores for parallel processing.
Make sense only for "Grid" optimizaton. Default is
|
The "Grid" method is brute-force optimization through evaluation of approximately n.iter combinations of the parameters set in the "MSnIDFilter" object. The enumeration of the parameter combinations proceeds as follows. The n.iter number is getting split given the dimensionality of the problem (that is the number of filter parameters in the "MSnIDFilter" object. For each parameter the evaluation points are equally spaced according to quantiles of the parameter distribution. This way we enumerate the grid that has more evaluation points in relatively more dense areas.
Note, optimization is computationally expensive. Thus, the
optimize_filter
call is memoised using facilities
from the R.cache package. Once the same
call of optime_filter
function issued second time
the results will be retrieved from cache.
Returns an instance of "MSnIDFilter" that is maximized to provide the most number of identifications while maintaining a pre-set confidence (FDR).
Vladislav A Petyuk [email protected]
MSnID
evaluate_filter
apply_filter
data(c_elegans) # explicitely adding parameters that will be used for data filtering msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$absParentMassErrorPPM <- abs(mass_measurement_error(msnidObj)) # Setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0) filtObj$msmsScore <- list(comparison=">", threshold=10.0) system.time({ filtObj.grid <- optimize_filter(filtObj, msnidObj, fdr.max=0.01, method="Grid", level="peptide", n.iter=50)}) show(filtObj.grid) # Fine tuning. Nelder-Mead optimization. system.time({ filtObj.nm <- optimize_filter(filtObj.grid, msnidObj, fdr.max=0.01, method="Nelder-Mead", level="peptide", n.iter=50)}) show(filtObj.nm) # Fine tuning. Simulated Annealing optimization. system.time({ filtObj.sann <- optimize_filter(filtObj.grid, msnidObj, fdr.max=0.01, method="SANN", level="peptide", n.iter=50)}) show(filtObj.sann)
data(c_elegans) # explicitely adding parameters that will be used for data filtering msnidObj$msmsScore <- -log10(msnidObj$`MS-GF:SpecEValue`) msnidObj$absParentMassErrorPPM <- abs(mass_measurement_error(msnidObj)) # Setting up filter object filtObj <- MSnIDFilter(msnidObj) filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0) filtObj$msmsScore <- list(comparison=">", threshold=10.0) system.time({ filtObj.grid <- optimize_filter(filtObj, msnidObj, fdr.max=0.01, method="Grid", level="peptide", n.iter=50)}) show(filtObj.grid) # Fine tuning. Nelder-Mead optimization. system.time({ filtObj.nm <- optimize_filter(filtObj.grid, msnidObj, fdr.max=0.01, method="Nelder-Mead", level="peptide", n.iter=50)}) show(filtObj.nm) # Fine tuning. Simulated Annealing optimization. system.time({ filtObj.sann <- optimize_filter(filtObj.grid, msnidObj, fdr.max=0.01, method="SANN", level="peptide", n.iter=50)}) show(filtObj.sann)
Returns the non-redundant list of peptides from the MSnID object
peptides(object)
peptides(object)
object |
An instance of class "MSnID". |
Non-redundant list of peptides.
Vladislav A Petyuk [email protected]
data(c_elegans) head(peptides(msnidObj))
data(c_elegans) head(peptides(msnidObj))
Returns results of MS/MS search (peptide-to-spectrum)
matches in the form of data.frame
.
psms(object, ...) psms(object) <- value
psms(object, ...) psms(object) <- value
object |
An instance of class "MSnID". |
value |
Value is a |
... |
ignored for now |
data.frame
Vladislav A Petyuk [email protected]
data(c_elegans) msnidDF <- psms(msnidObj) head(msnidDF)
data(c_elegans) msnidDF <- psms(msnidObj) head(msnidDF)
Reads mzIdentML files into psms
data.table
slot
of object
MSnID instance. The functionality leverage
mzID package facility. Note, the calls are memoised
using R.cache facility. So if the call with the same
list of files issues again, the results will be read from cache instead of
re-parsing the mzIdentML files.
read_mzIDs(object, mzids, backend)
read_mzIDs(object, mzids, backend)
object |
An instance of class "MSnID" |
mzids |
paths to mzIdentML (mzid) files |
backend |
Package that is leveraged for parsing.
Either 'mzID' or 'mzR' corresponding to
|
mzIdentML files can be either as is or in gzip compressed form (*.mzid.gz).
Returns an instance of "MSnID" class with @psms
data.table slot
populated with MS/MS identifications.
Vladislav A Petyuk [email protected]
## Not run: msnidObj <- MSnID(".") mzids <- system.file("extdata","c_elegans.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) # clean up the cache directory unlink(".Rcache", recursive=TRUE) ## End(Not run)
## Not run: msnidObj <- MSnID(".") mzids <- system.file("extdata","c_elegans.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) # clean up the cache directory unlink(".Rcache", recursive=TRUE) ## End(Not run)
Mass spectrometry measurements like any other real-worls measurements are prone to systematic errors. Typically they are minimized by instrument calibration prior the analysis. Nonetheless, the calibration may drift over time or be affected by some adverse factors (temperature or space charge fluctuations).
This function estimates and removes the systematic error from the datasets.
The side effect is the recalibrated experimentalMassToCharge
values.
recalibrate(object)
recalibrate(object)
object |
An instance of class "MSnID". |
Currently it employs a very simple method of zero-centering the histogram of mass measurement errors. In the future it will contain more sophisticated recalibration routines.
"MSnID" class instance with updated experimentalMassToCharge
.
Vladislav A Petyuk [email protected]
MSnID
mass_measurement_error
correct_peak_selection
data(c_elegans) # first let's fix the error of picking wrong monoisotopic peak # otherwise the mass error range will be very large msnidObj <- correct_peak_selection(msnidObj) # original mass error in ppm ppm <- mass_measurement_error(msnidObj) hist(ppm, 200, xlim=c(-20,+20)) # The dataset is well calibrated. So let's introduce # some mass measurement error. msnidObj$experimentalMassToCharge <- msnidObj$experimentalMassToCharge * (1+0.00001) # mass error (in ppm) after artificial de-calibration ppm <- mass_measurement_error(msnidObj) hist(ppm, 200, xlim=c(-20,+20)) # recalibration msnidObj <- recalibrate(msnidObj) ppm <- mass_measurement_error(msnidObj) hist(ppm, 200, xlim=c(-20,+20))
data(c_elegans) # first let's fix the error of picking wrong monoisotopic peak # otherwise the mass error range will be very large msnidObj <- correct_peak_selection(msnidObj) # original mass error in ppm ppm <- mass_measurement_error(msnidObj) hist(ppm, 200, xlim=c(-20,+20)) # The dataset is well calibrated. So let's introduce # some mass measurement error. msnidObj$experimentalMassToCharge <- msnidObj$experimentalMassToCharge * (1+0.00001) # mass error (in ppm) after artificial de-calibration ppm <- mass_measurement_error(msnidObj) hist(ppm, 200, xlim=c(-20,+20)) # recalibration msnidObj <- recalibrate(msnidObj) ppm <- mass_measurement_error(msnidObj) hist(ppm, 200, xlim=c(-20,+20))
Changes accessions from one protein id to another.
remap_accessions(object, conversion_table, extraction_pttrn=c("\\|([^|-]+)(-\\d+)?\\|", "([A-Z]P_\\d+)", "(ENS[A-Z0-9]+)"), path_to_FASTA=NULL)
remap_accessions(object, conversion_table, extraction_pttrn=c("\\|([^|-]+)(-\\d+)?\\|", "([A-Z]P_\\d+)", "(ENS[A-Z0-9]+)"), path_to_FASTA=NULL)
object |
An instance of class "MSnID". |
conversion_table |
(data.frame) first column in the data frame corresponds to identifiers in the FASTA file. Second column is the new identifier. |
extraction_pttrn |
(string) regex pattern that extract protein identifier from
FASTA entry name as first group (that is |
path_to_FASTA |
(string) path to FASTA file. If provided only accessions present in the given FASTA file will be retained. |
Returns an instance of "MSnID" with updated accessions
.
Vladislav A Petyuk [email protected]
m <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") m <- read_mzIDs(m, mzids) head(m$accessions) conv_tab <- fetch_conversion_table("Homo sapiens", "UNIPROT", "SYMBOL") m2 <- remap_accessions(m, conv_tab, "\\|([^|-]+)(-\\d+)?\\|") head(m2$accessions) unlink(".Rcache", recursive=TRUE)
m <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") m <- read_mzIDs(m, mzids) head(m$accessions) conv_tab <- fetch_conversion_table("Homo sapiens", "UNIPROT", "SYMBOL") m2 <- remap_accessions(m, conv_tab, "\\|([^|-]+)(-\\d+)?\\|") head(m2$accessions) unlink(".Rcache", recursive=TRUE)
Remaps entries in the FASTA file from one protein identifier to another according to provided conersion table. Input is a path to FASTA file. Output is also a path to a new FASTA file with updated entry names.
remap_fasta_entry_names(path_to_FASTA, conversion_table, extraction_pttrn=c("\\|([^|-]+)(-\\d+)?\\|", "([A-Z]P_\\d+)", "(ENS[A-Z0-9]+)"))
remap_fasta_entry_names(path_to_FASTA, conversion_table, extraction_pttrn=c("\\|([^|-]+)(-\\d+)?\\|", "([A-Z]P_\\d+)", "(ENS[A-Z0-9]+)"))
path_to_FASTA |
(string) path to FASTA file |
conversion_table |
(data.frame) first column in the data frame corresponds to identifiers in the FASTA file. Second column is the new identifier. |
extraction_pttrn |
(string) regex pattern that extract protein identifier from
FASTA entry name as first group (that is |
path to new FASTA file
Vladislav A Petyuk [email protected]
library(Biostrings) fst_path <- system.file("extdata","for_phospho.fasta.gz",package="MSnID") readAAStringSet(fst_path) conv_tab <- fetch_conversion_table("Homo sapiens", "UNIPROT", "SYMBOL") fst_path_2 <- remap_fasta_entry_names(fst_path, conv_tab, "\\|([^|-]+)(-\\d+)?\\|") readAAStringSet(fst_path_2) file.remove(fst_path_2)
library(Biostrings) fst_path <- system.file("extdata","for_phospho.fasta.gz",package="MSnID") readAAStringSet(fst_path) conv_tab <- fetch_conversion_table("Homo sapiens", "UNIPROT", "SYMBOL") fst_path_2 <- remap_fasta_entry_names(fst_path, conv_tab, "\\|([^|-]+)(-\\d+)?\\|") readAAStringSet(fst_path_2) file.remove(fst_path_2)
Parses out masses from the modification
column and return them as
table
with counts.
report_mods(object, ...)
report_mods(object, ...)
object |
An instance of class "MSnID". |
... |
currently not used |
Counts of each modification mass listed in modification column
.
Vladislav A Petyuk [email protected]
msnidObj <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) # reports the masses and number of their occurences report_mods(msnidObj) # clean-up cache unlink(".Rcache", recursive=TRUE)
msnidObj <- MSnID(".") mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID") msnidObj <- read_mzIDs(msnidObj, mzids) # reports the masses and number of their occurences report_mods(msnidObj) # clean-up cache unlink(".Rcache", recursive=TRUE)