Title: | Protein Significance Analysis in shotgun mass spectrometry-based proteomic experiments with tandem mass tag (TMT) labeling |
---|---|
Description: | The package provides statistical tools for detecting differentially abundant proteins in shotgun mass spectrometry-based proteomic experiments with tandem mass tag (TMT) labeling. It provides multiple functionalities, including aata visualization, protein quantification and normalization, and statistical modeling and inference. Furthermore, it is inter-operable with other data processing tools, such as Proteome Discoverer, MaxQuant, OpenMS and SpectroMine. |
Authors: | Devon Kohler [aut, cre], Ting Huang [aut], Meena Choi [aut], Mateusz Staniak [aut], Sicheng Hao [aut], Olga Vitek [aut] |
Maintainer: | Devon Kohler <[email protected]> |
License: | Artistic-2.0 |
Version: | 2.15.1 |
Built: | 2024-12-18 05:57:09 UTC |
Source: | https://github.com/bioc/MSstatsTMT |
Annotation of example data, raw.mine, in this package. It should be prepared by users. The variables are as follows:
annotation.mine
annotation.mine
A data frame with 72 rows and 7 variables.
Run : MS run ID. It should be the same as R.FileName info in raw.mine
Channel : Labeling information (TMT6_126, ..., TMT6_131). The channels should be consistent with the channel columns in raw.mine.
Condition : Condition (ex. Healthy, Cancer, Time0). If the channal doesn't have sample, please add 'Empty' under Condition.
Mixture : Mixture of samples labeled with different TMT reagents, which can be analyzed in a single mass spectrometry experiment.
TechRepMixture : Technical replicate of one mixture. One mixture may have multiple technical replicates. For example, if 'TechRepMixture' = 1, 2 are the two technical replicates of one mixture, then they should match with same 'Mixture' value.
Fraction : Fraction ID. One technical replicate of one mixture may be fractionated into multiple fractions to increase the analytical depth. Then one technical replicate of one mixture should correspond to multuple fractions. For example, if 'Fraction' = 1, 2, 3 are three fractions of the first technical replicate of one TMT mixture of biological subjects, then they should have same 'TechRepMixture' and 'Mixture' value.
BioReplicate : Unique ID for biological subject. If the channal doesn't have sample, please add 'Empty' under BioReplicate
head(annotation.mine)
head(annotation.mine)
Annotation of example data, evidence, in this package. It should be prepared by users. The variables are as follows:
annotation.mq
annotation.mq
A data frame with 150 rows and 7 variables.
Run : MS run ID. It should be the same as Raw.file info in raw.mq
Channel : Labeling information (channel.0, ..., channel.9). The channel index should be consistent with the channel columns in raw.mq.
Condition : Condition (ex. Healthy, Cancer, Time0)
Mixture : Mixture of samples labeled with different TMT reagents, which can be analyzed in a single mass spectrometry experiment. If the channal doesn't have sample, please add ‘Empty’ under Condition.
TechRepMixture : Technical replicate of one mixture. One mixture may have multiple technical replicates. For example, if ‘TechRepMixture’ = 1, 2 are the two technical replicates of one mixture, then they should match with same ‘Mixture’ value.
Fraction : Fraction ID. One technical replicate of one mixture may be fractionated into multiple fractions to increase the analytical depth. Then one technical replicate of one mixture should correspond to multuple fractions. For example, if ‘Fraction’ = 1, 2, 3 are three fractions of the first technical replicate of one TMT mixture of biological subjects, then they should have same ‘TechRepMixture’ and ‘Mixture’ value.
BioReplicate : Unique ID for biological subject. If the channal doesn't have sample, please add ‘Empty’ under BioReplicate.
head(annotation.mq)
head(annotation.mq)
Annotation of example data, raw.pd, in this package. It should be prepared by users. The variables are as follows:
annotation.pd
annotation.pd
A data frame with 150 rows and 7 variables.
Run : MS run ID. It should be the same as Spectrum.File info in raw.pd.
Channel : Labeling information (126, ... 131). It should be consistent with the channel columns in raw.pd.
Condition : Condition (ex. Healthy, Cancer, Time0)
Mixture : Mixture of samples labeled with different TMT reagents, which can be analyzed in a single mass spectrometry experiment. If the channal doesn't have sample, please add ‘Empty’ under Condition.
TechRepMixture : Technical replicate of one mixture. One mixture may have multiple technical replicates. For example, if ‘TechRepMixture’ = 1, 2 are the two technical replicates of one mixture, then they should match with same ‘Mixture’ value.
Fraction : Fraction ID. One technical replicate of one mixture may be fractionated into multiple fractions to increase the analytical depth. Then one technical replicate of one mixture should correspond to multuple fractions. For example, if ‘Fraction’ = 1, 2, 3 are three fractions of the first technical replicate of one TMT mixture of biological subjects, then they should have same ‘TechRepMixture’ and ‘Mixture’ value.
BioReplicate : Unique ID for biological subject. If the channal doesn't have sample, please add ‘Empty’ under BioReplicate.
head(annotation.pd)
head(annotation.pd)
To illustrate the quantitative data and quality control of MS runs, dataProcessPlotsTMT takes the quantitative data and summarized data from function 'proteinSummarization' as input and generate two types of figures in pdf files as output : (1) profile plot (specify "ProfilePlot" in option type), to identify the potential sources of variation for each protein; (2) quality control plot (specify "QCPlot" in option type), to evaluate the systematic bias between MS runs and channels.
dataProcessPlotsTMT( data, type, featureName = "Transition", ylimUp = FALSE, ylimDown = FALSE, x.axis.size = 10, y.axis.size = 10, text.size = 2, text.angle = 90, legend.size = 7, dot.size.profile = 2, ncol.guide = 5, width = 10, height = 10, which.Protein = "all", originalPlot = TRUE, summaryPlot = TRUE, address = "", isPlotly = FALSE )
dataProcessPlotsTMT( data, type, featureName = "Transition", ylimUp = FALSE, ylimDown = FALSE, x.axis.size = 10, y.axis.size = 10, text.size = 2, text.angle = 90, legend.size = 7, dot.size.profile = 2, ncol.guide = 5, width = 10, height = 10, which.Protein = "all", originalPlot = TRUE, summaryPlot = TRUE, address = "", isPlotly = FALSE )
data |
the output of |
type |
choice of visualization. "ProfilePlot" represents profile plot of log intensities across MS runs. "QCPlot" represents box plots of log intensities across channels and MS runs. |
featureName |
for "ProfilePlot" only, "Transition" (default) means printing feature legend in transition-level; "Peptide" means printing feature legend in peptide-level; "NA" means no feature legend printing. FALSE(Default) for Profile Plot and QC Plot uses the upper limit as rounded off maximum of log2(intensities) after normalization + 3.. |
ylimUp |
upper limit for y-axis in the log scale. |
ylimDown |
lower limit for y-axis in the log scale. FALSE(Default) for Profile Plot and QC Plot uses 0.. |
x.axis.size |
size of x-axis labeling for "Run" and "channel in Profile Plot and QC Plot. |
y.axis.size |
size of y-axis labels. Default is 10. |
text.size |
size of labels represented each condition at the top of Profile plot and QC plot. Default is 4. |
text.angle |
angle of labels represented each condition at the top of Profile plot and QC plot. Default is 0. |
legend.size |
size of legend above Profile plot. Default is 7. |
dot.size.profile |
size of dots in Profile plot. Default is 2. |
ncol.guide |
number of columns for legends at the top of plot. Default is 5. |
width |
width of the saved pdf file. Default is 10. |
height |
height of the saved pdf file. Default is 10. |
which.Protein |
Protein list to draw plots. List can be names of Proteins or order numbers of Proteins. Default is "all", which generates all plots for each protein. For QC plot, "allonly" will generate one QC plot with all proteins. |
originalPlot |
TRUE(default) draws original profile plots, without normalization. |
summaryPlot |
TRUE(default) draws profile plots with protein summarization for each channel and MS run. |
address |
the name of folder that will store the results. Default folder is the current working directory. The other assigned folder has to be existed under the current working directory. An output pdf file is automatically created with the default name of "ProfilePlot.pdf" or "QCplot.pdf". The command address can help to specify where to store the file as well as how to modify the beginning of the file name. If address=FALSE, plot will be not saved as pdf file but showed in window. |
isPlotly |
Parameter to use Plotly or ggplot2. If set to TRUE, MSstats will save Plotly plots as HTML files. If set to FALSE MSstats will save ggplot2 plots as PDF files |
plot or pdf
data(input.pd) quant.msstats = proteinSummarization(input.pd, method="msstats", global_norm=TRUE, reference_norm=TRUE) ## Profile plot dataProcessPlotsTMT(data=quant.msstats, type='ProfilePlot', width = 21, height = 7) ## NottoRun: QC plot # dataProcessPlotsTMT(data=quant.msstats, # type='QCPlot', # width = 21, # height = 7)
data(input.pd) quant.msstats = proteinSummarization(input.pd, method="msstats", global_norm=TRUE, reference_norm=TRUE) ## Profile plot dataProcessPlotsTMT(data=quant.msstats, type='ProfilePlot', width = 21, height = 7) ## NottoRun: QC plot # dataProcessPlotsTMT(data=quant.msstats, # type='QCPlot', # width = 21, # height = 7)
Calculate sample size for future experiments of a TMT experiment based on intensity-based linear model. Two options of the calculation: (1) number of biological replicates per condition, (2) power.
designSampleSizeTMT( data, desiredFC, FDR = 0.05, numSample = TRUE, power = 0.9, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL )
designSampleSizeTMT( data, desiredFC, FDR = 0.05, numSample = TRUE, power = 0.9, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL )
data |
'FittedModel' in testing output from function groupComparisonTMT. |
desiredFC |
the range of a desired fold change which includes the lower and upper values of the desired fold change. |
FDR |
a pre-specified false discovery ratio (FDR) to control the overall false positive rate. Default is 0.05 |
numSample |
minimal number of biological replicates per condition. TRUE represents you require to calculate the sample size for this category, else you should input the exact number of biological replicates. |
power |
a pre-specified statistical power which defined as the probability of detecting a true fold change. TRUE represent you require to calculate the power for this category, else you should input the average of power you expect. Default is 0.9 |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
The function fits the model and uses variance components to calculate sample size. The underlying model fitting with intensity-based linear model with technical MS run replication. Estimated sample size is rounded to 0 decimal. The function can only obtain either one of the categories of the sample size calculation (numSample, numPep, numTran, power) at the same time.
data.frame - sample size calculation results including varibles: desiredFC, numSample, FDR, and power.
data(input.pd) # use protein.summarization() to get protein abundance data quant.pd.msstats = proteinSummarization(input.pd, method="msstats", global_norm=TRUE, reference_norm=TRUE) test.pairwise = groupComparisonTMT(quant.pd.msstats, save_fitted_models = TRUE) head(test.pairwise$ComparisonResult) ## Calculate sample size for future experiments: #(1) Minimal number of biological replicates per condition designSampleSizeTMT(data=test.pairwise$FittedModel, numSample=TRUE, desiredFC=c(1.25,1.75), FDR=0.05, power=0.8) #(2) Power calculation designSampleSizeTMT(data=test.pairwise$FittedModel, numSample=2, desiredFC=c(1.25,1.75), FDR=0.05, power=TRUE)
data(input.pd) # use protein.summarization() to get protein abundance data quant.pd.msstats = proteinSummarization(input.pd, method="msstats", global_norm=TRUE, reference_norm=TRUE) test.pairwise = groupComparisonTMT(quant.pd.msstats, save_fitted_models = TRUE) head(test.pairwise$ComparisonResult) ## Calculate sample size for future experiments: #(1) Minimal number of biological replicates per condition designSampleSizeTMT(data=test.pairwise$FittedModel, numSample=TRUE, desiredFC=c(1.25,1.75), FDR=0.05, power=0.8) #(2) Power calculation designSampleSizeTMT(data=test.pairwise$FittedModel, numSample=2, desiredFC=c(1.25,1.75), FDR=0.05, power=TRUE)
Example of evidence.txt from MaxQuant. It is the input for MaxQtoMSstatsTMTFormat function, with proteinGroups.txt and annotation file. Annotation file should be made by users. It includes peak intensities for 10 proteins among 15 MS runs with TMT10. The important variables are as follows:
evidence
evidence
A data frame with 1075 rows and 105 variables.
Proteins
Protein.group.IDs
Modified.sequence
Charge
Raw.file
Score
Potential.contaminant
Reverse
Channels : Reporter.intensity.corrected.0, ..., Reporter.intensity.corrected.9
head(evidence)
head(evidence)
Tests for significant changes in protein abundance across conditions based on a family of linear mixed-effects models in TMT experiment. Experimental design of case-control study (patients are not repeatedly measured) is automatically determined based on proper statistical model.
groupComparisonTMT( data, contrast.matrix = "pairwise", moderated = FALSE, adj.method = "BH", remove_norm_channel = TRUE, remove_empty_channel = TRUE, save_fitted_models = FALSE, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL )
groupComparisonTMT( data, contrast.matrix = "pairwise", moderated = FALSE, adj.method = "BH", remove_norm_channel = TRUE, remove_empty_channel = TRUE, save_fitted_models = FALSE, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL )
data |
the output of |
contrast.matrix |
Comparison between conditions of interests. 1) default is "pairwise", which compare all possible pairs between two conditions. 2) Otherwise, users can specify the comparisons of interest. Based on the levels of conditions, specify 1 or -1 to the conditions of interests and 0 otherwise. The levels of conditions are sorted alphabetically. |
moderated |
TRUE will moderate t statistic; FALSE (default) uses ordinary t statistic. |
adj.method |
adjusted method for multiple comparison. "BH" is default. |
remove_norm_channel |
TRUE(default) removes "Norm" channels from protein level data. |
remove_empty_channel |
TRUE(default) removes "Empty" channels from protein level data. |
save_fitted_models |
logical, if TRUE, fitted models will be added to |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
a list that consists of the following elements: (1) ComparisonResult: statistical testing results; (2) FittedModel: the fitted linear models
data(input.pd) # use protein.summarization() to get protein abundance data quant.pd.msstats = proteinSummarization(input.pd, method="msstats", global_norm=TRUE, reference_norm=TRUE) test.pairwise = groupComparisonTMT(quant.pd.msstats, moderated = TRUE) head(test.pairwise$ComparisonResult) # Only compare condition 0.125 and 1 levels(quant.pd.msstats$ProteinLevelData$Condition) # Compare condition 1 and 0.125 comparison=matrix(c(-1,0,0,1),nrow=1) # Set the nafmes of each row row.names(comparison)="1-0.125" # Set the column names colnames(comparison)= c("0.125", "0.5", "0.667", "1") test.contrast = groupComparisonTMT(data = quant.pd.msstats, contrast.matrix = comparison, moderated = TRUE) head(test.contrast$ComparisonResult)
data(input.pd) # use protein.summarization() to get protein abundance data quant.pd.msstats = proteinSummarization(input.pd, method="msstats", global_norm=TRUE, reference_norm=TRUE) test.pairwise = groupComparisonTMT(quant.pd.msstats, moderated = TRUE) head(test.pairwise$ComparisonResult) # Only compare condition 0.125 and 1 levels(quant.pd.msstats$ProteinLevelData$Condition) # Compare condition 1 and 0.125 comparison=matrix(c(-1,0,0,1),nrow=1) # Set the nafmes of each row row.names(comparison)="1-0.125" # Set the column names colnames(comparison)= c("0.125", "0.5", "0.667", "1") test.contrast = groupComparisonTMT(data = quant.pd.msstats, contrast.matrix = comparison, moderated = TRUE) head(test.contrast$ComparisonResult)
It is made from raw.pd
and annotation.pd
,
which is the output of PDtoMSstatsTMTFormat function.
It should include the required columns as below.
input.pd
input.pd
A data frame with 20110 rows and 11 variables.
ProteinName : Protein ID
PeptideSequence : peptide sequence
Charge : peptide charge
PSM : peptide ion and spectra match
Channel : Labeling information (126, ... 131)
Condition : Condition (ex. Healthy, Cancer, Time0)
BioReplicate : Unique ID for biological subject.
Run : MS run ID
Mixture : Unique ID for TMT mixture.
TechRepMixture : Unique ID for technical replicate of one TMT mixture.
Intensity: Protein Abundance
head(input.pd)
head(input.pd)
Generate MSstatsTMT required input format from MaxQuant output
MaxQtoMSstatsTMTFormat( evidence, proteinGroups, annotation, which.proteinid = "Proteins", rmProt_Only.identified.by.site = FALSE, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
MaxQtoMSstatsTMTFormat( evidence, proteinGroups, annotation, which.proteinid = "Proteins", rmProt_Only.identified.by.site = FALSE, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
evidence |
name of 'evidence.txt' data, which includes feature-level data. |
proteinGroups |
name of 'proteinGroups.txt' data. |
annotation |
data frame which contains column Run, Fraction, TechRepMixture, Mixture, Channel, BioReplicate, Condition. Refer to the example 'annotation.mq' for the meaning of each column. |
which.proteinid |
Use 'Proteins' (default) column for protein name. 'Leading.proteins' or 'Leading.razor.proteins' or 'Gene.names' can be used instead to get the protein ID with single protein. However, those can potentially have the shared peptides. |
rmProt_Only.identified.by.site |
TRUE will remove proteins with '+' in 'Only.identified.by.site' column from proteinGroups.txt, which was identified only by a modification site. FALSE is the default. |
useUniquePeptide |
TRUE(default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein. |
rmPSM_withfewMea_withinRun |
TRUE (default) will remove the features that have 1 or 2 measurements within each Run. |
rmProtein_with1Feature |
TRUE will remove the proteins which have only 1 peptide and charge. Defaut is FALSE. |
summaryforMultipleRows |
sum(default) or max - when there are multiple measurements for certain feature in certain run, select the feature with the largest summation or maximal value. |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
... |
additional parameters to 'data.table::fread'. |
data.frame of class "MSstatsTMT"
head(evidence) head(proteinGroups) head(annotation.mq) input.mq <- MaxQtoMSstatsTMTFormat(evidence, proteinGroups, annotation.mq) head(input.mq)
head(evidence) head(proteinGroups) head(annotation.mq) input.mq <- MaxQtoMSstatsTMTFormat(evidence, proteinGroups, annotation.mq) head(input.mq)
A set of tools for detecting differentially abundant peptides and proteins in shotgun mass spectrometry-based proteomic experiments with tandem mass tag (TMT) labeling.
PDtoMSstatsTMTFormat
: generates MSstatsTMT required input format for Proteome discoverer output.
MaxQtoMSstatsTMTFormat
: generates MSstatsTMT required input format for MaxQuant output.
SpectroMinetoMSstatsTMTFormat
: generates MSstatsTMT required input format for SpectroMine output.
OpenMStoMSstatsTMTFormat
: generates MSstatsTMT required input format for OpenMS output.
proteinSummarization
: summarizes PSM level quantification to protein level quantification.
dataProcessPlotsTMT
: visualizes for explanatory data analysis.
groupComparisonTMT
: tests for significant changes in protein abundance across conditions.
Maintainer: Devon Kohler [email protected]
Authors:
Ting Huang [email protected]
Meena Choi [email protected]
Mateusz Staniak [email protected]
Sicheng Hao [email protected]
Olga Vitek [email protected]
Useful links:
Generate MSstatsTMT required input format for OpenMS output
OpenMStoMSstatsTMTFormat( input, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultiplePSMs = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
OpenMStoMSstatsTMTFormat( input, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultiplePSMs = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
input |
MSstatsTMT report from OpenMS |
useUniquePeptide |
TRUE(default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein. |
rmPSM_withfewMea_withinRun |
TRUE (default) will remove the features that have 1 or 2 measurements within each Run. |
rmProtein_with1Feature |
TRUE will remove the proteins which have only 1 peptide and charge. Defaut is FALSE. |
summaryforMultiplePSMs |
sum(default) or max - when there are multiple measurements for certain feature in certain run, select the feature with the largest summation or maximal value. |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
... |
additional parameters to 'data.table::fread'. |
'data.frame' of class 'MSstatsTMT'.
head(raw.om) input.om <- OpenMStoMSstatsTMTFormat(raw.om) head(input.om)
head(raw.om) input.om <- OpenMStoMSstatsTMTFormat(raw.om) head(input.om)
Convert Proteome Discoverer output to MSstatsTMT format.
PDtoMSstatsTMTFormat( input, annotation, which.proteinid = "Protein.Accessions", useNumProteinsColumn = TRUE, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
PDtoMSstatsTMTFormat( input, annotation, which.proteinid = "Protein.Accessions", useNumProteinsColumn = TRUE, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
input |
PD report or a path to it. |
annotation |
annotation with Run, Fraction, TechRepMixture, Mixture, Channel, BioReplicate, Condition columns or a path to file. Refer to the example 'annotation' for the meaning of each column. |
which.proteinid |
Use 'Protein.Accessions'(default) column for protein name. 'Master.Protein.Accessions' can be used instead to get the protein name with single protein. |
useNumProteinsColumn |
logical, TURE(default) remove shared peptides by information of # Proteins column in PSM sheet. |
useUniquePeptide |
logical, if TRUE (default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein. |
rmPSM_withfewMea_withinRun |
TRUE (default) will remove the features that have 1 or 2 measurements within each Run. |
rmProtein_with1Feature |
TRUE will remove the proteins which have only 1 peptide and charge. Defaut is FALSE. |
summaryforMultipleRows |
sum (default) or max - when there are multiple measurements for certain feature in certain run, select the feature with the largest summation or maximal value. |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
... |
additional parameters to 'data.table::fread'. |
'data.frame' of class 'MSstatsTMT'
head(raw.pd) head(annotation.pd) input.pd <- PDtoMSstatsTMTFormat(raw.pd, annotation.pd) head(input.pd)
head(raw.pd) head(annotation.pd) input.pd <- PDtoMSstatsTMTFormat(raw.pd, annotation.pd) head(input.pd)
Convert Philosopher (Fragpipe) output to MSstatsTMT format.
PhilosophertoMSstatsTMTFormat( input, annotation, protein_id_col = "Protein", peptide_id_col = "Peptide.Sequence", Purity_cutoff = 0.6, PeptideProphet_prob_cutoff = 0.7, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmPeptide_OxidationM = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
PhilosophertoMSstatsTMTFormat( input, annotation, protein_id_col = "Protein", peptide_id_col = "Peptide.Sequence", Purity_cutoff = 0.6, PeptideProphet_prob_cutoff = 0.7, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmPeptide_OxidationM = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
input |
data.frame of 'msstats.csv' file produced by Philosopher |
annotation |
annotation with Run, Fraction, TechRepMixture, Mixture, Channel, BioReplicate, Condition columns or a path to file. Refer to the example 'annotation' for the meaning of each column. Channel column should be consistent with the channel columns (Ignore the prefix "Channel ") in msstats.csv file. Run column should be consistent with the Spectrum.File columns in msstats.csv file. |
protein_id_col |
Use 'Protein'(default) column for protein name. 'Master.Protein.Accessions' can be used instead to get the protein ID with single protein. |
peptide_id_col |
Use 'Peptide.Sequence'(default) column for peptide sequence. 'Modified.Peptide.Sequence' can be used instead to get the modified peptide sequence. |
Purity_cutoff |
Cutoff for purity. Default is 0.6 |
PeptideProphet_prob_cutoff |
Cutoff for the peptide identification probability. Default is 0.7. The probability is confidence score determined by PeptideProphet and higher values indicate greater confidence. |
useUniquePeptide |
logical, if TRUE (default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein. |
rmPSM_withfewMea_withinRun |
TRUE (default) will remove the features that have 1 or 2 measurements within each Run. |
rmPeptide_OxidationM |
TRUE (default) will remove the peptides including oxidation (M) sequence. |
rmProtein_with1Feature |
TRUE will remove the proteins which have only 1 peptide and charge. Defaut is FALSE. |
summaryforMultipleRows |
sum (default) or max - when there are multiple measurements for certain feature in certain run, select the feature with the largest summation or maximal value. |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
... |
additional parameters to 'data.table::fread'. |
'data.frame' of class 'MSstatsTMT'
Example of proteinGroup.txt file from MaxQuant, which is identified protein group information file. It is the input for MaxQtoMSstatsTMTFormat function, with evidence.txt and annotation file. It includes identified protein groups for 10 proteins among 15 MS runs with TMT10. The important variables are as follows:
proteinGroups
proteinGroups
A data frame with 1075 rows and 105 variables.
id
Protein.IDs
Only.identified.by.site
Potential.contaminant
Reverse
head(proteinGroups)
head(proteinGroups)
We assume missing values are censored and then impute the missing values. Protein-level summarization from peptide level quantification are performed. After all, global median normalization on peptide level data and normalization between MS runs using reference channels will be implemented.
proteinSummarization( data, method = "msstats", global_norm = TRUE, reference_norm = TRUE, remove_norm_channel = TRUE, remove_empty_channel = TRUE, MBimpute = TRUE, maxQuantileforCensored = NULL, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, msstats_log_path = NULL )
proteinSummarization( data, method = "msstats", global_norm = TRUE, reference_norm = TRUE, remove_norm_channel = TRUE, remove_empty_channel = TRUE, MBimpute = TRUE, maxQuantileforCensored = NULL, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, msstats_log_path = NULL )
data |
Name of the output of PDtoMSstatsTMTFormat function or peptide-level quantified data from other tools. It should have columns ProteinName, PeptideSequence, Charge, PSM, Mixture, TechRepMixture, Run, Channel, Condition, BioReplicate, Intensity |
method |
Four different summarization methods to protein-level can be performed : "msstats"(default), "MedianPolish", "Median", "LogSum". |
global_norm |
Global median normalization on peptide level data (equalizing the medians across all the channels and MS runs). Default is TRUE. It will be performed before protein-level summarization. |
reference_norm |
Reference channel based normalization between MS runs on protein level data. TRUE(default) needs at least one reference channel in each MS run, annotated by 'Norm' in Condtion column. It will be performed after protein-level summarization. FALSE will not perform this normalization step. If data only has one run, then reference_norm=FALSE. |
remove_norm_channel |
TRUE(default) removes 'Norm' channels from protein level data. |
remove_empty_channel |
TRUE(default) removes 'Empty' channels from protein level data. |
MBimpute |
only for method="msstats". TRUE (default) imputes missing values by Accelated failure model. FALSE uses minimum value to impute the missing value for each peptide precursor ion. |
maxQuantileforCensored |
We assume missing values are censored. maxQuantileforCensored is Maximum quantile for deciding censored missing value, for instance, 0.999. Default is Null. |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
msstats_log_path |
path to a MSstats log file |
list that consists of two data.frames with feature-level (FeatureLevelData) and protein-level data (ProteinLevelData)
data(input.pd) quant.pd.msstats <- proteinSummarization(input.pd, method = "msstats", global_norm = TRUE, reference_norm = TRUE) head(quant.pd.msstats$ProteinLevelData)
data(input.pd) quant.pd.msstats <- proteinSummarization(input.pd, method = "msstats", global_norm = TRUE, reference_norm = TRUE) head(quant.pd.msstats$ProteinLevelData)
Example of SpectroMine PSM sheet. It is the output of SpectroMine and the input for SpectroMinetoMSstatsTMTFormat function, with annotation file. Annotation file should be made by users. It includes peak intensities for 10 proteins among 12 MS runs with TMT-6plex. The important variables are as follows:
raw.mine
raw.mine
A data frame with 170 rows and 28 variables.
PG.ProteinAccessions
P.MoleculeID
PP.Charge
R.FileName
PG.QValue
PSM.Qvalue
Channels : PSM.TMT6_126..Raw., ..., PSM.TMT6_131..Raw.
head(raw.mine)
head(raw.mine)
Example of MSstatsTMT PSM sheet from MaxQuant. It is the input for OpenMStoMSstatsTMTFormat function. It includes peak intensities for 10 proteins among 27 MS runs from three TMT10 mixtures. The important variables are as follows:
raw.om
raw.om
A data frame with 860 rows and 13 variables.
RetentionTime
ProteinName
PeptideSequence
Charge
Channel
Condition
BioReplicate
Run
Mixture
TechRepMixture
Fraction
Intensity
Reference
head(raw.om)
head(raw.om)
Example of Proteome discover PSM sheet. It is the input for PDtoMSstatsTMTFormat function, with annotation file. Annotation file should be made by users. It includes peak intensities for 10 proteins among 15 MS runs with TMT-10plex. The variables are as follows:
raw.pd
raw.pd
A data frame with 2858 rows and 50 variables.
Master.Protein.Accessions
Protein.Accessions
Annotated.Sequence
Charge
Ions.Score
Spectrum.File
Quan.Info
Channels : 126, ..., 131
head(raw.pd)
head(raw.pd)
Import data from SpectroMine
SpectroMinetoMSstatsTMTFormat( input, annotation, filter_with_Qvalue = TRUE, qvalue_cutoff = 0.01, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
SpectroMinetoMSstatsTMTFormat( input, annotation, filter_with_Qvalue = TRUE, qvalue_cutoff = 0.01, useUniquePeptide = TRUE, rmPSM_withfewMea_withinRun = TRUE, rmProtein_with1Feature = FALSE, summaryforMultipleRows = sum, use_log_file = TRUE, append = FALSE, verbose = TRUE, log_file_path = NULL, ... )
input |
data name of SpectroMine PSM output. Read PSM sheet. |
annotation |
data frame which contains column Run, Fraction, TechRepMixture, Mixture, Channel, BioReplicate, Condition. Refer to the example 'annotation.mine' for the meaning of each column. |
filter_with_Qvalue |
TRUE(default) will filter out the intensities that have greater than qvalue_cutoff in EG.Qvalue column. Those intensities will be replaced with NA and will be considered as censored missing values for imputation purpose. |
qvalue_cutoff |
Cutoff for EG.Qvalue. default is 0.01. |
useUniquePeptide |
TRUE(default) removes peptides that are assigned for more than one proteins. We assume to use unique peptide for each protein. |
rmPSM_withfewMea_withinRun |
TRUE (default) will remove the features that have 1 or 2 measurements within each Run. |
rmProtein_with1Feature |
TRUE will remove the proteins which have only 1 peptide and charge. Defaut is FALSE. |
summaryforMultipleRows |
sum(default) or max - when there are multiple measurements for certain feature in certain run, select the feature with the largest summation or maximal value. |
use_log_file |
logical. If TRUE, information about data processing will be saved to a file. |
append |
logical. If TRUE, information about data processing will be added to an existing log file. |
verbose |
logical. If TRUE, information about data processing wil be printed to the console. |
log_file_path |
character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file. |
... |
additional parameters to 'data.table::fread'. |
'data.frame' of class 'MSstatsTMT'
head(raw.mine) head(annotation.mine) input.mine <- SpectroMinetoMSstatsTMTFormat(raw.mine, annotation.mine) head(input.mine)
head(raw.mine) head(annotation.mine) input.mine <- SpectroMinetoMSstatsTMTFormat(raw.mine, annotation.mine) head(input.mine)