| Title: | MSstats Preprocessing for Larger than Memory Data |
|---|---|
| Description: | MSstats package provide tools for preprocessing, summarization and differential analysis of mass spectrometry (MS) proteomics data. Recently, some MS protocols enable acquisition of data sets that result in larger than memory quantitative data. MSstats functions are not able to process such data. MSstatsBig package provides additional converter functions that enable processing larger than memory data sets. |
| Authors: | Anthony Wu [aut, cre], Mateusz Staniak [aut], Devon Kohler [aut] |
| Maintainer: | Anthony Wu <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 1.11.0 |
| Built: | 2026-05-28 14:49:11 UTC |
| Source: | https://github.com/bioc/MSstatsBig |
Convert out-of-memory DIANN files to MSstats format.
bigDIANNtoMSstatsFormat( input_file, annotation = NULL, output_file_name, backend, MBR = TRUE, quantificationColumn = "FragmentQuantCorrected", global_qvalue_cutoff = 0.01, qvalue_cutoff = 0.01, pg_qvalue_cutoff = 0.01, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, calculateAnomalyScores = FALSE, anomalyModelFeatures = c(), connection = NULL )bigDIANNtoMSstatsFormat( input_file, annotation = NULL, output_file_name, backend, MBR = TRUE, quantificationColumn = "FragmentQuantCorrected", global_qvalue_cutoff = 0.01, qvalue_cutoff = 0.01, pg_qvalue_cutoff = 0.01, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, calculateAnomalyScores = FALSE, anomalyModelFeatures = c(), connection = NULL )
input_file |
name of the input text file in 10-column MSstats format. |
annotation |
name of 'annotation.txt' data which includes Condition, BioReplicate, Run. |
output_file_name |
name of an output file which will be saved after pre-processing |
backend |
"arrow" or "sparklyr". Option "sparklyr" requires a spark installation and connection to spark instance provided in the 'connection' parameter. |
MBR |
True if analysis was done with match between runs |
quantificationColumn |
Use 'FragmentQuantCorrected'(default) column for quantified intensities for DIANN 1.8.x. Use 'FragmentQuantRaw' for quantified intensities for DIANN 1.9.x. Use 'auto' for quantified intensities for DIANN 2.x where each fragment intensity is a separate column, e.g. Fr0Quantity. |
global_qvalue_cutoff |
The qvalue cutoff for the Q.Value column, i.e. the run-specific precursor q-value. Default is 0.01. |
qvalue_cutoff |
If MBR is false, the qvalue cutoff for the Global.Q.Value column, i.e. global precursor q-value. If MBR is true, the qvalue cutoff for the Lib.Q.Value column, i.e. the q-value for the library created after the first MBR pass. Default is 0.01. |
pg_qvalue_cutoff |
If MBR is false, the qvalue cutoff for the Global.PG.Q.Value column, i.e. the global q-value for the protein group. If MBR is true, the qvalue cutoff for the Lib.PG.Q.Value column, i.e. the protein group q-value for the library created after the first MBR pass. Default is 0.01. |
max_feature_count |
maximum number of features per protein. Features will be selected based on highest average intensity. |
filter_unique_peptides |
If TRUE, shared peptides will be removed. Please refer to the 'Details' section for additional information. |
aggregate_psms |
If TRUE, multiple measurements per PSM in a Run will be aggregated (by taking maximum value). Please refer to the 'Details' section for additional information. |
filter_few_obs |
If TRUE, feature with less than 3 observations across runs will be removed. Please refer to the 'Details' section for additional information. |
remove_annotation |
If TRUE, columns BioReplicate and Condition will be removed to reduce output file size. These will need to be added manually later before using dataProcess function. Only applicable to sparklyr backend. |
calculateAnomalyScores |
If TRUE, will carry anomaly model features through pipeline |
anomalyModelFeatures |
Character vector of column names to be carried through the pipeline |
connection |
Connection to a spark instance created with the 'spark_connect' function from 'sparklyr' package. |
either arrow object or sparklyr table that can be optionally collected into memory by using dplyr::collect function.
Convert out-of-memory FragPipe files to MSstats format.
bigFragPipetoMSstatsFormat( input_file, output_file_name, backend, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )bigFragPipetoMSstatsFormat( input_file, output_file_name, backend, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )
input_file |
name of the input text file in 10-column MSstats format. |
output_file_name |
name of an output file which will be saved after pre-processing |
backend |
"arrow" or "sparklyr". Option "sparklyr" requires a spark installation and connection to spark instance provided in the 'connection' parameter. |
max_feature_count |
maximum number of features per protein. Features will be selected based on highest average intensity. |
filter_unique_peptides |
If TRUE, shared peptides will be removed. Please refer to the 'Details' section for additional information. |
aggregate_psms |
If TRUE, multiple measurements per PSM in a Run will be aggregated (by taking maximum value). Please refer to the 'Details' section for additional information. |
filter_few_obs |
If TRUE, feature with less than 3 observations across runs will be removed. Please refer to the 'Details' section for additional information. |
remove_annotation |
If TRUE, columns BioReplicate and Condition will be removed to reduce output file size. These will need to be added manually later before using dataProcess function. Only applicable to sparklyr backend. |
connection |
Connection to a spark instance created with the 'spark_connect' function from 'sparklyr' package. |
either arrow object or sparklyr table that can be optionally collected into memory by using dplyr::collect function.
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)
Convert out-of-memory Spectronaut files to MSstats format.
bigSpectronauttoMSstatsFormat( input_file, output_file_name, backend, intensity = "F.NormalizedPeakArea", filter_by_excluded = FALSE, filter_by_identified = FALSE, filter_by_qvalue = FALSE, qvalue_cutoff = 0.01, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, calculateAnomalyScores = FALSE, anomalyModelFeatures = c(), connection = NULL )bigSpectronauttoMSstatsFormat( input_file, output_file_name, backend, intensity = "F.NormalizedPeakArea", filter_by_excluded = FALSE, filter_by_identified = FALSE, filter_by_qvalue = FALSE, qvalue_cutoff = 0.01, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, calculateAnomalyScores = FALSE, anomalyModelFeatures = c(), connection = NULL )
input_file |
name of the input text file in 10-column MSstats format. |
output_file_name |
name of an output file which will be saved after pre-processing |
backend |
"arrow" or "sparklyr". Option "sparklyr" requires a spark installation and connection to spark instance provided in the 'connection' parameter. |
intensity |
Name of the intensity column to be used in Spectronaut |
filter_by_excluded |
if TRUE, will filter by the 'F.ExcludedFromQuantification' column. |
filter_by_identified |
if TRUE, will filter by the 'EG.Identified' column. |
filter_by_qvalue |
if TRUE, will filter by EG.Qvalue and PG.Qvalue columns. |
qvalue_cutoff |
cutoff which will be used for q-value filtering. |
max_feature_count |
maximum number of features per protein. Features will be selected based on highest average intensity. |
filter_unique_peptides |
If TRUE, shared peptides will be removed. Please refer to the 'Details' section for additional information. |
aggregate_psms |
If TRUE, multiple measurements per PSM in a Run will be aggregated (by taking maximum value). Please refer to the 'Details' section for additional information. |
filter_few_obs |
If TRUE, feature with less than 3 observations across runs will be removed. Please refer to the 'Details' section for additional information. |
remove_annotation |
If TRUE, columns BioReplicate and Condition will be removed to reduce output file size. These will need to be added manually later before using dataProcess function. Only applicable to sparklyr backend. |
calculateAnomalyScores |
If TRUE, will carry anomaly model features through pipeline |
anomalyModelFeatures |
Character vector of column names to be carried through the pipeline |
connection |
Connection to a spark instance created with the 'spark_connect' function from 'sparklyr' package. |
either arrow object or sparklyr table that can be optionally collected into memory by using dplyr::collect function.
converted_data <- bigSpectronauttoMSstatsFormat( system.file("extdata", "spectronaut_input.csv", package = "MSstatsBig"), "output_file.csv", backend="arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)converted_data <- bigSpectronauttoMSstatsFormat( system.file("extdata", "spectronaut_input.csv", package = "MSstatsBig"), "output_file.csv", backend="arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)
Merge annotation to output of MSstatsPreprocessBig
MSstatsAddAnnotationBig(input, annotation)MSstatsAddAnnotationBig(input, annotation)
input |
output of MSstatsPreprocessBig |
annotation |
run annotation |
table of 'input' and 'annotation' merged by Run column.
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data) # Change annotation as an example: converted_data$Condition <- NULL converted_data$BioReplicate <- NULL annot <- data.frame(Run = unique(converted_data[["Run"]])) annot$BioReplicate <- rep(1:53, times = 2) annot$Condition <- rep(1:2, each = 53) head(MSstatsAddAnnotationBig(converted_data, annot))converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data) # Change annotation as an example: converted_data$Condition <- NULL converted_data$BioReplicate <- NULL annot <- data.frame(Run = unique(converted_data[["Run"]])) annot$BioReplicate <- rep(1:53, times = 2) annot$Condition <- rep(1:2, each = 53) head(MSstatsAddAnnotationBig(converted_data, annot))
General converter for larger-than-memory csv files in MSstats format 10-column format
MSstatsPreprocessBig( input_file, output_file_name, backend, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, calculateAnomalyScores = FALSE, anomalyModelFeatures = c(), connection = NULL )MSstatsPreprocessBig( input_file, output_file_name, backend, max_feature_count = 100, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, calculateAnomalyScores = FALSE, anomalyModelFeatures = c(), connection = NULL )
input_file |
name of the input text file in 10-column MSstats format. |
output_file_name |
name of an output file which will be saved after pre-processing |
backend |
"arrow" or "sparklyr". Option "sparklyr" requires a spark installation and connection to spark instance provided in the 'connection' parameter. |
max_feature_count |
maximum number of features per protein. Features will be selected based on highest average intensity. |
filter_unique_peptides |
If TRUE, shared peptides will be removed. Please refer to the 'Details' section for additional information. |
aggregate_psms |
If TRUE, multiple measurements per PSM in a Run will be aggregated (by taking maximum value). Please refer to the 'Details' section for additional information. |
filter_few_obs |
If TRUE, feature with less than 3 observations across runs will be removed. Please refer to the 'Details' section for additional information. |
remove_annotation |
If TRUE, columns BioReplicate and Condition will be removed to reduce output file size. These will need to be added manually later before using dataProcess function. Only applicable to sparklyr backend. |
calculateAnomalyScores |
If TRUE, will carry anomaly model features through pipeline |
anomalyModelFeatures |
Character vector of column names to be carried through the pipeline |
connection |
Connection to a spark instance created with the 'spark_connect' function from 'sparklyr' package. |
Filtering and aggregation may be very time consuming and the ability to perform them in a given R session depends on available memory, settings of external packages, etc. Hence, all value of related parameters ('filter_unique_peptides', 'aggregate_psms', 'filter_few_obs') are set to FALSE by default and only feature selection is performed, which saves both computation time and memory. Appropriately configured spark backend provides the most consistent way to perform these operations.
either arrow object or sparklyr table that can be optionally collected into memory by using dplyr::collect function.
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "tencol_format.csv", backend="arrow") procd <- MSstatsPreprocessBig("tencol_format.csv", "proc_out.csv", backend = "arrow") head(dplyr::collect(procd))converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "tencol_format.csv", backend="arrow") procd <- MSstatsPreprocessBig("tencol_format.csv", "proc_out.csv", backend = "arrow") head(dplyr::collect(procd))