Title: | MSstats Preprocessing for Larger than Memory Data |
---|---|
Description: | MSstats package provide tools for preprocessing, summarization and differential analysis of mass spectrometry (MS) proteomics data. Recently, some MS protocols enable acquisition of data sets that result in larger than memory quantitative data. MSstats functions are not able to process such data. MSstatsBig package provides additional converter functions that enable processing larger than memory data sets. |
Authors: | Mateusz Staniak [aut, cre], Devon Kohler [aut] |
Maintainer: | Mateusz Staniak <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.5.0 |
Built: | 2024-11-29 06:25:56 UTC |
Source: | https://github.com/bioc/MSstatsBig |
Convert out-of-memory FragPipe files to MSstats format.
bigFragPipetoMSstatsFormat( input_file, output_file_name, backend, max_feature_count = 20, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )
bigFragPipetoMSstatsFormat( input_file, output_file_name, backend, max_feature_count = 20, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )
input_file |
name of the input text file in 10-column MSstats format. |
output_file_name |
name of an output file which will be saved after pre-processing |
backend |
"arrow" or "sparklyr". Option "sparklyr" requires a spark installation and connection to spark instance provided in the 'connection' parameter. |
max_feature_count |
maximum number of features per protein. Features will be selected based on highest average intensity. |
filter_unique_peptides |
If TRUE, shared peptides will be removed. Please refer to the 'Details' section for additional information. |
aggregate_psms |
If TRUE, multiple measurements per PSM in a Run will be aggregated (by taking maximum value). Please refer to the 'Details' section for additional information. |
filter_few_obs |
If TRUE, feature with less than 3 observations across runs will be removed. Please refer to the 'Details' section for additional information. |
remove_annotation |
If TRUE, columns BioReplicate and Condition will be removed to reduce output file size. These will need to be added manually later before using dataProcess function. Only applicable to sparklyr backend. |
connection |
Connection to a spark instance created with the 'spark_connect' function from 'sparklyr' package. |
either arrow object or sparklyr table that can be optionally collected into memory by using dplyr::collect function.
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)
Convert out-of-memory Spectronaut files to MSstats format.
bigSpectronauttoMSstatsFormat( input_file, output_file_name, backend, filter_by_excluded = FALSE, filter_by_identified = FALSE, filter_by_qvalue = TRUE, qvalue_cutoff = 0.01, max_feature_count = 20, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )
bigSpectronauttoMSstatsFormat( input_file, output_file_name, backend, filter_by_excluded = FALSE, filter_by_identified = FALSE, filter_by_qvalue = TRUE, qvalue_cutoff = 0.01, max_feature_count = 20, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )
input_file |
name of the input text file in 10-column MSstats format. |
output_file_name |
name of an output file which will be saved after pre-processing |
backend |
"arrow" or "sparklyr". Option "sparklyr" requires a spark installation and connection to spark instance provided in the 'connection' parameter. |
filter_by_excluded |
if TRUE, will filter by the 'F.ExcludedFromQuantification' column. |
filter_by_identified |
if TRUE, will filter by the 'EG.Identified' column. |
filter_by_qvalue |
if TRUE, will filter by EG.Qvalue and PG.Qvalue columns. |
qvalue_cutoff |
cutoff which will be used for q-value filtering. |
max_feature_count |
maximum number of features per protein. Features will be selected based on highest average intensity. |
filter_unique_peptides |
If TRUE, shared peptides will be removed. Please refer to the 'Details' section for additional information. |
aggregate_psms |
If TRUE, multiple measurements per PSM in a Run will be aggregated (by taking maximum value). Please refer to the 'Details' section for additional information. |
filter_few_obs |
If TRUE, feature with less than 3 observations across runs will be removed. Please refer to the 'Details' section for additional information. |
remove_annotation |
If TRUE, columns BioReplicate and Condition will be removed to reduce output file size. These will need to be added manually later before using dataProcess function. Only applicable to sparklyr backend. |
connection |
Connection to a spark instance created with the 'spark_connect' function from 'sparklyr' package. |
either arrow object or sparklyr table that can be optionally collected into memory by using dplyr::collect function.
converted_data <- bigSpectronauttoMSstatsFormat( system.file("extdata", "spectronaut_input.csv", package = "MSstatsBig"), "output_file.csv", backend="arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)
converted_data <- bigSpectronauttoMSstatsFormat( system.file("extdata", "spectronaut_input.csv", package = "MSstatsBig"), "output_file.csv", backend="arrow") converted_data <- dplyr::collect(converted_data) head(converted_data)
Merge annotation to output of MSstatsPreprocessBig
MSstatsAddAnnotationBig(input, annotation)
MSstatsAddAnnotationBig(input, annotation)
input |
output of MSstatsPreprocessBig |
annotation |
run annotation |
table of 'input' and 'annotation' merged by Run column.
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data) # Change annotation as an example: converted_data$Condition <- NULL converted_data$BioReplicate <- NULL annot <- data.frame(Run = unique(converted_data[["Run"]])) annot$BioReplicate <- rep(1:53, times = 2) annot$Condition <- rep(1:2, each = 53) head(MSstatsAddAnnotationBig(converted_data, annot))
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "output_file.csv", backend = "arrow") converted_data <- dplyr::collect(converted_data) head(converted_data) # Change annotation as an example: converted_data$Condition <- NULL converted_data$BioReplicate <- NULL annot <- data.frame(Run = unique(converted_data[["Run"]])) annot$BioReplicate <- rep(1:53, times = 2) annot$Condition <- rep(1:2, each = 53) head(MSstatsAddAnnotationBig(converted_data, annot))
General converter for larger-than-memory csv files in MSstats format 10-column format
MSstatsPreprocessBig( input_file, output_file_name, backend, max_feature_count = 20, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )
MSstatsPreprocessBig( input_file, output_file_name, backend, max_feature_count = 20, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, remove_annotation = FALSE, connection = NULL )
input_file |
name of the input text file in 10-column MSstats format. |
output_file_name |
name of an output file which will be saved after pre-processing |
backend |
"arrow" or "sparklyr". Option "sparklyr" requires a spark installation and connection to spark instance provided in the 'connection' parameter. |
max_feature_count |
maximum number of features per protein. Features will be selected based on highest average intensity. |
filter_unique_peptides |
If TRUE, shared peptides will be removed. Please refer to the 'Details' section for additional information. |
aggregate_psms |
If TRUE, multiple measurements per PSM in a Run will be aggregated (by taking maximum value). Please refer to the 'Details' section for additional information. |
filter_few_obs |
If TRUE, feature with less than 3 observations across runs will be removed. Please refer to the 'Details' section for additional information. |
remove_annotation |
If TRUE, columns BioReplicate and Condition will be removed to reduce output file size. These will need to be added manually later before using dataProcess function. Only applicable to sparklyr backend. |
connection |
Connection to a spark instance created with the 'spark_connect' function from 'sparklyr' package. |
Filtering and aggregation may be very time consuming and the ability to perform them in a given R session depends on available memory, settings of external packages, etc. Hence, all value of related parameters ('filter_unique_peptides', 'aggregate_psms', 'filter_few_obs') are set to FALSE by default and only feature selection is performed, which saves both computation time and memory. Appropriately configured spark backend provides the most consistent way to perform these operations.
either arrow object or sparklyr table that can be optionally collected into memory by using dplyr::collect function.
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "tencol_format.csv", backend="arrow") procd <- MSstatsPreprocessBig("tencol_format.csv", "proc_out.csv", backend = "arrow") head(dplyr::collect(procd))
converted_data <- bigFragPipetoMSstatsFormat( system.file("extdata", "fgexample.csv", package = "MSstatsBig"), "tencol_format.csv", backend="arrow") procd <- MSstatsPreprocessBig("tencol_format.csv", "proc_out.csv", backend = "arrow") head(dplyr::collect(procd))