---
title: "MSstats: End to End Workflow"
date: September 5th, 2024
---


```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE)
options(width=110)
```

```{=html}
<!--
%\VignetteIndexEntry{MSstats: End to End Workflow}
%\VignetteEngine{knitr::knitr}
-->
```
# __MSstats: Protein/Peptide significance analysis__

Package: MSstats

Author: Anshuman Raina & Devon Kohler

Date: 5th Semptember 2024

## __Introduction__

`MSstats`, an R package in Bioconductor, supports protein differential analysis 
for statistical relative quantification of proteins and peptides in global, 
targeted and data-independent proteomics. It handles shotgun, label-free and 
label-based (universal synthetic peptide-based) SRM (selected reaction 
monitoring), and DIA (data independent acquisition) experiments. It can be used 
for experiments with complex designs (e.g. comparing more than two experimental 
conditions, or a repeated measure design, such as a time course).

This vignette summarizes the introduction and various options of all 
functionalities in `MSstats`. More details are available in `User Manual`.

For more information about the MSstats workflow, including a detailed 
description of the available processing options and their impact on the 
resulting differential analysis, please see the following publication:

Kohler et al, Nature Protocols 19, 2915–2938 (2024).

## __Installation__

To install this package, start R (version “4.0”) and enter:

``` {r code Installation}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstats")
library(MSstats)
library(ggplot2)
```

## __1. Workflow__

### __1.1 Raw Data__ 

To begin with, we will load sample datasets, including both annotated and plain 
data. The dataset you need can be found [here](https://github.com/Vitek-Lab/MSstatsConvert/blob/devel/inst/tinytest/raw_data/PD/pd_input.csv).

We will also load the Annotation Dataset using MSstatsConvert. You can access 
this dataset [here](https://github.com/Vitek-Lab/MSstatsConvert/blob/devel/inst/tinytest/raw_data/PD/annot_pd.csv).


``` {r code Load Dataset}
library(MSstats)

# Load data
pd_raw = system.file("tinytest/raw_data/PD/pd_input.csv", 
                    package = "MSstatsConvert")

annotation_raw = system.file("tinytest/raw_data/PD/annot_pd.csv", 
                   package = "MSstatsConvert")

pd = data.table::fread(pd_raw)
annotation = data.table::fread(annotation_raw)

head(pd, 5)
head(annotation, 5)

```

### __1.2 Loading PD Data to MSstats__

The imported data from Step 1.1. now must be converted through `MSstatsConvert` 
package's `PDtoMSstatsFormat` converter.

This function converts the Proteome Discoverer output into the required input 
format for `MSstats`.

Actual data modification can be seen below:

```{r code PDtoMSstatsFormat}
library(MSstatsConvert)

pd_imported = MSstatsConvert::PDtoMSstatsFormat(pd, annotation, 
                                                use_log_file = FALSE)

head(pd_imported)
```

### __1.3 Converters__

We have the following converters, which allow you to convert various types of 
output reports which include the feature level data to the required input format
of `MSstats`. Further information about the converters can be found in the 
`MSstatsConvert` package.

1. `DIANNtoMSstatsFormat`
2. `DIAUmpiretoMSstatsFormat`
3. `FragPipetoMSstatsFormat`
4. `MaxQtoMSstatsFormat`
5. `OpenMStoMSstatsFormat`
6. `OpenSWATHtoMSstatsFormat`
7. `PDtoMSstatsFormat`
8. `ProgenesistoMSstatsFormat`
9. `SkylinetoMSstatsFormat`
10. `SpectronauttoMSstatsFormat`
11. `MetamorpheusToMSstatsFormat`

We show an example of how to use the above said Converters. For more information 
about using the individual converters please see the coresponding documentation.

```{r Converter Files}

skyline_raw = system.file("tinytest/raw_data/Skyline/skyline_input.csv", 
                    package = "MSstatsConvert")

skyline = data.table::fread(skyline_raw)
head(skyline, 5)

```

```{r SkylinetoMSstatsFormat, results='hide', message=FALSE, warning=FALSE}

msstats_format = MSstatsConvert::SkylinetoMSstatsFormat(skyline_raw,
                                      qvalue_cutoff = 0.01,
                                      useUniquePeptide = TRUE,
                                      removeFewMeasurements = TRUE,
                                      removeOxidationMpeptides = TRUE,
                                      removeProtein_with1Feature = TRUE)


```

```{r SkylinetoMSstatsFormat head}
head(msstats_format)
```


### __1.4 Data Process__

Once we import the dataset correctly with Converter, we need to pre-process the 
data which is done by the `dataProcess` function. This step involves data 
processing and quality control of the measured feature intensities.

This function includes 5 main processing steps (with other additional small 
steps):

* __Log transformation__ - Transform the feature intensities from their original
scale to the log scale. This step helps make the data closer to being normally 
distributed, requiring less replicates for the central limit theorem to kick in.

* __Normalization__ - There are three different normalization options supported. 
'equalizeMedians' (default) represents constant normalization (equalizing the 
medians) based on reference signals is performed. 'quantile' represents 
quantile normalization based on reference signals is performed. 
'globalStandards' represents normalization with global standards proteins. 
FALSE represents no normalization is performed.

* __Feature selection__ - This also has three options i.e. Select All features, 
Top-N features (by mean intensity) or “Best” features.

* __Missing value imputation__ - We impute plausible values in case of missing 
data points. The RunLevelData can be queried to show Number of imputed 
intensities (censored intensities) in a RUN and Protein.

* __Summarization__ - After data processing the individual features are 
summarized up to the protein-level using Tukey's Median Polish. Linear 
summarization is also available as an option.


``` {r code dataProcess}
summarized = dataProcess(
    pd_imported,
    logTrans = 2,
    normalization = "equalizeMedians",
    featureSubset = "all",
    n_top_feature = 3,
    summaryMethod = "TMP",
    equalFeatureVar = TRUE,
    censoredInt = "NA",
    MBimpute = TRUE
    )

head(summarized$FeatureLevelData)

head(summarized$ProteinLevelData)

head(summarized$SummaryMethod)

```

### __1.4.1 Data Process Plots__

After processing the input data, `MSstats` provides multiple plots to analyze the 
results. Here we show the various types of plots we can use. By default, a 
pdf file will be downloaded with corresponding feature level data and the Plot 
generated. Alternatively, the `address` parameter can be set to `FALSE` which 
will output the plots directly.

```{r dataProcessPlots, results='hide', message=FALSE, warning=FALSE}

# Profile plot
dataProcessPlots(data=summarized, type="ProfilePlot", 
                 address = FALSE, which.Protein = "P0ABU9")

# Quality control plot
dataProcessPlots(data=summarized, type="QCPlot", 
                 address = FALSE, which.Protein = "P0ABU9")

# Quantification plot for conditions
dataProcessPlots(data=summarized, type="ConditionPlot", 
                 address = FALSE, which.Protein = "P0ABU9")

```


### __1.5 Modeling __

In this step we test for differential changes in protein abundance across 
conditions using a linear mixed-effects model. The model will be automatically 
adjusted based on your experimental design.

A contrast matrix must be provided to the model. Alternatively, all pairwise 
comparisons can be made by passing `pairwise` to the function. For more 
information on creating contrast matrices, please see the citation linked 
at the beginning of this document.

``` {r code groupComparison}

model = groupComparison("pairwise", summarized)

```

Model Details

``` {r Model }

head(model$ModelQC)

head(model$ComparisonResult)

```

### __1.5.1 groupComparisonPlot__

Visualization for model-based analysis and summarizing differentially abundant 
proteins. To summarize the results of log-fold changes and adjusted p-values 
for differentially abundant proteins, `groupComparisonPlots` takes testing 
results from function `groupComparison` as input and automatically generate 
three types of figures in pdf files as output :

* __Volcano plot__ : For each comparison separately. It illustrates actual 
log-fold changes and adjusted p-values for each comparison separately with all 
proteins. The x-axis is the log fold change. The base of logarithm 
transformation is the same as specified in “logTrans” from `dataProcess`. The 
y-axis is the negative log2 or log10 adjusted p-values. The horizontal dashed 
line represents the FDR cutoff. The points below the FDR cutoff line are 
non-significantly abundant proteins (colored in black). The points above the 
FDR cutoff line are significantly abundant proteins (colored in red/blue for 
up-/down-regulated). If fold change cutoff is specified (FCcutoff = specific 
value), the points above the FDR cutoff line but within the FC cutoff line are 
non-significantly abundant proteins (colored in black).

* __Heatmap__ : For multiple comparisons. It illustrates up-/down-regulated 
proteins for multiple comparisons with all proteins. Each column represents 
each comparison of interest. Each row represents each protein. Color red/blue 
represents proteins in that specific comparison are significantly 
up-regulated/down-regulated proteins with FDR cutoff and/or FC cutoff. The 
color scheme shows the evidences of significance. The darker color it is, the 
stronger evidence of significance it has. Color gold represents proteins are 
not significantly different in abundance.

* __Comparison plot__ : For multiple comparisons per protein. It illustrates 
log-fold change and its variation of multiple comparisons for single protein. 
X-axis is comparison of interest. Y-axis is the log fold change. The red points 
are the estimated log fold change from the model. The error bars are the 
confidence interval with 0.95 significant level for log fold change. This 
interval is only based on the standard error, which is estimated from the model.

``` {r GroupComparisonPlots}

groupComparisonPlots(
  model$ComparisonResult,
  type="Heatmap",
  sig = 0.05,
  FCcutoff = FALSE,
  logBase.pvalue = 10,
  ylimUp = FALSE,
  ylimDown = FALSE,
  xlimUp = FALSE,
  x.axis.size = 10,
  y.axis.size = 10,
  dot.size = 3,
  text.size = 4,
  text.angle = 0,
  legend.size = 13,
  ProteinName = TRUE,
  colorkey = TRUE,
  numProtein = 100,
  clustering = "both",
  width = 800,
  height = 600,
  which.Comparison = "all",
  which.Protein = "all",
  address = FALSE,
  isPlotly = FALSE
)


groupComparisonPlots(
  model$ComparisonResult,
  type="VolcanoPlot",
  sig = 0.05,
  FCcutoff = FALSE,
  logBase.pvalue = 10,
  ylimUp = FALSE,
  ylimDown = FALSE,
  xlimUp = FALSE,
  x.axis.size = 10,
  y.axis.size = 10,
  dot.size = 3,
  text.size = 4,
  text.angle = 0,
  legend.size = 13,
  ProteinName = TRUE,
  colorkey = TRUE,
  numProtein = 100,
  clustering = "both",
  width = 800,
  height = 600,
  which.Comparison = "Condition2 vs Condition4",
  which.Protein = "all",
  address = FALSE,
  isPlotly = FALSE
)

```

### __1.6 GroupComparisonQCPlots__

To check and verify that the resultant data of `groupComparison` offers a linear 
model for whole plot inference, `groupComparisonQC` plots take the fitted data 
and provide two ways of plotting:

1. Normal Q-Q plot : Quantile-Quantile plots represents normal quantile-quantile
plot for each protein after fitting models
2. Residual plot : represents a plot of residuals versus fitted values for each 
protein in the dataset.

Results based on statistical models for whole plot level inference are accurate 
as long as the assumptions of the model are met. The model assumes that the 
measurement errors are normally distributed with mean 0 and constant variance. 
The assumption of a constant variance can be checked by examining the residuals 
from the model.


``` {r GroupComparisonQCplots, results='hide', message=FALSE, warning=FALSE}

source("..//R//groupComparisonQCPlots.R")

groupComparisonQCPlots(data=model, type="QQPlots", address=FALSE, 
                       which.Protein = "P0ABU9")


groupComparisonQCPlots(data=model, type="ResidualPlots", address=FALSE, 
                       which.Protein = "P0ABU9")
```


### __1.7 Sample Size Calculation__

Calculate sample size for future experiments of a Selected Reaction 
Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and 
Data-Independent Acquisition (DIA or SWATH-MS) experiment based on 
intensity-based linear model. The function fits the model and uses variance 
components to calculate sample size. The underlying model fitting with 
intensity-based linear model with technical MS run replication. Estimated 
sample size is rounded to 0 decimal. Two options of the calculation:

* number of biological replicates per condition
* power

```{r Sample Size}

sample_size_calc = designSampleSize(model$FittedModel,
                                    desiredFC=c(1.75,2.5),
                                    power = TRUE,
                                    numSample=5)

```


### __1.7.1 Sample Size Calculation Plot__

To illustrate the relationship of desired fold change and the calculated minimal
number sample size which are

The input is the result from function `designSampleSize`.

```{r Sample Size plot}

designSampleSizePlots(sample_size_calc, isPlotly=FALSE)

```


### __1.8 Quantification from groupComparison Data__

Model-based quantification for each condition or for each biological samples 
per protein in a targeted Selected Reaction Monitoring (SRM), Data-Dependent 
Acquisition (DDA or shotgun), and Data-Independent Acquisition 
(DIA or SWATH-MS) experiment. Quantification takes the processed data set by 
`dataProcess` as input and automatically generate the quantification results 
(data.frame) with long or matrix format. The quantification for endogenous 
samples is based on run summarization from subplot model, with TMP robust 
estimation.

* Sample quantification : individual biological sample quantification for 
each protein. The label of each biological sample is a combination of the 
corresponding group and the sample ID. If there are no technical replicates or 
experimental replicates per sample, sample quantification is the same as run 
summarization from `dataProcess` (`RunlevelData` from `dataProcess`). If there 
are technical replicates or experimental replicates, sample quantification is 
median among run quantification corresponding MS runs.

* Group quantification : quantification for individual group or individual 
condition per protein. It is median among sample quantification.

```{r Quantification}
sample_quant_long = quantification(summarized,
                             type = "Sample",
                             format = "long")
sample_quant_long
sample_quant_wide = quantification(summarized,
                              type = "Sample",
                              format = "matrix")
sample_quant_wide
group_quant_long = quantification(summarized,
                                  type = "Group",
                                  format = "long")
group_quant_long
group_quant_wide = quantification(summarized,
                                  type = "Group",
                                  format = "matrix")
group_quant_wide
```