--- title: Analyzing mass spectrometry data with limpa author: "Mengbo Li and Gordon K. Smyth" date: "First version 30 December 2024, Last revised 25 March 2025" abstract: > Mass spectrometry is a powerful tool in biomedical research. Advancements in label-free methods and MS instruments have enabled high-throughput proteomics profiling of biological samples with the increasingly deeper coverage of the proteome, but missing values are still ubiquitous in MS-based proteomics data. Single-cell MS-based proteomics has also started to become available, with increased frequency of missing values. The limpa package implements a pipeline for quantification and differential expression analysis of mass-spectrometry (MS) proteomics data, with probabilistic information recovery from missing values. The package accepts peptide-level data, usually including missing values, and produces complete protein quantifications without the need for imputation. A key feature is the ability to quantify an expression value for every protein and every sample, regardless of the number of quantified peptides for each protein or the number of missing values at the peptide level. Another key feature is the ability to propagate quantification uncertainty through to the differential expression analysis, using variance modeling and precision weights, increasing statistical power while avoiding false discoveries. limpa produces a linear model object suitable for downstream analysis with the limma package, allowing complex experimental designs and other downstream tasks such as the gene ontology or pathway analysis. The package name ("limpa") is an acronym for "Linear Models for Proteomics Data". limpa package version: `r packageVersion("limpa")`. output: BiocStyle::html_document: toc: FALSE number_sections: FALSE vignette: > %\VignetteIndexEntry{Analyzing mass spectrometry data with limpa} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( prompt = TRUE, comment = NA ) options(digits=4) ``` # Background limpa assumes that data-independent acquisition (DIA) MS proteomics profiling has been performed on a series of biological samples, and that peptide quantifications have been obtained for each sample (Luo et al 2023). limma reads the peptide or peptide-precursor data from popular software tools such as DIA-NN (https://github.com/vdemichev/DiaNN) (Demichev et al 2020), Spectronaut (https://staging.biognosys.com/software/spectronaut/) or MaxQuant (Tyanova et al 2016). In each case, the peptide quantifications are read into a wide-format (peptide by sample) matrix of log-intensities. The only other information required to start a limpa analysis is a vector specifying the protein or protein-group membership of each peptide. Peptide quantifications typically include missing values for some samples, which cause considerable problems for downstream analyses. The missing values can't just be ignored, because they occur more often at low intensities, and missing value imputation also introduces biases and problems. limpa uses statistical models developed by Li & Smyth (2023) to evaluate how much information can be recovered from the missing values, quantifying the overall missing value "mechanism" into a detection probability curve (DPC). limpa uses the DPC, together with a Bayesian model, to summarize the peptide quantifications into protein-level quantifications and to undertake limma-based differential expression analyses (Li 2024, Ritchie et al 2015). # Quick start If `y.peptide` is a matrix of peptide-level log2-intensities (including NAs), `protein.id` is a vector of protein IDs, and `design` is a design matrix, then the following code will quantify complete log2-expression for the proteins without missing values and will conduct a differential expression analysis defined by the design matrix. ``` library(limpa) dpcfit <- dpc(y.peptide) y.protein <- dpcQuant(y.peptide, protein.id, dpc=dpcfit) fit <- dpcDE(y.protein, design) fit <- eBayes(fit) topTable(fit) ``` # Example analysis Real proteomics datasets are too large to analyze in this vignette, but we can demonstrate a complete reproducible analysis using a small simulated data. First, generate the dataset: ```{r sim} library(limpa) set.seed(20241230) y.peptide <- simProteinDataSet() ``` The dataset is stored as a limma EList object, with components `E` (log2-expression), `genes` (feature annotation) and `targets` (sample annotation). The simulation function can generate any number of peptides or proteins but, by default, the dataset has 100 peptides belonging to 25 proteins and the samples are in two groups with $n=5$ replicates in each group. About 40% of the expression values are missing. ```{r meanNA} dim(y.peptide) head(y.peptide$genes) table(y.peptide$targets$Group) mean(is.na(y.peptide$E)) ``` Next we estimate the intercept and slope of the detection probability curve, which relates the probability of detection to the underlying peptide expression level on the logit scale. ```{r dpc} dpcfit <- dpc(y.peptide) dpcfit$dpc plotDPC(dpcfit) ``` Then we use the DPC to quantify the protein log2-expression values, using the DPC to represent the missing values. ```{r dpcQuant} y.protein <- dpcQuant(y.peptide, "Protein", dpc=dpcfit) ``` There are no longer any missing values, and the samples now cluster into groups. The `plotMDSUsingSEs()` function is similar to `plotMDS()` in the limma package, but takes account of the standard errors generated by `dpcQuant()`. ```{r plotMDS} plotMDSUsingSEs(y.protein) Group <- factor(y.peptide$targets$Group) Group.color <- Group levels(Group.color) <- c("blue","red") plotMDSUsingSEs(y.protein, pch=16, col=as.character(Group.color)) ``` Finally, we conduct a differential expression analysis using the limma package. The `dpcDE` function calls limma's `voomaLmFit` function, which was specially developed for use with limpa. `voomaLmFit` computes precision weights, in a similar way to `voom` for RNA-seq, but instead of using count sizes it use the quantification precisions from `dpcQuant`. The plot shows how `dpcDE` predicts the protein-wise variances from the quantification uncertainties and expression levels. ```{r dpcDE} design <- model.matrix(~Group) fit <- dpcDE(y.protein, design, plot=TRUE) fit <- eBayes(fit) topTable(fit, coef=2) ``` This small dataset has five truly DE proteins. Four of the give are top-ranked in the DE results. The other DE protein is ranked 10th in the DE results and does not achieve statistical significant because it had only 17% detected observations and, hence, a high quantification uncertainty. To view the log-expression values for the top DE protein, together with standard errors: ```{r plotProtein} plotProtein(y.protein, "Protein23", col=as.character(Group.color)) ``` # Quantification without protein summaries For some applications, it is desirable to complete and impute the peptide-level intensities without summarizing them at the protein level. This is can done in limpa by ``` y.peptide.complete <- dpcImpute(y.peptide, dpc=dpcfit) ``` # Sesssion information ```{r sessionInfo} sessionInfo() ``` # Data input The limpa pipeline starts with a matrix of peptide precursor intensities (rows for peptides and columns for samples) and a character vector of protein IDs. The input data can be conveniently supplied as a limma EList object, but a plain numeric matrix containing the log-intensities is also acceptable. Non-detected peptides should be entered as `NA`. limpa includes the functions `readDIANN()` and `readSpectronaut()`, which read peptide or peptide-precursor level data output by the popular software tools DIA-NN or Spectronaut respectively. For example, ```{r, eval = FALSE} y.peptide <- readDIANN("Report.tsv") ``` reads the `Report.tsv` file written by DIA-NN from the current working directory. We are aware that DIA-NN no longer exports its main report in the `.tsv` format, but in the `.parquet` format from version 2.0. To process DIA-NN data in the `.parquet` format, use ```{r, eval = FALSE} y.peptide <- readDIANN("Report.parquet", format="parquet") ``` For data searched with match-between-run (MBR), as suggested by Thierry Nordmann (Max Planck Institute of Biochemistry), we recommend to filter the observations by the following ```{r, eval = FALSE} y.peptide <- readDIANN("report.tsv", q.columns = c("Q.Value","Lib.Q.Value","Lib.PG.Q.Value"), q.cutoffs = 0.01) ``` If searched without MBR, we recommend ```{r, eval = FALSE} y.peptide <- readDIANN("report.tsv", q.columns = c("Q.Value","Global.Q.Value","Global.PG.Q.Value"), q.cutoffs = 0.01) ``` After reading in the data, it is common to filter out non-proteotypic peptides by ```{r, eval = FALSE} y.peptide <- filterNonProteotypicPeptides(y.peptide) ``` and to filter out compound protein groups (protein groups mapped to two or more protein IDs) by ```{r, eval = FALSE} y.peptide <- filterCompoundProteins(y.peptide) ``` These filtering steps are not required by limpa but help with downstream interpretation of the results. Protein groups can also be filtered by number of peptides, typically to remove proteins with only one detected precursor: ```{r, eval = FALSE} y.peptide <- filterSingletonPeptides(y.peptide, min.n.peptides = 2) ``` This step is entirely optional. We generally recommend that users keep all proteins in order to retain maximum information. The `dpcQuant()` function will still quantify complete data even for proteins with just one peptide. Note that peptides do not need to be filtered based on the proportion of detected or missing values. limpa can process peptides correctly even if the number of detections is small. # The detection probability curve Missing values for some peptides in some samples has complicated the analysis of MS proteomics data. Peptides with very low expression values are frequently not detected, but peptides at high expression levels may also be undetected for a variety of reasons that are not completely understood or easily predictable, for example ambiguity of their elution profile with that of other peptides. If `y` is the true expression level of a particular peptide in a particular sample (on the log2 scale), then limpa assumes that the probability of detection is given by $$P(D | y) = F(\beta_0 + \beta_1 y)$$ where $D$ indicates detection, $\beta_0$ and $\beta_1$ are the intercept and slope of the DPC and $F$ is the logistic function, given by `plogis` in R. This probability relationship is called the detection probability curve (DPC) in limpa. The slope $\beta_1$ measures how dependent the missing value process is on the underlying expression level. A slope of zero would means completely random missing values, while very large slopes correspond to left censoring. The DPC allows limpa to recover information in a probabilistic manner from the missing values. The larger the slope, the more information there is to recover. We typically find $\beta_1$ values between about 0.7 and 1 to be representative of real MS data. The DPC is difficult to estimate because `y` is only observed for detected peptides, and the detected values are a biased representation of the complete values that in principle might have been observed had the missing value mechanism not operated. limpa uses a mathematical exponential tilting argument to represent the DPC in terms of observed values only, which provides a means to estimate the DPC from real data. The DPC slope $\beta_1$ is nevertheless often under-estimated if the variability of each peptide is large. # Quantification limpa uses the DPC, together with a Bayesian model, to estimate the expression level of each protein in each sample. A multivariate normal prior is estimated empirically from data to describe the variability in log-intensities across the samples and across the peptides. The DPC-Quant uses maximum posterior estimation to quantify the expression of each protein in each sample, and also returns the posterior standard error with which each expression value is estimated. # Differential expression Finally, the protein log2-expression values and associated uncertainties are passed to the `voomaLmFitWithImputation` function, which computes precision weights for each observation and fits proteinwise linear models. `voomaLmFitWithImputation` combines features from the `voomaLmFit` and `voomLmFit` functions in the limma and edgeR packages. It uses both protein expression and the quantification standard errors to predict the protein-wise variances and, hence, to construct precision weights for downstream linear modelling. This allows the uncertainty associated with missing values imputation to be propagated through to the differential expression analysis. `voomaLmFitWithImputation` also gives special consideration to instances where all the expression values for a particular protein are imputed for one or more treatment conditions, ensuring robust differential expression analyses even for proteins with only a small proportion of detected peptides. limpa's `dpcDE` function is a wrapper function, passing the appropriate standard errors from `dpcQuant` to `voomaLmFitWithImputation`. The limpa package is fully compatible with limma analysis pipelines, allowing any complex experimental design and other downstream tasks such as the gene ontology or pathway analysis. The `dpcDE()` function accepts any argument that `voomaLmFitWithImputation` (or `voomaLmFit`) does. For example, `dpcDE(y.protein, design, sample.weights=TRUE)` can be used to downweight outlier samples. Or `dpc(y.protein, design, block=subject)` could be used to model the correlation between repeated observations on the same subject. # How to get help Any questions about limpa can be sent to the [Bioconductor support forum](https://support.bioconductor.org). # How to cite We are working on a publication that will fully describe the limpa theory and functionality. In the meantime, please cite Li & Smyth (2023), which introduced the idea of the detection probability curve (DPC) that is fundamental to the limpa package. # Funding The limpa project was supported by Chan Zuckerberg Initiative EOSS grant 2021-237445, by Melbourne Research and CSL Translational Data Science Scholarships to ML, and by NHMRC Investigator Grant 2025645 to GS. # Cited references Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. *Nature Methods* 17(1), 41-44. Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. *Bioinformatics* 39(5), btad200. [doi:10.1093/bioinformatics/btad200](https://doi.org/10.1093/bioinformatics/btad200) Li M (2024). *Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data*. PhD Thesis, University of Melbourne. [http://hdl.handle.net/11343/351600](http://hdl.handle.net/11343/351600) Lou R, Cao Y, Li S, Lang X, Li Y, Zhang Y, Shui W (2023). Benchmarking commonly used software suites and analysis workflows for DIA proteomics and phosphoproteomics. *Nature Communications* 14(1), 94. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. *Nucleic Acids Research* 43, e47. [doi:10.1093/nar/gkv007](https://doi.org/doi:10.1093/nar/gkv007) Tyanova S, Temu T, Cox J (2016). The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. *Nature Protocols* 11, 2301-2319.