limpa assumes that data-independent acquisition (DIA) MS proteomics profiling has been performed on a series of biological samples, and that peptide quantifications have been obtained for each sample (Luo et al 2023). limma reads the peptide or peptide-precursor data from popular software tools such as DIA-NN (https://github.com/vdemichev/DiaNN) (Demichev et al 2020), Spectronaut (https://staging.biognosys.com/software/spectronaut/) or MaxQuant (Tyanova et al 2016). In each case, the peptide quantifications are read into a wide-format (peptide by sample) matrix of log-intensities. The only other information required to start a limpa analysis is a vector specifying the protein or protein-group membership of each peptide.
Peptide quantifications typically include missing values for some samples, which cause considerable problems for downstream analyses. The missing values can’t just be ignored, because they occur more often at low intensities, and missing value imputation also introduces biases and problems. limpa uses statistical models developed by Li & Smyth (2023) to evaluate how much information can be recovered from the missing values, quantifying the overall missing value “mechanism” into a detection probability curve (DPC). limpa uses the DPC, together with a Bayesian model, to summarize the peptide quantifications into protein-level quantifications and to undertake limma-based differential expression analyses (Li 2024, Ritchie et al 2015).
If y.peptide
is a matrix of peptide-level
log2-intensities (including NAs), protein.id
is a vector of
protein IDs, and design
is a design matrix, then the
following code will quantify complete log2-expression for the proteins
without missing values and will conduct a differential expression
analysis defined by the design matrix.
library(limpa)
dpcfit <- dpc(y.peptide)
y.protein <- dpcQuant(y.peptide, protein.id, dpc=dpcfit)
fit <- dpcDE(y.protein, design)
fit <- eBayes(fit)
topTable(fit)
Real proteomics datasets are too large to analyze in this vignette, but we can demonstrate a complete reproducible analysis using a small simulated data. First, generate the dataset:
Loading required package: limma
The dataset is stored as a limma EList object, with components
E
(log2-expression), genes
(feature
annotation) and targets
(sample annotation). The simulation
function can generate any number of peptides or proteins but, by
default, the dataset has 100 peptides belonging to 25 proteins and the
samples are in two groups with n = 5 replicates in each group.
About 40% of the expression values are missing.
[1] 100 10
Protein DEStatus
Peptide001 Protein01 NotDE
Peptide002 Protein01 NotDE
Peptide003 Protein01 NotDE
Peptide004 Protein01 NotDE
Peptide005 Protein02 NotDE
Peptide006 Protein02 NotDE
1 2
5 5
[1] 0.425
Next we estimate the intercept and slope of the detection probability curve, which relates the probability of detection to the underlying peptide expression level on the logit scale.
4 peptides are completely missing in all samples.
beta0 beta1
-4.0521 0.7506
Then we use the DPC to quantify the protein log2-expression values, using the DPC to represent the missing values.
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 25 Peptides: 100
There are no longer any missing values, and the samples now cluster
into groups. The plotMDSUsingSEs()
function is similar to
plotMDS()
in the limma package, but takes account of the
standard errors generated by dpcQuant()
.
> Group <- factor(y.peptide$targets$Group)
> Group.color <- Group
> levels(Group.color) <- c("blue","red")
> plotMDSUsingSEs(y.protein, pch=16, col=as.character(Group.color))
Finally, we conduct a differential expression analysis using the
limma package. The dpcDE
function calls limma’s
voomaLmFit
function, which was specially developed for use
with limpa. voomaLmFit
computes precision weights, in a
similar way to voom
for RNA-seq, but instead of using count
sizes it use the quantification precisions from dpcQuant
.
The plot shows how dpcDE
predicts the protein-wise
variances from the quantification uncertainties and expression
levels.
Protein DEStatus NPeptides PropObs logFC AveExpr t
Protein23 Protein23 Up 4 0.950 0.9366 9.243 5.1593
Protein24 Protein24 Down 4 0.975 -0.8053 9.330 -4.3522
Protein22 Protein22 Up 4 1.000 0.6810 8.911 4.3089
Protein08 Protein08 Up 4 0.475 0.7300 4.841 2.8457
Protein13 Protein13 NotDE 4 0.625 -0.5222 5.880 -2.3070
Protein11 Protein11 NotDE 4 0.500 0.3772 5.014 1.5973
Protein21 Protein21 NotDE 4 0.800 0.3151 8.462 1.5546
Protein07 Protein07 NotDE 4 0.375 0.3256 4.354 1.3382
Protein06 Protein06 NotDE 4 0.175 -0.2859 3.771 -1.3371
Protein03 Protein03 Down 4 0.175 -0.1870 2.367 -0.9061
P.Value adj.P.Val B
Protein23 8.921e-06 0.0002230 3.4121
Protein24 1.044e-04 0.0009902 1.0686
Protein22 1.188e-04 0.0009902 0.8621
Protein08 7.227e-03 0.0451676 -2.6406
Protein13 2.684e-02 0.1341772 -3.9625
Protein11 1.188e-01 0.4595377 -5.1112
Protein21 1.287e-01 0.4595377 -5.4636
Protein07 1.891e-01 0.5262638 -5.4638
Protein06 1.895e-01 0.5262638 -5.6045
Protein03 3.708e-01 0.8427484 -6.1130
This small dataset has five truly DE proteins. Four of the give are top-ranked in the DE results. The other DE protein is ranked 10th in the DE results and does not achieve statistical significant because it had only 17% detected observations and, hence, a high quantification uncertainty.
To view the log-expression values for the top DE protein, together with standard errors:
For some applications, it is desirable to complete and impute the peptide-level intensities without summarizing them at the protein level. This is can done in limpa by
y.peptide.complete <- dpcImpute(y.peptide, dpc=dpcfit)
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] limpa_0.99.10 limma_3.63.10 BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0
[4] xfun_0.51 maketools_1.3.2 cachem_1.1.0
[7] knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
[10] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.4
[13] sass_0.4.9 data.table_1.17.0 jquerylib_0.1.4
[16] statmod_1.5.0 compiler_4.4.3 sys_3.4.3
[19] tools_4.4.3 evaluate_1.0.3 bslib_0.9.0
[22] yaml_2.3.10 BiocManager_1.30.25 jsonlite_1.9.1
[25] rlang_1.1.5
The limpa pipeline starts with a matrix of peptide precursor
intensities (rows for peptides and columns for samples) and a character
vector of protein IDs. The input data can be conveniently supplied as a
limma EList object, but a plain numeric matrix containing the
log-intensities is also acceptable. Non-detected peptides should be
entered as NA
.
limpa includes the functions readDIANN()
and
readSpectronaut()
, which read peptide or peptide-precursor
level data output by the popular software tools DIA-NN or Spectronaut
respectively. For example,
reads the Report.tsv
file written by DIA-NN from the
current working directory.
We are aware that DIA-NN no longer exports its main report in the
.tsv
format, but in the .parquet
format from
version 2.0. To process DIA-NN data in the .parquet
format,
use
For data searched with match-between-run (MBR), as suggested by Thierry Nordmann (Max Planck Institute of Biochemistry), we recommend to filter the observations by the following
> y.peptide <- readDIANN("report.tsv",
+ q.columns = c("Q.Value","Lib.Q.Value","Lib.PG.Q.Value"),
+ q.cutoffs = 0.01)
If searched without MBR, we recommend
> y.peptide <- readDIANN("report.tsv",
+ q.columns = c("Q.Value","Global.Q.Value","Global.PG.Q.Value"),
+ q.cutoffs = 0.01)
After reading in the data, it is common to filter out non-proteotypic peptides by
and to filter out compound protein groups (protein groups mapped to two or more protein IDs) by
These filtering steps are not required by limpa but help with downstream interpretation of the results.
Protein groups can also be filtered by number of peptides, typically to remove proteins with only one detected precursor:
This step is entirely optional. We generally recommend that users
keep all proteins in order to retain maximum information. The
dpcQuant()
function will still quantify complete data even
for proteins with just one peptide.
Note that peptides do not need to be filtered based on the proportion of detected or missing values. limpa can process peptides correctly even if the number of detections is small.
Missing values for some peptides in some samples has complicated the
analysis of MS proteomics data. Peptides with very low expression values
are frequently not detected, but peptides at high expression levels may
also be undetected for a variety of reasons that are not completely
understood or easily predictable, for example ambiguity of their elution
profile with that of other peptides. If y
is the true
expression level of a particular peptide in a particular sample (on the
log2 scale), then limpa assumes that the probability of detection is
given by P(D|y) = F(β0 + β1y)
where D indicates detection,
β0 and β1 are the intercept and
slope of the DPC and F is the
logistic function, given by plogis
in R. This probability
relationship is called the detection probability curve (DPC) in limpa.
The slope β1
measures how dependent the missing value process is on the underlying
expression level. A slope of zero would means completely random missing
values, while very large slopes correspond to left censoring. The DPC
allows limpa to recover information in a probabilistic manner from the
missing values. The larger the slope, the more information there is to
recover. We typically find β1 values between about
0.7 and 1 to be representative of real MS data.
The DPC is difficult to estimate because y
is only
observed for detected peptides, and the detected values are a biased
representation of the complete values that in principle might have been
observed had the missing value mechanism not operated. limpa uses a
mathematical exponential tilting argument to represent the DPC in terms
of observed values only, which provides a means to estimate the DPC from
real data. The DPC slope β1 is nevertheless often
under-estimated if the variability of each peptide is large.
limpa uses the DPC, together with a Bayesian model, to estimate the expression level of each protein in each sample. A multivariate normal prior is estimated empirically from data to describe the variability in log-intensities across the samples and across the peptides. The DPC-Quant uses maximum posterior estimation to quantify the expression of each protein in each sample, and also returns the posterior standard error with which each expression value is estimated.
Finally, the protein log2-expression values and associated
uncertainties are passed to the voomaLmFitWithImputation
function, which computes precision weights for each observation and fits
proteinwise linear models. voomaLmFitWithImputation
combines features from the voomaLmFit
and
voomLmFit
functions in the limma and edgeR packages. It
uses both protein expression and the quantification standard errors to
predict the protein-wise variances and, hence, to construct precision
weights for downstream linear modelling. This allows the uncertainty
associated with missing values imputation to be propagated through to
the differential expression analysis.
voomaLmFitWithImputation
also gives special consideration
to instances where all the expression values for a particular protein
are imputed for one or more treatment conditions, ensuring robust
differential expression analyses even for proteins with only a small
proportion of detected peptides.
limpa’s dpcDE
function is a wrapper function, passing
the appropriate standard errors from dpcQuant
to
voomaLmFitWithImputation
. The limpa package is fully
compatible with limma analysis pipelines, allowing any complex
experimental design and other downstream tasks such as the gene ontology
or pathway analysis. The dpcDE()
function accepts any
argument that voomaLmFitWithImputation
(or
voomaLmFit
) does. For example,
dpcDE(y.protein, design, sample.weights=TRUE)
can be used
to downweight outlier samples. Or
dpc(y.protein, design, block=subject)
could be used to
model the correlation between repeated observations on the same
subject.
Any questions about limpa can be sent to the Bioconductor support forum.
We are working on a publication that will fully describe the limpa theory and functionality. In the meantime, please cite Li & Smyth (2023), which introduced the idea of the detection probability curve (DPC) that is fundamental to the limpa package.
The limpa project was supported by Chan Zuckerberg Initiative EOSS grant 2021-237445, by Melbourne Research and CSL Translational Data Science Scholarships to ML, and by NHMRC Investigator Grant 2025645 to GS.
Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods 17(1), 41-44.
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. doi:10.1093/bioinformatics/btad200
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Lou R, Cao Y, Li S, Lang X, Li Y, Zhang Y, Shui W (2023). Benchmarking commonly used software suites and analysis workflows for DIA proteomics and phosphoproteomics. Nature Communications 14(1), 94.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47. doi:10.1093/nar/gkv007
Tyanova S, Temu T, Cox J (2016). The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nature Protocols 11, 2301-2319.