Pirat

Installation

Pirat is an R package available via the Bioconductor repository for packages.

You can install the development version from GitHub :

BiocManager::install("prostarproteomics/Pirat")

When calling one of the imputation methods for the first time after package installation, the underlying python module and conda environment will be automatically downloaded and installed without the need for user intervention. This may take several minutes, but this is a one-time installation. the first time after package installation.

Standard imputation with Pirat

Pirat is a single imputation pipeline including a novel statistical approach dedicated to bottom-up proteomics data. All the technical details and the validation procedure of this method in available on the corresponding preprint Etourneau et al. (2023) . To demonstrate its usage, we first load Pirat package and a subset of Bouyssie2020 dataset in the environment.

library(Pirat)
library(utils)
data(subbouyssie)

Note that subbouyssie is actually a list that contains two elements:

  1. peptides_ab, the matrix of peptide log-2-abundances, with samples in rows and peptides in columns.
  2. adj, the adjacency matrix between peptides and proteins, containing either booleans or 0s and 1s (no preprocessing or simplification is needed for this matrix, Pirat will automatically build the PGs from it).

Slight imputation variations may occur for peptides belonging to very large PGs, because the latter are randomly split into several smaller PGs with fixed size to reduce computational costs. Although this variability is too small to affect imputation quality, we fix the seed in this tutorial such that the user can retrieve exactly the same imputed values when running the notebook again.

set.seed(12345)

One can then impute this dataset with the following line

imp.res <- my_pipeline_llkimpute(subbouyssie)
#> 
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.64780 -0.25838  0.06931  0.35373  0.76279 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      13.1950     1.5687   8.411 1.19e-07 ***
#> m_ab_sorted[seq(length(probs))]  -0.7937     0.0758 -10.471 4.38e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.42 on 18 degrees of freedom
#> Multiple R-squared:  0.859,  Adjusted R-squared:  0.8511 
#> F-statistic: 109.6 on 1 and 18 DF,  p-value: 4.378e-09

The first plot represents the goodness of fit of the inverse-gamma prior, whereas the second one represents the goodness of fit of the missingness mechanism (details on fitting methods are given in Etourneau et al. (2023)). Note that some of these parameters were originally proposed in Chen, Prentice, and Wang (2014), however no methods existed to find then automatically and without relying on heuristics.

The result imp.res is a list that contains:

  1. data.imputed, the imputed log-2 abundance matrix.
  2. params, a list containing parameters Γ and hyperparameters α and β.

You can check the imputed values here…

head(imp.res$data.imputed[ ,seq(5)])
#>                        ELISMAK  EDLLVTK  VVIEPHR  HAGVYIAR 
#> raw_abundance_10amol_1 24.07467 25.08683 24.79494  25.90383
#> raw_abundance_10amol_2 24.05896 25.11780 25.04157  26.20225
#> raw_abundance_10amol_3 23.94210 24.97989 24.67742  25.97934
#> raw_abundance_10amol_4 24.00094 24.93289 24.31135  25.80639
#> raw_abundance_50amol_1 24.05962 25.03968 24.66224  25.86291
#> raw_abundance_50amol_2 24.02732 25.02199 24.65812  25.84511
#>                        DHCIVVGR Carbamidomethyl (C3)
#> raw_abundance_10amol_1                      24.77011
#> raw_abundance_10amol_2                      24.98623
#> raw_abundance_10amol_3                      24.69103
#> raw_abundance_10amol_4                      24.50651
#> raw_abundance_50amol_1                      24.68406
#> raw_abundance_50amol_2                      24.66596

…and the computed parameters here.

imp.res$params
#> $alpha
#>        a 
#> 5.147265 
#> 
#> $beta
#>         b 
#> 0.2335229 
#> 
#> $gamma0
#> (Intercept) 
#>   -13.19496 
#> 
#> $gamma1
#> m_ab_sorted[seq(length(probs))] 
#>                       0.7937527

Note that a positive value for γ1 indicates that a random left-truncation mechanism is present in the dataset.

Intra-PG correlation analysis

Pirat has a diagnosis tool that compares distributions of correlations at random and those from same peptide groups (PGs). We use it here on the complete Ropers2021 dataset.

data(subropers)
plot_pep_correlations(subropers, titlename = "Ropers2021")

We see here that the blue distribution has much more weights on high correlations than the red one, indicating that PGs should clearly help for imputation.

Pirat extensions

To handle singleton PGs, Pirat proposes three extensions, on top of the classical Pirat approach. Note that the -T extensions can be applied to up to an arbitrary PG size. To illustrate our examples, we use a subset of Ropers2021 dataset.

-2, the 2-peptide rule

The -2 extension simply consists in not imputing singleton PGs. It can be used as following.

data(subropers)
imp.res = pipeline_llkimpute(subropers, extension = "2")
#> 
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.40672 -0.15008  0.08833  0.28986  0.84125 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     14.40677    1.19049   12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419    0.06469  -14.13  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared:  0.803,  Adjusted R-squared:  0.799 
#> F-statistic: 199.7 on 1 and 49 DF,  p-value: < 2.2e-16

We can then check that singleton peptides are not imputed (yet some may be already fully observed).

mask.sing.pg = colSums(subropers$adj) == 1
mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1
imp.res$data.imputed[, mask.sing.pep]
#>         NALPEGYAPAK__2 AIGIISNNPEFADVR__2 DLNIDPATLR__2
#> W1_1_MS       20.17349                 NA      19.18247
#> W1_2_MS       19.79753           17.22690      18.50307
#> W1_3_MS       20.07592           17.55627      18.83505
#> W2_1_MS       19.82019           17.42375      18.47821
#> W2_2_MS       20.17537           17.13305      17.89070
#> W2_3_MS       20.36344           17.50759      18.18041
#> R1_1_MS       20.69017                 NA      18.12682
#> R1_2_MS       20.70866                 NA      18.03316
#> R1_3_MS       21.04542                 NA      18.52355
#> R2_1_MS       20.95692                 NA      18.49510
#> R2_2_MS       20.90775           15.87217      17.92845
#> R2_3_MS       21.20864                 NA      18.24418
#> R3_1_MS       21.02669                 NA            NA
#> R3_2_MS       21.10948                 NA      17.34890
#> R3_3_MS       21.23109                 NA            NA
#> R4_1_MS       20.83249                 NA      17.28032
#> R4_2_MS       21.00524                 NA      17.67013
#> R4_3_MS       20.57447                 NA      17.33383

-S, samples-wise correlations

Pirat can leverage sample-wise correlations to impute the singleton peptides as following:

imp.res = my_pipeline_llkimpute(subropers, extension = "S")
#> 
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.40672 -0.15008  0.08833  0.28986  0.84125 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     14.40677    1.19049   12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419    0.06469  -14.13  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared:  0.803,  Adjusted R-squared:  0.799 
#> F-statistic: 199.7 on 1 and 49 DF,  p-value: < 2.2e-16

Here singleton peptides are impute after the rest of the dataset, using sample-wise correlations obtained.

mask.sing.pg = colSums(subropers$adj) == 1
mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1
imp.res$data.imputed[, mask.sing.pep]
#>         NALPEGYAPAK__2 AIGIISNNPEFADVR__2 DLNIDPATLR__2
#> W1_1_MS       20.17349           17.77321      19.18247
#> W1_2_MS       19.79753           17.22690      18.50307
#> W1_3_MS       20.07592           17.55627      18.83505
#> W2_1_MS       19.82019           17.42375      18.47821
#> W2_2_MS       20.17537           17.13305      17.89070
#> W2_3_MS       20.36344           17.50759      18.18041
#> R1_1_MS       20.69017           16.01870      18.12682
#> R1_2_MS       20.70866           16.28692      18.03316
#> R1_3_MS       21.04542           16.34702      18.52355
#> R2_1_MS       20.95692           16.10669      18.49510
#> R2_2_MS       20.90775           15.87217      17.92845
#> R2_3_MS       21.20864           16.43660      18.24418
#> R3_1_MS       21.02669           15.87970      17.69229
#> R3_2_MS       21.10948           16.04076      17.34890
#> R3_3_MS       21.23109           16.01596      18.00233
#> R4_1_MS       20.83249           16.22859      17.28032
#> R4_2_MS       21.00524           16.03777      17.67013
#> R4_3_MS       20.57447           15.87691      17.33383

-T, transcriptomic integration

The last extension consists in using correlations between peptides and gene/transcript expression obtained from a related transcriptomic analysis. To use this extension, the list of the dataset must contain:

  • rnas_ab, an log2-normalized-count table of gene or transcript expression, for which samples are either paired or related (i.e., from the same experimental/biological conditions).
  • adj_rna_pg, a adjacency matrix between transcripts or genes and PGs, containing either booleans or 0s and 1s.

ropers proteomic and transcriptomic samples are paired (i.e. the same biological samples were used for each type of analysis). Thus Pirat-T can be used as following:

imp.res = my_pipeline_llkimpute(subropers,
    extension = "T",
    rna.cond.mask = seq(nrow(subropers$peptides_ab)),
    pep.cond.mask = seq(nrow(subropers$peptides_ab)),
    max.pg.size.pirat.t = 1)
#> 
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.40672 -0.15008  0.08833  0.28986  0.84125 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     14.40677    1.19049   12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419    0.06469  -14.13  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared:  0.803,  Adjusted R-squared:  0.799 
#> F-statistic: 199.7 on 1 and 49 DF,  p-value: < 2.2e-16

Only few peptides have been used to fit the prior variance distribution in Σ, as we use a small subset from the original Ropers2021 dataset. Thus the goodness of fit may vary a lot depending on the subset chosen.

It gives the following imputed singletons:

mask.sing.pg = colSums(subropers$adj) == 1
mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1
imp.res$data.imputed[, mask.sing.pep]
#>         NALPEGYAPAK__2 AIGIISNNPEFADVR__2 DLNIDPATLR__2
#> W1_1_MS       20.17349           17.16826      19.18247
#> W1_2_MS       19.79753           17.22690      18.50307
#> W1_3_MS       20.07592           17.55627      18.83505
#> W2_1_MS       19.82019           17.42375      18.47821
#> W2_2_MS       20.17537           17.13305      17.89070
#> W2_3_MS       20.36344           17.50759      18.18041
#> R1_1_MS       20.69017           16.07105      18.12682
#> R1_2_MS       20.70866           15.69253      18.03316
#> R1_3_MS       21.04542           15.95376      18.52355
#> R2_1_MS       20.95692           15.74790      18.49510
#> R2_2_MS       20.90775           15.87217      17.92845
#> R2_3_MS       21.20864           15.54095      18.24418
#> R3_1_MS       21.02669           15.70025      17.84194
#> R3_2_MS       21.10948           15.83525      17.34890
#> R3_3_MS       21.23109           15.81318      17.82150
#> R4_1_MS       20.83249           15.99150      17.28032
#> R4_2_MS       21.00524           16.14081      17.67013
#> R4_3_MS       20.57447           17.69132      17.33383

On the other hand, if proteomic and transcriptomic samples are not paired but are derived from a same biological/experimental condition. Pirat-T can be used by adapting the mask related to samples in each type of analysis (here, both proteomic and transcriptomic datasets have 6 different conditions in the same order with 3 replicates each):

imp.res = my_pipeline_llkimpute(subropers,
                             extension = "T",
                             rna.cond.mask = rep(seq(6), each = 3),
                             pep.cond.mask = rep(seq(6), each = 3),
                             max.pg.size.pirat.t = 1)
#> 
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.40672 -0.15008  0.08833  0.28986  0.84125 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     14.40677    1.19049   12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419    0.06469  -14.13  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared:  0.803,  Adjusted R-squared:  0.799 
#> F-statistic: 199.7 on 1 and 49 DF,  p-value: < 2.2e-16

Also, it is possible to apply transcriptomic integration up to an arbitrary size of PG, simply by changing parameter max.pg.size.pirat.t in my_pipeline_llkimpute() to the desired limit PG size (e.g. +Inf for whole dataset).

5. Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] Pirat_1.1.0      BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.37.0 gtable_0.3.6               
#>  [3] dir.expiry_1.15.0           xfun_0.49                  
#>  [5] bslib_0.8.0                 ggplot2_3.5.1              
#>  [7] Biobase_2.67.0              lattice_0.22-6             
#>  [9] vctrs_0.6.5                 tools_4.4.2                
#> [11] generics_0.1.3              stats4_4.4.2               
#> [13] parallel_4.4.2              tibble_3.2.1               
#> [15] pkgconfig_2.0.3             Matrix_1.7-1               
#> [17] S4Vectors_0.45.2            lifecycle_1.0.4            
#> [19] GenomeInfoDbData_1.2.13     farver_2.1.2               
#> [21] compiler_4.4.2              progress_1.2.3             
#> [23] munsell_0.5.1               GenomeInfoDb_1.43.2        
#> [25] htmltools_0.5.8.1           sys_3.4.3                  
#> [27] buildtools_1.0.0            sass_0.4.9                 
#> [29] yaml_2.3.10                 pillar_1.10.0              
#> [31] crayon_1.5.3                jquerylib_0.1.4            
#> [33] MASS_7.3-61                 DelayedArray_0.33.3        
#> [35] cachem_1.1.0                abind_1.4-8                
#> [37] basilisk_1.19.0             digest_0.6.37              
#> [39] labeling_0.4.3              maketools_1.3.1            
#> [41] fastmap_1.2.0               grid_4.4.2                 
#> [43] colorspace_2.1-1            cli_3.6.3                  
#> [45] invgamma_1.1                SparseArray_1.7.2          
#> [47] magrittr_2.0.3              S4Arrays_1.7.1             
#> [49] withr_3.0.2                 prettyunits_1.2.0          
#> [51] filelock_1.0.3              scales_1.3.0               
#> [53] UCSC.utils_1.3.0            rmarkdown_2.29             
#> [55] XVector_0.47.0              httr_1.4.7                 
#> [57] matrixStats_1.4.1           reticulate_1.40.0          
#> [59] png_0.1-8                   hms_1.1.3                  
#> [61] evaluate_1.0.1              knitr_1.49                 
#> [63] GenomicRanges_1.59.1        IRanges_2.41.2             
#> [65] basilisk.utils_1.19.0       rlang_1.1.4                
#> [67] Rcpp_1.0.13-1               glue_1.8.0                 
#> [69] BiocManager_1.30.25         BiocGenerics_0.53.3        
#> [71] jsonlite_1.8.9              R6_2.5.1                   
#> [73] MatrixGenerics_1.19.0       zlibbioc_1.52.0

References

Chen, Lin S., Ross L. Prentice, and Pei Wang. 2014. A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation.” Biometrics 70 (2): 312–22. https://doi.org/10.1111/BIOM.12149.
Etourneau, Lucas, Laura Fancello, Samuel Wieczorek, Nelle Varoquaux, and Thomas Burger. 2023. A new take on missing value imputation for bottom-up label-free LC-MS/MS proteomics.” bioRxiv, January, 2023.11.09.566355. https://doi.org/10.1101/2023.11.09.566355.