--- title: "Pirat" author: - name: Lucas Etourneau - name: Samuel Wieczorek" package: Pirat date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true fig_caption: yes vignette: > %\VignetteIndexEntry{Pirat-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: - biblio_vignette.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE, warning = FALSE, message = FALSE ) ``` # Installation `r Biocpkg("Pirat")` is an `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages. You can install the development version from [GitHub](https://github.com/) : ```{r install_dev, eval = FALSE} BiocManager::install("prostarproteomics/Pirat") ``` When calling one of the imputation methods for the first time after package installation, the underlying python module and conda environment will be automatically downloaded and installed without the need for user intervention. This may take several minutes, but this is a one-time installation. the first time after package installation. # Standard imputation with Pirat Pirat is a single imputation pipeline including a novel statistical approach dedicated to bottom-up proteomics data. All the technical details and the validation procedure of this method in available on the corresponding preprint @Etourneau2023 . To demonstrate its usage, we first load Pirat package and a subset of Bouyssie2020 dataset in the environment. ```{r load_data, echo = TRUE} library(Pirat) library(utils) data(subbouyssie) ``` Note that `subbouyssie` is actually a list that contains two elements: 1. `peptides_ab`, the matrix of peptide log-2-abundances, with samples in rows and peptides in columns. 2. `adj`, the adjacency matrix between peptides and proteins, containing either booleans or 0s and 1s (no preprocessing or simplification is needed for this matrix, Pirat will automatically build the PGs from it). Slight imputation variations may occur for peptides belonging to very large PGs, because the latter are randomly split into several smaller PGs with fixed size to reduce computational costs. Although this variability is too small to affect imputation quality, we fix the seed in this tutorial such that the user can retrieve exactly the same imputed values when running the notebook again. ```{r setseed} set.seed(12345) ``` One can then impute this dataset with the following line ```{r impute} imp.res <- my_pipeline_llkimpute(subbouyssie) ``` The first plot represents the goodness of fit of the inverse-gamma prior, whereas the second one represents the goodness of fit of the missingness mechanism (details on fitting methods are given in @Etourneau2023). Note that some of these parameters were originally proposed in @Chen2014, however no methods existed to find then automatically and without relying on heuristics. The result `imp.res` is a list that contains: 1. `data.imputed`, the imputed log-2 abundance matrix. 2. `params`, a list containing parameters $\Gamma$ and hyperparameters $\alpha$ and $\beta$. You can check the imputed values here... ```{r test1} head(imp.res$data.imputed[ ,seq(5)]) ``` ...and the computed parameters here. ```{r params} imp.res$params ``` Note that a positive value for $\gamma1$ indicates that a random left-truncation mechanism is present in the dataset. # Intra-PG correlation analysis Pirat has a diagnosis tool that compares distributions of correlations at random and those from same peptide groups (PGs). We use it here on the complete Ropers2021 dataset. ```{r correlations} data(subropers) plot_pep_correlations(subropers, titlename = "Ropers2021") ``` We see here that the blue distribution has much more weights on high correlations than the red one, indicating that PGs should clearly help for imputation. # Pirat extensions To handle singleton PGs, Pirat proposes three extensions, on top of the classical Pirat approach. Note that the -T extensions can be applied to up to an arbitrary PG size. To illustrate our examples, we use a subset of Ropers2021 dataset. # -2, the 2-peptide rule The -2 extension simply consists in not imputing singleton PGs. It can be used as following. ```{r pipeline_llkimpute} data(subropers) imp.res = pipeline_llkimpute(subropers, extension = "2") ``` We can then check that singleton peptides are not imputed (yet some may be already fully observed). ```{r impute4} mask.sing.pg = colSums(subropers$adj) == 1 mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1 imp.res$data.imputed[, mask.sing.pep] ``` # -S, samples-wise correlations Pirat can leverage sample-wise correlations to impute the singleton peptides as following: ```{r my_pipeline_llkimpute2} imp.res = my_pipeline_llkimpute(subropers, extension = "S") ``` Here singleton peptides are impute after the rest of the dataset, using sample-wise correlations obtained. ```{r impute2} mask.sing.pg = colSums(subropers$adj) == 1 mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1 imp.res$data.imputed[, mask.sing.pep] ``` # -T, transcriptomic integration The last extension consists in using correlations between peptides and gene/transcript expression obtained from a related transcriptomic analysis. To use this extension, the list of the dataset must contain: * `rnas_ab`, an log2-normalized-count table of gene or transcript expression, for which samples are either paired or related (*i.e.*, from the same experimental/biological conditions). * `adj_rna_pg`, a adjacency matrix between transcripts or genes and PGs, containing either booleans or 0s and 1s. `ropers` proteomic and transcriptomic samples are paired (*i.e.* the same biological samples were used for each type of analysis). Thus Pirat-T can be used as following: ```{r my_pipeline_llkimpute3} imp.res = my_pipeline_llkimpute(subropers, extension = "T", rna.cond.mask = seq(nrow(subropers$peptides_ab)), pep.cond.mask = seq(nrow(subropers$peptides_ab)), max.pg.size.pirat.t = 1) ``` Only few peptides have been used to fit the prior variance distribution in $\Sigma$, as we use a small subset from the original Ropers2021 dataset. Thus the goodness of fit may vary a lot depending on the subset chosen. It gives the following imputed singletons: ```{r data.imputed3} mask.sing.pg = colSums(subropers$adj) == 1 mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1 imp.res$data.imputed[, mask.sing.pep] ``` On the other hand, if proteomic and transcriptomic samples are not paired but are derived from a same biological/experimental condition. Pirat-T can be used by adapting the mask related to samples in each type of analysis (here, both proteomic and transcriptomic datasets have 6 different conditions in the same order with 3 replicates each): ```{r my_pipeline_llkimpute_T} imp.res = my_pipeline_llkimpute(subropers, extension = "T", rna.cond.mask = rep(seq(6), each = 3), pep.cond.mask = rep(seq(6), each = 3), max.pg.size.pirat.t = 1) ``` Also, it is possible to apply transcriptomic integration up to an arbitrary size of PG, simply by changing parameter `max.pg.size.pirat.t` in `my_pipeline_llkimpute()` to the desired limit PG size (*e.g.* `+Inf` for whole dataset). # 5. Session info ```{r sessionInfo} sessionInfo() ``` # References