--- title: "IsoBayes" author: - name: Jordy Bollon affiliation: - Research Center dedicated to Personalized, Preventive and Predictive Medicine (CMP3VDA), Computational Department, Aosta, Italy email: jordy.bollon@iit.it - name: Simone Tiberi affiliation: - Department of Statistical Sciences, University of Bologna, Bologna, Italy email: simone.tiberi@unibo.it package: "`r BiocStyle::pkg_ver('IsoBayes')`" date: "`r format(Sys.Date(), '%m/%d/%Y')`" bibliography: References.bib vignette: > %\VignetteIndexEntry{IsoBayes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} header-includes: - \usepackage{csquotes} output: BiocStyle::html_document --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set( tidy = FALSE, cache = TRUE, dev = "png", message = TRUE, error = FALSE, warning = TRUE ) ``` # Introduction *IsoBayes* is a Bayesian method to perform inference on single protein isoforms. Our approach infers the presence/absence of protein isoforms, and also estimates their abundance; additionally, it provides a measure of the uncertainty of these estimates, via: i) the posterior probability that a protein isoform is present in the sample; ii) a posterior credible interval of its abundance. *IsoBayes* inputs liquid cromatography mass spectrometry (MS) data, and can work with both PSM counts, and intensities. When available, trascript isoform abundances (i.e., TPMs) are also incorporated: TPMs are used to formulate an informative prior for the respective protein isoform relative abundance. We further identify isoforms where the relative abundance of proteins and transcripts significantly differ. We use a two-layer latent variable approach to model two sources of uncertainty typical of MS data: i) peptides may be erroneously detected (even when absent); ii) many peptides are compatible with multiple protein isoforms. In the first layer, we sample the presence/absence of each peptide based on its estimated probability of being mistakenly detected, also known as PEP (i.e., posterior error probability). In the second layer, for peptides that were estimated as being present, we allocate their abundance across the protein isoforms they map to. These two steps allow us to recover the presence and abundance of each protein isoform. ## Bioconductor installation *IsoBayes* is available on [Bioconductor](https://bioconductor.org/packages/IsoBayes) and can be installed with the command: ```{r Bioconductor_installation, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("IsoBayes") ``` To access the R code used in the vignettes, type: ```{r vignettes, eval=FALSE} browseVignettes("IsoBayes") ``` # Questions, issues and citation Questions relative to *IsoBayes* should be reported as a new issue at *[BugReports](https://github.com/SimoneTiberi/IsoBayes/issues)*. To cite *IsoBayes*, type: ```{r citation} citation("IsoBayes") ``` # Load the package Load *IsoBayes*: ```{r load, message=FALSE} library(IsoBayes) ``` # Key options ## Input data *IsoBayes* works with the output of both *MetaMorpheus* (MM) [@MetaMorpheus], or *Percolator* [@Percolator] (via the *OpenMS* toolkit [@OpenMS]). Additionally, users can also provide MS data obtained from any bioinformatics tool, as long as the input files follow the structure mentioned in the [Input user-provided data](#input-user-provided-data) Section. In our benchmarks, we tested our model using both *MetaMorpheus* and *Percolator* data, and obtained slightly better results, and a shorter runtime with *MetaMorpheus*. At the bottom of the vignette, in Section [*OpenMS* and *Metamorpheus* pipeline](#openms-and-metamorpheus-pipeline) we provide example scripts for both *OpenMS* and *Metamorpheus*. In this tutorial, we use a small dataset created by *MetaMorpheus*. The code to generate the data can be found [here](https://github.com/SimoneTiberi/IsoBayes/blob/main/inst/script/extdata.Rmd). ## PSM counts vs. intensities We can run *IsoBayes* on either PSM counts or intensities. In our benchmarks, we found that, although results are consistent between the two types of input data, intensities led to a marginal improvement in performance. ## TPMs (optional) If available, transcriptomics data (from short of long reads), can also be integrated in the model in the form of TPMs; this will enhance protein isoform inference. To correctly match proteomics and transcriptomics data, transcript isoform and protein isoform names must be consistent. Here, we also specify the path to TPMs (optional). TPMs must be stored in a `data.frame` object (with columns `isoname` and `tpm`), or in a `.tsv` file (with headers `isoname` and `tpm`); in either case, there should be 1 row per isoform, and 2 columns: * 'isoname': a character string indicating the isoform name; * 'tpm': a numeric variable indicating the TPM counts. Column names must be 'isoname' and 'tpm'. We set the directory of the data (internal in the package): ```{r specify extdata path} data_dir = system.file("extdata", package = "IsoBayes") ``` We set the path (optional) to the TPMs: ```{r set path mRNA} tpm_path = paste0(data_dir, "/jurkat_isoform_kallisto.tsv") ``` ## PEP vs. FDR cutoff Additionally, we suggest to provide the probability that peptides are erroneously detected (usually called PEP): *IsoBayes* will use the PEP to sample the presence/absence of each peptide; this will propagate the peptide identification uncertainty throughout the inference process. Alternatively, unreliable peptides can be discarded by specifying a peptide False Discovery Rate (FDR) threshold (also known as qvalue). We suggest using PEP with a weak FDR threshold of 0.1 (default parameters options). This is because peptides with FDR > 0.1 are usually unreliable, and associated to high error probabilities (e.g., PEP > 0.9). In our benchmarks, we found that using `PEP` (with and FDR threshold of 0.1) provides a (minor) increase in performace compared to the classical FDR thresholding (0.01 threshold), at the cost of a (marginally) higher runtime. If we want to disable the PEP integration, and filter based on the FDR, we can set `PEP = FALSE` and specify a more stringent FDR cutoff, e.g., as `FDR_thd = 0.01`. # Generate `SummarizedExperiment` object Before running the IsoBayes model, we structure all the required input data into a `SummarizedExperiment` object. Such object can be created in multiple ways; ## Input *MetaMorpheus* data We set the path to the AllPeptides.psmtsv file returned by *MetaMorpheus*: ```{r set path MM psm} path_to_peptides_psm = paste0(data_dir, "/AllPeptides.psmtsv") ``` ### Peptide-Spectrum Match (PSM) counts The AllPeptides.psmtsv file contains all the information required to run *IsoBayes* with PSM counts. First, we load the file and we convert it into a `SummarizedExperiment` object via `generate_SE` function. Then, we preprocess the input data using the `input_data` function: ```{r load psm} SE = generate_SE(path_to_peptides_psm = path_to_peptides_psm, abundance_type = "psm", input_type = "metamorpheus") SE ``` ### Intensities If we want to run the algorithm with peptide intensities, in addition to AllPeptides.psmtsv, we also need to load a second file generated by MM: AllQuantifiedPeptides.tsv. Then, we have to set `abundance_type` argument to "intensities"; otherwise, the `generate_SE` function will ignore the AllQuantifiedPeptides.tsv file. ```{r set path MM intensities} path_to_peptides_intensities = paste0(data_dir, "/AllQuantifiedPeptides.tsv") SE = generate_SE(path_to_peptides_psm = path_to_peptides_psm, path_to_peptides_intensities = path_to_peptides_intensities, abundance_type = "intensities", input_type = "metamorpheus") ``` ## Input *Percolator* data *IsoBayes* is also compatible with the PSM file from *Percolator* (from the *OpenMS* toolkit). The [README](https://github.com/SimoneTiberi/IsoBayes/tree/main#readme) shows how to use *OpenMS* tools to generate an idXML file containing PSM data. Once the file has been created, we can pass its path to `input_data` and set `input_type` as "openMS". ```{r set path OpenMS, eval = FALSE} SE = generate_SE(path_to_peptides_psm = "/path/to/file.idXML", abundance_type = "psm", input_type = "openMS") ``` Please note that when using data generated by *Percolator*, the algorithm can only process PSM counts and not intensities. ## Input user provided-data We can also input MS data obtained from any bioinformatics tool. The data can be organized in a `.tsv` file, a `data.frame` or a `SummarizedExperiment`. ### `.tsv` or `data.frame` format In both `.tsv` and `data.frame` files, each row corresponds to a peptide, and columns refer to: * 'Y': a numeric variable indicating the peptide abundance (PSM counts or intensities, as defined by the user); * 'EC': Equivalent Classes, a character string indicating the isoform(s) name the peptide maps to. If the peptide maps to multiple protein isoforms, the names must be separated with "|" , i.e. "name_isoform_1|name_isoform_2"; * 'FDR': (optional) a numeric variable indicating the FDR of each peptide; * 'PEP': (optional) a numeric variable indicating the probability that a peptide is erroneously detected; * 'sequence': (required when using PEP) a character string indicating the peptide name/id/amino acids sequence. Note that, when using user-provided data, argument `path_to_peptides_intensities` is not considered, because users are free to set column `Y` to PSM counts or intensities. To load user-provided data, we just need to pass the file path, the `data.frame` or the `SummarizedExperiment` object and set `input_type` as "other" . ```{r set user data, eval = FALSE} # X can be a path to a .tsv file or a data.frame SE = generate_SE(path_to_peptides_psm = X, abundance_type = "psm", input_type = "other") ``` ### SummarizedExperiment format If the data is already stored in a `SummarizedExperiment`, it can be passed to `input_data` function directly. The `SummarizedExperiment` should contain `assays`, `colData` and `metadata` with the structure specified below. Object `assays`: * `Y`; a numeric vector indicating the peptide abundance (PSM counts or intensities). Object `colData`, a `DataFrame` object with columns: * 'EC': Equivalent Classes, a character string indicating the isoform(s) name the peptide maps to. If the peptide maps to multiple protein isoforms, the names must be separated with "|" , i.e. "name_isoform_1|name_isoform_2"; * 'PEP': (optional) a numeric variable indicating the probability that a peptide is erroneously detected; * 'sequence': (required when using PEP) a character string indicating the peptide name/id/amino acids sequence. Object `metadata`, a list of objects: * 'PEP': logical; if TRUE, the algorithm will account for the probability that peptides are erroneously detected; * 'protein_name': a character vector indicating the protein isoforms name. IMPORTANT: `input_data` does not filter peptides based on their FDR, since it assumes that unreliable peptides have been already discarded. Therefore, FDR filtering must be performed by users before the `SE` oject is provided to `input_data`. # Input and pre-process data Once we have generated an `SE` object, we can process it, together with the TPMs (if available), via `input_data`. For this vignette, we use the `SE` object generated above loading the output from *MetaMorpheus*. ```{r input_data} data_loaded = input_data(SE, path_to_tpm = tpm_path) ``` # Inference Once we have loaded the data, we can run the algorithm using the `inference` function. ```{r inference, eval = FALSE} set.seed(169612) results = inference(data_loaded) ``` By default, *IsoBayes* uses one single core. For large datasets, to speed up the execution time, the number of cores can be increased via the `n_cores` argument. ## Gene Normalization (optional) In order to analyse alternative splicing within a specific gene, we may want to normalize the estimated relative abundances for each set of protein isoforms that maps to a gene. To this aim, we need to input a csv file containing two columns denoting the isoform name and the gene name. Alternatively, if isoform-gene ids are already loaded, a data.frame can be used directly. In both cases, the csv file or data.frame must have 2 columns: the 1st column must contain the isoform name/id, while the 2nd column has the gene name/id. ```{r default inference} path_to_map_iso_gene = paste0(data_dir, "/map_iso_gene.csv") set.seed(169612) results_normalized = inference(data_loaded, map_iso_gene = path_to_map_iso_gene, traceplot = TRUE) ``` Specifying the `map_iso_gene` argument, also enables users to plot results via `plot_relative_abundances` function. # Visualizing results Results are stored as a `list` with three `data.frame` objects: 'isoform_results', 'normalized_isoform_results' and 'gene_abundance'. If 'map_iso_gene' was not provided, only 'isoform_results' is returned. ```{r get res} names(results_normalized) ``` Inside the 'isoform_results' `data.frame`, each row corresponds to a protein isoform. The columns report the following results: * 'Prob_present': the probability that the protein isoform is present in the biological sample; * 'Abundance': estimate (posterior mean) of the protein isoform absolute abundance; * 'CI_LB': lower bound of the Highest Posterior Density (HPD) credible interval (0.95 level) for the absolute abundance; * 'CI_UB': upper bound of the HPD credible interval (0.95 level) for the absolute abundance; * 'Pi': estimate (posterior mean) of the protein isoform relative abundance across all isoforms (i.e., the sum of across all isoforms adds to 1); * 'Pi_CI_LB': lower bound of the HPD credible interval (0.95 level) for Pi; * 'Pi_CI_UB': upper bound of the HPD credible interval (0.95 level) for Pi; * 'TPM': transcript isoform abundance, provided by the users as input in `input_data` function; * 'Log2_FC': log2 fold change between protein and transcript isoform relative abundances (i.e., 'Pi' and normalized TPMs); positive (negative) values indicate higher (lower) relative isoform abundance at the protein level, compared to the transcript level; * 'Prob_prot_inc': estimated probability that the relative abundance of the protein isoform is greater than the relative abundance of the corresponding transcript isoform (to be interpreted in conjunction with 'Log2_FC'); values towards 1 (towards 0) suggest that isoform relative abundances are higher (lower) at the protein-level than at the transcript-level. ```{r get iso res} head(results_normalized$isoform_results) ``` In 'normalized_isoform_results' `data.frame`, we report the protein isoform relative abundances, normalized within each gene (i.e., adding to 1 within a gene). ```{r get iso res norm} head(results_normalized$normalized_isoform_results) ``` In 'gene_abundance' `data.frame`, for each gene (row) we return: * 'Abundance': estimate (posterior mean) of the gene absolute abundance; * 'CI_LB': lower bound of the HPD credible interval (0.95 level) for the gene absolute abundance; * 'CI_UB': upper bound of the HPD credible interval (0.95 level) for the gene absolute abundance. ```{r get iso res norm gene} head(results_normalized$gene_abundance) ``` Finally, *IsoBayes* provides the `plot_relative_abundances` function to visualize protein-level and transcript-level relative abundances across the isoforms of a specific gene: ```{r plotting results} plot_relative_abundances(results_normalized, gene_id = "TUBB") ``` By default `plot_relative_abundances` normalizes the relative abundances within genes (again, adding to 1 within a gene). To disable the normalization, set the `normalize_gene` argument to `FALSE`: ```{r plotting results no normalization} plot_relative_abundances(results_normalized, gene_id = "TUBB", normalize_gene = FALSE) ``` Note that `plot_relative_abundances` can be used only when, in the `map_iso_gene` argument of the `inference` function, we provide a csv file that maps the protein isoforms to the corresponding gene (`path_to_map_iso_gene` in this case). # Assessing convergence via traceplots If `inference` function was run with parameter `traceplot = TRUE`, we can visualize the traceplot of the posterior chains of the relative abundances of each protein isoform (i.e., `PI`). We use the `plot_traceplot` function to plot the 3 isoforms of the gene `TUBB`: ```{r traceplot} plot_traceplot(results_normalized, "TUBB-205") plot_traceplot(results_normalized, "TUBB-206") plot_traceplot(results_normalized, "TUBB-208") ``` The vertical grey dashed line indicates the burn-in (the iterations on the left side of the burn-in are discarded in posterior analyses). Note that, although we are using normalized results, gene-level information is ignored in the traceplot. # Session info ```{r sessionInfo} sessionInfo() ``` # *OpenMS* and *Metamorpheus* pipeline *IsoBayes* works directly with the output of *MetaMorpheus*, or *Percolator* (via the *OpenMS* toolkit). Additionally, users can also provide MS data obtained from any bioinformatics tool, as long as the input files follow the structure mentioned in the Input user-provided data Section *IsoBayes* works with the output of both *MetaMorpheus* (MM) [@MetaMorpheus], or *Percolator* [@Percolator] (via the *OpenMS* toolkit [@OpenMS]). Additionally, users can also provide MS data obtained from any bioinformatics tool, as long as the input files follow the structure mentioned in the [Input user-provided data](#input-user-provided-data) Section. Below, we provide example scripts for both *OpenMS* and *Metamorpheus*. ## *MetaMorpheus* pipeline To generate the MM output files required to run *IsoBayes*, we need to execute the following commands: * Install *MetaMorpheus* via [*Conda*](https://docs.conda.io/en/latest/miniconda.html): ```shell conda install -c conda-forge metamorpheus ``` * Inside the folder with the configuration (.toml), spectra (.mzML or .raw) and database (.xml) files run: ```shell metamorpheus -t Task1-SearchTaskconfig.toml Task2-CalibrateTaskconfig.toml Task3-SearchTaskconfig.toml Task4-GPTMDTaskconfig.toml Task5-SearchTaskconfig.toml -s 04-30-13_CAST_Frac4_6uL.raw 04-30-13_CAST_Frac5_4uL.raw -d uniprot-mouse-reviewed-1-24-2018.xml.gz uniprot-cRAP-1-24-2018.xml.gz ``` or ```shell metamorpheus -t Task1-SearchTaskconfig.toml Task2-CalibrateTaskconfig.toml Task3-SearchTaskconfig.toml Task4-GPTMDTaskconfig.toml Task5-SearchTaskconfig.toml -s mzML/04-30-13_CAST_Frac4_6uL.mzML mzML/04-30-13_CAST_Frac5_4uL.mzML -d uniprot-mouse-reviewed-1-24-2018.xml.gz uniprot-cRAP-1-24-2018.xml.gz ``` There are several ways to install and run MM. For more details see the MM [tutorial](https://github.com/smith-chem-wisc/MetaMorpheus/wiki/Getting-Started#test-installation-via-net-core-dll---linux-macos-windows), where you can also find the example files used here. ## *Percolator* pipeline We provide a brief pipeline where several *OpenMS* applications are chained together to generate an idXML file required to run *IsoBayes* with *Percolator* output. The pipeline starts from peptide identification results stored in mzID files. First, install *OpenMS* toolkit and *Percolator* tool. For instructions on how to install them on your operating system see [OpenMS Installation](https://openms.readthedocs.io/en/latest/openms-applications-and-tools/installation.html) and [Percolator Installation](https://github.com/percolator/percolator). Next, declare some useful global variable: ``` shell path_to_data=/path/to/mzIDfiles path_out=/path/to/output NTHREADS=4 ENZYME_indexer="Chymotrypsin" ENZYME_percolator="chymotrypsin" DECOY_STRING="mz|DECOY_" fdr=1 ``` Below, we show an example with chymotrypsin enzyme. If the data was generated with another enzyme, please search for the corresponding enzyme in the documentation below, and reset the global variables `ENZYME_indexer` and `ENZYME_percolator` with the correct enzyme. ``` shell PeptideIndexer --help PercolatorAdapter --help ``` This pipeline also assumes that in the `/path/to/mzIDfiles` folder there is a fasta file listing target and decoy protein isoforms. The `DECOY_STRING` allows you to change the string needed to identify a decoy in the fasta file. ``` shell cd $path_out # convert mzID files into idXML files for mz in $path_to_data/*.mzID do IDFileConverter -in $mz -threads $NTHREADS -out $mz.idXML done # merge the files IDMerger -in $path_to_data/*.idXML -threads $NTHREADS -merge_proteins_add_PSMs -out $path_out/merge.idXML rm $path_to_data/*.idXML # index the peptide file with the fasta file PeptideIndexer -in $path_out/merge.idXML -enzyme:name $ENZYME_indexer -threads $NTHREADS -decoy_string_position prefix -decoy_string $DECOY_STRING -fasta $path_to_data/genecodeAndDecoy.fasta -out $path_out/merge_index.idXML rm $path_out/merge.idXML # run percolator PercolatorAdapter -in $path_out/merge_index.idXML -enzyme $ENZYME_percolator -threads $NTHREADS -generic_feature_set -score_type pep -out $path_out/merge_index_percolator_pep.idXML rm $path_out/merge_index.idXML # Estimate the false discovery rate on peptide level using decoy searches and keep the ones with FDR < $fdr FalseDiscoveryRate -in $path_out/merge_index_percolator_pep.idXML -out $path_out/merge_index_percolator_pep_$fdr.idXML -protein false -threads $NTHREADS -FDR:PSM $fdr -algorithm:add_decoy_peptides -algorithm:add_decoy_proteins rm $path_out/merge_index_percolator_pep.idXML # Associate each peptite with Posterior Error Probability score IDScoreSwitcher -in $path_out/merge_index_percolator_pep_$fdr.idXML -out $path_out/merge_index_percolator_pep_switched_$fdr.idXML -new_score 'Posterior Error Probability_score' -new_score_orientation lower_better -new_score_type pep -threads $NTHREADS rm $path_out/merge_index_percolator_pep_$fdr.idXML ``` For more details on *OpenMS* tools see its [Documentation](https://abibuilder.cs.uni-tuebingen.de/archive/openms/Documentation/nightly/html/TOPP_documentation.html). # References