IsoBayes is a Bayesian method to perform inference on single protein isoforms. Our approach infers the presence/absence of protein isoforms, and also estimates their abundance; additionally, it provides a measure of the uncertainty of these estimates, via: i) the posterior probability that a protein isoform is present in the sample; ii) a posterior credible interval of its abundance. IsoBayes inputs liquid cromatography mass spectrometry (MS) data, and can work with both PSM counts, and intensities. When available, trascript isoform abundances (i.e., TPMs) are also incorporated: TPMs are used to formulate an informative prior for the respective protein isoform relative abundance. We further identify isoforms where the relative abundance of proteins and transcripts significantly differ. We use a two-layer latent variable approach to model two sources of uncertainty typical of MS data: i) peptides may be erroneously detected (even when absent); ii) many peptides are compatible with multiple protein isoforms. In the first layer, we sample the presence/absence of each peptide based on its estimated probability of being mistakenly detected, also known as PEP (i.e., posterior error probability). In the second layer, for peptides that were estimated as being present, we allocate their abundance across the protein isoforms they map to. These two steps allow us to recover the presence and abundance of each protein isoform.
IsoBayes is available on Bioconductor and can be installed with the command:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("IsoBayes")
To access the R code used in the vignettes, type:
Questions relative to IsoBayes should be reported as a new issue at BugReports.
To cite IsoBayes, type:
## To cite DifferentialRegulation in publications use:
##
## Jordy Bollon, Michael R. Shortreed, Ben T. Jordan, Rachel Miller,
## Erin Jeffery, Andrea Cavalli, Lloyd M. Smith, Colin Dewey, Gloria M.
## Sheynkman, and Simone Tiberi (2024). IsoBayes: a Bayesian approach
## for single-isoform proteomics inference. bioRxiv. URL
## https://doi.org/10.1101/2024.06.10.598223
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {IsoBayes: a Bayesian approach for single-isoform proteomics inference},
## author = {Jordy Bollon and Michael R. Shortreed and Ben T. Jordan and Rachel Miller and Erin Jeffery and Andrea Cavalli and Lloyd M. Smith and Colin Dewey and Gloria M. Sheynkman and Simone Tiberi},
## eprint = {https://doi.org/10.1101/2024.06.10.598223},
## journal = {bioRxiv},
## year = {2024},
## doi = {10.1101/2024.06.10.598223},
## url = {https://doi.org/10.1101/2024.06.10.598223},
## }
IsoBayes works with the output of both MetaMorpheus (MM) (Solntsev et al. 2018), or Percolator (The et al. 2016) (via the OpenMS toolkit (Röst et al. 2016)). Additionally, users can also provide MS data obtained from any bioinformatics tool, as long as the input files follow the structure mentioned in the Input user-provided data Section. In our benchmarks, we tested our model using both MetaMorpheus and Percolator data, and obtained slightly better results, and a shorter runtime with MetaMorpheus.
At the bottom of the vignette, in Section OpenMS and Metamorpheus pipeline we provide example scripts for both OpenMS and Metamorpheus.
In this tutorial, we use a small dataset created by MetaMorpheus. The code to generate the data can be found here.
We can run IsoBayes on either PSM counts or intensities. In our benchmarks, we found that, although results are consistent between the two types of input data, intensities led to a marginal improvement in performance.
If available, transcriptomics data (from short of long reads), can
also be integrated in the model in the form of TPMs; this will enhance
protein isoform inference. To correctly match proteomics and
transcriptomics data, transcript isoform and protein isoform names must
be consistent. Here, we also specify the path to TPMs (optional). TPMs
must be stored in a data.frame
object (with columns
isoname
and tpm
), or in a .tsv
file (with headers isoname
and tpm
); in either
case, there should be 1 row per isoform, and 2 columns:
We set the directory of the data (internal in the package):
We set the path (optional) to the TPMs:
Additionally, we suggest to provide the probability that peptides are
erroneously detected (usually called PEP): IsoBayes will use
the PEP to sample the presence/absence of each peptide; this will
propagate the peptide identification uncertainty throughout the
inference process. Alternatively, unreliable peptides can be discarded
by specifying a peptide False Discovery Rate (FDR) threshold (also known
as qvalue). We suggest using PEP with a weak FDR threshold of 0.1
(default parameters options). This is because peptides with FDR > 0.1
are usually unreliable, and associated to high error probabilities
(e.g., PEP > 0.9). In our benchmarks, we found that using
PEP
(with and FDR threshold of 0.1) provides a (minor)
increase in performace compared to the classical FDR thresholding (0.01
threshold), at the cost of a (marginally) higher runtime. If we want to
disable the PEP integration, and filter based on the FDR, we can set
PEP = FALSE
and specify a more stringent FDR cutoff, e.g.,
as FDR_thd = 0.01
.
SummarizedExperiment
objectBefore running the IsoBayes model, we structure all the required
input data into a SummarizedExperiment
object. Such object
can be created in multiple ways;
We set the path to the AllPeptides.psmtsv file returned by MetaMorpheus:
The AllPeptides.psmtsv file contains all the information required to
run IsoBayes with PSM counts. First, we load the file and we
convert it into a SummarizedExperiment
object via
generate_SE
function. Then, we preprocess the input data
using the input_data
function:
SE = generate_SE(path_to_peptides_psm = path_to_peptides_psm,
abundance_type = "psm",
input_type = "metamorpheus")
## We found:
## 10647 protein isoforms.
## class: SummarizedExperiment
## dim: 1 8209
## metadata(5): PEP FDR_thd input_type protein_name id_openMS
## assays(1): Y
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(5): EC QValue DCT sequence PEP
If we want to run the algorithm with peptide intensities, in addition
to AllPeptides.psmtsv, we also need to load a second file generated by
MM: AllQuantifiedPeptides.tsv. Then, we have to set
abundance_type
argument to “intensities”; otherwise, the
generate_SE
function will ignore the
AllQuantifiedPeptides.tsv file.
path_to_peptides_intensities = paste0(data_dir, "/AllQuantifiedPeptides.tsv")
SE = generate_SE(path_to_peptides_psm = path_to_peptides_psm,
path_to_peptides_intensities = path_to_peptides_intensities,
abundance_type = "intensities",
input_type = "metamorpheus")
## Peptides with intensity equal to 0:
## 0.568%
## NA's intensity: 30.823%
## We found:
## 10647 protein isoforms.
IsoBayes is also compatible with the PSM file from
Percolator (from the OpenMS toolkit). The README
shows how to use OpenMS tools to generate an idXML file
containing PSM data. Once the file has been created, we can pass its
path to input_data
and set input_type
as
“openMS”.
SE = generate_SE(path_to_peptides_psm = "/path/to/file.idXML",
abundance_type = "psm",
input_type = "openMS")
Please note that when using data generated by Percolator, the algorithm can only process PSM counts and not intensities.
We can also input MS data obtained from any bioinformatics tool. The
data can be organized in a .tsv
file, a
data.frame
or a SummarizedExperiment
.
.tsv
or data.frame
formatIn both .tsv
and data.frame
files, each row
corresponds to a peptide, and columns refer to:
Note that, when using user-provided data, argument
path_to_peptides_intensities
is not considered, because
users are free to set column Y
to PSM counts or
intensities.
To load user-provided data, we just need to pass the file path, the
data.frame
or the SummarizedExperiment
object
and set input_type
as “other” .
If the data is already stored in a SummarizedExperiment
,
it can be passed to input_data
function directly. The
SummarizedExperiment
should contain assays
,
colData
and metadata
with the structure
specified below.
Object assays
:
Y
; a numeric vector indicating the peptide abundance
(PSM counts or intensities).Object colData
, a DataFrame
object with
columns:
Object metadata
, a list of objects:
IMPORTANT: input_data
does not filter peptides based on
their FDR, since it assumes that unreliable peptides have been already
discarded. Therefore, FDR filtering must be performed by users before
the SE
oject is provided to input_data
.
Once we have generated an SE
object, we can process it,
together with the TPMs (if available), via input_data
. For
this vignette, we use the SE
object generated above loading
the output from MetaMorpheus.
## After filtering petides, we will analyze:
## 7106 protein isoforms.
## Percentage of unique peptides:
## 46.49%
Once we have loaded the data, we can run the algorithm using the
inference
function.
By default, IsoBayes uses one single core. For large
datasets, to speed up the execution time, the number of cores can be
increased via the n_cores
argument.
In order to analyse alternative splicing within a specific gene, we may want to normalize the estimated relative abundances for each set of protein isoforms that maps to a gene. To this aim, we need to input a csv file containing two columns denoting the isoform name and the gene name. Alternatively, if isoform-gene ids are already loaded, a data.frame can be used directly. In both cases, the csv file or data.frame must have 2 columns: the 1st column must contain the isoform name/id, while the 2nd column has the gene name/id.
path_to_map_iso_gene = paste0(data_dir, "/map_iso_gene.csv")
set.seed(169612)
results_normalized = inference(data_loaded, map_iso_gene = path_to_map_iso_gene, traceplot = TRUE)
Specifying the map_iso_gene
argument, also enables users
to plot results via plot_relative_abundances
function.
Results are stored as a list
with three
data.frame
objects: ‘isoform_results’,
‘normalized_isoform_results’ and ‘gene_abundance’. If ‘map_iso_gene’ was
not provided, only ‘isoform_results’ is returned.
## [1] "isoform_results" "normalized_isoform_results"
## [3] "gene_abundance" "MCMC"
Inside the ‘isoform_results’ data.frame
, each row
corresponds to a protein isoform. The columns report the following
results:
input_data
function;## Gene Isoform Prob_present Abundance CI_LB CI_UB Pi Pi_CI_LB
## 1 AAGAB AAGAB-201 0.644 1.262 0 3 1.584713e-05 2.383631e-11
## 2 AAGAB AAGAB-203 0.793 1.738 0 3 2.153024e-05 6.702643e-08
## 3 AAK1 AAK1-201 0.736 1.565 0 3 1.489832e-05 3.930209e-26
## 4 AAK1 AAK1-203 0.355 0.701 0 3 6.720064e-06 3.967865e-28
## 5 AAK1 AAK1-211 0.797 3.443 0 6 3.240565e-05 7.724298e-11
## 6 AAK1 AAK1-212 0.624 2.291 0 6 2.185897e-05 3.862298e-15
## Pi_CI_UB TPM Log2_FC Prob_prot_inc
## 1 4.702063e-05 17.284100 -1.55230266 0.050
## 2 5.471843e-05 26.384500 -1.72040513 0.015
## 3 4.220247e-05 1.625010 1.76955056 0.702
## 4 3.228300e-05 0.632271 1.98278141 0.428
## 5 7.634762e-05 11.637600 0.05037318 0.485
## 6 6.822451e-05 7.264770 0.16215810 0.430
In ‘normalized_isoform_results’ data.frame
, we report
the protein isoform relative abundances, normalized within each gene
(i.e., adding to 1 within a gene).
## Gene Isoform Pi_norm Pi_norm_CI_LB Pi_norm_CI_UB Pi_norm_TPM
## 1 AAGAB AAGAB-201 0.42417351 4.935733e-07 0.9559613 0.39580156
## 2 AAGAB AAGAB-203 0.57582649 4.403868e-02 0.9999995 0.60419844
## 3 AAK1 AAK1-201 0.19820016 3.737326e-22 0.5085673 0.07679758
## 4 AAK1 AAK1-203 0.08748612 6.632691e-24 0.4136955 0.02988097
## 5 AAK1 AAK1-211 0.42462808 1.506936e-06 0.8609503 0.54999017
## 6 AAK1 AAK1-212 0.28968564 9.926213e-11 0.8071473 0.34333128
In ‘gene_abundance’ data.frame
, for each gene (row) we
return:
## Gene Abundance CI_LB CI_UB
## 1 AAGAB 3.000 3 3
## 2 AAK1 8.000 8 8
## 3 AAMP 8.991 9 9
## 4 AAR2 8.000 8 8
## 5 AARS1 119.000 119 119
## 6 AARS2 4.000 4 4
Finally, IsoBayes provides the
plot_relative_abundances
function to visualize
protein-level and transcript-level relative abundances across the
isoforms of a specific gene:
By default plot_relative_abundances
normalizes the
relative abundances within genes (again, adding to 1 within a gene). To
disable the normalization, set the normalize_gene
argument
to FALSE
:
Note that plot_relative_abundances
can be used only
when, in the map_iso_gene
argument of the
inference
function, we provide a csv file that maps the
protein isoforms to the corresponding gene
(path_to_map_iso_gene
in this case).
If inference
function was run with parameter
traceplot = TRUE
, we can visualize the traceplot of the
posterior chains of the relative abundances of each protein isoform
(i.e., PI
). We use the plot_traceplot
function
to plot the 3 isoforms of the gene TUBB
:
The vertical grey dashed line indicates the burn-in (the iterations on the left side of the burn-in are discarded in posterior analyses).
Note that, although we are using normalized results, gene-level information is ignored in the traceplot.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] IsoBayes_1.5.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.37.0 gtable_0.3.6
## [3] xfun_0.49 bslib_0.8.0
## [5] ggplot2_3.5.1 Biobase_2.67.0
## [7] lattice_0.22-6 vctrs_0.6.5
## [9] tools_4.4.2 generics_0.1.3
## [11] stats4_4.4.2 parallel_4.4.2
## [13] tibble_3.2.1 fansi_1.0.6
## [15] pkgconfig_2.0.3 Matrix_1.7-1
## [17] data.table_1.16.2 S4Vectors_0.45.2
## [19] rngtools_1.5.2 HDInterval_0.2.4
## [21] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [23] farver_2.1.2 compiler_4.4.2
## [25] munsell_0.5.1 codetools_0.2-20
## [27] GenomeInfoDb_1.43.2 htmltools_0.5.8.1
## [29] sys_3.4.3 buildtools_1.0.0
## [31] sass_0.4.9 yaml_2.3.10
## [33] pillar_1.9.0 crayon_1.5.3
## [35] jquerylib_0.1.4 DelayedArray_0.33.2
## [37] cachem_1.1.0 doRNG_1.8.6
## [39] iterators_1.0.14 abind_1.4-8
## [41] foreach_1.5.2 digest_0.6.37
## [43] labeling_0.4.3 maketools_1.3.1
## [45] fastmap_1.2.0 grid_4.4.2
## [47] colorspace_2.1-1 cli_3.6.3
## [49] SparseArray_1.7.2 magrittr_2.0.3
## [51] S4Arrays_1.7.1 utf8_1.2.4
## [53] withr_3.0.2 UCSC.utils_1.3.0
## [55] scales_1.3.0 rmarkdown_2.29
## [57] XVector_0.47.0 httr_1.4.7
## [59] matrixStats_1.4.1 evaluate_1.0.1
## [61] knitr_1.49 GenomicRanges_1.59.1
## [63] IRanges_2.41.1 doParallel_1.0.17
## [65] rlang_1.1.4 Rcpp_1.0.13-1
## [67] glue_1.8.0 BiocManager_1.30.25
## [69] BiocGenerics_0.53.3 jsonlite_1.8.9
## [71] R6_2.5.1 MatrixGenerics_1.19.0
## [73] zlibbioc_1.52.0
IsoBayes works directly with the output of MetaMorpheus, or Percolator (via the OpenMS toolkit). Additionally, users can also provide MS data obtained from any bioinformatics tool, as long as the input files follow the structure mentioned in the Input user-provided data Section IsoBayes works with the output of both MetaMorpheus (MM) (Solntsev et al. 2018), or Percolator (The et al. 2016) (via the OpenMS toolkit (Röst et al. 2016)). Additionally, users can also provide MS data obtained from any bioinformatics tool, as long as the input files follow the structure mentioned in the Input user-provided data Section.
Below, we provide example scripts for both OpenMS and Metamorpheus.
To generate the MM output files required to run IsoBayes, we need to execute the following commands:
conda install -c conda-forge metamorpheus
metamorpheus -t Task1-SearchTaskconfig.toml Task2-CalibrateTaskconfig.toml Task3-SearchTaskconfig.toml Task4-GPTMDTaskconfig.toml Task5-SearchTaskconfig.toml -s 04-30-13_CAST_Frac4_6uL.raw 04-30-13_CAST_Frac5_4uL.raw -d uniprot-mouse-reviewed-1-24-2018.xml.gz uniprot-cRAP-1-24-2018.xml.gz
or
metamorpheus -t Task1-SearchTaskconfig.toml Task2-CalibrateTaskconfig.toml Task3-SearchTaskconfig.toml Task4-GPTMDTaskconfig.toml Task5-SearchTaskconfig.toml -s mzML/04-30-13_CAST_Frac4_6uL.mzML mzML/04-30-13_CAST_Frac5_4uL.mzML -d uniprot-mouse-reviewed-1-24-2018.xml.gz uniprot-cRAP-1-24-2018.xml.gz
There are several ways to install and run MM. For more details see the MM tutorial, where you can also find the example files used here.
We provide a brief pipeline where several OpenMS applications are chained together to generate an idXML file required to run IsoBayes with Percolator output. The pipeline starts from peptide identification results stored in mzID files.
First, install OpenMS toolkit and Percolator tool. For instructions on how to install them on your operating system see OpenMS Installation and Percolator Installation.
Next, declare some useful global variable:
path_to_data=/path/to/mzIDfiles
path_out=/path/to/output
NTHREADS=4
ENZYME_indexer="Chymotrypsin"
ENZYME_percolator="chymotrypsin"
DECOY_STRING="mz|DECOY_"
fdr=1
Below, we show an example with chymotrypsin enzyme. If the data was
generated with another enzyme, please search for the corresponding
enzyme in the documentation below, and reset the global variables
ENZYME_indexer
and ENZYME_percolator
with the
correct enzyme.
PeptideIndexer --help
PercolatorAdapter --help
This pipeline also assumes that in the
/path/to/mzIDfiles
folder there is a fasta file listing
target and decoy protein isoforms. The DECOY_STRING
allows
you to change the string needed to identify a decoy in the fasta
file.
cd $path_out
# convert mzID files into idXML files
for mz in $path_to_data/*.mzID
do
IDFileConverter -in $mz -threads $NTHREADS -out $mz.idXML
done
# merge the files
IDMerger -in $path_to_data/*.idXML -threads $NTHREADS -merge_proteins_add_PSMs -out $path_out/merge.idXML
rm $path_to_data/*.idXML
# index the peptide file with the fasta file
PeptideIndexer -in $path_out/merge.idXML -enzyme:name $ENZYME_indexer -threads $NTHREADS -decoy_string_position prefix -decoy_string $DECOY_STRING -fasta $path_to_data/genecodeAndDecoy.fasta -out $path_out/merge_index.idXML
rm $path_out/merge.idXML
# run percolator
PercolatorAdapter -in $path_out/merge_index.idXML -enzyme $ENZYME_percolator -threads $NTHREADS -generic_feature_set -score_type pep -out $path_out/merge_index_percolator_pep.idXML
rm $path_out/merge_index.idXML
# Estimate the false discovery rate on peptide level using decoy searches and keep the ones with FDR < $fdr
FalseDiscoveryRate -in $path_out/merge_index_percolator_pep.idXML -out $path_out/merge_index_percolator_pep_$fdr.idXML -protein false -threads $NTHREADS -FDR:PSM $fdr -algorithm:add_decoy_peptides -algorithm:add_decoy_proteins
rm $path_out/merge_index_percolator_pep.idXML
# Associate each peptite with Posterior Error Probability score
IDScoreSwitcher -in $path_out/merge_index_percolator_pep_$fdr.idXML -out $path_out/merge_index_percolator_pep_switched_$fdr.idXML -new_score 'Posterior Error Probability_score' -new_score_orientation lower_better -new_score_type pep -threads $NTHREADS
rm $path_out/merge_index_percolator_pep_$fdr.idXML
For more details on OpenMS tools see its Documentation.