| Title: | Attribute-Weighted Aggregation |
|---|---|
| Description: | This package implements an attribute-weighted aggregation algorithm which leverages peptide-spectrum match (PSM) attributes to provide a more accurate estimate of protein abundance compared to conventional aggregation methods. This algorithm employs pre-trained random forest models to predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are then aggregated to the protein level using a weighted average, taking the predicted inaccuracy into account. Additionally, the package allows users to construct their own training sets that are more relevant to their specific experimental conditions if desired. |
| Authors: | Jiahua Tan [aut, cre] (ORCID: <https://orcid.org/0000-0001-5839-1049>), Gian L. Negri [aut] (ORCID: <https://orcid.org/0000-0001-7722-8888>), Gregg B. Morin [aut] (ORCID: <https://orcid.org/0000-0001-8949-4374>), David D. Y. Chen [aut] (ORCID: <https://orcid.org/0000-0002-3669-6041>) |
| Maintainer: | Jiahua Tan <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.3.0 |
| Built: | 2026-05-29 08:38:55 UTC |
| Source: | https://github.com/bioc/AWAggregator |
Aggregates PSMs using a random forest model.
aggregateByAttributes( PSM, colOfReporterIonInt, ranger = NULL, predError = NULL, ratioCalc = FALSE )aggregateByAttributes( PSM, colOfReporterIonInt, ranger = NULL, predError = NULL, ratioCalc = FALSE )
PSM |
A data frame containing all PSMs to be aggregated. |
colOfReporterIonInt |
A vector of column names representing reporter ion intensities across different channels. |
ranger |
The random forest model to be applied for aggregation. |
predError |
The predicted level of inaccuracy for the PSMs, obtained
from external sources. Either the |
ratioCalc |
A logical value indicating whether relative reporter intensities are calculated using the total reporter intensities across all channels. |
A data frame containing protein abundance estimates.
library(AWAggregatorData) data(sample.PSM.FP) regr <- loadQuantInaccuracyModel(useAvgCV=FALSE) # Load sample names (Sample 1 ~ Sample 9) samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))] groups <- samples df <- getPSMAttributes( PSM=sample.PSM.FP, fixedPTMs=c('229.1629', '57.0214'), colOfReporterIonInt=samples, groups=groups, setProgressBar=TRUE ) aggregated_results <- aggregateByAttributes( PSM=df, colOfReporterIonInt=samples, ranger=regr, ratioCalc=FALSE )library(AWAggregatorData) data(sample.PSM.FP) regr <- loadQuantInaccuracyModel(useAvgCV=FALSE) # Load sample names (Sample 1 ~ Sample 9) samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))] groups <- samples df <- getPSMAttributes( PSM=sample.PSM.FP, fixedPTMs=c('229.1629', '57.0214'), colOfReporterIonInt=samples, groups=groups, setProgressBar=TRUE ) aggregated_results <- aggregateByAttributes( PSM=df, colOfReporterIonInt=samples, ranger=regr, ratioCalc=FALSE )
Converts output from Proteome Discoverer into the input format required by AWAggregator.
convertPDFormat(PSM, protein, colOfReporterIonInt)convertPDFormat(PSM, protein, colOfReporterIonInt)
PSM |
A data frame containing the PSM table from Proteome Discoverer to be converted. |
protein |
A data frame containing the corresponding protein table from Proteome Discoverer. |
colOfReporterIonInt |
A vector of column names for reporter ion intensities across different channels. |
A data frame in the format required by AWAggregator.
data(sample.PSM.PD) data(sample.prot.PD) # Load sample names (Sample 1 ~ Sample 9) samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))] df <- convertPDFormat( PSM=sample.PSM.PD, protein=sample.prot.PD, colOfReporterIonInt=samples )data(sample.PSM.PD) data(sample.prot.PD) # Load sample names (Sample 1 ~ Sample 9) samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))] df <- convertPDFormat( PSM=sample.PSM.PD, protein=sample.prot.PD, colOfReporterIonInt=samples )
Trains a random forest model to predict the level of quantitative inaccuracy of PSMs.
fitQuantInaccuracyModel( PSM, numTrees = 500, useAvgCV = TRUE, importance = FALSE, seed, appliedAttributes = c(NA) )fitQuantInaccuracyModel( PSM, numTrees = 500, useAvgCV = TRUE, importance = FALSE, seed, appliedAttributes = c(NA) )
PSM |
A data frame containing all PSMs used for training. |
numTrees |
The number of trees to include in the random forest model. |
useAvgCV |
A logical value indicating whether to include the average CV
attribute in the training. This parameter is ignored if |
importance |
A logical value indicating whether to assess the importance of attributes. |
seed |
An integer seed for random number generation in the random forest. |
appliedAttributes |
A vector of attribute names to be used in training, replacing the default attributes. |
A trained random forest model.
library(ExperimentHub) library(stringr) eh <- ExperimentHub() benchmarkSet3 <- eh[['EH9639']] # Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2') samples <- colnames(benchmarkSet3)[ grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3)) ] groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1] PSM <- getPSMAttributes( PSM=benchmarkSet3, # TMTpro tag (304.2071) and N-ethylmaleimide (125.0476) are applied as # fixed PTMs fixedPTM=c('304.2071', '125.0476'), colOfReporterIonInt=samples, groups=groups ) PSM <- getAvgScaledErrorOfLog2FC( PSM=PSM, colOfReporterIonInt=samples, groups=groups, expectedRelativeAbundance=list( `H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10 ), speciesAtConstLevel='HUMAN' ) regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)library(ExperimentHub) library(stringr) eh <- ExperimentHub() benchmarkSet3 <- eh[['EH9639']] # Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2') samples <- colnames(benchmarkSet3)[ grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3)) ] groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1] PSM <- getPSMAttributes( PSM=benchmarkSet3, # TMTpro tag (304.2071) and N-ethylmaleimide (125.0476) are applied as # fixed PTMs fixedPTM=c('304.2071', '125.0476'), colOfReporterIonInt=samples, groups=groups ) PSM <- getAvgScaledErrorOfLog2FC( PSM=PSM, colOfReporterIonInt=samples, groups=groups, expectedRelativeAbundance=list( `H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10 ), speciesAtConstLevel='HUMAN' ) regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)
Calculates the Average Scaled Error of log2FC values required for training sets.
getAvgScaledErrorOfLog2FC( PSM, colOfReporterIonInt, groups, expectedRelativeAbundance, speciesAtConstLevel )getAvgScaledErrorOfLog2FC( PSM, colOfReporterIonInt, groups, expectedRelativeAbundance, speciesAtConstLevel )
PSM |
A data frame containing all PSMs used for training. |
colOfReporterIonInt |
A vector of column names for reporter ion intensities across different channels. |
groups |
A vector specifying sample groups. |
expectedRelativeAbundance |
A named list where group names are keys and the corresponding expected relative abundance values for species at varying concentrations are provided as values. Unknown ratios can be designated as NA. |
speciesAtConstLevel |
A string specifying the species that are spiked in at a constant level. |
A data frame containing PSMs with Average Scaled Error of log2FC values required for the random forest model.
library(ExperimentHub) library(stringr) eh <- ExperimentHub() benchmarkSet1 <- eh[['EH9637']] # Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3') samples <- colnames(benchmarkSet1)[ grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1)) ] groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1] PSM <- getAvgScaledErrorOfLog2FC( PSM=benchmarkSet1, colOfReporterIonInt=samples, groups=groups, expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA), speciesAtConstLevel='HUMAN' )library(ExperimentHub) library(stringr) eh <- ExperimentHub() benchmarkSet1 <- eh[['EH9637']] # Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3') samples <- colnames(benchmarkSet1)[ grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1)) ] groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1] PSM <- getAvgScaledErrorOfLog2FC( PSM=benchmarkSet1, colOfReporterIonInt=samples, groups=groups, expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA), speciesAtConstLevel='HUMAN' )
Calculates the distance metric for PSMs. Distance metric reflects on whether the quantified ratio of each pair of samples of a PSM diverges from other PSMs in the same redundant/unique group. Redundant group, unique group and distance metric were originally defined in the iPQF method. Please refer to "iPQF: a new peptide-to-protein summarization method using peptide spectra characteristics to improve protein quantification" for more details.
getDistMetric(PSM, channel, setProgressBar = TRUE)getDistMetric(PSM, channel, setProgressBar = TRUE)
PSM |
A data frame containing the PSMs for which distance metrics are to be calculated. |
channel |
A vector specifying the channels used for calculating the distance metric. |
setProgressBar |
A logical value indicating whether to display a progress bar. |
A vector of distance metrics for the specified PSMs.
Martina Fischer, Bernhard Y. Renard (2016). iPQF: A New Peptide-to-Protein Summarization Method Using Peptide Spectra Characteristics to Improve Protein Quantification. Bioinformatics, 32(7), 1040-1047.
library(ExperimentHub) eh <- ExperimentHub() benchmarkSet3 <- eh[['EH9639']] # Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2') samples <- colnames(benchmarkSet3)[ grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3)) ] df <- getDistMetric( PSM=benchmarkSet3, channel=samples, setProgressBar=TRUE )library(ExperimentHub) eh <- ExperimentHub() benchmarkSet3 <- eh[['EH9639']] # Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2') samples <- colnames(benchmarkSet3)[ grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3)) ] df <- getDistMetric( PSM=benchmarkSet3, channel=samples, setProgressBar=TRUE )
Retrieves attributes required for training or test sets.
getPSMAttributes( PSM, fixedPTMs, colOfReporterIonInt, groups, groupsExcludedFromCV = NA, setProgressBar = TRUE )getPSMAttributes( PSM, fixedPTMs, colOfReporterIonInt, groups, groupsExcludedFromCV = NA, setProgressBar = TRUE )
PSM |
A data frame containing all PSMs used for training. |
fixedPTMs |
A numeric vector with the masses of fixed post-translational modifications (PTMs) in Da. Other PTMs will be treated as variable PTMs. |
colOfReporterIonInt |
A vector of column names for reporter ion intensities across different channels. |
groups |
A vector specifying sample groups. |
groupsExcludedFromCV |
A vector of sample groups excluded from average CV calculations, which may occur due to a zero spike-in concentration for a species. |
setProgressBar |
A logical value indicating whether to display a progress bar. |
A data frame containing the PSM table with attributes required for the random forest model.
library(ExperimentHub) library(stringr) eh <- ExperimentHub() benchmarkSet3 <- eh[['EH9639']] # Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2') samples <- colnames(benchmarkSet3)[ grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3)) ] groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1] PSM <- getPSMAttributes( PSM=benchmarkSet3, fixedPTM=c('304.2071', '125.0476'), colOfReporterIonInt=samples, groups=groups, groupsExcludedFromCV='H1+Y0' )library(ExperimentHub) library(stringr) eh <- ExperimentHub() benchmarkSet3 <- eh[['EH9639']] # Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2') samples <- colnames(benchmarkSet3)[ grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3)) ] groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1] PSM <- getPSMAttributes( PSM=benchmarkSet3, fixedPTM=c('304.2071', '125.0476'), colOfReporterIonInt=samples, groups=groups, groupsExcludedFromCV='H1+Y0' )
Extracts a similar number of PSMs from each input dataset and merges them into a single training set.
mergeTrainingSets(PSMList, numPSMs, setProgressBar = TRUE)mergeTrainingSets(PSMList, numPSMs, setProgressBar = TRUE)
PSMList |
A named list where dataset names are keys and the corresponding data frames of PSMs used for training are values. |
numPSMs |
The minimum number of PSMs to extract from each dataset for merging. |
setProgressBar |
A logical value indicating whether to display a progress bar. |
A data frame containing the merged PSM table from a subset of each input dataset.
library(ExperimentHub) eh <- ExperimentHub() benchmarkSet1 <- eh[['EH9637']] benchmarkSet2 <- eh[['EH9638']] benchmarkSet3 <- eh[['EH9639']] PSM <- mergeTrainingSets( PSMList=list( `Benchmark Set 1`=benchmarkSet1, `Benchmark Set 2`=benchmarkSet2, `Benchmark Set 3`=benchmarkSet3 ), numPSMs=min( nrow(benchmarkSet1), nrow(benchmarkSet2), nrow(benchmarkSet3) ), )library(ExperimentHub) eh <- ExperimentHub() benchmarkSet1 <- eh[['EH9637']] benchmarkSet2 <- eh[['EH9638']] benchmarkSet3 <- eh[['EH9639']] PSM <- mergeTrainingSets( PSMList=list( `Benchmark Set 1`=benchmarkSet1, `Benchmark Set 2`=benchmarkSet2, `Benchmark Set 3`=benchmarkSet3 ), numPSMs=min( nrow(benchmarkSet1), nrow(benchmarkSet2), nrow(benchmarkSet3) ), )
This data frame represents sample proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the search results of Proteome Discoverer. Columns unnecessary for the AWAggregator have been removed from the sample data.
data(sample.prot.PD)data(sample.prot.PD)
A data frame with 5 rows and 2 variables:
The unique identifier assigned to the protein
The name of the protein exclusive of the identifier that appears in the Accession column
This data frame represents sample peptide spectrum matches (PSMs) mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the search results of FragPipe. Columns unnecessary for the AWAggregator have been removed from the sample data.
data(sample.PSM.FP)data(sample.PSM.FP)
A data frame with 118 rows and 20 variables:
Peptide amino acid sequence
The charge state of the identified peptide
Mass of the identified peptide after m/z calibration (in Da)
Mass-to-charge ratio of the peptide ion after m/z calibration
Difference between calibrated observed peptide mass and theoretical peptide mass (in Da)
The similarity score between observed and theoretical spectra
Number of potential enzymatic cleavage sites within the identified sequence
Raw integrated precursor abundance for each PSM
Post-translational modifications within the identified sequence
The proportion of total ion abundance in the inclusion window from the precursor
Protein sequence header corresponding to the identified peptide sequence
Processed reporter ion intensities from sample 1 to 9
This data frame represents sample peptide spectrum matches (PSMs) mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the search results of Proteome Discoverer. Columns unnecessary for the AWAggregator have been removed from the sample data.
data(sample.PSM.PD)data(sample.PSM.PD)
A data frame with 128 rows and 21 variables:
The names of the flanking residues of a peptide in a protein
The static and dynamic modifications identified in the peptide
The number of mapped proteins
A description of the master proteins
The number of potential enzymatic cleavage sites within the identified sequence
The charge state of the peptide
The mass-to-charge ratio of the precursor ion, in daltons
The measured protonated monoisotopic mass of the peptides, in daltons
The difference between the measured charged mass (m/z in Da) and the theoretical mass of the same charge (z)
The percentage of interference by co-isolation within the precursor isolation window
The average reporter S/N values
Scores the number of fragment ions that are common to two different peptides with the same precursor mass and calculates the cross-correlation score for all candidate peptides queried from the database
Processed reporter ion intensities from sample 1 to 9