Package 'AWAggregator' reference manual

Title:	Attribute-Weighted Aggregation
Description:	This package implements an attribute-weighted aggregation algorithm which leverages peptide-spectrum match (PSM) attributes to provide a more accurate estimate of protein abundance compared to conventional aggregation methods. This algorithm employs pre-trained random forest models to predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are then aggregated to the protein level using a weighted average, taking the predicted inaccuracy into account. Additionally, the package allows users to construct their own training sets that are more relevant to their specific experimental conditions if desired.
Authors:	Jiahua Tan [aut, cre] (ORCID: <https://orcid.org/0000-0001-5839-1049>), Gian L. Negri [aut] (ORCID: <https://orcid.org/0000-0001-7722-8888>), Gregg B. Morin [aut] (ORCID: <https://orcid.org/0000-0001-8949-4374>), David D. Y. Chen [aut] (ORCID: <https://orcid.org/0000-0002-3669-6041>)
Maintainer:	Jiahua Tan <[email protected]>
License:	MIT + file LICENSE
Version:	1.3.0
Built:	2026-07-05 00:06:54 UTC
Source:	https://github.com/bioc/AWAggregator

Aggregate PSMs to Protein Level

Description

Aggregates PSMs using a random forest model.

Usage

aggregateByAttributes(
  PSM,
  colOfReporterIonInt,
  ranger = NULL,
  predError = NULL,
  ratioCalc = FALSE
)
aggregateByAttributes(
  PSM,
  colOfReporterIonInt,
  ranger = NULL,
  predError = NULL,
  ratioCalc = FALSE
)

Arguments

PSM

A data frame containing all PSMs to be aggregated.

colOfReporterIonInt

A vector of column names representing reporter ion intensities across different channels.

ranger

The random forest model to be applied for aggregation.

predError

The predicted level of inaccuracy for the PSMs, obtained from external sources. Either the ranger model or predError must be specified.

ratioCalc

A logical value indicating whether relative reporter intensities are calculated using the total reporter intensities across all channels.

Value

A data frame containing protein abundance estimates.

Examples

library(AWAggregatorData)
data(sample.PSM.FP)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))]
groups <- samples
df <- getPSMAttributes(
    PSM=sample.PSM.FP,
    fixedPTMs=c('229.1629', '57.0214'),
    colOfReporterIonInt=samples,
    groups=groups,
    setProgressBar=TRUE
)
aggregated_results <- aggregateByAttributes(
    PSM=df,
    colOfReporterIonInt=samples,
    ranger=regr,
    ratioCalc=FALSE
)
library(AWAggregatorData)
data(sample.PSM.FP)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))]
groups <- samples
df <- getPSMAttributes(
    PSM=sample.PSM.FP,
    fixedPTMs=c('229.1629', '57.0214'),
    colOfReporterIonInt=samples,
    groups=groups,
    setProgressBar=TRUE
)
aggregated_results <- aggregateByAttributes(
    PSM=df,
    colOfReporterIonInt=samples,
    ranger=regr,
    ratioCalc=FALSE
)

Convert Proteome Discoverer Output to Required Input Format

Description

Converts output from Proteome Discoverer into the input format required by AWAggregator.

Usage

convertPDFormat(PSM, protein, colOfReporterIonInt)
convertPDFormat(PSM, protein, colOfReporterIonInt)

Arguments

PSM

A data frame containing the PSM table from Proteome Discoverer to be converted.

protein

A data frame containing the corresponding protein table from Proteome Discoverer.

colOfReporterIonInt

A vector of column names for reporter ion intensities across different channels.

Value

A data frame in the format required by AWAggregator.

Examples

data(sample.PSM.PD)
data(sample.prot.PD)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))]
df <- convertPDFormat(
    PSM=sample.PSM.PD,
    protein=sample.prot.PD,
    colOfReporterIonInt=samples
)
data(sample.PSM.PD)
data(sample.prot.PD)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))]
df <- convertPDFormat(
    PSM=sample.PSM.PD,
    protein=sample.prot.PD,
    colOfReporterIonInt=samples
)

Fit a Random Forest Model

Description

Trains a random forest model to predict the level of quantitative inaccuracy of PSMs.

Usage

fitQuantInaccuracyModel(
  PSM,
  numTrees = 500,
  useAvgCV = TRUE,
  importance = FALSE,
  seed,
  appliedAttributes = c(NA)
)
fitQuantInaccuracyModel(
  PSM,
  numTrees = 500,
  useAvgCV = TRUE,
  importance = FALSE,
  seed,
  appliedAttributes = c(NA)
)

Arguments

PSM

A data frame containing all PSMs used for training.

numTrees

The number of trees to include in the random forest model.

useAvgCV

A logical value indicating whether to include the average CV attribute in the training. This parameter is ignored if appliedAttributes is specified.

importance

A logical value indicating whether to assess the importance of attributes.

seed

An integer seed for random number generation in the random forest.

appliedAttributes

A vector of attribute names to be used in training, replacing the default attributes.

Value

A trained random forest model.

Examples

library(ExperimentHub)
library(stringr)
eh <- ExperimentHub()
benchmarkSet3 <- eh[['EH9639']]
# Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM <- getPSMAttributes(
    PSM=benchmarkSet3,
    # TMTpro tag (304.2071) and N-ethylmaleimide (125.0476) are applied as
    # fixed PTMs
    fixedPTM=c('304.2071', '125.0476'),
    colOfReporterIonInt=samples,
    groups=groups
)
PSM <- getAvgScaledErrorOfLog2FC(
    PSM=PSM,
    colOfReporterIonInt=samples,
    groups=groups,
    expectedRelativeAbundance=list(
        `H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10
    ),
    speciesAtConstLevel='HUMAN'
)
regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)
library(ExperimentHub)
library(stringr)
eh <- ExperimentHub()
benchmarkSet3 <- eh[['EH9639']]
# Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM <- getPSMAttributes(
    PSM=benchmarkSet3,
    # TMTpro tag (304.2071) and N-ethylmaleimide (125.0476) are applied as
    # fixed PTMs
    fixedPTM=c('304.2071', '125.0476'),
    colOfReporterIonInt=samples,
    groups=groups
)
PSM <- getAvgScaledErrorOfLog2FC(
    PSM=PSM,
    colOfReporterIonInt=samples,
    groups=groups,
    expectedRelativeAbundance=list(
        `H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10
    ),
    speciesAtConstLevel='HUMAN'
)
regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)

Get Average Scaled Error of log2FC for PSMs in a Spike-in Dataset

Description

Calculates the Average Scaled Error of log2FC values required for training sets.

Usage

getAvgScaledErrorOfLog2FC(
  PSM,
  colOfReporterIonInt,
  groups,
  expectedRelativeAbundance,
  speciesAtConstLevel
)
getAvgScaledErrorOfLog2FC(
  PSM,
  colOfReporterIonInt,
  groups,
  expectedRelativeAbundance,
  speciesAtConstLevel
)

Arguments

PSM

A data frame containing all PSMs used for training.

colOfReporterIonInt

A vector of column names for reporter ion intensities across different channels.

groups

A vector specifying sample groups.

expectedRelativeAbundance

A named list where group names are keys and the corresponding expected relative abundance values for species at varying concentrations are provided as values. Unknown ratios can be designated as NA.

speciesAtConstLevel

A string specifying the species that are spiked in at a constant level.

Value

A data frame containing PSMs with Average Scaled Error of log2FC values required for the random forest model.

Examples

library(ExperimentHub)
library(stringr)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']]
# Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3')
samples <- colnames(benchmarkSet1)[
    grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1))
]
groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1]
PSM <- getAvgScaledErrorOfLog2FC(
    PSM=benchmarkSet1,
    colOfReporterIonInt=samples,
    groups=groups,
    expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA),
    speciesAtConstLevel='HUMAN'
)
library(ExperimentHub)
library(stringr)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']]
# Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3')
samples <- colnames(benchmarkSet1)[
    grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1))
]
groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1]
PSM <- getAvgScaledErrorOfLog2FC(
    PSM=benchmarkSet1,
    colOfReporterIonInt=samples,
    groups=groups,
    expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA),
    speciesAtConstLevel='HUMAN'
)

Calculates the distance metric for PSMs. Distance metric reflects on whether the quantified ratio of each pair of samples of a PSM diverges from other PSMs in the same redundant/unique group. Redundant group, unique group and distance metric were originally defined in the iPQF method. Please refer to "iPQF: a new peptide-to-protein summarization method using peptide spectra characteristics to improve protein quantification" for more details.

Usage

getDistMetric(PSM, channel, setProgressBar = TRUE)
getDistMetric(PSM, channel, setProgressBar = TRUE)

Arguments

PSM

A data frame containing the PSMs for which distance metrics are to be calculated.

channel

A vector specifying the channels used for calculating the distance metric.

setProgressBar

A logical value indicating whether to display a progress bar.

Value

A vector of distance metrics for the specified PSMs.

References

Martina Fischer, Bernhard Y. Renard (2016). iPQF: A New Peptide-to-Protein Summarization Method Using Peptide Spectra Characteristics to Improve Protein Quantification. Bioinformatics, 32(7), 1040-1047.

Examples

library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet3 <- eh[['EH9639']]
# Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
df <- getDistMetric(
    PSM=benchmarkSet3,
    channel=samples,
    setProgressBar=TRUE
)
library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet3 <- eh[['EH9639']]
# Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
df <- getDistMetric(
    PSM=benchmarkSet3,
    channel=samples,
    setProgressBar=TRUE
)

Get attributes for PSMs

Description

Retrieves attributes required for training or test sets.

Usage

getPSMAttributes(
  PSM,
  fixedPTMs,
  colOfReporterIonInt,
  groups,
  groupsExcludedFromCV = NA,
  setProgressBar = TRUE
)
getPSMAttributes(
  PSM,
  fixedPTMs,
  colOfReporterIonInt,
  groups,
  groupsExcludedFromCV = NA,
  setProgressBar = TRUE
)

Arguments

PSM

A data frame containing all PSMs used for training.

fixedPTMs

A numeric vector with the masses of fixed post-translational modifications (PTMs) in Da. Other PTMs will be treated as variable PTMs.

colOfReporterIonInt

A vector of column names for reporter ion intensities across different channels.

groups

A vector specifying sample groups.

groupsExcludedFromCV

A vector of sample groups excluded from average CV calculations, which may occur due to a zero spike-in concentration for a species.

setProgressBar

A logical value indicating whether to display a progress bar.

Value

A data frame containing the PSM table with attributes required for the random forest model.

Examples

library(ExperimentHub)
library(stringr)
eh <- ExperimentHub()
benchmarkSet3 <- eh[['EH9639']]
# Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM <- getPSMAttributes(
    PSM=benchmarkSet3,
    fixedPTM=c('304.2071', '125.0476'),
    colOfReporterIonInt=samples,
    groups=groups,
    groupsExcludedFromCV='H1+Y0'
)
library(ExperimentHub)
library(stringr)
eh <- ExperimentHub()
benchmarkSet3 <- eh[['EH9639']]
# Load sample names (Sample 'H1+Y0_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM <- getPSMAttributes(
    PSM=benchmarkSet3,
    fixedPTM=c('304.2071', '125.0476'),
    colOfReporterIonInt=samples,
    groups=groups,
    groupsExcludedFromCV='H1+Y0'
)

Merge Training Sets

Description

Extracts a similar number of PSMs from each input dataset and merges them into a single training set.

Usage

mergeTrainingSets(PSMList, numPSMs, setProgressBar = TRUE)
mergeTrainingSets(PSMList, numPSMs, setProgressBar = TRUE)

Arguments

PSMList

A named list where dataset names are keys and the corresponding data frames of PSMs used for training are values.

numPSMs

The minimum number of PSMs to extract from each dataset for merging.

setProgressBar

A logical value indicating whether to display a progress bar.

Value

A data frame containing the merged PSM table from a subset of each input dataset.

Examples

library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']]
benchmarkSet2 <- eh[['EH9638']]
benchmarkSet3 <- eh[['EH9639']]
PSM <- mergeTrainingSets(
    PSMList=list(
        `Benchmark Set 1`=benchmarkSet1,
        `Benchmark Set 2`=benchmarkSet2,
        `Benchmark Set 3`=benchmarkSet3
    ),
    numPSMs=min(
        nrow(benchmarkSet1), nrow(benchmarkSet2), nrow(benchmarkSet3)
    ),
)
library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']]
benchmarkSet2 <- eh[['EH9638']]
benchmarkSet3 <- eh[['EH9639']]
PSM <- mergeTrainingSets(
    PSMList=list(
        `Benchmark Set 1`=benchmarkSet1,
        `Benchmark Set 2`=benchmarkSet2,
        `Benchmark Set 3`=benchmarkSet3
    ),
    numPSMs=min(
        nrow(benchmarkSet1), nrow(benchmarkSet2), nrow(benchmarkSet3)
    ),
)

Sample Protein Data from Protein Discoverer

Description

This data frame represents sample proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the search results of Proteome Discoverer. Columns unnecessary for the AWAggregator have been removed from the sample data.

Usage

data(sample.prot.PD)
data(sample.prot.PD)

Format

A data frame with 5 rows and 2 variables:

Accession: The unique identifier assigned to the protein
Description: The name of the protein exclusive of the identifier that appears in the Accession column

Sample PSM Data from FragPipe

Description

This data frame represents sample peptide spectrum matches (PSMs) mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the search results of FragPipe. Columns unnecessary for the AWAggregator have been removed from the sample data.

Usage

data(sample.PSM.FP)
data(sample.PSM.FP)

Format

A data frame with 118 rows and 20 variables:

Peptide: Peptide amino acid sequence
Charge: The charge state of the identified peptide
Calibrated Observed Mass: Mass of the identified peptide after m/z calibration (in Da)
Calibrated Observed M/Z: Mass-to-charge ratio of the peptide ion after m/z calibration
Delta Mass: Difference between calibrated observed peptide mass and theoretical peptide mass (in Da)
Hyperscore: The similarity score between observed and theoretical spectra
Number of Missed Cleavages: Number of potential enzymatic cleavage sites within the identified sequence
Intensity: Raw integrated precursor abundance for each PSM
Assigned Modifications: Post-translational modifications within the identified sequence
Purity: The proportion of total ion abundance in the inclusion window from the precursor
Protein: Protein sequence header corresponding to the identified peptide sequence
Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9: Processed reporter ion intensities from sample 1 to 9

Sample PSM Data from Proteome Discoverer

Description

This data frame represents sample peptide spectrum matches (PSMs) mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the search results of Proteome Discoverer. Columns unnecessary for the AWAggregator have been removed from the sample data.

Usage

data(sample.PSM.PD)
data(sample.PSM.PD)

Format

A data frame with 128 rows and 21 variables:

Annotated Sequence: The names of the flanking residues of a peptide in a protein
Modifications: The static and dynamic modifications identified in the peptide
Number of Proteins: The number of mapped proteins
Master Protein Accessions: A description of the master proteins
Number of Missed Cleavages: The number of potential enzymatic cleavage sites within the identified sequence
Charge: The charge state of the peptide
mz in Da: The mass-to-charge ratio of the precursor ion, in daltons
MHplus in Da: The measured protonated monoisotopic mass of the peptides, in daltons
Delta mz in Da: The difference between the measured charged mass (m/z in Da) and the theoretical mass of the same charge (z)
Isolation Interference in Percent: The percentage of interference by co-isolation within the precursor isolation window
Average Reporter SN: The average reporter S/N values
XCorr: Scores the number of fragment ions that are common to two different peptides with the same precursor mass and calculates the cross-correlation score for all candidate peptides queried from the database
Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9: Processed reporter ion intensities from sample 1 to 9

Package 'AWAggregator'

Help Index

Aggregate PSMs to Protein Level

Description

Usage

Arguments

Value

Examples

Convert Proteome Discoverer Output to Required Input Format

Description

Usage

Arguments

Value

Examples

Fit a Random Forest Model

Description

Usage

Arguments

Value

Examples

Get Average Scaled Error of log2FC for PSMs in a Spike-in Dataset

Description

Usage

Arguments

Value

Examples

Get Distance Metric

Description

Usage

Arguments

Value

References

Examples

Get attributes for PSMs

Description

Usage

Arguments

Value

Examples

Merge Training Sets

Description

Usage

Arguments

Value

Examples

Sample Protein Data from Protein Discoverer

Description

Usage

Format

Sample PSM Data from FragPipe

Description

Usage

Format

Sample PSM Data from Proteome Discoverer

Description

Usage

Format