Package 'PrInCE' reference manual

Title:	Predicting Interactomes from Co-Elution
Description:	PrInCE (Predicting Interactomes from Co-Elution) uses a naive Bayes classifier trained on dataset-derived features to recover protein-protein interactions from co-elution chromatogram profiles. This package contains the R implementation of PrInCE.
Authors:	Michael Skinnider [aut, trl, cre], R. Greg Stacey [ctb], Nichollas Scott [ctb], Anders Kristensen [ctb], Leonard Foster [aut, led]
Maintainer:	Michael Skinnider <[email protected]>
License:	GPL-3 + file LICENSE
Version:	1.23.0
Built:	2025-03-04 06:07:15 UTC
Source:	https://github.com/bioc/PrInCE

Create an adjacency matrix from a data frame

Description

Convert a data frame containing pairwise interactions into an adjacency matrix. The resulting square adjacency matrix contains ones for proteins that are found in interactions and zeroes otherwise.

Usage

adjacency_matrix_from_data_frame(dat, symmetric = TRUE, node_columns = c(1, 2))
adjacency_matrix_from_data_frame(dat, symmetric = TRUE, node_columns = c(1, 2))

Arguments

`dat`	a data frame containing pairwise interactions
`symmetric`	if true, interactions in both directions will be added to the adjacency matrix
`node_columns`	a vector of length two, denoting either the indices (integer vector) or column names (character vector) of the columns within the data frame containing the nodes participating in pairwise interactions; defaults to the first two columns of the data frame (`c(1, 2)`)

Value

an adjacency matrix between all interacting proteins

Examples

ppi <- data.frame(protein_A = paste0("protein", seq_len(10)),
                  protein_B = paste0("protein", c(rep(3, 2), rep(5, 5), 
                                     rep(7, 3))))
adj <- adjacency_matrix_from_data_frame(ppi)

ppi <- data.frame(protein_A = paste0("protein", seq_len(10)),
                  protein_B = paste0("protein", c(rep(3, 2), rep(5, 5), 
                                     rep(7, 3))))
adj <- adjacency_matrix_from_data_frame(ppi)

Create an adjacency matrix from a list of complexes

Description

Convert a list of complexes into a pairwise adjacency matrix. The resulting square adjacency matrix contains ones for proteins that are found in the same complex and zeroes otherwise.

Usage

adjacency_matrix_from_list(complexes)
adjacency_matrix_from_list(complexes)

Arguments

complexes

a list of complexes, with each entry containing complex subunits as a character vector

Value

an adjacency matrix between all complex subunits

Examples

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)

Model selection for Gaussian mixture models

Description

Calculate the AIC, corrected AIC, or BIC for a curve fit with a Gaussian mixture model by nonlinear least squares optimization. This function permits the calculation of the AIC/AICc/BIC after rejecting some Gaussians in the model, for example because their centres are outside the bounds of the profile.

Usage

gaussian_aic(coefs, chromatogram)

gaussian_aicc(coefs, chromatogram)

gaussian_bic(coefs, chromatogram)
gaussian_aic(coefs, chromatogram)

gaussian_aicc(coefs, chromatogram)

gaussian_bic(coefs, chromatogram)

Arguments

`coefs`	the coefficients of the Gaussian mixture model, output by `fit_gaussians`
`chromatogram`	the raw elution profile

Value

the AIC, corrected AIC, or BIC of the fit model

Deconvolve profiles into Gaussian mixture models

Description

Identify peaks in co-fractionation profiles by deconvolving peaks in Gaussian mixture models. Models are mixtures of between 1 and 5 Gaussians. Profiles are pre-processed prior to building Gaussians by filtering and cleaning. By default, profiles with fewer than 5 non-missing points, or fewer than 5 consecutive points after imputation of single missing values, are removed. Profiles are cleaned by replacing missing values with near-zero noise, imputing single missing values as the mean of neighboring points, and smoothing with a moving average filter.

Usage

build_gaussians(
  profile_matrix,
  min_points = 1,
  min_consecutive = 5,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  max_gaussians = 5,
  criterion = c("AICc", "AIC", "BIC"),
  max_iterations = 50,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  filter_gaussians_center = TRUE,
  filter_gaussians_height = 0.15,
  filter_gaussians_variance_min = 0.5,
  filter_gaussians_variance_max = 50
)
build_gaussians(
  profile_matrix,
  min_points = 1,
  min_consecutive = 5,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  max_gaussians = 5,
  criterion = c("AICc", "AIC", "BIC"),
  max_iterations = 50,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  filter_gaussians_center = TRUE,
  filter_gaussians_height = 0.15,
  filter_gaussians_variance_min = 0.5,
  filter_gaussians_variance_max = 50
)

Arguments

`profile_matrix`	a numeric matrix of co-elution profiles, with proteins in rows, or a `MSnSet` object
`min_points`	filter profiles without at least this many total, non-missing points; passed to `filter_profiles`
`min_consecutive`	filter profiles without at least this many consecutive, non-missing points; passed to `filter_profiles`
`impute_NA`	if true, impute single missing values with the average of neighboring values; passed to `clean_profiles`
`smooth`	if true, smooth the chromatogram with a moving average filter; passed to `clean_profiles`
`smooth_width`	width of the moving average filter, in fractions; passed to `clean_profiles`
`max_gaussians`	the maximum number of Gaussians to fit; defaults to 5. Note that Gaussian mixtures with more parameters than observed (i.e., non-zero or NA) points will not be fit. Passed to `choose_gaussians`
`criterion`	the criterion to use for model selection; one of "AICc" (corrected AIC, and default), "AIC", or "BIC". Passed to `choose_gaussians`
`max_iterations`	the number of times to try fitting the curve with different initial conditions; defaults to 50. Passed to `fit_gaussians`
`min_R_squared`	the minimum R-squared value to accept when fitting the curve with different initial conditions; defaults to 0.5. Passed to `fit_gaussians`
`method`	the method used to select the initial conditions for nonlinear least squares optimization (one of "guess" or "random"); see `make_initial_conditions` for details. Passed to `fit_gaussians`
`filter_gaussians_center`	true or false: filter Gaussians whose centres fall outside the bounds of the chromatogram. Passed to `fit_gaussians`
`filter_gaussians_height`	Gaussians whose heights are below this fraction of the chromatogram height will be filtered. Setting this value to zero disables height-based filtering of fit Gaussians. Passed to `fit_gaussians`
`filter_gaussians_variance_min`	Gaussians whose variance falls below this number of fractions will be filtered. Setting this value to zero disables filtering. Passed to `fit_gaussians`
`filter_gaussians_variance_max`	Gaussians whose variance is above this number of fractions will be filtered. Setting this value to zero disables filtering. Passed to `fit_gaussians`

Value

a list of fit Gaussian mixture models, where each item in the list contains the following five fields: the number of Gaussians used to fit the curve; the R^2 of the fit; the number of iterations used to fit the curve with different initial conditions; the coefficients of the fit model; and the curve predicted by the fit model. Profiles that could not be fit by a Gaussian mixture model above the minimum R-squared cutoff will be absent from the returned list.

Examples

data(scott)
mat <- clean_profiles(scott[seq_len(5), ])
gauss <- build_gaussians(mat, max_gaussians = 3)

data(scott)
mat <- clean_profiles(scott[seq_len(5), ])
gauss <- build_gaussians(mat, max_gaussians = 3)

Calculate the autocorrelation for each protein between a pair of co-elution experiments.

Description

For a given protein, the correlation coefficient to all other proteins in the first condition is calculated, yielding a vector of correlation coefficients. The same procedure is repeated for the second condition, and the two vectors of correlation coefficients are themselves correlated, yielding a metric whereby higher values reflect proteins with unchanging interaction profiles between conditions, while lower values reflect proteins with substantially changing interaction profiles.

Usage

calculate_autocorrelation(
  profile1,
  profile2,
  cor_method = c("pearson", "spearman", "kendall"),
  min_replicates = 1,
  min_fractions = 1,
  min_pairs = 0
)
calculate_autocorrelation(
  profile1,
  profile2,
  cor_method = c("pearson", "spearman", "kendall"),
  min_replicates = 1,
  min_fractions = 1,
  min_pairs = 0
)

Arguments

`profile1`	a numeric matrix or data frame with proteins in rows and fractions in columns, or a `MSnSet` object, representing the first co-elution condition
`profile2`	a numeric matrix or data frame with proteins in rows and fractions in columns, or a `MSnSet` object, representing the second co-elution condition
`cor_method`	the correlation method to use; one of `"pearson"`, `"spearman"`, or `"kendall"`).
`min_fractions`	filter proteins not quantified in at least this many fractions
`min_pairs`	remove correlations between protein pairs not co-occuring in at least this many fractions from the autocorrelation calculation

Details

Note that all of zero, NA, NaN, and infinite values are all treated equivalently as missing values when applying the min_fractions and min_pairs filters, but different handling of missing values will produce different autocorrelation scores.

Value

a named vector of autocorrelation scores for all proteins found in both matrices.

Calculate the default features used to predict interactions in PrInCE

Description

Calculate the six features that are used to discriminate interacting and non-interacting protein pairs based on co-elution profiles in PrInCE, namely: raw Pearson R value, cleaned Pearson R value, raw Pearson P-value, Euclidean distance, co-peak, and co-apex. Optionally, one or more of these can be disabled.

Usage

calculate_features(
  profile_matrix,
  gaussians,
  min_pairs = 0,
  pearson_R_raw = TRUE,
  pearson_R_cleaned = TRUE,
  pearson_P = TRUE,
  euclidean_distance = TRUE,
  co_peak = TRUE,
  co_apex = TRUE,
  n_pairs = FALSE,
  max_euclidean_quantile = 0.9
)
calculate_features(
  profile_matrix,
  gaussians,
  min_pairs = 0,
  pearson_R_raw = TRUE,
  pearson_R_cleaned = TRUE,
  pearson_P = TRUE,
  euclidean_distance = TRUE,
  co_peak = TRUE,
  co_apex = TRUE,
  n_pairs = FALSE,
  max_euclidean_quantile = 0.9
)

Arguments

`profile_matrix`	a numeric matrix of co-elution profiles, with proteins in rows, or a `MSnSet` object
`gaussians`	a list of Gaussian mixture models fit to the profile matrix by `link{build_gaussians}`
`min_pairs`	minimum number of overlapping fractions between any given protein pair to consider a potential interaction
`pearson_R_raw`	if true, include the Pearson correlation (R) between raw profiles as a feature
`pearson_R_cleaned`	if true, include the Pearson correlation (R) between cleaned profiles as a feature
`pearson_P`	if true, include the P-value of the Pearson correlation between raw profiles as a feature
`euclidean_distance`	if true, include the Euclidean distance between cleaned profiles as a feature
`co_peak`	if true, include the 'co-peak score' (that is, the distance, in fractions, between the single highest value of each profile) as a feature
`co_apex`	if true, include the 'co-apex score' (that is, the minimum Euclidean distance between any pair of fit Gaussians) as a feature
`max_euclidean_quantile`	very high Euclidean distance values are trimmed to avoid numerical precision issues; values above this quantile will be replaced with the value at this quantile (default: `0.9`)

Value

a data frame containing the calculated features for all possible protein pairs

Calculate precision at each point in a sequence

Description

Calculate the precision of a list of interactions at each point in the list, given a set of labels.

Usage

calculate_precision(labels)
calculate_precision(labels)

Arguments

labels

a vector of zeroes (FPs) and ones (TPs)

Value

a vector of the same length giving the precision at each point in the input vector

Examples

## calculate features
data(scott)
data(scott_gaussians)
subset <- scott[seq_len(500), ] ## limit to first 500 proteins
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
features <- calculate_features(subset, gauss)
## make training labels
data(gold_standard)
ref <- adjacency_matrix_from_list(gold_standard)
labels <- make_labels(ref, features)
## predict interactions with naive Bayes classifier
ppi <- predict_ensemble(features, labels, classifier = "NB", cv_folds = 3, 
                        models = 1)
## tag precision of each interaction
ppi$precision <- calculate_precision(ppi$label)

## calculate features
data(scott)
data(scott_gaussians)
subset <- scott[seq_len(500), ] ## limit to first 500 proteins
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
features <- calculate_features(subset, gauss)
## make training labels
data(gold_standard)
ref <- adjacency_matrix_from_list(gold_standard)
labels <- make_labels(ref, features)
## predict interactions with naive Bayes classifier
ppi <- predict_ensemble(features, labels, classifier = "NB", cv_folds = 3, 
                        models = 1)
## tag precision of each interaction
ppi$precision <- calculate_precision(ppi$label)

Check the format of a list of Gaussians

Description

Test whether an input list of Gaussians conforms to the format expected by PrInCE: that is, a named list with five fields for each entry, i.e., the number of Gaussians in the mixture model, the $r^2$ value, the number of iterations used by nls, the coefficients of each model, and the fitted curve.

Usage

check_gaussians(
  gaussians,
  proteins = NULL,
  replicate_idx = NULL,
  n_error = 3,
  pct_warning = 0.1
)
check_gaussians(
  gaussians,
  proteins = NULL,
  replicate_idx = NULL,
  n_error = 3,
  pct_warning = 0.1
)

Arguments

`gaussians`	the list of Gaussians
`proteins`	the complete set of input proteins
`replicate_idx`	the replicate being analyzed, if input proteins are provided; used to throw more informative error messages
`n_error`	minimum number of proteins that can have fitted Gaussians without throwing an error
`pct_warning`	minimum fraction of proteins that can have fitted Gaussians without giving a warning

Details

Optionally, some extra checks will be done on the fraction of proteins in the complete dataset for which a Gaussian mixture model could be fit, if provided. In particular, the function will throw an error if fewer than n_error proteins have a fitted Gaussian, and emit a warning if fewer than pct_warning do.

Value

TRUE if all conditions are met, but throws an error if any is not

Examples

data(scott_gaussians)
check_gaussians(scott_gaussians)

data(scott_gaussians)
check_gaussians(scott_gaussians)

Fit a Gaussian mixture model to a co-elution profile

Description

Fit mixtures of one or more Gaussians to the curve formed by a chromatogram profile, and choose the best fitting model using an information criterion of choice.

Usage

choose_gaussians(
  chromatogram,
  points = NULL,
  max_gaussians = 5,
  criterion = c("AICc", "AIC", "BIC"),
  max_iterations = 10,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  filter_gaussians_center = TRUE,
  filter_gaussians_height = 0.15,
  filter_gaussians_variance_min = 0.1,
  filter_gaussians_variance_max = 50
)
choose_gaussians(
  chromatogram,
  points = NULL,
  max_gaussians = 5,
  criterion = c("AICc", "AIC", "BIC"),
  max_iterations = 10,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  filter_gaussians_center = TRUE,
  filter_gaussians_height = 0.15,
  filter_gaussians_variance_min = 0.1,
  filter_gaussians_variance_max = 50
)

Arguments

`chromatogram`	a numeric vector corresponding to the chromatogram trace
`points`	optional, the number of non-NA points in the raw data
`max_gaussians`	the maximum number of Gaussians to fit; defaults to 5. Note that Gaussian mixtures with more parameters than observed (i.e., non-zero or NA) points will not be fit.
`criterion`	the criterion to use for model selection; one of "AICc" (corrected AIC, and default), "AIC", or "BIC"
`max_iterations`	the number of times to try fitting the curve with different initial conditions; defaults to 10
`min_R_squared`	the minimum R-squared value to accept when fitting the curve with different initial conditions; defaults to 0.5
`method`	the method used to select the initial conditions for nonlinear least squares optimization (one of "guess" or "random"); see `make_initial_conditions` for details
`filter_gaussians_center`	true or false: filter Gaussians whose centres fall outside the bounds of the chromatogram
`filter_gaussians_height`	Gaussians whose heights are below this fraction of the chromatogram height will be filtered. Setting this value to zero disables height-based filtering of fit Gaussians
`filter_gaussians_variance_min`	Gaussians whose variance is below this threshold will be filtered. Setting this value to zero disables filtering.
`filter_gaussians_variance_max`	Gaussians whose variance is above this threshold will be filtered. Setting this value to zero disables filtering.

Value

a list with five entries: the number of Gaussians used to fit the curve; the R^2 of the fit; the number of iterations used to fit the curve with different initial conditions; the coefficients of the fit model; and the curve predicted by the fit model.

Examples

data(scott)
chrom <- clean_profile(scott[1, ])
gauss <- choose_gaussians(chrom, max_gaussians = 3)

data(scott)
chrom <- clean_profile(scott[1, ])
gauss <- choose_gaussians(chrom, max_gaussians = 3)

Preprocess a co-elution profile

Description

Clean a co-elution/co-fractionation profile by (1) imputing single missing values with the average of neighboring values, (2) replacing missing values with random, near-zero noise, and (3) smoothing with a moving average filter.

Usage

clean_profile(
  chromatogram,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  noise_floor = 0.001
)
clean_profile(
  chromatogram,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  noise_floor = 0.001
)

Arguments

`chromatogram`	a numeric vector corresponding to the chromatogram trace
`impute_NA`	if true, impute single missing values with the average of neighboring values
`smooth`	if true, smooth the chromatogram with a moving average filter
`smooth_width`	width of the moving average filter, in fractions
`noise_floor`	mean value of the near-zero noise to add

Value

a cleaned profile

Examples

data(scott)
chrom <- scott[16, ]
cleaned <- clean_profile(chrom)

data(scott)
chrom <- scott[16, ]
cleaned <- clean_profile(chrom)

Preprocess a co-elution profile matrix

Description

Clean a matrix of co-elution/co-fractionation profiles by (1) imputing single missing values with the average of neighboring values, (2) replacing missing values with random, near-zero noise, and (3) smoothing with a moving average filter.

Usage

clean_profiles(
  profile_matrix,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  noise_floor = 0.001
)
clean_profiles(
  profile_matrix,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  noise_floor = 0.001
)

Arguments

`profile_matrix`	a numeric matrix of co-elution profiles, with proteins in rows, or a `MSnSet` object
`impute_NA`	if true, impute single missing values with the average of neighboring values
`smooth`	if true, smooth the chromatogram with a moving average filter
`smooth_width`	width of the moving average filter, in fractions
`noise_floor`	mean value of the near-zero noise to add

Value

a cleaned matrix

Examples

data(scott)
mat <- scott[c(1, 16), ]
mat_clean <- clean_profiles(mat)

data(scott)
mat <- scott[c(1, 16), ]
mat_clean <- clean_profiles(mat)

Calculate the co-apex score for every protein pair

Description

Calculate the co-apex score for every pair of proteins. This is defined as the minimum Euclidean distance between any two Gaussians fit to each profile.

Usage

co_apex(gaussians, proteins = NULL)
co_apex(gaussians, proteins = NULL)

Arguments

`gaussians`	a list of Gaussian mixture models fit to the profile matrix by `link{build_gaussians}`
`proteins`	all proteins being scored, optionally including those without Gaussian fits

Value

a matrix of co-apex scores

Examples

data(scott_gaussians)
gauss <- scott_gaussians[seq_len(25)]
CA <- co_apex(gauss)

data(scott_gaussians)
gauss <- scott_gaussians[seq_len(25)]
CA <- co_apex(gauss)

Combine features across multiple replicates

Description

Concatenate features extracted from multiple replicates to a single data frame that will be used as input to a classifier. Doing so allows the classifier to naturally weight evidence for an interaction between each protein pair from each feature in each replicate in proportion to its discriminatory power on known examples.

Usage

concatenate_features(feature_list)
concatenate_features(feature_list)

Arguments

feature_list

a list of feature data frames, as produced by calculate_features, with proteins in the first two columns

Value

a data frame containing features for all protein pairs across all replicates

Detect significantly interacting complexes in a chromatogram matrix

Description

Use a permutation testing approach to identify complexes that show a significant tendency to interact, relative to random sets of complexes of equivalent size. The function begins by calculating the Pearson correlation or Euclidean distance between all proteins in the matrix, and

Usage

detect_complexes(
  profile_matrix,
  complexes,
  method = c("pearson", "euclidean"),
  min_pairs = 10,
  bootstraps = 100,
  progress = TRUE
)
detect_complexes(
  profile_matrix,
  complexes,
  method = c("pearson", "euclidean"),
  min_pairs = 10,
  bootstraps = 100,
  progress = TRUE
)

Arguments

`profile_matrix`	a matrix of chromatograms, with proteins in the rows and fractions in the columns, or a `MSnSet` object
`complexes`	a named list of protein complexes, where the name is the complex name and the entries are proteins within that complex
`method`	method to use to calculate edge weights; one of `pearson` or `euclidean`
`min_pairs`	the minimum number of pairwise observations to count a correlation or distance towards the z score
`bootstraps`	number of bootstraps to execute to estimate z scores
`progress`	whether to show the progress of the function

Value

a named vector of z scores for each complex in the input list

Examples

data(scott)
data(gold_standard)
complexes <- gold_standard[lengths(gold_standard) >= 3]
z_scores <- detect_complexes(t(scott), complexes)
length(na.omit(z_scores)) ## number of complexes that could be tested
z_scores[which.max(z_scores)] ## most significant complex

data(scott)
data(gold_standard)
complexes <- gold_standard[lengths(gold_standard) >= 3]
z_scores <- detect_complexes(t(scott), complexes)
length(na.omit(z_scores)) ## number of complexes that could be tested
z_scores[which.max(z_scores)] ## most significant complex

Filter a co-elution profile matrix

Description

Filter a matrix of co-elution/co-fractionation profiles by removing profiles without a certain number of non-mising or consecutive points.

Usage

filter_profiles(profile_matrix, min_points = 1, min_consecutive = 5)
filter_profiles(profile_matrix, min_points = 1, min_consecutive = 5)

Arguments

`profile_matrix`	a numeric matrix of co-elution profiles, with proteins in rows, or a `MSnSet` object
`min_points`	filter profiles without at least this many total, non-missing points
`min_consecutive`	filter profiles without at least this many consecutive, non-missing points

Value

the filtered profile matrix

Examples

data(scott)
nrow(scott)
filtered <- filter_profiles(scott)
nrow(scott)

data(scott)
nrow(scott)
filtered <- filter_profiles(scott)
nrow(scott)

Output the fit curve for a given mixture of Gaussians

Description

For a Gaussian mixture model fit to a curve by fit_gaussians, output the fit curve using the coefficients rather than the nls object. This allows individual Gaussians to be removed from the fit model: for example, if their height is below a certain threshold, or their centres are outside the bounds of the chromatogram.

Usage

fit_curve(coef, indices)
fit_curve(coef, indices)

Arguments

`coef`	numeric vector of coefficients for a Gaussian mixture model fit by `fit_gaussians`. This function assumes that the heights of the Gaussians are specified by coefficients beginning with "A" ("A1", "A2", "A3", etc.), centres are specified by coefficients beginning with "mu", and standard deviations are specified by coefficients beginning with "sigma".
`indices`	the indices, or x-values, to predict a fitted curve for (for example, the fractions in a given chromatogram)

Value

the fitted curve

Examples

data(scott)
chrom <- clean_profile(scott[1, ])
fit <- fit_gaussians(chrom, n_gaussians = 1)
curve <- fit_curve(fit$coefs, seq_along(chrom))

data(scott)
chrom <- clean_profile(scott[1, ])
fit <- fit_gaussians(chrom, n_gaussians = 1)
curve <- fit_curve(fit$coefs, seq_along(chrom))

Fit a mixture of Gaussians to a chromatogram curve

Description

Fit mixtures of one or more Gaussians to the curve formed by a chromatogram profile, using nonlinear least-squares.

Usage

fit_gaussians(
  chromatogram,
  n_gaussians,
  max_iterations = 10,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  filter_gaussians_center = TRUE,
  filter_gaussians_height = 0.15,
  filter_gaussians_variance_min = 0.1,
  filter_gaussians_variance_max = 50
)
fit_gaussians(
  chromatogram,
  n_gaussians,
  max_iterations = 10,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  filter_gaussians_center = TRUE,
  filter_gaussians_height = 0.15,
  filter_gaussians_variance_min = 0.1,
  filter_gaussians_variance_max = 50
)

Arguments

`chromatogram`	a numeric vector corresponding to the chromatogram trace
`n_gaussians`	the number of Gaussians to fit
`max_iterations`	the number of times to try fitting the curve with different initial conditions; defaults to 10
`min_R_squared`	the minimum R-squared value to accept when fitting the curve with different initial conditions; defaults to 0.5
`method`	the method used to select the initial conditions for nonlinear least squares optimization (one of "guess" or "random"); see `make_initial_conditions` for details
`filter_gaussians_center`	true or false: filter Gaussians whose centres fall outside the bounds of the chromatogram
`filter_gaussians_height`	Gaussians whose heights are below this fraction of the chromatogram height will be filtered. Setting this value to zero disables height-based filtering of fit Gaussians
`filter_gaussians_variance_min`	Gaussians whose variance falls below this number of fractions will be filtered. Setting this value to zero disables filtering.
`filter_gaussians_variance_max`	Gaussians whose variance is above this number of fractions will be filtered. Setting this value to zero disables filtering.

Value

a list with six entries: the number of Gaussians used to fit the curve; the R^2 of the fit; the number of iterations used to fit the curve with different initial conditions; the coefficients of the fit model; and the fit curve predicted by the fit model.

Examples

data(scott)
chrom <- clean_profile(scott[1, ])
fit <- fit_gaussians(chrom, n_gaussians = 1)

data(scott)
chrom <- clean_profile(scott[1, ])
fit <- fit_gaussians(chrom, n_gaussians = 1)

Reference set of human protein complexes

Description

A reference set of 467 experimentally confirmed human protein complexes, derived from the EBI Complex Portal database.

Usage

data(gold_standard)
data(gold_standard)

Format

a list containing 467 entries (character vectors)

Details

467 protein complexes, ranging in size from 2 to 44 proteins and involving 877 proteins in total, to provide a reference set of true positive and true negative interactions (intra- and inter-complex interactions, respectively) for demonstration in PrInCE analysis of a co-elution dataset. Other "gold standards" are possible in practice, most notably the CORUM database; however, the Complex Portal reference set is included in this package due to its CC-BY license.

Source

https://www.ebi.ac.uk/complexportal/complex/organisms

Impute single missing values

Description

Impute single missing values within a chromatogram profile as the average of their neighbors.

Usage

impute_neighbors(chromatogram)
impute_neighbors(chromatogram)

Arguments

chromatogram

a numeric vector corresponding to the chromatogram trace

Value

the imputed chromatogram

Examples

data(scott)
chrom <- scott[16, ]
imputed <- impute_neighbors(chrom)

data(scott)
chrom <- scott[16, ]
imputed <- impute_neighbors(chrom)

Test whether a network is unweighted

Description

Test whether a network is unweighted

Usage

is_unweighted(network)
is_unweighted(network)

Arguments

network

the network to analyze

Value

true if the input network is a square logical or numeric matrix

Examples

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)
is_unweighted(adj) ## returns TRUE

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)
is_unweighted(adj) ## returns TRUE

Test whether a network is weighted

Description

Test whether a network is weighted

Usage

is_weighted(network)
is_weighted(network)

Arguments

network

the network to analyze

Value

true if the input network is a square numeric matrix with more than two values

Examples

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)
is_weighted(adj) ## returns FALSE

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)
is_weighted(adj) ## returns FALSE

Interactome of HeLa cells

Description

Co-elution profiles derived from size exclusion chromatography (SEC) of HeLa cell lysates.

Usage

data(kristensen)
data(kristensen)

Format

a data frame with 1875 rows and 48 columns, with proteins in rows and SEC fractions in columns

Details

Protein quantitation was accomplished by SILAC (stable isotopic labelling by amino acids in cell culture), and is ratiometric, i.e., it reflects the ratio between the intensity of the heavy isotope and the light isotope ("H/L"). The dataset was initially described in Kristensen et al., Nat. Methods 2012. The medium isotope channel from replicate 1 (Supplementary Table 1a in the online supplementary information) is included in the PrInCE package. The R script used to generate this matrix from the supplementary materials of the paper is provided in the data-raw directory of the package source code.

Source

https://www.nature.com/articles/nmeth.2131

Fitted Gaussian mixture models for the `kristensen` dataset

Description

The kristensen dataset consists of protein co-migration profiles derived from size exclusion chromatography (SEC) of unstimulated HeLa cell lysates. The kristensen_gaussians object contains Gaussian mixture models fit by the function build_gaussians; this is bundled with the R package in order to expedite the demonstration code, as the process of Gaussian fitting is one of the more time-consuming aspects of the package.

Usage

data(kristensen_gaussians)
data(kristensen_gaussians)

Format

a named list with 1117 entries; names are proteins, and list items conain information about fitted Gaussians in the format that PrInCE expects

Details

As with the kristensen dataset, the code used to generate this data object is provided in the data-raw directory of the package source.

Create a feature vector for a classifier from a data frame

Description

Convert a data frame containing pairwise interactions, and a score or other data associated with each interaction, into a feature vector that matches the dimensions of a data frame used as input to a classifier, such as a naive Bayes, random forests, or support vector machine classifier.

Usage

make_feature_from_data_frame(
  dat,
  target,
  dat_node_cols = c(1, 2),
  target_node_cols = c(1, 2),
  feature_col = 3,
  default_value = NA
)
make_feature_from_data_frame(
  dat,
  target,
  dat_node_cols = c(1, 2),
  target_node_cols = c(1, 2),
  feature_col = 3,
  default_value = NA
)

Arguments

`dat`	a data frame containing pairwise interactions and a feature to be converted to a vector in a third column
`target`	the data frame of features that will be provided as input to a classifier
`dat_node_cols`	a vector of length two, denoting either the indices (integer vector) or column names (character vector) of the columns within the feature data frame; defaults to the first two columns of the data frame (`c(1, 2)`)
`target_node_cols`	a vector of length two, denoting either the indices (integer vector) or column names (character vector) of the columns within the target data frame; defaults to the first two columns of the data frame (`c(1, 2)`)
`feature_col`	the name or index of the column in the first data frame that contains a feature for each interaction
`default_value`	the default value for protein pairs that are not in the first data frame (set, by default, to `NA`)

Value

a vector matching the dimensions and order of the feature data frame, to use as input for a classifier in interaction prediction

Create a feature vector from expression data

Description

Convert a gene or protein expression matrix into a feature vector that matches the dimensions of a data frame used as input to a classifier, such as a naive Bayes, random forests, or support vector machine classifier, by calculating the correlation between each pair of genes or proteins.

Usage

make_feature_from_expression(expr, dat, node_columns = c(1, 2), ...)
make_feature_from_expression(expr, dat, node_columns = c(1, 2), ...)

Arguments

`expr`	a matrix containing gene or protein expression data, with genes/proteins in columns and samples in rows
`dat`	the data frame of features to be used by the classifier, with protein pairs in the columns specified by the `node_columns` argument
`node_columns`	a vector of length two, denoting either the indices (integer vector) or column names (character vector) of the columns within the data frame containing the nodes participating in pairwise interactions; defaults to the first two columns of the data frame (`c(1, 2)`)
`...`	arguments passed to `cor`

Value

a vector matching the dimensions and order of the feature data frame, to use as input for a classifier in interaction prediction

Make initial conditions for curve fitting with a mixture of Gaussians

Description

Construct a set of initial conditions for curve fitting using nonlinear least squares using a mixture of Gaussians. The "guess" method ports code from the Matlab release of PrInCE. This method finds local maxima within the chromatogram, orders them by their separation (in number of fractions) from the previous local maxima, and uses the positions and heights of these local maxima (+/- some random noise) as initial conditions for Gaussian curve-fitting. The "random" method simply picks random values within the fraction and intensity intervals as starting points for Gaussian curve-fitting. The initial value of sigma is set by default to a random number within +/- 0.5 of two for both modes; this is based on our manual inspection of a large number of chromatograms.

Usage

make_initial_conditions(
  chromatogram,
  n_gaussians,
  method = c("guess", "random"),
  sigma_default = 2,
  sigma_noise = 0.5,
  mu_noise = 1.5,
  A_noise = 0.5
)
make_initial_conditions(
  chromatogram,
  n_gaussians,
  method = c("guess", "random"),
  sigma_default = 2,
  sigma_noise = 0.5,
  mu_noise = 1.5,
  A_noise = 0.5
)

Arguments

`chromatogram`	a numeric vector corresponding to the chromatogram trace
`n_gaussians`	the number of Gaussians being fit
`method`	one of "guess" or "random", discussed above
`sigma_default`	the default mean initial value of sigma
`sigma_noise`	the amount of random noise to add or subtract from the default mean initial value of sigma
`mu_noise`	the amount of random noise to add or subtract from the Gaussian centers in "guess" mode
`A_noise`	the amount of random noise to add or subtract from the Gaussian heights in "guess" mode

Value

a list of three numeric vectors (A, mu, and sigma), each having a length equal to the maximum number of Gaussians to fit

Examples

data(scott)
chrom <- clean_profile(scott[16, ])
set.seed(0)
start <- make_initial_conditions(chrom, n_gaussians = 2, method = "guess")

data(scott)
chrom <- clean_profile(scott[16, ])
set.seed(0)
start <- make_initial_conditions(chrom, n_gaussians = 2, method = "guess")

Make labels for a classifier based on a gold standard

Description

Create labels for a classifier for protein pairs in the same order as in a dataset that will be used as input to a classifier, in a memory-friendly way.

Usage

make_labels(gold_standard, dat, node_columns = c(1, 2), protein_groups = NULL)
make_labels(gold_standard, dat, node_columns = c(1, 2), protein_groups = NULL)

Arguments

`gold_standard`	an adjacency matrix of gold-standard interactions
`dat`	a data frame with interacting proteins in the first two columns
`node_columns`	a vector of length two, denoting either the indices (integer vector) or column names (character vector) of the columns within the data frame containing the nodes participating in pairwise interactions; defaults to the first two columns of the data frame (`c(1, 2)`)
`protein_groups`	optionally, specify a list linking each protein in the first two columns of the input data frame to a protein group

Value

a vector of the same length as the input dataset, containing NAs for protein pairs not in the gold standard and ones or zeroes based on the content of the adjacency matrix

Examples

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)
proteins <- unique(unlist(gold_standard))
dat <- data.frame(protein_A = sample(proteins, 10), 
                  protein_B = sample(proteins, 10))
labels <- make_labels(adj, dat)

data(gold_standard)
adj <- adjacency_matrix_from_list(gold_standard)
proteins <- unique(unlist(gold_standard))
dat <- data.frame(protein_A = sample(proteins, 10), 
                  protein_B = sample(proteins, 10))
labels <- make_labels(adj, dat)

Match the dimensions of a query matrix to a profile matrix

Description

Match the row and column names of a square feature matrix to the row names of a profile matrix, adding rows/columns containing NAs when proteins in the profile matrix are missing from the feature matrix.

Usage

match_matrix_dimensions(query, profile_matrix)
match_matrix_dimensions(query, profile_matrix)

Arguments

`query`	a square matrix containing features for pairs of proteins
`profile_matrix`	the profile matrix for which interactions are being predicted

Value

a square matrix with the same row and column names as the input profile matrix, for use in interaction prediction

Examples

data(gold_standard)
subset <- adjacency_matrix_from_list(gold_standard[seq(1, 200)])
target <- adjacency_matrix_from_list(gold_standard)
matched <- match_matrix_dimensions(subset, target)
dim(subset)
dim(target)
dim(matched)

data(gold_standard)
subset <- adjacency_matrix_from_list(gold_standard[seq(1, 200)])
target <- adjacency_matrix_from_list(gold_standard)
matched <- match_matrix_dimensions(subset, target)
dim(subset)
dim(target)
dim(matched)

Predict interactions using an ensemble of classifiers

Description

Use an ensemble of classifiers to predict interactions from co-elution dataset features. The ensemble approach ensures that results are robust to the partitioning of the dataset into folds. For each model, the median of classifier scores across all folds is calculated. Then, the median of all such medians across all models is calculated.

Usage

predict_ensemble(
  dat,
  labels,
  classifier = c("NB", "SVM", "RF", "LR"),
  models = 1,
  cv_folds = 10,
  trees = 500,
  node_columns = c(1, 2)
)
predict_ensemble(
  dat,
  labels,
  classifier = c("NB", "SVM", "RF", "LR"),
  models = 1,
  cv_folds = 10,
  trees = 500,
  node_columns = c(1, 2)
)

Arguments

`dat`	a data frame containing interacting gene/protein pairs in the first two columns, and the features to use for classification in the remaining columns
`labels`	labels for each interaction in `dat`: 0 for negatives, 1 for positives, and NA for interactions outside the reference set
`classifier`	the type of classifier to use; one of `"NB"` (naive Bayes), `"SVM"` (support vector machine), `"RF"` (random forest), or `"LR"` (logistic regression)
`models`	the number of classifiers to train
`cv_folds`	the number of folds to split the reference dataset into when training each classifier. By default, each classifier uses ten-fold cross-validation, i.e., the classifier is trained on 90% of the dataset and used to classify the remaining 10%
`trees`	for random forest classifiers only, the number of trees to grow for each fold
`node_columns`	a vector of length two, denoting either the indices (integer vector) or column names (character vector) of the columns within the input data frame containing the nodes participating in pairwise interactions; defaults to the first two columns of the data frame (`c(1, 2)`)

Value

the input data frame of pairwise interactions, ranked by the median of classifier scores across all ensembled models

Examples

## calculate features
data(scott)
data(scott_gaussians)
subset <- scott[seq_len(500), ] ## limit to first 500 proteins
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
features <- calculate_features(subset, gauss)
## make training labels
data(gold_standard)
ref <- adjacency_matrix_from_list(gold_standard)
labels <- make_labels(ref, features)
## predict interactions with naive Bayes classifier
ppi <- predict_ensemble(features, labels, classifier = "NB", 
                        cv_folds = 3, models = 1)

## calculate features
data(scott)
data(scott_gaussians)
subset <- scott[seq_len(500), ] ## limit to first 500 proteins
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
features <- calculate_features(subset, gauss)
## make training labels
data(gold_standard)
ref <- adjacency_matrix_from_list(gold_standard)
labels <- make_labels(ref, features)
## predict interactions with naive Bayes classifier
ppi <- predict_ensemble(features, labels, classifier = "NB", 
                        cv_folds = 3, models = 1)

Predict interactions given a set of features and examples

Description

Discriminate interacting from non-interacting protein pairs by training a machine learning model on a set of labelled examples, given a set of features derived from a co-elution profile matrix (see calculate_features.

Usage

predict_interactions(
  features,
  gold_standard,
  classifier = c("NB", "SVM", "RF", "LR", "ensemble"),
  verbose = FALSE,
  models = 10,
  cv_folds = 10,
  trees = 500
)
predict_interactions(
  features,
  gold_standard,
  classifier = c("NB", "SVM", "RF", "LR", "ensemble"),
  verbose = FALSE,
  models = 10,
  cv_folds = 10,
  trees = 500
)

Arguments

`features`	a data frame with proteins in the first two columns, and features to be passed to the classifier in the remaining columns
`gold_standard`	an adjacency matrix of "gold standard" interactions used to train the classifier
`classifier`	the type of classifier to use: one of `"NB"` (naive Bayes), `"SVM"` (support vector machine), `"RF"` (random forest), `"LR"` (logistic regression), or `"ensemble"` (an ensemble of all four)
`verbose`	if `TRUE`, print a series of messages about the stage of the analysis
`models`	the number of classifiers to train and average across, each with a different k-fold cross-validation split
`cv_folds`	the number of folds to use for k-fold cross-validation
`trees`	for random forests only, the number of trees in the forest

Details

PrInCE implements four different classifiers (naive Bayes, support vector machine, random forest, and logistic regression). Naive Bayes is used as a default. The classifiers are trained on the gold standards using a ten-fold cross-validation procedure, training on 90 that are part of the training data, the held-out split is used to assign a classifier score, whereas for the remaining protein pairs, the median of all ten folds is used. Furthermore, to ensure the results are not sensitive to the precise classifier split used, an ensemble of multiple classifiers (ten, by default) is trained, and the classifier score is subsequently averaged across classifiers.

PrInCE can also ensemble across multiple different types of classifiers, by supplying the "ensemble" option to the classifier argument.

Value

a ranked data frame of pairwise interactions, with the classifier score, label, and cumulative precision for each interaction

Examples

## calculate features
data(scott)
data(scott_gaussians)
subset <- scott[seq_len(500), ] ## limit to first 500 proteins
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
features <- calculate_features(subset, gauss)
## load training data
data(gold_standard)
ref <- adjacency_matrix_from_list(gold_standard)
## predict interactions
ppi <- predict_interactions(features, ref, cv_folds = 3, models = 1)

## calculate features
data(scott)
data(scott_gaussians)
subset <- scott[seq_len(500), ] ## limit to first 500 proteins
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
features <- calculate_features(subset, gauss)
## load training data
data(gold_standard)
ref <- adjacency_matrix_from_list(gold_standard)
## predict interactions
ppi <- predict_interactions(features, ref, cv_folds = 3, models = 1)

PrInCE: Prediction of Interactomes from Co-Elution

Description

PrInCE is a computational approach to infer protein-protein interaction networks from co-elution proteomics data, also called co-migration, co-fractionation, or protein correlation profiling. This family of methods separates interacting protein complexes on the basis of their diameter or biochemical properties. Protein-protein interactions can then be inferred for pairs of proteins with similar elution profiles. PrInCE implements a machine-learning approach to identify protein-protein interactions given a set of labelled examples, using features derived exclusively from the data. This allows PrInCE to infer high-quality protein interaction networks from raw proteomics data, without bias towards known interactions or functionally associated proteins, making PrInCE a unique resource for discovery.

Usage

PrInCE(
  profiles,
  gold_standard,
  gaussians = NULL,
  precision = NULL,
  verbose = FALSE,
  min_points = 1,
  min_consecutive = 5,
  min_pairs = 3,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  max_gaussians = 5,
  max_iterations = 50,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  criterion = c("AICc", "AIC", "BIC"),
  pearson_R_raw = TRUE,
  pearson_R_cleaned = TRUE,
  pearson_P = TRUE,
  euclidean_distance = TRUE,
  co_peak = TRUE,
  co_apex = TRUE,
  n_pairs = FALSE,
  classifier = c("NB", "SVM", "RF", "LR", "ensemble"),
  models = 1,
  cv_folds = 10,
  trees = 500
)
PrInCE(
  profiles,
  gold_standard,
  gaussians = NULL,
  precision = NULL,
  verbose = FALSE,
  min_points = 1,
  min_consecutive = 5,
  min_pairs = 3,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  max_gaussians = 5,
  max_iterations = 50,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  criterion = c("AICc", "AIC", "BIC"),
  pearson_R_raw = TRUE,
  pearson_R_cleaned = TRUE,
  pearson_P = TRUE,
  euclidean_distance = TRUE,
  co_peak = TRUE,
  co_apex = TRUE,
  n_pairs = FALSE,
  classifier = c("NB", "SVM", "RF", "LR", "ensemble"),
  models = 1,
  cv_folds = 10,
  trees = 500
)

Arguments

`profiles`	the co-elution profile matrix, or a list of profile matrices if replicate experiments were performed. Can be a single numeric matrix, with proteins in rows and fractions in columns, or a list of matrices. Alternatively, can be provided as a single `MSnSet` object or a list of objects.
`gold_standard`	a set of 'gold standard' interactions, used to train the classifier. Can be provided either as an adjacency matrix, in which both rows and columns correspond to protein IDs in the co-elution matrix or matrices, or as a list of proteins in the same complex, which will be converted to an adjacency matrix by PrInCE. Zeroes in the adjacency matrix are interpreted by PrInCE as "true negatives" when calculating precision.
`gaussians`	optionally, provide Gaussian mixture models fit by the `build_gaussians` function. If `profiles` is a numeric matrix, this should be the named list output by `build_gaussians` for that matrix; if `profiles` is a list of numeric matrices, this should be a list of named lists
`precision`	optionally, return only interactions above the given precision; by default, all interactions are returned and the user can subsequently threshold the list using the `threshold_precision` function
`verbose`	if `TRUE`, print a series of messages about the stage of the analysis
`min_points`	filter profiles without at least this many total, non-missing points; passed to `filter_profiles`
`min_consecutive`	filter profiles without at least this many consecutive, non-missing points; passed to `filter_profiles`
`min_pairs`	minimum number of overlapping fractions between any given protein pair to consider a potential interaction
`impute_NA`	if true, impute single missing values with the average of neighboring values; passed to `clean_profiles`
`smooth`	if true, smooth the chromatogram with a moving average filter; passed to `clean_profiles`
`smooth_width`	width of the moving average filter, in fractions; passed to `clean_profiles`
`max_gaussians`	the maximum number of Gaussians to fit; defaults to 5. Note that Gaussian mixtures with more parameters than observed (i.e., non-zero or NA) points will not be fit. Passed to `choose_gaussians`
`max_iterations`	the number of times to try fitting the curve with different initial conditions; defaults to 50. Passed to `fit_gaussians`
`min_R_squared`	the minimum R-squared value to accept when fitting the curve with different initial conditions; defaults to 0.5. Passed to `fit_gaussians`
`method`	the method used to select the initial conditions for nonlinear least squares optimization (one of "guess" or "random"); see `make_initial_conditions` for details. Passed to `fit_gaussians`
`criterion`	the criterion to use for model selection; one of "AICc" (corrected AIC, and default), "AIC", or "BIC". Passed to `choose_gaussians`
`pearson_R_raw`	if true, include the Pearson correlation (R) between raw profiles as a feature
`pearson_R_cleaned`	if true, include the Pearson correlation (R) between cleaned profiles as a feature
`pearson_P`	if true, include the P-value of the Pearson correlation between raw profiles as a feature
`euclidean_distance`	if true, include the Euclidean distance between cleaned profiles as a feature
`co_peak`	if true, include the 'co-peak score' (that is, the distance, in fractions, between the single highest value of each profile) as a feature
`co_apex`	if true, include the 'co-apex score' (that is, the minimum Euclidean distance between any pair of fit Gaussians) as a feature
`n_pairs`	if `TRUE`, include the number of fractions in which both of a given pair of proteins were detected as a feature
`classifier`	the type of classifier to use: one of `"NB"` (naive Bayes), `"SVM"` (support vector machine), `"RF"` (random forest), `"LR"` (logistic regression), or `"ensemble"` (an ensemble of all four)
`models`	the number of classifiers to train and average across, each with a different k-fold cross-validation split
`cv_folds`	the number of folds to use for k-fold cross-validation
`trees`	for random forests only, the number of trees in the forest

Details

PrInCE takes as input a co-elution matrix, with detected proteins in rows and fractions as columns, and a set of 'gold standard' true positives and true negatives. If replicate experiments were performed, a list of co-elution matrices can be provided as input. PrInCE will construct features for each replicate separately and use features from all replicates as input to the classifier. The 'gold standard' can be either a data frame or adjacency matrix of known interactions (and non-interactions), or a list of protein complexes. For computational convenience, Gaussian mixture models can be pre-fit to every profile and provided separately to the PrInCE function. The matrix, or matrices, can be provided to PrInCE either as numeric matrices or as MSnSet objects.

PrInCE implements three different types of classifiers to predict protein-protein interaction networks, including naive Bayes (the default), random forests, and support vector machines. The classifiers are trained on the gold standards using a ten-fold cross-validation procedure, training on 90 that are part of the training data, the held-out split is used to assign a classifier score, whereas for the remaining protein pairs, the median of all ten folds is used. Furthermore, to ensure the results are not sensitive to the precise classifier split used, an ensemble of multiple classifiers (ten, by default) is trained, and the classifier score is subsequently averaged across classifiers. PrInCE can also ensemble across a set of classifiers.

By default, PrInCE calculates six features from each pair of co-elution profiles as input to the classifier, including conventional similarity metrics but also several features specifically adapted to co-elution proteomics. For example, one such feature is derived from fitting a Gaussian mixture model to each elution profile, then calculating the smallest Euclidean distance between any pair of fitted Gaussians. The complete set of features includes:

the Pearson correlation between raw co-elution profiles;
the p-value of the Pearson correlation between raw co-elution profiles;
the Pearson correlation between cleaned profiles, which are generated by imputing single missing values with the mean of their neighbors, replacing remaining missing values with random near-zero noise, and smoothing the profiles using a moving average filter (see clean_profile);
the Euclidean distance between cleaned profiles;
the 'co-peak' score, defined as the distance, in fractions, between the maximum values of each profile; and
the 'co-apex' score, defined as the minimum Euclidean distance between any pair of fit Gaussians

The output of PrInCE is a ranked data frame, containing the classifier score for every possible protein pair. PrInCE also calculates the precision at every point in this ranked list, using the 'gold standard' set of protein complexes or binary interactions. Our recommendation is to select a threshold for the precision and use this to construct an unweighted protein interaction network.

Value

a ranked data frame of interacting proteins, with the precision at each point in the list

References

Stacey RG, Skinnider MA, Scott NE, Foster LJ (2017). “A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).” BMC Bioinformatics, 18(1), 457.

Scott NE, Brown LM, Kristensen AR, Foster LJ (2015). “Development of a computational framework for the analysis of protein correlation profiling and spatial proteomics experiments.” Journal of Proteomics, 118, 112–129.

Kristensen AR, Gsponer J, Foster LJ (2012). “A high-throughput approach for measuring temporal changes in the interactome.” Nature Methods, 9(9), 907–909.

Skinnider MA, Stacey RG, Foster LJ (2018). “Genomic data integration systematically biases interactome mapping.” PLoS Computational Biology, 14(10), e1006474.

Examples

data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset <- scott[seq_len(500), ]
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi <- PrInCE(subset, gold_standard,
  gaussians = gauss, models = 1,
  cv_folds = 3
)
data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset <- scott[seq_len(500), ]
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi <- PrInCE(subset, gold_standard,
  gaussians = gauss, models = 1,
  cv_folds = 3
)

Replace missing data with median ± random noise

Description

Replace missing data within each numeric column of a data frame with the column median, plus or minus some random noise, in order to train classifiers that do not easily ignore missing data (e.g. random forests or support vector machines).

Usage

replace_missing_data(dat, noise_pct = 0.05)
replace_missing_data(dat, noise_pct = 0.05)

Arguments

`dat`	the data frame to replace missing data in
`noise_pct`	the standard deviation of the random normal distribution from which to draw added noise, expressed as a percentage of the standard deviation of the non-missing values in each column

Value

a data frame with missing values in each numeric column replaced by the column median, plus or minus some random noise

Cytoplasmic interactome of Jurkat T cells during apoptosis

Description

Co-elution profiles derived from size exclusion chromatography (SEC) of cytoplasmic fractions from Jurkat T cells, 4 hours following Fas stimulation.

Usage

data(scott)
data(scott)

Format

a data frame with 1560 rows and 55 columns, with proteins in rows and SEC fractions in columns

Details

Protein quantitation was accomplished by SILAC (stable isotopic labelling by amino acids in cell culture), and is ratiometric, i.e., it reflects the ratio between the intensity of the heavy isotope and the light isotope ("H/L"). The dataset was initially described in Scott et al., Mol. Syst. Biol. 2017. The heavy isotope channel from replicate 1 is included in the PrInCE package. The R script used to generate this matrix from the supplementary materials of the paper is provided in the data-raw directory of the package source code.

Source

http://msb.embopress.org/content/13/1/906

Fitted Gaussian mixture models for the `scott` dataset

Description

The scott dataset consists of protein co-migration profiles derived from size exclusion chromatography (SEC) of cytoplasmic fractions from Jurkat T cells, 4 hours following Fas stimulation. The scott_gaussians object contains Gaussian mixture models fit by the function build_gaussians; this is bundled with the R package in order to expedite the demonstration code, as the process of Gaussian fitting is one of the more time-consuming aspects of the package.

Usage

data(scott_gaussians)
data(scott_gaussians)

Format

a named list with 970 entries; names are proteins, and list items conain information about fitted Gaussians in the format that PrInCE expects

Details

As with the scott dataset, the code used to generate this data object is provided in the data-raw directory of the package source.

Threshold interactions at a given precision cutoff

Description

Threshold interactions at a given precision cutoff

Usage

threshold_precision(interactions, threshold)
threshold_precision(interactions, threshold)

Arguments

`interactions`	the ranked list of interactions output by `predict_interactions`, including a `precision` column
`threshold`	the minimum precision of the unweighted interaction network to return

Value

the subset of the original ranked list at the given precision

Examples

data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset <- scott[seq_len(500), ]
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi <- PrInCE(subset, gold_standard,
  gaussians = gauss, models = 1,
  cv_folds = 3
)
network <- threshold_precision(ppi, threshold = 0.5)
nrow(network)
data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset <- scott[seq_len(500), ]
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi <- PrInCE(subset, gold_standard,
  gaussians = gauss, models = 1,
  cv_folds = 3
)
network <- threshold_precision(ppi, threshold = 0.5)
nrow(network)

Package 'PrInCE'

Help Index

Create an adjacency matrix from a data frame

Description

Usage

Arguments

Value

Examples

Create an adjacency matrix from a list of complexes

Description

Usage

Arguments

Value

Examples

Model selection for Gaussian mixture models

Description

Usage

Arguments

Value

Deconvolve profiles into Gaussian mixture models

Description

Usage

Arguments

Value

Examples

Calculate the autocorrelation for each protein between a pair of co-elution experiments.

Description

Usage

Arguments

Details

Value

Calculate the default features used to predict interactions in PrInCE

Description

Usage

Arguments

Value

Calculate precision at each point in a sequence

Description

Usage

Arguments

Value

Examples

Check the format of a list of Gaussians

Description

Usage

Arguments

Details

Value

Examples

Fit a Gaussian mixture model to a co-elution profile

Description

Usage

Arguments

Value

Examples

Preprocess a co-elution profile

Description

Usage

Arguments

Value

Examples

Preprocess a co-elution profile matrix

Description

Usage

Arguments

Value

Examples

Calculate the co-apex score for every protein pair

Description

Usage

Arguments

Value

Examples

Combine features across multiple replicates

Description

Usage

Arguments

Value

Detect significantly interacting complexes in a chromatogram matrix

Description

Fitted Gaussian mixture models for the `kristensen` dataset