Title: | A Hybrid Method Integrating Empirical Bayes and Reproducibility-Optimized Statistics for Robust Analysis of Proteomics and Metabolomics Data |
---|---|
Description: | Differential expression analysis is a prevalent method utilised in the examination of diverse biological data. The reproducibility-optimized test statistic (ROTS) modifies a t-statistic based on the data's intrinsic characteristics and ranks features according to their statistical significance for differential expression between two or more groups (f-statistic). Focussing on proteomics and metabolomics, the current ROTS implementation cannot account for technical or biological covariates such as MS batches or gender differences among the samples. Consequently, we developed LimROTS, which employs a reproducibility-optimized test statistic utilising the limma methodology to simulate complex experimental designs. LimROTS is a hybrid method integrating empirical bayes and reproducibility-optimized statistics for robust analysis of proteomics and metabolomics data. |
Authors: | Ali Mostafa Anwar [aut, cre] , Leo Lahti [aut, ths] , Akewak Jeba [aut, ctb] , Eleanor Coffey [aut, ths] |
Maintainer: | Ali Mostafa Anwar <[email protected]> |
License: | Artistic-2.0 |
Version: | 0.99.12 |
Built: | 2024-12-20 03:33:31 UTC |
Source: | https://github.com/bioc/LimROTS |
Parallel processing handling function
Boot_parallel( BPPARAM = NULL, samples, data, formula.str, group.name, groups, meta.info, a1, a2, pSamples )
Boot_parallel( BPPARAM = NULL, samples, data, formula.str, group.name, groups, meta.info, a1, a2, pSamples )
BPPARAM |
A parallel BPPARAM object for distributed computation. |
samples |
bootstrapped samples matrix |
data |
A |
formula.str |
A formula string used when covariates are present in meta. info for modeling. It should include "~ 0 + ..." to exclude the intercept from the model. |
group.name |
A string specifying the column in |
groups |
groups information from |
meta.info |
A data frame containing sample-level metadata, where each
row corresponds to a sample. It should include the grouping variable
specified in |
a1 |
Optional numeric value used in the optimization process. If defined by the user, no optimization occurs. |
a2 |
Optional numeric value used in the optimization process. If defined by the user, no optimization occurs. |
pSamples |
a permutated list of samples |
A list containing: D, S, pD, pS
for bootstrapped data and
for permuted data.
This function generates bootstrap samples from the input metadata. It samples with replacement within each group defined in the metadata, and optionally adjusts for paired groups.
bootstrapS(niter, meta.info, group.name)
bootstrapS(niter, meta.info, group.name)
niter |
Integer. The number of bootstrap samples to generate. |
meta.info |
Data frame. Metadata containing sample information, where each row corresponds to a sample. |
group.name |
Character. The name of the column in |
The function works by resampling the row names of the metadata for each group separately.
A matrix of dimension niter
x n
, where n
is the
number of samples. Each row corresponds to a bootstrap sample, and each
entry is a resampled row name from the metadata.
This function generates stratified bootstrap samples based on the groupings and additional factors in the metadata. The function ensures that samples are drawn proportionally based on strata defined by the interaction of factor columns in the metadata.
bootstrapSamples_limRots(niter, meta.info, group.name)
bootstrapSamples_limRots(niter, meta.info, group.name)
niter |
Integer. The number of bootstrap samples to generate. |
meta.info |
Data frame. Metadata containing sample information,
where each row corresponds to a sample. Factor columns in |
group.name |
Character. The name of the column in |
The function works by first identifying the factors in the meta.info
data
frame that are used to create strata for sampling. Within each group defined
by group.name
, the function samples according to the strata proportions,
ensuring that samples are drawn from the correct groups and strata in a
proportional manner.
A matrix of dimension niter
x n
, where n
is the
number of samples. Each row corresponds to a bootstrap sample, and each
entry is a resampled row name from the metadata, stratified by group and
additional factors.
This function calculates the false discovery rate (FDR) by comparing observed values to permuted values.The function sorts observed values, compares them against permuted data, and computes FDR using the median of permutation results.
calculateFalseDiscoveryRate(observedValues, permutedValues)
calculateFalseDiscoveryRate(observedValues, permutedValues)
observedValues |
Numeric vector. The observed test statistics or values to be evaluated for significance. |
permutedValues |
Numeric matrix. The permuted test statistics or values,
with rows corresponding to the same values as in |
A numeric vector of the same length as observedValues
, containing
the estimated FDR for each observed value.
This function calculates the overlap between observed and permuted data for two sets of comparisons. It computes the ratio of overlap between pairs of vectors (res1/res2 and pres1/pres2) after sorting the values.
calOverlaps(D, S, pD, pS, nrow, N, N_len, ssq, niter, overlaps, overlaps_P)
calOverlaps(D, S, pD, pS, nrow, N, N_len, ssq, niter, overlaps, overlaps_P)
D |
Numeric vector. Observed data values (e.g., differences). |
S |
Numeric vector. Standard errors or related values associated with the observed data. |
pD |
Numeric vector. Permuted data values (e.g., differences). |
pS |
Numeric vector. Standard errors or related values associated with the permuted data. |
nrow |
Integer. Number of rows in each block of data. |
N |
Integer vector. Number of top values to consider for overlap calculation. |
N_len |
Integer. Length of the |
ssq |
Numeric. A small constant added to standard errors for stability. |
niter |
Integer. Number of bootstrap samples or resampling iterations. |
overlaps |
Numeric matrix. Matrix to store overlap results for observed data. |
overlaps_P |
Numeric matrix. Matrix to store overlap results for permuted data. |
The function calculates overlaps for two sets of comparisons: one for
observed data (res1/res2) and one for permuted data (pres1/pres2).For each
bootstrap sample, the function orders the two vectors being compared, then
calculates the proportion of overlap for the top N
values.
A list containing two matrices: overlaps
for observed data and
overlaps_P
for permuted data.
This function computes the overlap between two sets of observed and permuted 'values for single-label replicates (SLR). It calculates the proportion of overlap between pairs of vectors (res1/res2 and pres1/pres2) after sorting them.
calOverlaps_slr(D, pD, nrow, N, N_len, niter, overlaps, overlaps_P)
calOverlaps_slr(D, pD, nrow, N, N_len, niter, overlaps, overlaps_P)
D |
Numeric vector. Observed data values (e.g., differences). |
pD |
Numeric vector. Permuted data values. |
nrow |
Integer. Number of rows in each block of data. |
N |
Integer vector. Number of top values to consider for overlap calculation. |
N_len |
Integer. Length of the |
niter |
Integer. Number of bootstrap samples or resampling iterations. |
overlaps |
Numeric matrix. Matrix to store overlap results for observed data. |
overlaps_P |
Numeric matrix. Matrix to store overlap results for permuted data. |
The function calculates the overlap for two sets of comparisons: one for
observed data (res1
/res2
) and one for permuted data (pres1
/pres2
).
For each bootstrap sample, the function orders the two vectors being
compared, then computes the proportion of overlap for the top N
values.
A list containing two matrices: overlaps
for observed data and
overlaps_P
for permuted data.
Check if meta info is correct
Check_meta_info(meta.info, data, log)
Check_meta_info(meta.info, data, log)
meta.info |
Data frame. Metadata associated with the samples
(columns of |
data |
A matrix-like object or a |
log |
Logical, indicating whether the data is already log-transformed. Default is TRUE. |
Logical
Check if SummarizedExperiment or data is correct
Check_SummarizedExperiment(data.exp, meta.info, group.name)
Check_SummarizedExperiment(data.exp, meta.info, group.name)
data.exp |
A matrix-like object or a |
meta.info |
Data frame. Metadata associated with the samples
(columns of |
group.name |
Character. Column name in |
a list of data
, groups
and meta.info
This helper function compares observed values against permuted values and counts the number of permuted values that are greater than or equal to each observed value.
countLargerThan(observedVec, permutedVec)
countLargerThan(observedVec, permutedVec)
observedVec |
Numeric vector. The observed values. |
permutedVec |
Numeric vector. The permuted values to compare against the observed values. |
A numeric vector containing the counts of permuted values greater than or equal to the corresponding observed values.
This function performs linear modeling using the Limma package while
accounting for covariates specified
in the meta.info
. It supports two-group comparisons and multi-group
analysis, incorporating covariates
through a design matrix.
Limma_bootstrap(x, group.name, meta.info, formula.str)
Limma_bootstrap(x, group.name, meta.info, formula.str)
x |
A list containing two or more data matrices where rows represent features (e.g., genes, proteins) and columns represent samples. The list should contain at least two matrices for pairwise group comparison. |
group.name |
A character string indicating the name of the group
variable in |
meta.info |
A data frame containing the metadata for the samples. This includes sample grouping and any covariates to be included in the model. |
formula.str |
A string specifying the formula to be used in model
fitting. It should follow the standard R formula syntax
(e.g., |
This function first combines the data matrices from different groups and
prepares a design matrix based on the covariates specified in meta.info
using the provided formula. It fits a linear model using Limma,
computes contrasts between groups, and applies empirical Bayes moderation.
For two-group comparisons, the function returns log-fold changes and
associated statistics. In multi-group settings with a single covariate,
it calculates pairwise contrasts and moderated F-statistics.
A list containing the following elements:
d |
A vector of the test statistics (log-fold changes or F-statistics) for each feature. |
s |
A vector of the standard deviations for each feature, adjusted by the empirical Bayes procedure. |
lmFit
, eBayes
,
topTable
,
makeContrasts
This function performs linear modeling using the Limma package, incorporating covariates in the model fitting process. It is designed to handle both two-group comparisons and multi-group settings with covariates.
Limma_fit(x, group.name, meta.info, formula.str, trend, robust)
Limma_fit(x, group.name, meta.info, formula.str, trend, robust)
x |
A list containing two or more data matrices where rows represent features (e.g., genes, proteins) and columns represent samples. The list should contain at least two matrices for pairwise group comparison. |
group.name |
A character string indicating the name of the group
variable in |
meta.info |
A data frame containing the metadata for the samples. This includes sample grouping and any covariates to be included in the model. |
formula.str |
A string specifying the formula to be used in model
fitting. It should follow the standard R formula syntax
(e.g., |
trend |
A logical value indicating whether to allow for an intensity-dependent trend in the prior variance. |
robust |
A logical value indicating whether to use a robust fitting procedure to protect against outliers. |
This function combines the data matrices from different groups and fits a
linear model using covariates provided in the meta.info
. For two-group
comparisons, the function computes contrasts between the two groups and
applies empirical Bayes moderation. For multi-group analysis with a single
covariate, pairwise contrasts are computed, and the moderated F-statistic is
calculated for each feature.
A list containing the following elements:
d |
A vector of the test statistics (log-fold changes or F-statistics) for each feature. |
s |
A vector of the standard deviations for each feature, adjusted by t he empirical Bayes procedure. |
corrected.logfc |
The log-fold changes for each feature after fitting the model. |
lmFit
,
eBayes
,
topTable
,
makeContrasts
This function performs linear modeling using the Limma package with permutation of the covariates to evaluate the test statistics under random assignments. It handles two-group comparisons and multi-group settings.
Limma_permutating(x, group.name, meta.info, formula.str)
Limma_permutating(x, group.name, meta.info, formula.str)
x |
A data matrices where rows represent features (e.g., genes, proteins) and columns represent samples. The list should contain at least two matrices for pairwise group comparison. |
group.name |
A character string indicating the name of the group
variable in |
meta.info |
A data frame containing the metadata for the samples. This includes sample grouping and any covariates to be included in the model. |
formula.str |
A string specifying the formula to be used in model
fitting. It should follow the standard R formula syntax
(e.g., |
This function combines the data matrices from different groups and permutes
the covariates from meta.info
before fitting a linear model using Limma.
Permutation helps assess how the covariates behave under random conditions,
providing a null distribution of the test statistics. For two-group
comparisons, the function computes contrasts between the two groups and
applies empirical Bayes moderation. For multi-group analysis with a single
covariate, pairwise contrasts are computed, and the moderated F-statistic is
calculated for each feature.
A list containing the following elements:
d |
A vector of the test statistics (log-fold changes or F-statistics) for each feature. |
s |
A vector of the standard deviations for each feature, adjusted by the empirical Bayes procedure. |
lmFit
, eBayes
,
topTable
,
makeContrasts
LimROTS
: A Hybrid Method Integrating Empirical Bayes and
Reproducibility-Optimized Statistics for Robust Analysis of Proteomics and
Metabolomics DataLimROTS
: A Hybrid Method Integrating Empirical Bayes and
Reproducibility-Optimized Statistics for Robust Analysis of Proteomics and
Metabolomics Data
LimROTS( x, niter = 1000, K = NULL, a1 = NULL, a2 = NULL, log = TRUE, verbose = TRUE, meta.info, BPPARAM = NULL, group.name, formula.str, robust = TRUE, trend = TRUE, permutating.group = FALSE )
LimROTS( x, niter = 1000, K = NULL, a1 = NULL, a2 = NULL, log = TRUE, verbose = TRUE, meta.info, BPPARAM = NULL, group.name, formula.str, robust = TRUE, trend = TRUE, permutating.group = FALSE )
x |
A |
niter |
An integer representing the amount of bootstrap iterations. Default is 1000. |
K |
An optional integer representing the top list size for ranking. If not specified, it is set to one-fourth of the number of features. |
a1 |
Optional numeric value used in the optimization process. If defined by the user, no optimization occurs. |
a2 |
Optional numeric value used in the optimization process. If defined by the user, no optimization occurs. |
log |
Logical, indicating whether the data is already log-transformed.
Default is |
verbose |
Logical, indicating whether to display messages during the
function's execution. Default is |
meta.info |
a character vector of the metadata needed for the
model to run and can be retrieved using |
BPPARAM |
A |
group.name |
A string specifying the column in |
formula.str |
A formula string for modeling.
It should include "~ 0 + ..." to exclude the intercept from the model.
All the model parameters must be present in |
robust |
indicating whether robust fitting should be used. Default is TRUE, see eBayes. |
trend |
indicating whether to include trend fitting in the differential expression analysis. Default is TRUE. see eBayes. |
permutating.group |
Logical, If |
The LimROTS approach initially uses
limma package functionality to simulate the intensity data of
proteins and
metabolites. A linear model is subsequently fitted using the design matrix.
Empirical Bayes variance shrinking is then implemented. To obtain the
moderated t-statistics, the adjusted standard error
for each feature is computed, along with the regression
coefficient for each feature (indicating the impact of variations in the
experimental settings). Then, by adapting a reproducibility-optimized
technique known as ROTS to establish an optimality
based on the largest overlap of top-ranked features within group-preserving
bootstrap datasets, Finally based on the optimized parameters
and
this equation used to calculates the final statistics:
where
is the final statistics for each feature,
is the coefficient, and
is the the adjusted
standard error. LimROTS generates p-values from permutation samples
using the implementation available in
qvalue package, along with internal implementation of FDR
adapted from ROTS package. Additionally, the qvalue package is used
to calculate q-values, were the proportion of true null p-values is
set to the bootstrap method pi0est. We recommend using
permutation-derived p-values and qvalues.
This function processes a dataset using parallel computation. It leverages the BiocParallel framework to distribute tasks across multiple workers, which can significantly reduce runtime for large datasets.
An object of class "SummarizedExperiment"
with the
following elements:
data |
The original data matrix. |
niter |
The number of bootstrap samples used. |
statistics |
The optimized statistics for each feature. |
logfc |
Log-fold change values between groups. |
pvalue |
P-values computed based on the permutation samples. |
FDR |
False discovery rate estimates. |
a1 |
Optimized parameter used in differential expression ranking. |
a2 |
Optimized parameter used in differential expression ranking. |
k |
Top list size used for ranking. |
corrected.logfc |
estimate of the log2-fold-change corresponding to the effect corrected by the s model see topTable. |
q_values |
Estimated q-values using the |
BH.pvalue |
Benjamini-Hochberg adjusted p-values. |
null.statistics |
The optimized null statistics for each feature. |
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47
Suomi T, Seyednasrollah F, Jaakkola M, Faux T, Elo L (2017). “ROTS: An R package for reproducibility-optimized statistical testing. ” PLoS computational biology, 13(5), e1005562. doi:10.1371/journal.pcbi.1005562 https://doi.org/10.1371/journal.pcbi.1005562, http://www.ncbi.nlm.nih.gov/pubmed/28542205
Elo LL, Filen S, Lahesmaa R, Aittokallio T. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Trans Comput Biol Bioinform. 2008;5(3):423-431. doi:10.1109/tcbb.2007.1078
# Example usage: data <- data.frame(matrix(rnorm(500), nrow = 100, ncol = 10)) # Simulated data meta.info <- data.frame( group = factor(rep(1:2, each = 5)), row.names = colnames(data) ) formula.str <- "~ 0 + group" result <- LimROTS(data, meta.info = meta.info, group.name = "group", formula.str = formula.str, niter = 10 )
# Example usage: data <- data.frame(matrix(rnorm(500), nrow = 100, ncol = 10)) # Simulated data meta.info <- data.frame( group = factor(rep(1:2, each = 5)), row.names = colnames(data) ) formula.str <- "~ 0 + group" result <- LimROTS(data, meta.info = meta.info, group.name = "group", formula.str = formula.str, niter = 10 )
This function optimizes parameters by calculating overlaps between observed
and permuted data for multiple values of a smoothing constant (ssq
) and a
single-label replicate (SLR) comparison.
Optimizing(niter, ssq, N, D, S, pD, pS, verbose)
Optimizing(niter, ssq, N, D, S, pD, pS, verbose)
niter |
Integer. Number of bootstrap samples or resampling iterations. |
ssq |
Numeric vector. Smoothing constants to be evaluated. |
N |
Integer vector. Number of top values to consider for overlap calculation. |
D |
Numeric matrix. Observed data values. |
S |
Numeric matrix. Standard errors or related values for observed data. |
pD |
Numeric matrix. Permuted data values. |
pS |
Numeric matrix. Standard errors or related values for permuted data. |
verbose |
Logical. If |
The function calculates overlaps for a range of smoothing constants and
identifies the optimal set of parameters by maximizing a z-score-based
metric, which compares the overlap of observed data to permuted data.
It computes overlap matrices for both observed (D
and S
) and permuted
(pD
and pS
) data and returns the optimal parameters based on the
highest z-score.
A list containing the optimal parameters:
a1
: Optimal smoothing constant or 1 for SLR.
a2
: SLR flag (1 if smoothing constant is optimal,
0 if SLR is optimal).
k
: Optimal number of top values to consider for overlap.
R
: Optimal overlap value.
Z
: Optimal z-score.
ztable
: Matrix of z-scores for all evaluated parameters.
This function performs a series of checks and initial setups for input data, metadata, and parameters, ensuring everything is correctly formatted for downstream analysis.
SanityChecK( x, niter = 1000, K = NULL, meta.info, group.name, formula.str, verbose = TRUE, log = TRUE )
SanityChecK( x, niter = 1000, K = NULL, meta.info, group.name, formula.str, verbose = TRUE, log = TRUE )
x |
A matrix-like object or a |
niter |
Integer. Number of bootstrap samples or resampling iterations. Default is 1000. |
K |
Integer. Top list size. If NULL, it will be set to a quarter of the number of rows in the data matrix. Default is NULL. |
meta.info |
Data frame. Metadata associated with the samples
(columns of |
group.name |
Character. Column name in |
formula.str |
Optional character string representing the formula for the model. |
verbose |
Logical, indicating whether to display messages during the function's execution. Default is TRUE. |
log |
Logical, indicating whether the data is already log-transformed. Default is TRUE. |
This function checks whether the input data and metadata are in the correct
format, processes metadata from a SummarizedExperiment
object if provided,
and ensures that group information is correctly specified. If no top list
size (K
) is provided, it defaults to a quarter of the number of rows in
the data.
A list containing:
meta.info
: Processed metadata.
data
: Processed data matrix.
groups
: Numeric or factor vector indicating group assignments.
K
: Top list size to be used in the analysis.
A
SummarizedExperiment
object containing DIA proteomics data from a UPS1-spiked E. coli proteins,
processed using both Spectronaut and ScaffoldDIA separately, then merged.
An instance of the
SummarizedExperiment
class with the following assays:
This assay includes log2 protein intensities calculated by averaging the peptides derived from the same protein
The object also contains colData and rowData:
A DataFrame with metadata for samples.
A DataFrame with metadata for proteins.
The colData
contains the following columns:
Unique identifier for each sample.
Experimental condition or group for each sample , representing different conc. of UPS1-spiked proteins.
software tool used, Spectronaut or ScaffoldDIA.
A fake digestion batch.
Generated with both spectronaut and ScaffoldDIA separately. using a mixed mode acquisition method and FASTA mode for demonstration purposes.
Gotti, C., Roux-Dalvai, F., Joly-Beauparlant, C., Mangnier, L., Leclercq, M., & Droit, A. (2022). DIA proteomics data from a UPS1-spiked E.coli protein mixture processed with six software tools. In Data in Brief (Vol. 41, p. 107829). Elsevier BV. https://doi.org/10.1016/j.dib.2022.107829