Package 'RolDE' reference manual

Title:	RolDE: Robust longitudinal Differential Expression
Description:	RolDE detects longitudinal differential expression between two conditions in noisy high-troughput data. Suitable even for data with a moderate amount of missing values.RolDE is a composite method, consisting of three independent modules with different approaches to detecting longitudinal differential expression. The combination of these diverse modules allows RolDE to robustly detect varying differences in longitudinal trends and expression levels in diverse data types and experimental settings.
Authors:	Tommi Valikangas [aut], Medical Bioinformatics Centre [cre]
Maintainer:	Medical Bioinformatics Centre <[email protected]>
License:	GPL-3
Version:	1.11.0
Built:	2025-03-30 06:28:31 UTC
Source:	https://github.com/bioc/RolDE

A proteomics dataset with random protein expression values - no missing values.

Description

A longitudinal proteomics dataset with five timepoints, two conditions and three replicates for each sample at each timepoint in each condition. The expression values for the random data have been generated using the rnorm function. The values for each protein (row) have been drawn from a normal distribution with a mean of 22.5 and a standard deviation of 1.5 corresponding to random protein expression values.

Usage

data1
data1

Format

A matrix with 1045 rows and 30 variables.

A small dataset with 50 proteins and 30 samples for the example usage of RolDE

Description

Based on the included semi-simulated UPS1 spike-in dataset (data3) with 50 randomly chosen proteins from the data.

Usage

data2
data2

Format

A matrix with 50 rows and 30 variables.

Source

https://www.ebi.ac.uk/pride/archive/projects/PXD002099

A semi-simulated UPS1 spike-in dataset with differences in longitudinal expression for the spike-in proteins - no missing values.

Description

A longitudinal proteomics dataset with five timepoints, two conditions and three replicates for each sample at each timepoint in each condition. The expression values for each protein in each sample have been generated by using the rnorm function and the means and standard deviations of the corresponding proteins and samples in the original experimental UPS1 spike-in data. In this manner, the expression for each protein in each sample could be replicated with some random variation multiple times when necessary. The pattern of missing values was directly copied from the used samples in the original UPS1 spike-in dataset. All proteins with missing values have been filtered out. Linear-like trends for the spike-in proteins for Condition 1 were generated using the 4,4,10,25 and 50 fmol samples of the original UPS1 spike-in data. Linear-like trends for the spike-in proteins for Condition 2 were generated using the 2,4,4,10 and 25 fmol samples of the original UPS1 spike-in data. For more information about the generation of the semi-simulated spike-in datasets, see the original RolDE publication. In the spike-in datasets, the UPS1 spike-in proteins are expected to differ between conditions while the expression of the rest of the proteins (the background proteins) are expected to remain stable between the conditions, excluding experimental noise.

Usage

data3
data3

Format

A matrix with 1033 rows and 30 variables.

Source

https://www.ebi.ac.uk/pride/archive/projects/PXD002099

A RolDE design matrix for data1

Description

A design matrix to be used with data1 for the RolDE-function. Column 1 contains sample names, column 2 the condition information for each sample, column 3 indicates the timepoint for each sample, and column 4 gives the Replicate or Individual each sample is coming from.

Usage

des_matrix1
des_matrix1

Format

A matrix with 30 rows and 4 variables.

A RolDE design matrix for data2

Description

A design matrix to be used with data2 for the RolDE-function.Column 1 contains sample names, column 2 the condition information for each sample, column 3 indicates the timepoint for each sample, and column 4 gives the Replicate or Individual each sample is coming from.

Usage

des_matrix2
des_matrix2

Format

A matrix with 30 rows and 4 variables.

A RolDE design matrix for data3

Description

A design matrix to be used with data3 for the RolDE-function. Column 1 contains sample names, column 2 the condition information for each sample, column 3 indicates the timepoint for each sample, and column 4 gives the Replicate or Individual each sample is coming from.

Usage

des_matrix3
des_matrix3

Format

A matrix with 30 rows and 4 variables.

Plot RolDE results

Description

Plot the findings from longitudinal differential expression analysis with RolDE.

Usage

plotFindings(file_name = NULL, RolDE_res, top_n, col1 = "blue", col2 = "red")
plotFindings(file_name = NULL, RolDE_res, top_n, col1 = "blue", col2 = "red")

Arguments

`file_name`	a string indicating the file name in which the results should be plotted. Should have a ".pdf" extension. Default is NULL, no file is created.
`RolDE_res`	the RolDE result object.
`top_n`	an integer or a vector of integers indicating what top differentially expressed features should be plotted. If `top_n` is a single number, the `top_n` most differentially expressed feature will be plotted (e.g `top_n`=1 will plot the most differentially expressed feature). If `top_n` is a vector of numbers, the differentially expressed features corresponding to top detections within the given range will be plotted (e.g. `top_n`=seq(1:50) will plot the top 50 differentially expressed features). If more than one feature will be plotted, it is advisable to define a suitable file name in `file_name`.
`col1`	a string indicating which color should be used for Individuals / Replicates in condition 1. The default is blue.
`col2`	a string indicating which color should be used for Individuals / Replicates in condition 2. The default is red.

Details

The function plots the longitudinal expression of the top RolDE findings. The function can plot either the expression of a single finding or multiple top findings as indicated by the top_n. The findings can be plotted into a pdf file as indicated by the file_name. The given file_name should have a ".pdf" extension. If the plottable feature has missing values, a mean value over the feature values will be imputted for visualization purposes. The missing / imputed value will be indicated with an empty circle symbol.

Value

plotFindings Plots the results from the RolDE object.

Examples

data("res3")
#Plotting the most DE finding. DE results are in the res3 object.
plotFindings(file_name = NULL, RolDE_res = res3, top_n = 1)
data("res3")
#Plotting the most DE finding. DE results are in the res3 object.
plotFindings(file_name = NULL, RolDE_res = res3, top_n = 1)

RolDE results for data1

Description

RolDE results of data1 to be used for generating documentation only.

Usage

res1
res1

Format

An object of class list of length 10.

RolDE results for data3

Description

RolDE results of data3 to be used for generating documentation only.

Usage

res3
res3

Format

An object of class list of length 10.

Robust longitudinal Differential Expression

Description

Detects longitudinal differential expression between two conditions (or groups) in time point aligned data or in data with non-aligned time points. A rank product from the results of three independent modules, RegROTS, DiffROTS and PolyReg, is determined to indicate the strength of differential expression of features between the conditions / groups. RolDE tolerates a fair amount of missing values and is especially suitable for noisy proteomics data.

Usage

RolDE(
        data,
        des_matrix = NULL,
        aligned = TRUE,
        n_cores = 1,
        model_type = "auto",
        sigValSampN = 5e+05,
        sig_adj_meth = "fdr"
)
RolDE(
        data,
        des_matrix = NULL,
        aligned = TRUE,
        n_cores = 1,
        model_type = "auto",
        sigValSampN = 5e+05,
        sig_adj_meth = "fdr"
)

Arguments

`data`	the preprocessed normalized data as as a numerical matrix or as a SummarizedExperiment instance. Features (rows) and variables (columns) of the data must have unique identifiers. If `data` is a SummarizedExperiment object, the design matrix must be included in the `colData` argument of the `data` object.
`des_matrix`	the design matrix for the `data`. Rows correspond to columns of the `data`. Must contain four character columns (see included example design matrices `des_matrix1` or `des_matrix3`). First column should contain sample (column) names of the `data`. Second column should indicate condition status (for each sample), for which longitudinal differential expression is to be examined. Third column should indicate time point (time point aligned data) or time value (non-aligned time point data) for each sample. Fourth column should provide the replicate (individual) information for each sample as a numerical value. Each replicate or indivdual should have a distinct number. If `data` is a SummarizedExperiment object, the design matrix must be included in the `colData` argument of the `data` object.
`aligned`	logical; are the time points in different conditions and replicates (individuals) in the `data` aligned (fixed)? In aligned time point data, the time points should be the same for each replicate (individual).
`n_cores`	a positive integer. The number of threads used for parallel computing. If set to 1 (the default), no parallel computing is used.
`model_type`	a string indicating the type of regression to be used for the PolyReg module and the maximum level for which random effects should be allowed in the case of mixed models. Default "auto" for automatic selection.
`sigValSampN`	a positive integer indicating the number of permutations for significance value calculations. The overall used number will be `sigValSampN` * the number of rows in the `data`. Or set `sigValSampN` to 0 to turn significance calculations off. Should be > 100000.
`sig_adj_meth`	The multiple test hypothesis correction method for the estimated significance values. Only relevant if `sigValSampN` is not 0.

Details

RolDE, is a composite method, consisting of three independent modules with different approaches to detecting longitudinal differential expression. The combination of these diverse modules allows RolDE to robustly detect varying differences in longitudinal trends and expression levels in diverse data types and experimental settings.

The *RegROTS* module merges the power of regression modelling with the power of the established differential expression method Reproducibility Optimized Test Statistic (ROTS) (Elo et al., Suomi et al.). A polynomial regression model of protein expression over time is fitted separately for each replicate (individual) in each condition. Differential expression between two replicates (individuals) in different conditions is examined by comparing the coefficients of the replicate-specific regression models. If all coefficient differences are zero, no longitudinal differential expression between the two replicates (individuals) in different conditions exist. For a thorough exploration of differential expression between the conditions, all possible combinations of replicates (individuals) in different conditions are examined.

In the *DiffROTS* module the expression of replicates (individuals) in different conditions are directly compared at all time points. Again, if the expression level differences at all time points are zero, no differential expression between the examined replicates (individuals) in different conditions exist. Similarly to the RegROTS module, differential expression is examined between all possible combinations of replicates (individuals) in the different conditions. In non-aligned time point data, the expression level differences between the conditions is examined when accounting for time-associated trends of varying complexity in the data. More specifically, the expression level differences between the conditions are examined when adjusting for increasingly complex time-related expression trends of polynomial degrees d=0,1,.,d where d is the maximum degree for the polynomial and the same degree as is used for the PolyReg module.

In the *PolyReg* module, polynomial regression modelling is used to detect longitudinal differential expression. Condition is included as a categorical factor within the models and by investigating the condition related intercept and the polynomial termns at different levels of the condition factor, average differences in expression levels as well as differences in longitudinal expression patterns between the conditions can be examined.

Finally, to conclusively detect any differential expression, the detections from the different modules are combined using the rank product. For more details about the method, see the original RolDE publication (Valikangas et al.).

By bare minimum, the user should provide RolDE the data in a normalized numerical matrix, adjusted for confounding effects if needed, together with a suitable design matrix for the data. If the time points in the data are non-aligned, the user should set the parameter aligned to FALSE. Other parameter values RolDE determines automatically by default. The default values should be suitable for a typical longitudinal differential expression analysis but the user is given control of many of the parameters for RolDE.

By default, RolDE assumes aligned time points in the data. If the time points in the data are non-aligned, the user should set the parameter aligned to FALSE.

Parallel processing can be enabled by setting the parameter n_cores as larger than the default 1 (highly recommended). With parallel processing using multiple threads, the run time for RolDE can be significantly decreased. The parameter n_cores controls the number of threads available for parallel processing.

By default, RolDE uses fixed effects only regression with a common intercept and slope for the replicates (individuals) when time points in the data are aligned and mixed effects models with a random effect for the individual baseline (intercept) if the time points are non aligned for the PolyReg and the DiffROTS (only in data with non aligned time points) modules. This behaviour is controlled with the parameter model_type and the default behaviour is induced when model_type is allowed to be "auto". However, the user can choose to use mixed effects regression modelling when appropriate by setting the parameter model_type as "mixed0" for random effects for the individual baseline and setting model_type as "mixed1" for an individual baseline and slope. Fixed effects only models can be chosen to be used by setting as "fixed". Valid inputs for model_type are "auto" (the default), "mixed0", "mixed1" and "fixed".

If the interest is only in ordering the features based on the strength of longitudinal differential expression between the conditions, sigValSampN can be set to 0 to disable significance value estimation and to reduce the computational time used by RolDE. Otherwise, Parameter sigValSampN indicates how many permutations should be performed when estimating the significance values. A larger value will lead to more accurate estimates but increases the required computational time. The total number of permutataions for the significance value estimation will be approximately sigValSampN. The default value used by RolDE is 500 000. The realized value of permutations might be sightly different, depending on the number of features in the data. Using parallel processing greatly decreases the time needed for the significance value calculations. The estimated significance values can be adjusted by any method in the p.adjust method in the stats package. Alternatively, q-values as defined by Storey et al. in the Bioconductor package qvalue can be used. Valid values for sig_adj_meth are then: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr","none", "qvalue". The default value is "fdr". For more details about RolDE, see the original RolDE publication (Valikangas et al.)

Please use set.seed for reproducibility.

Value

RolDE returns a list with the following components:

RolDE_Results a dataframe with the RolDE main results. Contains the RolDE rank product, the estimated significance values (if sigValSampN is not set to 0) and the multiple hypothesis adjusted estimated significance values.

RegROTS_Results a data frame of results for the RegROTS module. RegROTS internal rank products.

RegROTS_P_Values a data frame of significance values for all the RegROTS runs.

DiffROTS_Results a data frame of results for the DiffROTS module. DiffROTS internal rank products.

DiffROTS_P_Values a data frame of the significance values for all the DiffROTS runs.

PolyReg_Results a data frame of results for the PolyReg module. The representative (minimum) condition - related significance values.

PolyReg_P_Values a data frame of all the condition - related significance values for the PolyReg module.

ROTS_Runs a list containing the samples in the different ROTS runs for the RegROTS and DiffROTS (time point aligned data) modules.

Method_Degrees a list containing the used degrees for the RegROTS and the PolyReg (and DiffROTS in non-aligned time point data) modules.

Input a list of all the used inputs for RolDE.

References

Elo, Laura, Filen S, Lahesmaa R, et al. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008; 5:423-31.

Suomi T, Seyednasrollah F, Jaakkola MK, et al. ROTS: An R package for reproducibility-optimized statistical testing. PLoS Comput. Biol. 2017; 13:5.

Storey JD, Bass AJ, Dabney A, et al. qvalue: Q-value estimation for false discovery rate control. 2019.

Välikangas T, Suomi T, ELo LL, et al. Enhanced longitudinal differential expression detection in proteomics with robust reproducibility optimization regression. bioRxiv 2021.

Examples

#Usage of RolDE in time point aligned data without significance value estimation and 1 core
data("data2")
data("des_matrix2")
set.seed(1) #For reproducibility.
data2.res<-RolDE(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)
#Usage of RolDE in time point aligned data without significance value estimation and 1 core
data("data2")
data("des_matrix2")
set.seed(1) #For reproducibility.
data2.res<-RolDE(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)

Robust longitudinal Differential Expression

Description

Usage

RolDE_Main(
        data,
        des_matrix = NULL,
        aligned = TRUE,
        min_comm_diff = "auto",
        min_feat_obs = 3,
        degree_RegROTS = "auto",
        degree_PolyReg = "auto",
        n_cores = 1,
        model_type = "auto",
        sigValSampN = 5e+05,
        sig_adj_meth = "fdr"
)
RolDE_Main(
        data,
        des_matrix = NULL,
        aligned = TRUE,
        min_comm_diff = "auto",
        min_feat_obs = 3,
        degree_RegROTS = "auto",
        degree_PolyReg = "auto",
        n_cores = 1,
        model_type = "auto",
        sigValSampN = 5e+05,
        sig_adj_meth = "fdr"
)

Arguments

`data`	the preprocessed normalized data as as a numerical matrix or as a SummarizedExperiment instance. Features (rows) and variables (columns) of the data must have unique identifiers. If `data` is a SummarizedExperiment object, the design matrix must be included in the `colData` argument of the `data` object.
`des_matrix`	the design matrix for the `data`. Rows correspond to columns of the `data`. Must contain four character columns (see included example design matrices `des_matrix1` or `des_matrix3`). First column should contain sample (column) names of the `data`. Second column should indicate condition status (for each sample), for which longitudinal differential expression is to be examined. Third column should indicate time point (time point aligned data) or time value (non-aligned time point data) for each sample. Fourth column should provide the replicate (individual) information for each sample as a numerical value. Each replicate or indivdual should have a distinct number. If `data` is a SummarizedExperiment object, the design matrix must be included in the `colData` argument of the `data` object.
`aligned`	logical; are the time points in different conditions and replicates (individuals) in the `data` aligned? In aligned time point data, the time points should be the same for each replicate (individual).
`min_comm_diff`	a vector of two positive integers or string ("auto"). The minimum number of common time points for the replicates (individuals) in different conditions to be compared (aligned time points) or the number of time points in the common time interval for the replicates (individuals) in the different conditions to be compared (non-aligned time points). The first integer refers to the minimum number of common time points for the RegROTS module (aligned and non-aligned time points) and the second to DiffROTS (aligned time points). Second value needed but not used for DiffROTS when `aligned` is set to `FALSE`.
`min_feat_obs`	a positive integer. The minimum number of non-missing obsevations a feature must have for a replicate (individual) in a condition to be included in the comparisons for the RegROTS module and the DiffROTS module (aligned time points).
`degree_RegROTS`	a positive integer or string ("auto"). The degree of the polynomials used for the RegROTS module.
`degree_PolyReg`	a positive integer or string ("auto"). The degree of the polynomials used for the PolyReg module.
`n_cores`	a positive integer. The number of threads used for parallel computing. If set to 1 (the default), no parallel computing is used.
`model_type`	a string indicating the type of regression to be used for the PolyReg module and the maximum level for which random effects should be allowed in the case of mixed models. Default "auto" for automatic selection.
`sigValSampN`	a positive integer indicating the number of permutations for significance value calculations. The overall used number will be `sigValSampN` * the number of rows in the `data`. Or set `sigValSampN` to 0 to turn significance calculations off. Should be > 100000.
`sig_adj_meth`	The multiple test hypothesis correction method for the estimated significance values. Only relevant if `sigValSampN` is not 0.

Details

The *RegROTS* module merges the power of regression modelling with the power of the established differential expression method Reproducibility Optimized Test Statistic (ROTS) (Elo et al., Suomi et al.). A polynomial regression model of protein expression over time is fitted separately for each replicate (individual) in each condition. Differential expression between two replicates (individuals) in different conditions is examined by comparing the coefficients of the replicate-specific regression models. If all coefficient differences are zero, no longitudinal differential expression between the two replicates (individuals) in different conditions exist. For a through exploration of differential expression between the conditions, all possible combinations of replicates (individuals) in different conditions are examined.

By default, RolDE assumes aligned time points in the data. If the time points in the data are non-aligned, the user should set the parameter aligned to FALSE.

Parameter min_comm_diff controls how many common time points must two replicates (individuals) have in different conditions to be compared. The first value controls the number of common time points for the RegROTS module, while the second one controls the number of common time points for the DiffROTS module. If min_comm_diff is set to "auto", RolDE will use a value of 3 for the RegROTS module and a value of 1 for the DiffROTS module. Minimum values for the RegROTS and DiffROTS modules are 2 and 1, respectively. In the case of data with non-aligned time points (aligned is set to FALSE), the first value of min_comm_diff controls how many time values (or similar, e.g. age, temperature) must both replicates (individuals) in different conditions have in the common time interval to be compared. The common time interval for two replicates (individuals) r1 and r2 with time values t1 and t2 is defined as: \[max(min(t1,t2)),min(max(t1,t2))\]. In data with non-aligned time points a value of =>1 for DiffROTS (the second value for min_comm_diff) is required but not used. When aligned is FALSE an overall group comparison over all the replicates (individuals) is performed by the DiffROTS module.

min_feat_obs controls the number of non-missing values a feature must have for a replicate (an individual) in a condition to be compared in the RegROTS module and the DiffROTS module (in data with aligned time points). A feature is required to have at least min_feat_obs non-missing values for both replicates (individuals) in the different conditions to be compared. The default value used by RoldE is 3. If lowered, more missing values are allowed but the analysis may become less accurate. In data with non-aligned time points, a common comparison over all the replicates (individuals) between the conditions is performed in the DiffROTS module and the number of allowed missing values for a feature is controlled internally through other means.

The user can control the degree of polynomials used by the RegROTS and the PolyReg modules via the degtree_RegROTS and the degree_PolyReg parameters. If left to "auto", RolDE will by default use as the degree_RegROTS=max(1, min(floor(median(t)/2),4)) and as the degree_PolyReg=max(2, min((median(t)-1),5)), where t is a vector of the number of time points/values for all the replicates (individuals).

Please use set.seed for reproducibility.

Value

References

Examples

#Usage of RolDE in time point aligned data without significance value estimation and 1 core
data("data2")
data("des_matrix2")
set.seed(1) #For reproducibility.
data2.res<-RolDE_Main(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)
#Usage of RolDE in time point aligned data without significance value estimation and 1 core
data("data2")
data("des_matrix2")
set.seed(1) #For reproducibility.
data2.res<-RolDE_Main(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)

Package 'RolDE'

Help Index

A proteomics dataset with random protein expression values - no missing values.

Description

Usage

Format

A small dataset with 50 proteins and 30 samples for the example usage of RolDE

Description

Usage

Format

Source

A semi-simulated UPS1 spike-in dataset with differences in longitudinal expression for the spike-in proteins - no missing values.

Description

Usage

Format

Source

A RolDE design matrix for data1

Description

Usage

Format

A RolDE design matrix for data2

Description

Usage

Format

A RolDE design matrix for data3

Description

Usage

Format

Plot RolDE results

Description

Usage

Arguments

Details

Value

Examples

RolDE results for data1

Description

Usage

Format

RolDE results for data3

Description

Usage

Format

Robust longitudinal Differential Expression

Description

Usage

Arguments

Details

Value

References

Examples

Robust longitudinal Differential Expression

Description

Usage

Arguments

Details

Value

References

Examples