Title: | RolDE: Robust longitudinal Differential Expression |
---|---|
Description: | RolDE detects longitudinal differential expression between two conditions in noisy high-troughput data. Suitable even for data with a moderate amount of missing values.RolDE is a composite method, consisting of three independent modules with different approaches to detecting longitudinal differential expression. The combination of these diverse modules allows RolDE to robustly detect varying differences in longitudinal trends and expression levels in diverse data types and experimental settings. |
Authors: | Tommi Valikangas [aut], Medical Bioinformatics Centre [cre] |
Maintainer: | Medical Bioinformatics Centre <[email protected]> |
License: | GPL-3 |
Version: | 1.11.0 |
Built: | 2024-10-31 04:28:38 UTC |
Source: | https://github.com/bioc/RolDE |
A longitudinal proteomics dataset with five timepoints, two conditions and
three replicates for each sample at each timepoint in each condition. The expression
values for the random data have been generated using the rnorm
function.
The values for each protein (row) have been drawn from a normal distribution with a
mean of 22.5 and a standard deviation of 1.5 corresponding to random protein expression
values.
data1
data1
A matrix with 1045 rows and 30 variables.
Based on the included semi-simulated UPS1 spike-in dataset (data3) with 50 randomly chosen proteins from the data.
data2
data2
A matrix with 50 rows and 30 variables.
https://www.ebi.ac.uk/pride/archive/projects/PXD002099
A longitudinal proteomics dataset with five timepoints, two conditions and
three replicates for each sample at each timepoint in each condition. The expression
values for each protein in each sample have been generated by using the rnorm
function and the means and standard deviations of the corresponding proteins and samples
in the original experimental UPS1 spike-in data. In this manner, the expression for each
protein in each sample could be replicated with some random variation multiple times when
necessary. The pattern of missing values was directly copied from the used samples
in the original UPS1 spike-in dataset. All proteins with missing values have been filtered out.
Linear-like trends for the spike-in proteins for Condition 1 were generated using the 4,4,10,25 and 50 fmol samples of the original
UPS1 spike-in data. Linear-like trends for the spike-in proteins for Condition 2 were generated using the 2,4,4,10 and 25 fmol samples
of the original UPS1 spike-in data. For more information about the generation of the semi-simulated spike-in datasets,
see the original RolDE publication. In the spike-in datasets, the UPS1
spike-in proteins are expected to differ between conditions while the expression of the rest
of the proteins (the background proteins) are expected to remain stable between the conditions,
excluding experimental noise.
data3
data3
A matrix with 1033 rows and 30 variables.
https://www.ebi.ac.uk/pride/archive/projects/PXD002099
A design matrix to be used with data1 for the RolDE-function. Column 1 contains sample names, column 2 the condition information for each sample, column 3 indicates the timepoint for each sample, and column 4 gives the Replicate or Individual each sample is coming from.
des_matrix1
des_matrix1
A matrix with 30 rows and 4 variables.
A design matrix to be used with data2 for the RolDE-function.Column 1 contains sample names, column 2 the condition information for each sample, column 3 indicates the timepoint for each sample, and column 4 gives the Replicate or Individual each sample is coming from.
des_matrix2
des_matrix2
A matrix with 30 rows and 4 variables.
A design matrix to be used with data3 for the RolDE-function. Column 1 contains sample names, column 2 the condition information for each sample, column 3 indicates the timepoint for each sample, and column 4 gives the Replicate or Individual each sample is coming from.
des_matrix3
des_matrix3
A matrix with 30 rows and 4 variables.
Plot the findings from longitudinal differential expression analysis with RolDE.
plotFindings(file_name = NULL, RolDE_res, top_n, col1 = "blue", col2 = "red")
plotFindings(file_name = NULL, RolDE_res, top_n, col1 = "blue", col2 = "red")
file_name |
a string indicating the file name in which the results should be plotted. Should have a ".pdf" extension. Default is NULL, no file is created. |
RolDE_res |
the RolDE result object. |
top_n |
an integer or a vector of integers indicating what top differentially expressed features should be plotted. If |
col1 |
a string indicating which color should be used for Individuals / Replicates in condition 1. The default is blue. |
col2 |
a string indicating which color should be used for Individuals / Replicates in condition 2. The default is red. |
The function plots the longitudinal expression of the top RolDE findings. The function can plot either the expression of a single finding
or multiple top findings as indicated by the top_n
. The findings can be plotted into a pdf file as indicated by the file_name
.
The given file_name
should have a ".pdf" extension. If the plottable feature has missing values, a mean value over the feature values will
be imputted for visualization purposes. The missing / imputed value will be indicated with an empty circle symbol.
plotFindings
Plots the results from the RolDE object.
data("res3") #Plotting the most DE finding. DE results are in the res3 object. plotFindings(file_name = NULL, RolDE_res = res3, top_n = 1)
data("res3") #Plotting the most DE finding. DE results are in the res3 object. plotFindings(file_name = NULL, RolDE_res = res3, top_n = 1)
RolDE results of data1 to be used for generating documentation only.
res1
res1
An object of class list
of length 10.
RolDE results of data3 to be used for generating documentation only.
res3
res3
An object of class list
of length 10.
Detects longitudinal differential expression between two conditions (or groups) in time point
aligned data or in data with non-aligned time points. A rank product from the results of three independent modules,
RegROTS, DiffROTS and PolyReg, is determined to indicate the strength of differential expression of features
between the conditions / groups. RolDE
tolerates a fair amount of missing values and is especially suitable
for noisy proteomics data.
RolDE( data, des_matrix = NULL, aligned = TRUE, n_cores = 1, model_type = "auto", sigValSampN = 5e+05, sig_adj_meth = "fdr" )
RolDE( data, des_matrix = NULL, aligned = TRUE, n_cores = 1, model_type = "auto", sigValSampN = 5e+05, sig_adj_meth = "fdr" )
data |
the preprocessed normalized data as as a numerical matrix or as a SummarizedExperiment instance. Features (rows) and variables (columns)
of the data must have unique identifiers. If |
des_matrix |
the design matrix for the |
aligned |
logical; are the time points in different conditions and replicates (individuals) in the |
n_cores |
a positive integer. The number of threads used for parallel computing. If set to 1 (the default), no parallel computing is used. |
model_type |
a string indicating the type of regression to be used for the PolyReg module and the maximum level for which random effects should be allowed in the case of mixed models. Default "auto" for automatic selection. |
sigValSampN |
a positive integer indicating the number of permutations for significance value
calculations. The overall used number will be |
sig_adj_meth |
The multiple test hypothesis correction method for the estimated significance values. Only
relevant if |
RolDE
, is a composite method, consisting of three independent
modules with different approaches to detecting longitudinal differential expression.
The combination of these diverse modules allows RolDE to robustly detect
varying differences in longitudinal trends and expression levels in
diverse data types and experimental settings.
The *RegROTS* module merges the power of regression modelling with the power of the established differential expression method Reproducibility Optimized Test Statistic (ROTS) (Elo et al., Suomi et al.). A polynomial regression model of protein expression over time is fitted separately for each replicate (individual) in each condition. Differential expression between two replicates (individuals) in different conditions is examined by comparing the coefficients of the replicate-specific regression models. If all coefficient differences are zero, no longitudinal differential expression between the two replicates (individuals) in different conditions exist. For a thorough exploration of differential expression between the conditions, all possible combinations of replicates (individuals) in different conditions are examined.
In the *DiffROTS* module the expression of replicates (individuals) in different conditions are directly compared at all time points. Again, if the expression level differences at all time points are zero, no differential expression between the examined replicates (individuals) in different conditions exist. Similarly to the RegROTS module, differential expression is examined between all possible combinations of replicates (individuals) in the different conditions. In non-aligned time point data, the expression level differences between the conditions is examined when accounting for time-associated trends of varying complexity in the data. More specifically, the expression level differences between the conditions are examined when adjusting for increasingly complex time-related expression trends of polynomial degrees d=0,1,.,d where d is the maximum degree for the polynomial and the same degree as is used for the PolyReg module.
In the *PolyReg* module, polynomial regression modelling is used to detect longitudinal differential expression. Condition is included as a categorical factor within the models and by investigating the condition related intercept and the polynomial termns at different levels of the condition factor, average differences in expression levels as well as differences in longitudinal expression patterns between the conditions can be examined.
Finally, to conclusively detect any differential expression,
the detections from the different modules are combined using the rank product.
For more details about the method, see the original RolDE
publication (Valikangas et al.).
By bare minimum, the user should provide RolDE
the data in a normalized numerical matrix,
adjusted for confounding effects if needed, together with a suitable design matrix for the data. If
the time points in the data are non-aligned, the user should set the parameter aligned
to FALSE
.
Other parameter values RolDE
determines automatically by default. The default values should be suitable
for a typical longitudinal differential expression analysis but the user is given control of many of the
parameters for RolDE
.
By default, RolDE
assumes aligned time points in the data. If the time points
in the data are non-aligned, the user should set the parameter aligned
to FALSE
.
Parallel processing can be enabled by setting the parameter n_cores
as larger than the default 1 (highly recommended). With
parallel processing using multiple threads, the run time for RolDE
can be significantly decreased. The parameter
n_cores
controls the number of threads available for parallel processing.
By default, RolDE
uses fixed effects only regression with a common intercept and slope for the replicates (individuals) when time points
in the data are aligned and mixed effects models with a random effect for the individual baseline (intercept) if the time points are non aligned
for the PolyReg and the DiffROTS (only in data with non aligned time points) modules. This behaviour is controlled with the parameter model_type
and the default behaviour is induced when model_type
is allowed to be "auto". However, the user can choose to use mixed effects regression modelling
when appropriate by setting the parameter model_type
as "mixed0" for random effects for the individual baseline and
setting model_type
as "mixed1" for an individual baseline and slope. Fixed effects only models can be chosen to be used by setting
as "fixed". Valid inputs for model_type
are "auto" (the default), "mixed0", "mixed1" and "fixed".
If the interest is only in ordering the features based on the strength of longitudinal differential expression between the conditions, sigValSampN
can
be set to 0 to disable significance value estimation and to reduce the computational time used by RolDE
. Otherwise, Parameter sigValSampN
indicates how many permutations should be performed when estimating the significance values. A larger value will lead to more accurate estimates but increases the
required computational time. The total number of permutataions for the significance value estimation will be approximately sigValSampN
.
The default value used by RolDE
is 500 000. The realized value of permutations might be sightly different, depending on the number of features
in the data. Using parallel processing greatly decreases the time needed for the significance value calculations. The estimated significance
values can be adjusted by any method in the p.adjust
method in the stats
package. Alternatively, q-values as defined by Storey et al. in the
Bioconductor package qvalue
can be used. Valid values for sig_adj_meth
are then:
"holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr","none", "qvalue". The default value is "fdr".
For more details about RolDE
, see the original RolDE
publication (Valikangas et al.)
Please use set.seed
for reproducibility.
RolDE
returns a list with the following components: RolDE_Results
a dataframe with the RolDE
main results. Contains the RolDE
rank product, the estimated
significance values (if sigValSampN
is not set to 0) and the multiple hypothesis adjusted estimated significance values.RegROTS_Results
a data frame of results for the RegROTS module. RegROTS internal rank products.RegROTS_P_Values
a data frame of significance values for all the RegROTS runs.DiffROTS_Results
a data frame of results for the DiffROTS module. DiffROTS internal rank products.DiffROTS_P_Values
a data frame of the significance values for all the DiffROTS runs.PolyReg_Results
a data frame of results for the PolyReg module. The representative (minimum) condition - related significance values.PolyReg_P_Values
a data frame of all the condition - related significance values for the PolyReg module.ROTS_Runs
a list containing the samples in the different ROTS runs for the RegROTS and DiffROTS (time point aligned data) modules.Method_Degrees
a list containing the used degrees for the RegROTS and the PolyReg (and DiffROTS in non-aligned time point data) modules.Input
a list of all the used inputs for RolDE
.
Elo, Laura, Filen S, Lahesmaa R, et al. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008; 5:423-31.
Suomi T, Seyednasrollah F, Jaakkola MK, et al. ROTS: An R package for reproducibility-optimized statistical testing. PLoS Comput. Biol. 2017; 13:5.
Storey JD, Bass AJ, Dabney A, et al. qvalue: Q-value estimation for false discovery rate control. 2019.
Välikangas T, Suomi T, ELo LL, et al. Enhanced longitudinal differential expression detection in proteomics with robust reproducibility optimization regression. bioRxiv 2021.
#Usage of RolDE in time point aligned data without significance value estimation and 1 core data("data2") data("des_matrix2") set.seed(1) #For reproducibility. data2.res<-RolDE(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)
#Usage of RolDE in time point aligned data without significance value estimation and 1 core data("data2") data("des_matrix2") set.seed(1) #For reproducibility. data2.res<-RolDE(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)
Detects longitudinal differential expression between two conditions (or groups) in time point
aligned data or in data with non-aligned time points. A rank product from the results of three independent modules,
RegROTS, DiffROTS and PolyReg, is determined to indicate the strength of differential expression of features
between the conditions / groups. RolDE
tolerates a fair amount of missing values and is especially suitable
for noisy proteomics data.
RolDE_Main( data, des_matrix = NULL, aligned = TRUE, min_comm_diff = "auto", min_feat_obs = 3, degree_RegROTS = "auto", degree_PolyReg = "auto", n_cores = 1, model_type = "auto", sigValSampN = 5e+05, sig_adj_meth = "fdr" )
RolDE_Main( data, des_matrix = NULL, aligned = TRUE, min_comm_diff = "auto", min_feat_obs = 3, degree_RegROTS = "auto", degree_PolyReg = "auto", n_cores = 1, model_type = "auto", sigValSampN = 5e+05, sig_adj_meth = "fdr" )
data |
the preprocessed normalized data as as a numerical matrix or as a SummarizedExperiment instance. Features (rows) and variables (columns)
of the data must have unique identifiers. If |
des_matrix |
the design matrix for the |
aligned |
logical; are the time points in different conditions and replicates (individuals) in the |
min_comm_diff |
a vector of two positive integers or string ("auto"). The minimum number of common time points for the replicates (individuals)
in different conditions to be compared (aligned time points) or the number of time points in the common time interval for the replicates (individuals)
in the different conditions
to be compared (non-aligned time points). The first integer refers to the minimum number of common time points for the RegROTS module
(aligned and non-aligned time points) and the second to DiffROTS (aligned time points). Second value needed but not used for DiffROTS when |
min_feat_obs |
a positive integer. The minimum number of non-missing obsevations a feature must have for a replicate (individual) in a condition to be included in the comparisons for the RegROTS module and the DiffROTS module (aligned time points). |
degree_RegROTS |
a positive integer or string ("auto"). The degree of the polynomials used for the RegROTS module. |
degree_PolyReg |
a positive integer or string ("auto"). The degree of the polynomials used for the PolyReg module. |
n_cores |
a positive integer. The number of threads used for parallel computing. If set to 1 (the default), no parallel computing is used. |
model_type |
a string indicating the type of regression to be used for the PolyReg module and the maximum level for which random effects should be allowed in the case of mixed models. Default "auto" for automatic selection. |
sigValSampN |
a positive integer indicating the number of permutations for significance value
calculations. The overall used number will be |
sig_adj_meth |
The multiple test hypothesis correction method for the estimated significance values. Only
relevant if |
RolDE
, is a composite method, consisting of three independent
modules with different approaches to detecting longitudinal differential expression.
The combination of these diverse modules allows RolDE to robustly detect
varying differences in longitudinal trends and expression levels in
diverse data types and experimental settings.
The *RegROTS* module merges the power of regression modelling with the power of the established differential expression method Reproducibility Optimized Test Statistic (ROTS) (Elo et al., Suomi et al.). A polynomial regression model of protein expression over time is fitted separately for each replicate (individual) in each condition. Differential expression between two replicates (individuals) in different conditions is examined by comparing the coefficients of the replicate-specific regression models. If all coefficient differences are zero, no longitudinal differential expression between the two replicates (individuals) in different conditions exist. For a through exploration of differential expression between the conditions, all possible combinations of replicates (individuals) in different conditions are examined.
In the *DiffROTS* module the expression of replicates (individuals) in different conditions are directly compared at all time points. Again, if the expression level differences at all time points are zero, no differential expression between the examined replicates (individuals) in different conditions exist. Similarly to the RegROTS module, differential expression is examined between all possible combinations of replicates (individuals) in the different conditions. In non-aligned time point data, the expression level differences between the conditions is examined when accounting for time-associated trends of varying complexity in the data. More specifically, the expression level differences between the conditions are examined when adjusting for increasingly complex time-related expression trends of polynomial degrees d=0,1,.,d where d is the maximum degree for the polynomial and the same degree as is used for the PolyReg module.
In the *PolyReg* module, polynomial regression modelling is used to detect longitudinal differential expression. Condition is included as a categorical factor within the models and by investigating the condition related intercept and the polynomial termns at different levels of the condition factor, average differences in expression levels as well as differences in longitudinal expression patterns between the conditions can be examined.
Finally, to conclusively detect any differential expression,
the detections from the different modules are combined using the rank product.
For more details about the method, see the original RolDE
publication (Valikangas et al.).
By bare minimum, the user should provide RolDE
the data in a normalized numerical matrix,
adjusted for confounding effects if needed, together with a suitable design matrix for the data. If
the time points in the data are non-aligned, the user should set the parameter aligned
to FALSE
.
Other parameter values RolDE
determines automatically by default. The default values should be suitable
for a typical longitudinal differential expression analysis but the user is given control of many of the
parameters for RolDE
.
By default, RolDE
assumes aligned time points in the data. If the time points
in the data are non-aligned, the user should set the parameter aligned
to FALSE
.
Parameter min_comm_diff
controls how many common time points must two replicates (individuals)
have in different conditions to be compared. The first value controls the number of common time points
for the RegROTS module, while the second one controls the number of common time points for the DiffROTS module.
If min_comm_diff
is set to "auto", RolDE
will use a value of 3 for the RegROTS module and a value of
1 for the DiffROTS module. Minimum values for the RegROTS and DiffROTS modules are 2 and 1, respectively.
In the case of data with non-aligned time points (aligned
is set to FALSE
), the first value of
min_comm_diff
controls how many time values (or similar, e.g. age, temperature) must both replicates (individuals)
in different conditions have in the common time interval to be compared. The common time interval for two replicates (individuals)
r1 and r2 with time values t1 and t2 is defined as: \[max(min(t1,t2)),min(max(t1,t2))\]. In data with non-aligned
time points a value of =>1 for DiffROTS (the second value for min_comm_diff
) is required but not used.
When aligned
is FALSE
an overall group comparison over all the replicates (individuals) is performed
by the DiffROTS module.
min_feat_obs
controls the number of non-missing values a feature must have for a replicate (an individual) in a condition to be
compared in the RegROTS module and the DiffROTS module (in data with aligned time points). A feature is
required to have at least min_feat_obs
non-missing values for both replicates (individuals) in the different conditions
to be compared. The default value used by RoldE
is 3. If lowered, more missing values are allowed but
the analysis may become less accurate. In data with non-aligned time points, a common comparison over all the
replicates (individuals) between the conditions is performed in the DiffROTS module and the number of allowed missing values
for a feature is controlled internally through other means.
The user can control the degree of polynomials used by the RegROTS and the PolyReg modules via the
degtree_RegROTS
and the degree_PolyReg
parameters. If left to "auto", RolDE
will by
default use as the degree_RegROTS
=max(1, min(floor(median(t)/2),4)) and as the
degree_PolyReg
=max(2, min((median(t)-1),5)), where t is a vector of the number of time points/values
for all the replicates (individuals).
Parallel processing can be enabled by setting the parameter n_cores
as larger than the default 1 (highly recommended). With
parallel processing using multiple threads, the run time for RolDE
can be significantly decreased. The parameter
n_cores
controls the number of threads available for parallel processing.
By default, RolDE
uses fixed effects only regression with a common intercept and slope for the replicates (individuals) when time points
in the data are aligned and mixed effects models with a random effect for the individual baseline (intercept) if the time points are non aligned
for the PolyReg and the DiffROTS (only in data with non aligned time points) modules. This behaviour is controlled with the parameter model_type
and the default behaviour is induced when model_type
is allowed to be "auto". However, the user can choose to use mixed effects regression modelling
when appropriate by setting the parameter model_type
as "mixed0" for random effects for the individual baseline and
setting model_type
as "mixed1" for an individual baseline and slope. Fixed effects only models can be chosen to be used by setting
as "fixed". Valid inputs for model_type
are "auto" (the default), "mixed0", "mixed1" and "fixed".
If the interest is only in ordering the features based on the strength of longitudinal differential expression between the conditions, sigValSampN
can
be set to 0 to disable significance value estimation and to reduce the computational time used by RolDE
. Otherwise, Parameter sigValSampN
indicates how many permutations should be performed when estimating the significance values. A larger value will lead to more accurate estimates but increases the
required computational time. The total number of permutataions for the significance value estimation will be approximately sigValSampN
.
The default value used by RolDE
is 500 000. The realized value of permutations might be sightly different, depending on the number of features
in the data. Using parallel processing greatly decreases the time needed for the significance value calculations. The estimated significance
values can be adjusted by any method in the p.adjust
method in the stats
package. Alternatively, q-values as defined by Storey et al. in the
Bioconductor package qvalue
can be used. Valid values for sig_adj_meth
are then:
"holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr","none", "qvalue". The default value is "fdr".
For more details about RolDE
, see the original RolDE
publication (Valikangas et al.)
Please use set.seed
for reproducibility.
RolDE
returns a list with the following components: RolDE_Results
a dataframe with the RolDE
main results. Contains the RolDE
rank product, the estimated
significance values (if sigValSampN
is not set to 0) and the multiple hypothesis adjusted estimated significance values.RegROTS_Results
a data frame of results for the RegROTS module. RegROTS internal rank products.RegROTS_P_Values
a data frame of significance values for all the RegROTS runs.DiffROTS_Results
a data frame of results for the DiffROTS module. DiffROTS internal rank products.DiffROTS_P_Values
a data frame of the significance values for all the DiffROTS runs.PolyReg_Results
a data frame of results for the PolyReg module. The representative (minimum) condition - related significance values.PolyReg_P_Values
a data frame of all the condition - related significance values for the PolyReg module.ROTS_Runs
a list containing the samples in the different ROTS runs for the RegROTS and DiffROTS (time point aligned data) modules.Method_Degrees
a list containing the used degrees for the RegROTS and the PolyReg (and DiffROTS in non-aligned time point data) modules.Input
a list of all the used inputs for RolDE
.
Elo, Laura, Filen S, Lahesmaa R, et al. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008; 5:423-31.
Suomi T, Seyednasrollah F, Jaakkola MK, et al. ROTS: An R package for reproducibility-optimized statistical testing. PLoS Comput. Biol. 2017; 13:5.
Storey JD, Bass AJ, Dabney A, et al. qvalue: Q-value estimation for false discovery rate control. 2019.
Välikangas T, Suomi T, ELo LL, et al. Enhanced longitudinal differential expression detection in proteomics with robust reproducibility optimization regression. bioRxiv 2021.
#Usage of RolDE in time point aligned data without significance value estimation and 1 core data("data2") data("des_matrix2") set.seed(1) #For reproducibility. data2.res<-RolDE_Main(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)
#Usage of RolDE in time point aligned data without significance value estimation and 1 core data("data2") data("des_matrix2") set.seed(1) #For reproducibility. data2.res<-RolDE_Main(data=data2, des_matrix=des_matrix2, n_cores=1, sigValSampN = 0)