Title: | Data Mining for RNA-seq data: normalization, feature selection and classification |
---|---|
Description: | The DaMiRseq package offers a tidy pipeline of data mining procedures to identify transcriptional biomarkers and exploit them for both binary and multi-class classification purposes. The package accepts any kind of data presented as a table of raw counts and allows including both continous and factorial variables that occur with the experimental setting. A series of functions enable the user to clean up the data by filtering genomic features and samples, to adjust data by identifying and removing the unwanted source of variation (i.e. batches and confounding factors) and to select the best predictors for modeling. Finally, a "stacking" ensemble learning technique is applied to build a robust classification model. Every step includes a checkpoint that the user may exploit to assess the effects of data management by looking at diagnostic plots, such as clustering and heatmaps, RLE boxplots, MDS or correlation plot. |
Authors: | Mattia Chiesa <[email protected]>, Luca Piacentini <[email protected]> |
Maintainer: | Mattia Chiesa <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.19.0 |
Built: | 2024-10-30 05:31:39 UTC |
Source: | https://github.com/bioc/DaMiRseq |
This is a helper function to easily draw (1) clustering dendrogram and heatmap of a sample-per-sample correlation matrix, (2) multidimensional scaling plots (MDS), (3) relative log expression (RLE) boxplots of expression data, (4) a sample-by-sample expression value distribution, and (5) a class average expression value distribution
DaMiR.Allplot( data, df, type = c("spearman", "pearson"), what = c("all", "all_w_PCA", "MDS", "PCA", "heatmap", "RLEbox", "distr", "avg_distr") )
DaMiR.Allplot( data, df, type = c("spearman", "pearson"), what = c("all", "all_w_PCA", "MDS", "PCA", "heatmap", "RLEbox", "distr", "avg_distr") )
data |
A SummarizedExperiment object or a matrix or a data.frame where rows and cols should be, respectively, observations and features |
df |
A data frame with class and known variables (or a subset of them); at least one column with 'class' label must be included |
type |
A character string specifing the metric to be applied to correlation analysis. Either "spearman" or "pearson" is allowed; default is "spearman" |
what |
A character string specifing the plots to be shown 'all', 'all_w_PCA', 'MDS','PCA','heatmap','RLEbox', 'distr', 'avg_distr' are allowed; default is "all" |
Please be sure that NAs are not present in df
's columns.
Plots will not be drawn in the presence of NAs.
A dendrogram and heatmap, MDS plot(s), a RLE boxplot, a sample-by-sample expression value distribution, and a class average expression value distribution
Mattia Chiesa, Luca Piacentini
# use example data: data(data_norm) data(df) # Draw clustering dendrogram and heatmap, MDS, RLE boxplot: DaMiR.Allplot(data=data_norm, df=df[,5,drop=FALSE])
# use example data: data(data_norm) data(df) # Draw clustering dendrogram and heatmap, MDS, RLE boxplot: DaMiR.Allplot(data=data_norm, df=df[,5,drop=FALSE])
The function helps to draw a clustering dendrogram and a heatmap of expression data.
DaMiR.Clustplot( data, df, type_row = c("euclidean", "correlation"), type_col = c("euclidean", "correlation") )
DaMiR.Clustplot( data, df, type_row = c("euclidean", "correlation"), type_col = c("euclidean", "correlation") )
data |
A SummarizedExperiment object or a matrix or a data.frame where rows and cols should be, respectively, observations and features |
df |
A data frame with class and (optionally) known variables; at least one column with 'class' label must be included |
type_row |
The metric to be used to cluster rows. Either "euclidean" or "correlation" is allowed; default is "euclidean" |
type_col |
The metric to be used to cluster cols. Either "euclidean" or "correlation" is allowed; default is "euclidean" |
A clustering dendrogram and heatmap.
Mattia Chiesa, Luca Piacentini
# use example data: data(data_norm) data(df) # use the first 100 genes: data_norm_red<-data_norm[1:100,] # Draw heatmap: samples (cols) per genes (rows) # and use variable annotation: DaMiR.Clustplot(data=data_norm_red, df=df, type_row="correlation", type_col="correlation")
# use example data: data(data_norm) data(df) # use the first 100 genes: data_norm_red<-data_norm[1:100,] # Draw heatmap: samples (cols) per genes (rows) # and use variable annotation: DaMiR.Clustplot(data=data_norm_red, df=df, type_row="correlation", type_col="correlation")
This function easily draws the correlation plot of surrogate variables (sv) and variables.
DaMiR.corrplot(sv, df, type = c("pearson", "spearman"), sig.level = 0.01)
DaMiR.corrplot(sv, df, type = c("pearson", "spearman"), sig.level = 0.01)
sv |
The matrix of sv identified by |
df |
A data frame with class and known variables; at least one column with 'class' label must be included |
type |
Type of correlation metric to be applied; default is "pearson" |
sig.level |
The significance level of the correlation; default is 0.0001 |
Factorial variables are allowed. They will be tranformed as
numeric before
applying the rcorr
function of Hmisc
.The
corrplot
function, which draws the plot, marks with a cross all the correlations
that do not reach
the significance threshold defined in the sig.level
argument.This
plot allows the user to
identify those sv that present significant correlations with either
technical and
biological known variables.
Notably, none of the sv should present signifcant correlation with
"class" variable.
A correlation plot between sv and known variables.
Mattia Chiesa, Luca Piacentini
# use example data: data(df) data(sv) # Draw correlation plot: #DaMiR.corrplot(sv=sv, df=df, type = "pearson", sig.level=0.01)
# use example data: data(df) data(sv) # Draw correlation plot: #DaMiR.corrplot(sv=sv, df=df, type = "pearson", sig.level=0.01)
This function implements a 'Stacking' ensemble learning strategy. Users can provide heterogeneous features (other than genomic features) which will be taken into account during classification model building.
DaMiR.EnsembleLearning( data, classes, variables, fSample.tr = 0.7, fSample.tr.w = 0.7, iter = 100, cl_type = c("RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
DaMiR.EnsembleLearning( data, classes, variables, fSample.tr = 0.7, fSample.tr.w = 0.7, iter = 100, cl_type = c("RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
data |
A transposed data frame of normalized expression data. Rows and Cols should be, respectively, observations and features |
classes |
A class vector with |
variables |
An optional data frame containing other variables (but without 'class' column). Each column represents a different covariate to be considered in the model |
fSample.tr |
Fraction of samples to be used as training set; default is 0.7 |
fSample.tr.w |
Fraction of samples of training set to be used during weight estimation; default is 0.7 |
iter |
Number of iterations to assess classification accuracy; default is 100 |
cl_type |
List of weak classifiers that will compose the meta-learners. Only "RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS" are allowed. Default is c("RF", "LR", "kNN", "LDA", "NB", "SVM") |
To assess the robustness of a set of predictors, a specific 'Stacking' strategy has been implemented. First, a training set (TR1) and a test set (TS1) are generated by 'bootstrap' sampling. Then, sampling again from TR1 subset, another pair of training (TR2) and test set (TS2) are obtained. TR2 is used to train Random Forest (RF), Naive Bayes (NB), Support Vector Machines (SVM), k-Nearest Neighbour (kNN), Linear Discriminant Analysis (LDA) and Logistic Regression (LR) classifiers, whereas TS2 is used to test their accuracy and to calculate weights. The decision rule of 'Stacking' classifier is made by a linear combination of the product between weigths (w) and predictions (Pr) of each classifier; for each sample k, the prediction is computed by:
Performance of 'Stacking' classifier is evaluated by using TS1. This process is repeated several times (default 100 times).
A list containing:
A matrix of accuracies of each classifier in each iteration.
A matrix of weights used for each classifier in each iteration.
A list of all models generated in each iteration.
A violin plot of model accuracy obtained for each iteration.
Mattia Chiesa, Luca Piacentini
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsembleLearning(selected_features, # classes=df$class, fSample.tr=0.6, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","kNN"))
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsembleLearning(selected_features, # classes=df$class, fSample.tr=0.6, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","kNN"))
This function implements a 'Stacking' ensemble learning strategy. Users can provide heterogeneous features (other than genomic features) which will be taken into account during classification model building. A 'two-classes' classification task is addressed.
DaMiR.EnsembleLearning2cl( data, classes, variables, fSample.tr = 0.7, fSample.tr.w = 0.7, iter = 100, cl_type = c("RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
DaMiR.EnsembleLearning2cl( data, classes, variables, fSample.tr = 0.7, fSample.tr.w = 0.7, iter = 100, cl_type = c("RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
data |
A transposed data frame of normalized expression data. Rows and Cols should be, respectively, observations and features |
classes |
A class vector with |
variables |
An optional data frame containing other variables (but without 'class' column). Each column represents a different covariate to be considered in the model |
fSample.tr |
Fraction of samples to be used as training set; default is 0.7 |
fSample.tr.w |
Fraction of samples of training set to be used during weight estimation; default is 0.7 |
iter |
Number of iterations to assess classification accuracy; default is 100 |
cl_type |
List of weak classifiers that will compose the meta-learners. "RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS" are allowed. Default is c("RF", "LR", "kNN", "LDA", "NB", "SVM") |
To assess the robustness of a set of predictors, a specific 'Stacking' strategy has been implemented. First, a training set (TR1) and a test set (TS1) are generated by 'bootstrap' sampling. Then, sampling again from TR1 subset, another pair of training (TR2) and test set (TS2) are obtained. TR2 is used to train Random Forest (RF), Naive Bayes (NB), Support Vector Machines (SVM), k-Nearest Neighbour (kNN), Linear Discriminant Analysis (LDA) and Logistic Regression (LR) classifiers, whereas TS2 is used to test their accuracy and to calculate weights. The decision rule of 'Stacking' classifier is made by a linear combination of the product between weigths (w) and predictions (Pr) of each classifier; for each sample k, the prediction is computed by:
Performance of 'Stacking' classifier is evaluated by using TS1. This process is repeated several times (default 100 times).
A list containing:
A matrix of accuracies of each classifier in each iteration.
A matrix of weights used for each classifier in each iteration.
A list of all models generated in each iteration.
A violin plot of model accuracy obtained for each iteration.
Mattia Chiesa, Luca Piacentini
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsembleLearning(selected_features, # classes=df$class, fSample.tr=0.6, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","kNN"))
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsembleLearning(selected_features, # classes=df$class, fSample.tr=0.6, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","kNN"))
This function implements a 'Stacking' ensemble learning strategy. Users can provide heterogeneous features (other than genomic features) which will be taken into account during classification model building. A 'multi-classes' classification task is addressed.
DaMiR.EnsembleLearningNcl( data, classes, variables, fSample.tr = 0.7, fSample.tr.w = 0.7, iter = 100, cl_type = c("RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
DaMiR.EnsembleLearningNcl( data, classes, variables, fSample.tr = 0.7, fSample.tr.w = 0.7, iter = 100, cl_type = c("RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
data |
A transposed data frame of normalized expression data. Rows and Cols should be, respectively, observations and features |
classes |
A class vector with |
variables |
An optional data frame containing other variables (but without 'class' column). Each column represents a different covariate to be considered in the model |
fSample.tr |
Fraction of samples to be used as training set; default is 0.7 |
fSample.tr.w |
Fraction of samples of training set to be used during weight estimation; default is 0.7 |
iter |
Number of iterations to assess classification accuracy; default is 100 |
cl_type |
List of weak classifiers that will compose the meta-learners. "RF", "kNN", "SVM", "LDA", "LR", "NB", "NN", "PLS" are allowed. Default is c("RF", "LR", "kNN", "LDA", "NB", "SVM") |
To assess the robustness of a set of predictors, a specific 'Stacking' strategy has been implemented. First, a training set (TR1) and a test set (TS1) are generated by 'bootstrap' sampling. Then, sampling again from TR1 subset, another pair of training (TR2) and test set (TS2) are obtained. TR2 is used to train Random Forest (RF), Naive Bayes (NB), Support Vector Machines (SVM), k-Nearest Neighbour (kNN), Linear Discriminant Analysis (LDA) and Logistic Regression (LR) classifiers, whereas TS2 is used to test their accuracy and to calculate weights. The decision rule of 'Stacking' classifier is made by a linear combination of the product between weigths (w) and predictions (Pr) of each classifier; for each sample k, the prediction is computed by:
Performance of 'Stacking' classifier is evaluated by using TS1. This process is repeated several times (default 100 times).
A matrix of accuracies of each classifier in each iteration.
Mattia Chiesa, Luca Piacentini
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsembleLearning(selected_features, # classes=df$class, fSample.tr=0.6, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","kNN"))
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsembleLearning(selected_features, # classes=df$class, fSample.tr=0.6, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","kNN"))
The best model learned by the DaMiR.EnsL_Train functionn is tested on a new dataset, in order to predict the samples class
DaMiR.EnsL_Predict(data, bestModel)
DaMiR.EnsL_Predict(data, bestModel)
data |
A SummarizedExperiment object or a data frame/matrix of normalized expression data. Rows and Cols should be observations and features, respectively. |
bestModel |
The best model, selected between those trained by the DaMiR.EnsL_Train function. |
This function implements the prediction step on new data, given a model learned by DaMiR.EnsL_Train
A matrix containing the predictions
Mattia Chiesa, Luca Piacentini
# use example data: data(selected_features) data(df)
# use example data: data(selected_features) data(df)
This function tests the models learned by the DaMiR.EnsL_Train function, on a test set
DaMiR.EnsL_Test(data, classes, EnsL_model)
DaMiR.EnsL_Test(data, classes, EnsL_model)
data |
A SummarizedExperiment object or a data frame/matrix of normalized expression data. Rows and Cols should be observations and features, respectively. |
classes |
A class vector with |
EnsL_model |
A list with the models trained by DaMiR.EnsL_Train function. |
This function implements the test step of DaMiR.EnsembleLearning2cl function
A dataframe containing the predictions on the testset
Mattia Chiesa, Luca Piacentini
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Tr_res <- DaMiR.EnsL_Train( # selected_features,classes=df$class, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","LR")) # DaMiR.EnsembleLearning2cl_Test(selected_features, #classes=df$class,Tr_res)
# use example data: data(selected_features) data(df) set.seed(1) # only for the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Tr_res <- DaMiR.EnsL_Train( # selected_features,classes=df$class, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","LR")) # DaMiR.EnsembleLearning2cl_Test(selected_features, #classes=df$class,Tr_res)
This function learn a meta learner by a 'Stacking' strategy. Users can provide heterogeneous features (other than genomic features) which will be taken into account during classification model building. A 'two-classes' classification task isaddressed.
DaMiR.EnsL_Train( data, classes, variables, fSample.tr.w = 0.7, cl_type = c("RF", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
DaMiR.EnsL_Train( data, classes, variables, fSample.tr.w = 0.7, cl_type = c("RF", "SVM", "LDA", "LR", "NB", "NN", "PLS") )
data |
A SummarizedExperiment object or a data frame/matrix of normalized expression data. Rows and Cols should be observations and features, respectively. |
classes |
A class vector with |
variables |
An optional data frame containing other variables (but without 'class' column). Each column represents a different covariate to be considered in the model |
fSample.tr.w |
Fraction of samples of training set to be used during weight estimation; default is 0.7 |
cl_type |
List of weak classifiers that will compose the meta-learners. "RF", "SVM", "LDA", "LR", "NB", "NN", "PLS" are allowed. Default is c("RF", "LR", "LDA", "NB", "SVM") |
This function implements the training step of DaMiR.EnsembleLearning2cl function
A list containing:
The models of each classifier used to build the Ensemble meta-learner with the median or the best accuracy (over the iteration) for the Ensemble classifier;
the weights associated to each weak classifier;
Mattia Chiesa, Luca Piacentini
# use example data: data(selected_features) data(df) set.seed(1) # For the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsL_Train( # selected_features,classes=df$class, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","LR"))
# use example data: data(selected_features) data(df) set.seed(1) # For the example: # speed up the process setting a low 'iter' argument value; # for real data set use default 'iter' value (i.e. 100) or higher: # Classification_res <- DaMiR.EnsL_Train( # selected_features,classes=df$class, fSample.tr.w=0.6, iter=3, # cl_type=c("RF","LR"))
This function allows the user to select a subset of predictors; the number of predictors can be defined by user or selected automatically.
DaMiR.FBest( data, ranking, autoselect = c("no", "yes"), n.pred = 10, th.zscore = 2 )
DaMiR.FBest( data, ranking, autoselect = c("no", "yes"), n.pred = 10, th.zscore = 2 )
data |
A transposed data frame of expression data. Rows and Cols should be, respectively, observations and features |
ranking |
A data frame with importance score for each feature,
generated by
|
autoselect |
A flag to specify how to select predictors:
|
n.pred |
If |
th.zscore |
Threshold of scaled importance score (Z-score); default value is 2 |
A list containing:
A data frame of normalized expression data of the most important selected predictors.
A vector with predictors name.
Mattia Chiesa, Luca Piacentini
# use example data: data(data_reduced) data(data_relief) # select the first 8 predictors rankad by imporatance: selected_features <- DaMiR.FBest(data_reduced, data_relief, n.pred = 8) # select predictors by importance but automatically: selected_features <- DaMiR.FBest(data_reduced, data_relief, autoselect = "yes", th.zscore = 1.5)
# use example data: data(data_reduced) data(data_relief) # select the first 8 predictors rankad by imporatance: selected_features <- DaMiR.FBest(data_reduced, data_relief, n.pred = 8) # select predictors by importance but automatically: selected_features <- DaMiR.FBest(data_reduced, data_relief, autoselect = "yes", th.zscore = 1.5)
This function allows the user to remove highly correlated features.
DaMiR.FReduct(data, th.corr = 0.85, type = c("spearman", "pearson"))
DaMiR.FReduct(data, th.corr = 0.85, type = c("spearman", "pearson"))
data |
A transposed data frame or matrix of normalized expression data. Rows and Cols should be, respectively, observations and features |
th.corr |
Feature-per-feature correlation threshold; default is 0.85 |
type |
Type of correlation metric to be applied; default is "spearman" |
This function produces an absolute correlation matrix that
it is then used to reduce pair-wise correlations.
When two features present a correlation higher than that defined by
the user in th.corr
argument,
the function, first, calculates the mean absolute correlation of each
feature and, then, removes the feature
with the largest mean absolute correlation.
An expression matrix without highly correlated features.
Mattia Chiesa, Luca Piacentini
# use example data: data(data_reduced) # reduce the number of features: data_Reduced <- DaMiR.FReduct(data_reduced, th.corr = 0.75, type = "pearson")
# use example data: data(data_reduced) # reduce the number of features: data_Reduced <- DaMiR.FReduct(data_reduced, th.corr = 0.75, type = "pearson")
This function identifies the class-correlated principal components (PCs) which are then used to implement a backward variable elimination procedure for the removal of non informative features.
DaMiR.FSelect( data, df, th.corr = 0.6, type = c("spearman", "pearson"), th.VIP = 3, nPlsIter = 1 )
DaMiR.FSelect( data, df, th.corr = 0.6, type = c("spearman", "pearson"), th.VIP = 3, nPlsIter = 1 )
data |
A transposed data frame or a matrix of normalized expression data. Rows and Cols should be, respectively, observations and features |
df |
A data frame with known variables; at least one column with 'class' label must be included |
th.corr |
Minimum threshold of correlation between class and PCs; default is 0.6. Note. If df$class has more than two levels, this option is disable and the number of PCs is set to 3. |
type |
Type of correlation metric; default is "spearman" |
th.VIP |
Threshold for |
nPlsIter |
Number of times that bve_pls has to run. Each iteration produces a set of selected features, usually similar to each other but not exacly the same! When nPlsIter is > 1, the intersection between each set of selected features is performed; so that, only the most robust features are selected. Default is 1 |
The function aims to reduce the number of features to obtain
the most informative variables for classification purpose. First,
PCs obtained by principal component analysis (PCA) are correlated
with "class". The correlation threshold is defined by the user
in th.corr
argument. The higher is the correlation, the
lower is the number of PCs returned. Importantly, if df$class has
more than two levels, the number of PCs is automatically set to 3.
In a binary experimental setting, users should pay attention to
appropriately set the th.corr
argument because it will also
affect the total number of selected features that ultimately
depend on the number of PCs. The bve_pls
function
of plsVarSel
package is, then, applied.
This function exploits a backward variable elimination procedure
coupled to a partial least squares approach to remove those variable
which are less informative with respect to class. The returned
vector of variables is further reduced by the following
DaMiR.FReduct
function in order to obtain a subset of
non correlated putative predictors.
A list containing:
An expression matrix with only informative features.
A data frame with class and optional variables information.
Mattia Chiesa, Luca Piacentini
Tahir Mehmood, Kristian Hovde Liland, Lars Snipen and Solve Saebo (2011). A review of variable selection methods in Partial Least Squares Regression. Chemometrics and Intelligent Laboratory Systems 118, pp. 62-69.
# use example data: data(data_norm) data(df) # extract expression data from SummarizedExperiment object # and transpose the matrix: t_data<-t(assay(data_norm)) t_data <- t_data[,seq_len(100)] # select class-related features data_reduced <- DaMiR.FSelect(t_data, df, th.corr = 0.7, type = "spearman", th.VIP = 1)
# use example data: data(data_norm) data(df) # extract expression data from SummarizedExperiment object # and transpose the matrix: t_data<-t(assay(data_norm)) t_data <- t_data[,seq_len(100)] # select class-related features data_reduced <- DaMiR.FSelect(t_data, df, th.corr = 0.7, type = "spearman", th.VIP = 1)
This function implements a procedure in order to rank features by their importance evaluated by RReliefF score.
DaMiR.FSort(data, df, fSample = 1)
DaMiR.FSort(data, df, fSample = 1)
data |
A transposed data frame of expression data, i.e. transformed counts by vst or rlog. A log2 transformed expression matrix is also accepted. Rows and Cols should be, respectively, observations and features |
df |
A data frame with class and known variables; at least one column with 'class' label must be included |
fSample |
Fraction of sample to be used for the implementation of RReliefF algorithm; default is 1 |
This function is very time-consuming when the number of features is high. We observed there is a quadratic relationship between execution time and the number of features. Thus, we have also implemented a formula which allows the users to estimate the time to perform this step, given the number of features. The formula is:
where T = Time and N = Number of genes. We strongly suggest to filter out non informative features before performing this step.
A data frame with two culmuns, where features are sorted by importance scores:
RReliefF score - Calculated by relief
function,
implemented in
FSelector
package;
scaled.RReliefF score - Z-score value, computed for each RReliefF score.
A plot with the first 50 features ordered by their importance.
Mattia Chiesa, Luca Piacentini
Marko Robnik-Sikonja, Igor Kononenko: An adaptation of Relief for attribute estimation in regression. In: Fourteenth International Conference on Machine Learning, 296-304, 1997
relief
, DaMiR.FSelect
,
DaMiR.FReduct
# use example data: data(data_reduced) data(df) # rank features by importance: df.importance <- DaMiR.FSort(data_reduced[,1:10], df, fSample = 0.75)
# use example data: data(data_reduced) data(df) # rank features by importance: df.importance <- DaMiR.FSort(data_reduced[,1:10], df, fSample = 0.75)
This function implements a formula based on current date and time.
DaMiR.goldenDice()
DaMiR.goldenDice()
The number is generated by combining current seconds (S),
minutes (Mi), hours (H), days (D), months (Mo), years (Y) and golden
ratio (), in the form:
An integer number.
Mattia Chiesa, Luca Piacentini
gen_numb <- DaMiR.goldenDice() set.seed(gen_numb)
gen_numb <- DaMiR.goldenDice() set.seed(gen_numb)
This function aims to perform a batch correction on a normalized independent test set, exploiting the ComBat function of the sva package.
DaMiR.iTSadjust(adj_Learning_set, norm_Ind_Test_set, iTS_batch)
DaMiR.iTSadjust(adj_Learning_set, norm_Ind_Test_set, iTS_batch)
adj_Learning_set |
A SummarizedExperiment object or a data frame/matrix of adjusted and normalized data, obtained by the DaMiR.SVadjust function. |
norm_Ind_Test_set |
A data frame or a matrix of normalized data. The independent test set is supposed to be already normlaized by the DaMiR.iTSnorm function |
iTS_batch |
(Optional). A factor or a data.frame, containing information regarding experimental batches of the independent test set. Users can ignore this argument, if the independent test set is deemed a single experimental batch. |
The function applied a batch correction procedure to the independent test set, normalized by DaMiR.iTSnorm.
A matrix containing a normalized and adjusted expression matrix (log2 scale).
Mattia Chiesa, Luca Piacentini
Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Elana J. Fertig, Andrew E. Jaffe and John D. Storey (2016). sva: Surrogate Variable Analysis. R package version 3.22.0.
# use example data: data(SE)
# use example data: data(SE)
This function aims to normalize properly an actual independent test set by taking information from the Learning set that will be used to transform the new sample(s).
DaMiR.iTSnorm( Learning_set, Ind_Test_set, normtype = c("vst", "rlog", "logcpm"), method = c("precise", "quick") )
DaMiR.iTSnorm( Learning_set, Ind_Test_set, normtype = c("vst", "rlog", "logcpm"), method = c("precise", "quick") )
Learning_set |
A SummarizedExperiment object or a data frame/matrix of raw count data. The learning set is supposed to be a raw counts dataset of the expressed features (not all features). Rows and Cols should be features and samples, respectively. |
Ind_Test_set |
A SummarizedExperiment object or a data frame/matrix of raw count data. The independent test set is supposed to be a raw counts dataset with the same features of 'Learning_set'. Rows and Cols should be features and samples, respectively. |
normtype |
Type of normalization to be applied:
|
method |
Type of method to estimate the dispersion, applied to the independent test set to normalize data. Only 'precise' and 'quick' are allowed. In the first case, the dispersion is estimated by the Learning set and applied to the independent test set. In the second case, is estimated from the independent test set. Default is "precise". See details in dispersionFunction |
The Learning_set is supposed to be a raw counts dataset of the expressed features. Moreover, the independent test set is supposed to be a raw counts dataset with the same features of 'Learning_set'. The independent test set is normalized, taking into account the dispersion parameter, estimated by the Learning set ('precise' method) or by the independent test set itself ('quick' method).
A matrix containing a normalized expression matrix (log2 scale)
Mattia Chiesa, Luca Piacentini
Michael I Love, Wolfgang Huber and Simon Anders (2014): Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology
varianceStabilizingTransformation, rlog cpm
# use example data: data(SE)
# use example data: data(SE)
This is an helper function that allows the user to
simultaneously
import counts, class (mandatory) and
variables (optional) data, and creates a SummarizedExperiment
object.
DaMiR.makeSE(x, y)
DaMiR.makeSE(x, y)
x |
A tab-delimited file which contains RNA-Seq count data. Each row is a feature (i.e. gene, transcript, exon etc.) and each column is a sample |
y |
A tab-delimited file which contains experiment information. Each row is a sample and each column is a variable. This file must contain at least one column which represent 'class' information for data adjustment and classification; the class column must be labeled as 'class' |
Before creating a SummarizedExperiment
object, the
function performs some checks on input data to ensure that only a
matrix
of raw counts is accordingly loaded. Other checks allows the
identification of missing data (NA) in the data frame of the variables
of
interest.
A SummarizedExperiment
object containing raw counts,
class and (optionally) variables of interest.
Mattia Chiesa, Luca Piacentini
Morgan M, Obenchain V, Hester J and Pag\'es H (2016). SummarizedExperiment: SummarizedExperiment container. R package version 1.4.0.
rawdata.path <- system.file(package = "DaMiRseq","extdata") # import tab-delimited files: # sample data are a small subset of Genotype-Tissue Expression (GTEx) # RNA-Seq database (dbGap Study Accession: phs000424.v6.p1): count_data <- read.delim(file.path(rawdata.path, "counts_import.txt")) variables_data <- read.delim(file.path(rawdata.path, "annotation_import.txt")) # create a SummarizedExperiment object: SE <- DaMiR.makeSE(count_data, variables_data) print(SE)
rawdata.path <- system.file(package = "DaMiRseq","extdata") # import tab-delimited files: # sample data are a small subset of Genotype-Tissue Expression (GTEx) # RNA-Seq database (dbGap Study Accession: phs000424.v6.p1): count_data <- read.delim(file.path(rawdata.path, "counts_import.txt")) variables_data <- read.delim(file.path(rawdata.path, "annotation_import.txt")) # create a SummarizedExperiment object: SE <- DaMiR.makeSE(count_data, variables_data) print(SE)
A MDS plot is drawn in order to visualize class clustering.
DaMiR.MDSplot(data, df, type = c("spearman", "pearson"))
DaMiR.MDSplot(data, df, type = c("spearman", "pearson"))
data |
A SummarizedExperiment object or a matrix or a data.frame where rows and cols should be, respectively, observations and features |
df |
A data frame with class; it can be directly subset from data |
type |
A character string specifing the metric to be applied to correlation analysis. Either "spearman" or "pearson" is allowed; default is "spearman" |
The MDS plot is drawn taking as input a dissimilarity matrix produced by either a sample-per-sample Pearson's or Spearman's correlation of normalized expression data.
A MDS plot, using only 'class' information
Mattia Chiesa, Luca Piacentini
# use example data: data(data_reduced) data(df) # Draw MDS: DaMiR.MDSplot(data=data_reduced, df=df, type="pearson")
# use example data: data(data_reduced) data(df) # Draw MDS: DaMiR.MDSplot(data=data_reduced, df=df, type="pearson")
This function selects the bestmodels learned by the DaMiR.EnsL_Train and a tested by DaMiR.EnsL_Test.
DaMiR.ModelSelect( df, type.sel = c("mode", "median", "greater"), th.sel = 0.85, npred.sel = c("min", "rnd"), metric.idx = 1, npred.idx = 2 )
DaMiR.ModelSelect( df, type.sel = c("mode", "median", "greater"), th.sel = 0.85, npred.sel = c("min", "rnd"), metric.idx = 1, npred.idx = 2 )
df |
A data frame of performance metrics. At least two columns representing a specific classification metrics (e.g., Accuracy) and the number of predictors must be provided. Additionally, other classification metrics (e.g., MCC, Sensitivity, Specificity,PPV, NPV, AUC, ...) can be appended (from the third column onwards) and used for the evaluation, by correctly setting the 'metric.idx' parameter. |
type.sel |
The method to select the best models. Only "mode","median" and "greater" values are allowed. For a specific classification metrics, "mode" selects all models whose score is the mode of all scores; "median" selects all models whose score is the median of all scores; and, "greater" selects all models whose score is greater than the value specified in "th.sel". Default: "mode". |
th.sel |
Threshold for the evaluation of the performance when "type.sel" is equal to "greater". Default: 0.85 |
npred.sel |
The method to select the best model. Only "min" and "rnd" values are allowed. Taking into account the subset of models found by 'type.sel', this parameter selects one single model with the minimum number of predictors ("min") or randomly ("rnd"). Default: "min". |
metric.idx |
The index of the 'df' column (i.e., classification metrics) to be considered for the models evaluation. Default: 1. |
npred.idx |
The index of the 'df' column representing the number of predictors. Default: 2. |
This function finds the best model, taking into account specific classification metrics.
The index of df (row), representing the model selected and a bubble chart
Mattia Chiesa, Luca Piacentini
# use example data: set.seed(1)
# use example data: set.seed(1)
Features will be firstly filtered based on their expression value and/or by their variability across samples; features will be then normalized.
DaMiR.normalization( data, minCounts = 10, fSample = 0.5, hyper = c("yes", "no"), th.cv = 3, type = c("vst", "rlog", "logcpm"), nFitType = c("parametric", "local", "mean") )
DaMiR.normalization( data, minCounts = 10, fSample = 0.5, hyper = c("yes", "no"), th.cv = 3, type = c("vst", "rlog", "logcpm"), nFitType = c("parametric", "local", "mean") )
data |
A |
minCounts |
Minimum reads counts; default is 10 |
fSample |
Fraction of samples with |
hyper |
Flag to enable gene filtering by Coefficient of Variation (CV); default is "yes" |
th.cv |
Threshold of minimum CV to consider a feature 'Hypervariant' accross samples; default is 3 |
type |
Type of normalization to be applied:
|
nFitType |
Type of method to estimate the dispersion by vst or rlog. Default is "parametric". |
Before normalization step, this function allows the user to filter features by:
Expression - Features will be filtered out whether their reads
count
do not reach a minCounts
in at least fSample
of samples;
CV - The CV of each feature is individually calculated for each
sample class.
Featurers with both class CV greater than th.cv
will be
discarded.
Computing a class restricted CV may prevent the removal of hypervariant
features that
may be specifically associated with a certain class. This could be
important, for example, for
immune genes whose expression under definite conditions may unveil
peculiar class-gene
association.
Finally, expressed features will be normalized by
varianceStabilizingTransformation
(default) or rlog
, both implemented in DESeq2
package.
We suggest to
use varianceStabilizingTransformation
to speed up the
normalization process
because rlog
is very time-consuming despite the two methods
produce quite
similar results.
A SummarizedExperiment
object which contains a normalized
expression matrix (log2 scale) and the data frame with 'class' and
(optionally) variables.
Mattia Chiesa, Luca Piacentini
Michael I Love, Wolfgang Huber and Simon Anders (2014): Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology
varianceStabilizingTransformation, rlog
# use example data: data(SE) # perform normalization on a subset of data: SE_sub<-SE[1:1000, c(1:3, 21:23)] data_norm <- DaMiR.normalization(SE_sub, minCounts=10, fSample=0.8, hyper="yes", th.cv = 2.5)
# use example data: data(SE) # perform normalization on a subset of data: SE_sub<-SE[1:1000, c(1:3, 21:23)] data_norm <- DaMiR.normalization(SE_sub, minCounts=10, fSample=0.8, hyper="yes", th.cv = 2.5)
This function implements a sample-per-sample correlation. Samples with a mean correlation lower than a user's defined threshold will be filtered out.
DaMiR.sampleFilt(data, th.corr = 0.9, type = c("spearman", "pearson"))
DaMiR.sampleFilt(data, th.corr = 0.9, type = c("spearman", "pearson"))
data |
A SummarizedExpression object |
th.corr |
Threshold of mean correlation; default is 0.9 |
type |
Type of correlation metric; default is "spearman" |
This step introduces a sample quality checkpoint. Global gene
expression should,
in fact, exhibit a high correlation among biological replicates;
conversely, low correlated
samples may be suspected to bear some technical artifact (e.g. poor RNA
or library
preparation quality), despite they may have passed sequencing quality
checks. If not assessed,
these samples may, thus, negatively affect all the downstream analysis.
This function looks at
the mean absolute correlation of each sample and removes those samples
with a mean correlation
lower than the value set in th.corr
argument. This threshold may
be specific for
different experimental setting but should be as high as possible.
For sequencing data we
suggest to set th.corr
greater than 0.85.
A SummarizedExperiment
object which contains a normalized
and filtered
expression matrix (log2 scale) and a filtered data frame with 'class'
and (optionally) variables.
Mattia Chiesa, Luca Piacentini
# use example data: data(data_norm) # filter out samples with Pearson's correlation <0.92: data_filt<- DaMiR.sampleFilt(data_norm, th.corr=0.92, type ="pearson")
# use example data: data(data_norm) # filter out samples with Pearson's correlation <0.92: data_filt<- DaMiR.sampleFilt(data_norm, th.corr=0.92, type ="pearson")
This function returns a matrix of surrogate variables (sv) using the implementation by Chiesa-Piacentini or the sva method by Leek et al.
DaMiR.SV( data, method = c("fve", "leek", "be"), th.fve = 0.95, second.var = NULL )
DaMiR.SV( data, method = c("fve", "leek", "be"), th.fve = 0.95, second.var = NULL )
data |
A SummarizedExpression object |
method |
The method used to identify sv. If missing, the "fve" method will be selected. Otherwise the method "leek" or "be" should be choosen |
th.fve |
This argument sets the threshold of maximum fraction of variance explained (fve) to be used in conjunction with "fve" method; default is 0.95 |
second.var |
A factor or a numeric vector corresponding to an additional variable to take into account during the sv identification. This variable together with 'class' in the data object will be used to design the model matrix (~ class + second.var) |
This function helps the user to identify the appropriate number of sv:
it is possible to select a different strategy to be used by changing the
option in method
argument. Three methods are available:
"be" - this option uses the num.sv
function of sva
package with default parameters;
"leek" - The same of before but with asymptotic approach proposed by Leek;
"fve" - This method is introduced in DaMiRseq
package,
and integrates part
of sva
function with custom code. Briefly, we computed
eigenvalues of data
using code already implemented in sva
function and then, we
calculated the squared
of each eigenvalues. Thus, the ratio between each "squared eigenvalue"
and the sum of them
were calculated. These values represent a surrogate measure of the
"Percentage of
Explained Variance" (pve) obtained by principal component analysis (PCA),
and their cumulative
sum can be used to select sv.
A matrix of sv. A plot with the sv identified by "fve" method is also returned. A red dot shows the maximum number of variables to be included for a specific "fve".
Mattia Chiesa, Luca Piacentini
Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Elana J. Fertig, Andrew E. Jaffe and John D. Storey (2016). sva: Surrogate Variable Analysis. R package version 3.22.0.
# use example data: data(data_norm) sv <- DaMiR.SV(data_norm, method = "fve", th.fve=0.95)
# use example data: data(data_norm) sv <- DaMiR.SV(data_norm, method = "fve", th.fve=0.95)
This function removes surrogate or other confounding
variable effects from
normalized expression data by the usage of
removeBatchEffect
function
of limma
package.
DaMiR.SVadjust(data, sv, n.sv)
DaMiR.SVadjust(data, sv, n.sv)
data |
A SummarizedExpression object |
sv |
The matrix of surrogate variables identified by
|
n.sv |
The number of surrogate variables to be used to adjust the data |
A SummarizedExpression object containing a matrix of log-expression values with sv effects removed and the data frame of the variables.
Mattia Chiesa, Luca Piacentini
# use example data: data(data_norm) data(sv) data_adjust <- DaMiR.SVadjust(data_norm, sv = sv, n.sv = 3)
# use example data: data(data_norm) data(sv) data_adjust <- DaMiR.SVadjust(data_norm, sv = sv, n.sv = 3)
This function transposes matrix and replaces '.' and '-' special characters.
DaMiR.transpose(data)
DaMiR.transpose(data)
data |
Matrix of normalized expression data, i.e. transformed counts by vst or rlog. A log2 transformed expression matrix is also accepted |
Normalized matrix in which each row is a sample and each column is a feature
Mattia Chiesa, Luca Piacentini
data(data_norm) data.transposed <- DaMiR.transpose(assay(data_norm))
data(data_norm) data.transposed <- DaMiR.transpose(assay(data_norm))
A dataset with a small dimension of normalized expression data in DaMiRseq package
data_min
data_min
A data frame with 40 samples (rows) and 87 genes (columns)
An example dataset for DaMiRseq
package
A dataset with a normalized matrix to test several DaMiRseq functions: sample data are a subset of Genotype-Tissue Expression (GTEx) RNA-Seq database (dbGap Study Accession: phs000424.v6.p1)
data_norm
data_norm
A SummarizedExperiment object containing an assay of 4897 genes (rows) and 40 samples (columns) and a colData with 5 variables
An example dataset for DaMiRseq
package
A dataset with a small dimension of normalized expression data in DaMiRseq package
data_reduced
data_reduced
A list with:
reduced expression matrix
a data frame with variables
An example dataset for DaMiRseq
package
A data frame with relieF and scaled reliefF scores for each gene
data_relief
data_relief
A dataframe with 87 genes (rows) and 2 variables (columns):
reliefF score for each gene
scaled reliefF score for each gene, by z-score
An example dataset for DaMiRseq
package
A data frame with class and covariates information
df
df
A dataframe with 40 samples (rows) and 5 variables (columns):
center where sample has been collected
sample's gender
sample's age
kind of sample's death, based on Hardy scale
sample's class
An example dataset for DaMiRseq
package
A dataset with count matrix to test several DaMiRseq functions. To show package functionality in a reasonably execution time, sample data are a subset of Genotype-Tissue Expression (GTEx) RNA-Seq database (dbGap Study Accession: phs000424.v6.p1). Samples incude 20 Anterior Cingulate Cortex (ACC) tissues and 20 Frontal Cortex (FC) tissues. 21363 genes have been preaviously selected to have 5 read counts in at least 60
SE
SE
A SummarizedExperiment object containing an assay of 21363 randomly selected genes (rows) and 40 samples (columns) and a colData with 5 variables
An example dataset for DaMiRseq
package
A dataset with normalized expression data to build classification models in DaMiRseq package
selected_features
selected_features
A dataframe with 40 samples (rows) and 7 variables (genes):
An example dataset for DaMiRseq
package
A sample dataset with a normalized count matrix for "testthat" functions.
SEtest_norm
SEtest_norm
A SummarizedExperiment object containing an assay of 100 genes (rows) and 11 samples (columns) and a colData with 5 variables
An example dataset for DaMiRseq
package
A dataset with surrogate variables to test DaMiRseq functions
sv
sv
A matrix with 40 samples (rows) and 4 surrogate variables (columns):
An example dataset for DaMiRseq
package