Title: | Evaluation of Bioinformatics Metrics |
---|---|
Description: | Evaluating the reliability of your own metrics and the measurements done on your own datasets by analysing the stability and goodness of the classifications of such metrics. |
Authors: | José Antonio Bernabé-Díaz [aut, cre], Manuel Franco [aut], Juana-María Vivo [aut], Manuel Quesada-Martínez [aut], Astrid Duque-Ramos [aut], Jesualdo Tomás Fernández-Breis [aut] |
Maintainer: | José Antonio Bernabé-Díaz <[email protected]> |
License: | GPL-3 |
Version: | 1.23.0 |
Built: | 2024-10-30 07:46:06 UTC |
Source: | https://github.com/bioc/evaluomeR |
Return a named list, where each metric name is linked to a data frame containing the evaluated individuals, their score for the specified metric, and the cluster id in which each individual is classified. This cluster assignment is performed by calculating the optimal k value by evaluome.
annotateClustersByMetric(df, k.range, bs, seed)
annotateClustersByMetric(df, k.range, bs, seed)
df |
Input data frame. The first column denotes the identifier of the evaluated individuals. The remaining columns contain the metrics used to evaluate the individuals. Rows with NA values will be ignored. |
k.range |
Range of k values in which the optimal k will be searched |
bs |
Bootstrap re-sample param. |
seed |
Random seed to be used. |
A named list resulting from computing the optimal cluster for each metric. Each metric is a name in the named list, and its content is a data frame that includes the individuals, the value for the corresponding metric, and the cluster id in which the individual has been asigned according to the optimal cluster.
data("ontMetrics") annotated_clusters=annotateClustersByMetric(ontMetrics, k.range=c(2,3), bs=20, seed=100) annotated_clusters[['ANOnto']]
data("ontMetrics") annotated_clusters=annotateClustersByMetric(ontMetrics, k.range=c(2,3), bs=20, seed=100) annotated_clusters[['ANOnto']]
Metrics for biological pathways, 2 metrics that quantitative characterizations of the importance of regulation in biochemical pathway systems, including systems designed for applications in synthetic biology or metabolic engineering. The metrics are reachability and efficiency
data("bioMetrics")
data("bioMetrics")
An object of class SummarizedExperiment
with 15 rows and 3 columns.
Davis JD, Voit EO (2018). “Metrics for regulated biochemical pathway systems.” Bioinformatics. doi:10.1093/bioinformatics/bty942.
A vector of supported CBIs available in evaluomeR.
evaluomeRSupportedCBI()
evaluomeRSupportedCBI()
A String vector.
supportedCBIs <- evaluomeRSupportedCBI
supportedCBIs <- evaluomeRSupportedCBI
qualityRange
function.This method is a wrapper to retrieve a specific SummarizedExperiment
given a k
value from
the object returned by qualityRange
function.
getDataQualityRange(data, k)
getDataQualityRange(data, k)
data |
The object returned by |
k |
The desired |
The SummarizedExperiment
that contains information about the selected k
cluster.
# Using example data from our package data("ontMetrics") qualityRangeData <- qualityRange(ontMetrics, k.range=c(3,5), getImages = FALSE) # Getting dataframe that contains information about k=5 k5Data = getDataQualityRange(qualityRangeData, 5)
# Using example data from our package data("ontMetrics") qualityRangeData <- qualityRange(ontMetrics, k.range=c(3,5), getImages = FALSE) # Getting dataframe that contains information about k=5 k5Data = getDataQualityRange(qualityRangeData, 5)
Obtains the ranges of the metrics obtained by each optimal cluster.
getMetricRangeByCluster(df, k.range, bs, seed)
getMetricRangeByCluster(df, k.range, bs, seed)
df |
Input data frame. The first column denotes the identifier of the evaluated individuals. The remaining columns contain the metrics used to evaluate the individuals. Rows with NA values will be ignored. |
k.range |
Range of k values in which the optimal k will be searched |
bs |
Bootstrap re-sample param. |
seed |
Random seed to be used. |
A dataframe including the min and the max value for each pair (metric, cluster).
Obtains the ranges of the metrics obtained by each optimal cluster.
getMetricsRelevancy(df, k, alpha = NULL, L1 = NULL, seed = NULL)
getMetricsRelevancy(df, k, alpha = NULL, L1 = NULL, seed = NULL)
df |
Input data frame. The first column denotes the identifier of the evaluated individuals. The remaining columns contain the metrics used to evaluate the individuals. Rows with NA values will be ignored. |
k |
K value (number of clusters) |
alpha |
0 <= alpha <= 1, the proportion of the cases to be trimmed in robust sparse K-means, see |
L1 |
A single L1 bound on weights (the feature weights), see |
seed |
Random seed to be used. |
A dataframe including the min and the max value for each pair (metric, cluster).
data("ontMetrics") metricsRelevancy = getMetricsRelevancy(ontMetrics, k=3, alpha=0.1, seed=100) metricsRelevancy$rskc # RSKC output object metricsRelevancy$trimmed_cases # Trimmed cases from input (row indexes) metricsRelevancy$relevancy # Metrics relevancy table
data("ontMetrics") metricsRelevancy = getMetricsRelevancy(ontMetrics, k=3, alpha=0.1, seed=100) metricsRelevancy$rskc # RSKC output object metricsRelevancy$trimmed_cases # Trimmed cases from input (row indexes) metricsRelevancy$relevancy # Metrics relevancy table
This method finds the optimal value of K per each metric.
getOptimalKValue(stabData, qualData, k.range = NULL)
getOptimalKValue(stabData, qualData, k.range = NULL)
stabData |
An output |
qualData |
An output |
k.range |
A range of K values to limit the scope of the analysis. |
It returns a dataframe following the schema:
metric
, optimal_k
.
# Using example data from our package data("rnaMetrics") stabilityData <- stabilityRange(data=rnaMetrics, k.range=c(2,4), bs=20, getImages = FALSE) qualityData <- qualityRange(data=rnaMetrics, k.range=c(2,4), getImages = FALSE) kOptTable = getOptimalKValue(stabilityData, qualityData)
# Using example data from our package data("rnaMetrics") stabilityData <- stabilityRange(data=rnaMetrics, k.range=c(2,4), bs=20, getImages = FALSE) qualityData <- qualityRange(data=rnaMetrics, k.range=c(2,4), getImages = FALSE) kOptTable = getOptimalKValue(stabilityData, qualityData)
This analysis calculates a global metric score based upon a prediction model
computed with flexmix
package.
globalMetric(data, k.range = c(2, 15), nrep = 10, criterion = c("BIC", "AIC"), PCA = FALSE, seed = NULL)
globalMetric(data, k.range = c(2, 15), nrep = 10, criterion = c("BIC", "AIC"), PCA = FALSE, seed = NULL)
data |
A |
k.range |
Concatenation of two positive integers.
The first value |
nrep |
Positive integer. Number of random initializations used in adjusting the model. |
criterion |
String. Critirion applied in order to select the best model. Possible values: "BIC" or "AIC". |
PCA |
Boolean. If true, a PCA is performed on the input dataframe before computing the predictions. |
seed |
Positive integer. A seed for internal bootstrap. |
A dataframe containing the global metric score for each metric.
# Using example data from our package data("rnaMetrics") globalMetric(rnaMetrics, k.range = c(2,3), nrep=10, criterion="AIC", PCA=TRUE)
# Using example data from our package data("rnaMetrics") globalMetric(rnaMetrics, k.range = c(2,3), nrep=10, criterion="AIC", PCA=TRUE)
Calculation of Pearson correlation coefficient between every pair of metrics available in order to quantify their interrelationship degree. The score is in the range [-1,1]. Perfect correlations: -1 (inverse), and 1 (direct).
metricsCorrelations(data, margins = c(0, 10, 9, 11), getImages = TRUE)
metricsCorrelations(data, margins = c(0, 10, 9, 11), getImages = TRUE)
data |
A |
margins |
See |
getImages |
Boolean. If true, a plot is displayed. |
The Pearson correlation matrix as an assay
in a SummarizedExperiment
object.
# Using example data from our package data("ontMetrics") cor = metricsCorrelations(ontMetrics, getImages = TRUE, margins = c(1,0,5,11))
# Using example data from our package data("ontMetrics") cor = metricsCorrelations(ontMetrics, getImages = TRUE, margins = c(1,0,5,11))
Structural ontology metrics, 19 metrics measuring structural aspects of bio-ontologies have been analysed on two different corpora of ontologies: OBO Foundry and AgroPortal
data("ontMetrics")
data("ontMetrics")
An object of class SummarizedExperiment
with 80 rows and 20 columns.
Franco M, Vivo JM, Quesada-Martínez M, Duque-Ramos A, Fernández-Breis JT (2019). “Evaluation of ontology structural metrics based on public repository data.” Bioinformatics. doi:10.1093/bib/bbz009, https://dx.doi.org/10.1093/bib/bbz009.
It plots the value of the metrics in a SummarizedExperiment
object as a boxplot.
plotMetricsBoxplot(data)
plotMetricsBoxplot(data)
data |
A |
Nothing.
# Using example data from our package data("ontMetrics") plotMetricsBoxplot(ontMetrics)
# Using example data from our package data("ontMetrics") plotMetricsBoxplot(ontMetrics)
It clusters the value of the metrics in a SummarizedExperiment
object a an hclust dendogram from stats
. By default distance is measured in 'euclidean'
and hclust method is 'ward.D20.
plotMetricsCluster(data, scale = FALSE, k = NULL)
plotMetricsCluster(data, scale = FALSE, k = NULL)
data |
A |
scale |
Boolean. If true input data is scaled. Default: FALSE. |
k |
Integer. If not NULL a 'cutree' cut on the cluster is done. Default: NULL |
An hclust object.
# Using example data from our package data("ontMetrics") plotMetricsCluster(ontMetrics, scale=TRUE)
# Using example data from our package data("ontMetrics") plotMetricsCluster(ontMetrics, scale=TRUE)
It plots a clustering comparison between two different k-cluster vectors for a set of metrics.
plotMetricsClusterComparison(data, k.vector1, k.vector2 = NULL, seed = NULL)
plotMetricsClusterComparison(data, k.vector1, k.vector2 = NULL, seed = NULL)
data |
A |
k.vector1 |
Vector of positive integers representing |
k.vector2 |
Optional. Vector of positive integers representing |
seed |
Positive integer. A seed for internal bootstrap. |
Nothing.
# Using example data from our package data("rnaMetrics") stabilityData <- stabilityRange(data=rnaMetrics, k.range=c(2,4), bs=20, getImages = FALSE) qualityData <- qualityRange(data=rnaMetrics, k.range=c(2,4), getImages = FALSE) kOptTable = getOptimalKValue(stabilityData, qualityData)
# Using example data from our package data("rnaMetrics") stabilityData <- stabilityRange(data=rnaMetrics, k.range=c(2,4), bs=20, getImages = FALSE) qualityData <- qualityRange(data=rnaMetrics, k.range=c(2,4), getImages = FALSE) kOptTable = getOptimalKValue(stabilityData, qualityData)
It plots the minimum, maximum and standard deviation
values of the metrics in a SummarizedExperiment
object.
plotMetricsMinMax(data)
plotMetricsMinMax(data)
data |
A |
Nothing.
# Using example data from our package data("ontMetrics") plotMetricsMinMax(ontMetrics)
# Using example data from our package data("ontMetrics") plotMetricsMinMax(ontMetrics)
It plots the value of the metrics in a SummarizedExperiment
object as a violin plot.
plotMetricsViolin(data, nplots = 20)
plotMetricsViolin(data, nplots = 20)
data |
A |
nplots |
Positive integer. Number of metrics per violin plot. Default: 20. |
Nothing.
# Using example data from our package data("ontMetrics") plotMetricsViolin(ontMetrics)
# Using example data from our package data("ontMetrics") plotMetricsViolin(ontMetrics)
The goodness of the classifications are assessed by validating the clusters generated. For this purpose, we use the Silhouette width as validity index. This index computes and compares the quality of the clustering outputs found by the different metrics, thus enabling to measure the goodness of the classification for both instances and metrics. More precisely, this goodness measurement provides an assessment of how similar an instance is to other instances from the same cluster and dissimilar to all the other clusters. The average on all the instances quantifies how appropriately the instances are clustered. Kaufman and Rousseeuw suggested the interpretation of the global Silhouette width score as the effectiveness of the clustering structure. The values are in the range [0,1], having the following meaning:
There is no substantial clustering structure: [-1, 0.25].
The clustering structure is weak and could be artificial: ]0.25, 0.50].
There is a reasonable clustering structure: ]0.50, 0.70].
A strong clustering structure has been found: ]0.70, 1].
quality(data, k = 5, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
quality(data, k = 5, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
data |
A |
k |
Positive integer. Number of clusters between [2,15] range. |
cbi |
Clusterboot interface name (default: "kmeans"):
"kmeans", "clara", "clara_pam", "hclust", "pamk", "pamk_pam", "pamk".
Any CBI appended with '_pam' makes use of |
getImages |
Boolean. If true, a plot is displayed. |
all_metrics |
Boolean. If true, clustering is performed upon all the dataset. |
seed |
Positive integer. A seed for internal bootstrap. |
A SummarizedExperiment
containing the silhouette width measurements and
cluster sizes for cluster k
.
Kaufman L, Rousseeuw PJ (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.
# Using example data from our package data("ontMetrics") result = quality(ontMetrics, k=4)
# Using example data from our package data("ontMetrics") result = quality(ontMetrics, k=4)
The goodness of the classifications are assessed by validating the clusters generated for a range of k values. For this purpose, we use the Silhouette width as validity index. This index computes and compares the quality of the clustering outputs found by the different metrics, thus enabling to measure the goodness of the classification for both instances and metrics. More precisely, this measurement provides an assessment of how similar an instance is to other instances from the same cluster and dissimilar to the rest of clusters. The average on all the instances quantifies how the instances appropriately are clustered. Kaufman and Rousseeuw suggested the interpretation of the global Silhouette width score as the effectiveness of the clustering structure. The values are in the range [0,1], having the following meaning:
There is no substantial clustering structure: [-1, 0.25].
The clustering structure is weak and could be artificial: ]0.25, 0.50].
There is a reasonable clustering structure: ]0.50, 0.70].
A strong clustering structure has been found: ]0.70, 1].
qualityRange(data, k.range = c(3, 5), cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
qualityRange(data, k.range = c(3, 5), cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
data |
A |
k.range |
Concatenation of two positive integers.
The first value |
cbi |
Clusterboot interface name (default: "kmeans"):
"kmeans", "clara", "clara_pam", "hclust", "pamk", "pamk_pam", "pamk".
Any CBI appended with '_pam' makes use of |
getImages |
Boolean. If true, a plot is displayed. |
all_metrics |
Boolean. If true, clustering is performed upon all the dataset. |
seed |
Positive integer. A seed for internal bootstrap. |
A list of SummarizedExperiment
containing the silhouette width measurements and
cluster sizes from k.range[1]
to k.range[2]
. The position on the list matches
with the k-value used in that dataframe. For instance, position 5
represents the dataframe with k = 5.
Kaufman L, Rousseeuw PJ (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.
# Using example data from our package data("ontMetrics") # Without plotting dataFrameList = qualityRange(ontMetrics, k.range=c(2,3), getImages = FALSE)
# Using example data from our package data("ontMetrics") # Without plotting dataFrameList = qualityRange(ontMetrics, k.range=c(2,3), getImages = FALSE)
The goodness of the classifications are assessed by validating the clusters generated for a range of k values. For this purpose, we use the Silhouette width as validity index. This index computes and compares the quality of the clustering outputs found by the different metrics, thus enabling to measure the goodness of the classification for both instances and metrics. More precisely, this measurement provides an assessment of how similar an instance is to other instances from the same cluster and dissimilar to the rest of clusters. The average on all the instances quantifies how the instances appropriately are clustered. Kaufman and Rousseeuw suggested the interpretation of the global Silhouette width score as the effectiveness of the clustering structure. The values are in the range [0,1], having the following meaning:
There is no substantial clustering structure: [-1, 0.25].
The clustering structure is weak and could be artificial: ]0.25, 0.50].
There is a reasonable clustering structure: ]0.50, 0.70].
A strong clustering structure has been found: ]0.70, 1].
qualitySet(data, k.set = c(2, 4), cbi = "kmeans", all_metrics = FALSE, getImages = FALSE, seed = NULL, ...)
qualitySet(data, k.set = c(2, 4), cbi = "kmeans", all_metrics = FALSE, getImages = FALSE, seed = NULL, ...)
data |
A |
k.set |
A list of integer values of |
cbi |
Clusterboot interface name (default: "kmeans"):
"kmeans", "clara", "clara_pam", "hclust", "pamk", "pamk_pam", "pamk".
Any CBI appended with '_pam' makes use of |
all_metrics |
Boolean. If true, clustering is performed upon all the dataset. |
getImages |
Boolean. If true, a plot is displayed. |
seed |
Positive integer. A seed for internal bootstrap. |
A list of SummarizedExperiment
containing the silhouette width measurements and
cluster sizes from k.set
.
Kaufman L, Rousseeuw PJ (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.
# Using example data from our package data("rnaMetrics") # Without plotting dataFrameList = qualitySet(rnaMetrics, k.set=c(2,3), getImages = FALSE)
# Using example data from our package data("rnaMetrics") # Without plotting dataFrameList = qualitySet(rnaMetrics, k.set=c(2,3), getImages = FALSE)
RNA quality metrics for the assessment of gene expression differences, 2 quality metrics from 16 aliquots of a unique batch of RNA Samples. The metrics are Degradation Factor (DegFact) and RNA Integrity Number (RIN)
data("rnaMetrics")
data("rnaMetrics")
An object of class SummarizedExperiment
with 16 rows and 3 columns.
Imbeaud S, Graudens E, Boulanger V, Barlet X, Zaborski P, Eveno E, Mueller O, Schroeder A, Auffray C (2005). “Towards standardization of RNA quality assessment using user-independent classifiers of microcapillary electrophoresis traces.” Nucleic acids research, 33(6), e56–e56.
This analysis permits to estimate whether the clustering is meaningfully
affected by small variations in the sample. First, a clustering using the
k-means algorithm is carried out. The value of k
can be provided by the user.
Then, the stability index is the mean of the Jaccard coefficient
values of a number of bs
bootstrap replicates. The values are in the range [0,1],
having the following meaning:
Unstable: [0, 0.60[.
Doubtful: [0.60, 0.75].
Stable: ]0.75, 0.85].
Highly Stable: ]0.85, 1].
stability(data, k = 5, bs = 100, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
stability(data, k = 5, bs = 100, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
data |
A |
k |
Positive integer. Number of clusters between [2,15] range. |
bs |
Positive integer. Bootstrap value to perform the resampling. |
cbi |
Clusterboot interface name (default: "kmeans"):
"kmeans", "clara", "clara_pam", "hclust", "pamk", "pamk_pam", "pamk".
Any CBI appended with '_pam' makes use of |
getImages |
Boolean. If true, a plot is displayed. |
all_metrics |
Boolean. If true, clustering is performed upon all the dataset. |
seed |
Positive integer. A seed for internal bootstrap. |
A ExperimentList
containing the stability and cluster measurements
for k clusters.
Milligan GW, Cheng R (1996). “Measuring the influence of individual data points in a cluster analysis.” Journal of classification, 13(2), 315–335.
Jaccard P (1901). “Distribution de la flore alpine dans le bassin des Dranses et dans quelques regions voisines.” Bull Soc Vaudoise Sci Nat, 37, 241–272.
# Using example data from our package data("ontMetrics") result <- stability(ontMetrics, k=6, getImages=TRUE)
# Using example data from our package data("ontMetrics") result <- stability(ontMetrics, k=6, getImages=TRUE)
This analysis permits to estimate whether the clustering is meaningfully
affected by small variations in the sample. For a range of k values (k.range
),
a clustering using the k-means algorithm is carried out.
Then, the stability index is the mean of the Jaccard coefficient
values of a number of bs
bootstrap replicates. The values are in the range [0,1],
having the following meaning:
Unstable: [0, 0.60[.
Doubtful: [0.60, 0.75].
Stable: ]0.75, 0.85].
Highly Stable: ]0.85, 1].
stabilityRange(data, k.range = c(2, 15), bs = 100, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
stabilityRange(data, k.range = c(2, 15), bs = 100, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
data |
A |
k.range |
Concatenation of two positive integers.
The first value |
bs |
Positive integer. Bootstrap value to perform the resampling. |
cbi |
Clusterboot interface name (default: "kmeans"):
"kmeans", "clara", "clara_pam", "hclust", "pamk", "pamk_pam", "pamk".
Any CBI appended with '_pam' makes use of |
getImages |
Boolean. If true, a plot is displayed. |
all_metrics |
Boolean. If true, clustering is performed upon all the dataset. |
seed |
Positive integer. A seed for internal bootstrap. |
A ExperimentList
containing the stability and cluster measurements
for 2 to k
clusters.
Milligan GW, Cheng R (1996). “Measuring the influence of individual data points in a cluster analysis.” Journal of classification, 13(2), 315–335.
Jaccard P (1901). “Distribution de la flore alpine dans le bassin des Dranses et dans quelques regions voisines.” Bull Soc Vaudoise Sci Nat, 37, 241–272.
# Using example data from our package data("ontMetrics") result <- stabilityRange(ontMetrics, k.range=c(2,3))
# Using example data from our package data("ontMetrics") result <- stabilityRange(ontMetrics, k.range=c(2,3))
This analysis permits to estimate whether the clustering is meaningfully
affected by small variations in the sample. For a set of k values (k.set
),
a clustering using the k-means algorithm is carried out.
Then, the stability index is the mean of the Jaccard coefficient
values of a number of bs
bootstrap replicates. The values are in the range [0,1],
having the following meaning:
Unstable: [0, 0.60[.
Doubtful: [0.60, 0.75].
Stable: ]0.75, 0.85].
Highly Stable: ]0.85, 1].
stabilitySet(data, k.set = c(2, 3), bs = 100, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
stabilitySet(data, k.set = c(2, 3), bs = 100, cbi = "kmeans", getImages = FALSE, all_metrics = FALSE, seed = NULL, ...)
data |
A |
k.set |
A list of integer values of |
bs |
Positive integer. Bootstrap value to perform the resampling. |
cbi |
Clusterboot interface name (default: "kmeans"):
"kmeans", "clara", "clara_pam", "hclust", "pamk", "pamk_pam", "pamk".
Any CBI appended with '_pam' makes use of |
getImages |
Boolean. If true, a plot is displayed. |
all_metrics |
Boolean. If true, clustering is performed upon all the dataset. |
seed |
Positive integer. A seed for internal bootstrap. |
A ExperimentList
containing the stability and cluster measurements
of the list of k
clusters.
Milligan GW, Cheng R (1996). “Measuring the influence of individual data points in a cluster analysis.” Journal of classification, 13(2), 315–335.
Jaccard P (1901). “Distribution de la flore alpine dans le bassin des Dranses et dans quelques regions voisines.” Bull Soc Vaudoise Sci Nat, 37, 241–272.
# Using example data from our package data("rnaMetrics") result <- stabilitySet(rnaMetrics, k.set=c(2,3))
# Using example data from our package data("rnaMetrics") result <- stabilitySet(rnaMetrics, k.set=c(2,3))