Title: | A package for nonlinear dimension reduction with Isomap and LLE. |
---|---|
Description: | A package for nonlinear dimension reduction using the Isomap and LLE algorithm. It also includes a routine for computing the Davis-Bouldin-Index for cluster validation, a plotting tool and a data generator for microarray gene expression data and for the Swiss Roll dataset. |
Authors: | Christoph Bartenhagen |
Maintainer: | Christoph Bartenhagen <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.57.0 |
Built: | 2024-11-30 03:47:47 UTC |
Source: | https://github.com/bioc/RDRToolbox |
Computes the Davis-Bouldin-Index for cluster validation purposes.
DBIndex(data, labels)
DBIndex(data, labels)
data |
N x D matrix (N samples, D features) |
labels |
a vector of class labels |
To compute a clusters' compactness, this version uses the Euclidean distance to determine the mean distances between the samples and the cluster centers. Furthermore, the distance of two clusters is given by the distance of their centers.
'DBIndex' returns the Davis-Bouldin cluster index, a numeric value.
Christoph Bartenhagen
## DB-Index of a 50 dimensional dataset with 20 samples separated into two classes d = generateData(samples=20, genes=50, diffgenes=10, blocksize=5) DBIndex (data=d[[1]], labels=d[[2]])
## DB-Index of a 50 dimensional dataset with 20 samples separated into two classes d = generateData(samples=20, genes=50, diffgenes=10, blocksize=5) DBIndex (data=d[[1]], labels=d[[2]])
A simulator for gene expression data, whose values are normally
distributed values with zero mean. The covariances are given by a
configurable block-diagonal matrix.
By default, half of the samples contain differential gene expression values (see parameter diffsamples
).
generateData(samples=50, genes=10000, diffgenes=200, blocksize=50, cov1=0.2, cov2=0, diff=0.6, diffsamples)
generateData(samples=50, genes=10000, diffgenes=200, blocksize=50, cov1=0.2, cov2=0, diff=0.6, diffsamples)
samples |
number of samples |
genes |
number of gene expression values per sample |
diffgenes |
number of differential genes for class 1 |
blocksize |
size of each block in the blockdiagonal correlation matrix |
cov1 |
covariance within the blocks in the correlation matrix |
cov2 |
covariance between the blocks in the correlation matrix |
diff |
difference between the random gene expression values and the differential gene expression values |
diffsamples |
number of samples containing differential gene expression values compared to the rest (if missing, this parameter is set to half of the total number of samples) |
The simulator generates two labeled classes:
label 1: samples with differentially expressed genes.
label -1: samples without differentially expressed genes.
'generateData' returns a list containing:
data |
a (samples x features)-matrix with the simulated gene expression values |
labels |
a vector with labels (1,-1) for the two classes |
Christoph Bartenhagen
## generate a dataset with 20 samples and 1.000 gene expression values d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) data = d[[1]] labels = d[[2]]
## generate a dataset with 20 samples and 1.000 gene expression values d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) data = d[[1]] labels = d[[2]]
Computes the Isomap embedding as introduced in 2000 by Tenenbaum, de Silva and Langford.
Isomap(data, dims = 2, k, mod = FALSE, plotResiduals = FALSE, verbose = TRUE)
Isomap(data, dims = 2, k, mod = FALSE, plotResiduals = FALSE, verbose = TRUE)
data |
N x D matrix (N samples, D features) |
dims |
vector containing the target space dimension(s) |
k |
number of neighbours |
mod |
use modified Isomap algorithm |
plotResiduals |
show a plot with the residuals between the high and the low dimensional data |
verbose |
show a summary of the embedding procedure at the end |
Isomap is a nonlinear dimension reduction technique, that preserves
global properties of the data. That means, that geodesic distances
between all samples are captured best in the low dimensional
embedding.
This R version is based on the Matlab implementation by Tenenbaum and
uses Floyd's Algorithm to compute the neighbourhood graph of shortest
distances, when calculating the geodesic distances.
A modified version of the original Isomap algorithm is included. It
respects nearest and farthest neighbours.
To estimate the intrinsic dimension of the data, the function can plot
the residuals between the high and the low dimensional data for a
given range of dimensions.
It returns a N x dim matrix (N samples, dim features) with the reduced input data (list of several matrices if more than one dimension was specified)
Christoph Bartenhagen
Tenenbaum, J. B. and de Silva, V. and Langford, J. C., "A global geometric framework for nonlinear dimensionality reduction.", 2000; Matlab code is available at http://waldron.stanford.edu/~isomap/
## two dimensional Isomap embedding of a 1.000 dimensional dataset using k=5 neighbours d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = Isomap(data=d[[1]], dims=2, k=5) ## Isomap residuals for target dimensions 1-10 d_low = Isomap(data=d[[1]], dims=1:10, k=5, plotResiduals=TRUE) ## three dimensional Isomap embedding of a 1.000 dimensional dataset using k=10 (nearest and farthest) neighbours d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = Isomap(data=d[[1]], dims=3, mod=TRUE, k=10)
## two dimensional Isomap embedding of a 1.000 dimensional dataset using k=5 neighbours d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = Isomap(data=d[[1]], dims=2, k=5) ## Isomap residuals for target dimensions 1-10 d_low = Isomap(data=d[[1]], dims=1:10, k=5, plotResiduals=TRUE) ## three dimensional Isomap embedding of a 1.000 dimensional dataset using k=10 (nearest and farthest) neighbours d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = Isomap(data=d[[1]], dims=3, mod=TRUE, k=10)
Computes the Locally Linear Embedding as introduced in 2000 by Roweis, Saul and Lawrence.
LLE(data, dim=2, k)
LLE(data, dim=2, k)
data |
N x D matrix (N samples, D features) |
dim |
dimension of the target space |
k |
number of neighbours |
Locally Linear Embedding (LLE) preserves local properties of the data by
representing each sample in the data by a linear combination of
its k nearest neighbours with each neighbour weighted
independently. LLE finally chooses the low-dimensional
representation that best preserves the weights in the target
space.
This R version is based on the Matlab implementation by Sam Roweis.
It returns a N x dim matrix (N samples, dim features) with the reduced input data
Christoph Bartenhagen
Roweis, Sam T. and Saul, Lawrence K., "Nonlinear Dimensionality Reduction by Locally Linear Embedding",2000;
## two dimensional LLE embedding of a 1.000 dimensional dataset using k=5 neighbours d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = LLE(data=d[[1]], dim=2, k=5)
## two dimensional LLE embedding of a 1.000 dimensional dataset using k=5 neighbours d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = LLE(data=d[[1]], dim=2, k=5)
Creates two and three dimensional plots of (labeled) data. It uses the library "rgl" for rotatable 3D scatterplots.
plotDR(data, labels, axesLabels=c("x","y","z"), legend=FALSE, text, col, pch, ...)
plotDR(data, labels, axesLabels=c("x","y","z"), legend=FALSE, text, col, pch, ...)
data |
matrix with values to be plotted (rows correspond to samples, columns to features) |
labels |
vector containing labels of the classes within the data (optional) |
axesLabels |
vector containing labels for the axes of the plot |
legend |
logical value whether to automatically insert a legend into the plot |
text |
vector with (short) labels for each point (optional) |
col |
character vector of colours for each class (optional); see |
pch |
character or integer value specifying the symbol when plotting points (see |
... |
other common R plot parameters like for example |
It colours the data points according to given class labels (max. six classes when using default colours). A legend will be printed in the R console by default (for three dimensional plots, a legend is not supported).
Christoph Bartenhagen
## plot a two dimensional LLE embedding of a 1.000 dimensional dataset d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = LLE(data=d[[1]], dim=2, k=5) plotDR(data=d_low, labels=d[[2]]) ## plot a two dimensional LLE embedding of a 1.000 dimensional dataset ## add axis labels, a legend and plot a text for each sample d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = LLE(data=d[[1]], dim=2, k=5) text = letters[1:20] plotDR(data=d_low, labels=d[[2]], axesLabels=c("first component", "second component"), text=text, legend=TRUE) ## manually add a legend to the plot plotDR(data=d_low, labels=d[[2]], axesLabels=c("first component", "second component"), text=text) legend("topright", legend=c("class 1","class 2"), col=c("black", "red"), pch=1)
## plot a two dimensional LLE embedding of a 1.000 dimensional dataset d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = LLE(data=d[[1]], dim=2, k=5) plotDR(data=d_low, labels=d[[2]]) ## plot a two dimensional LLE embedding of a 1.000 dimensional dataset ## add axis labels, a legend and plot a text for each sample d = generateData(samples=20, genes=1000, diffgenes=100, blocksize=10) d_low = LLE(data=d[[1]], dim=2, k=5) text = letters[1:20] plotDR(data=d_low, labels=d[[2]], axesLabels=c("first component", "second component"), text=text, legend=TRUE) ## manually add a legend to the plot plotDR(data=d_low, labels=d[[2]], axesLabels=c("first component", "second component"), text=text) legend("topright", legend=c("class 1","class 2"), col=c("black", "red"), pch=1)
Computes and plots the Swiss Roll dataset of a given size and height. It uses the library "rgl" for rotatable 3D scatterplots.
SwissRoll(N = 2000, Height = 30, Plot=FALSE)
SwissRoll(N = 2000, Height = 30, Plot=FALSE)
N |
number of samples |
Height |
controls the spreading of the samples in the second dimension |
Plot |
a boolean specifying whether to plot the Swiss Roll dataset or not |
'SwissRoll' returns all N samples as a Nx3-matrix
Christoph Bartenhagen
## compute and plot a Swiss Roll dataset with 1.000 samples data=SwissRoll(N = 1000, Plot=TRUE)
## compute and plot a Swiss Roll dataset with 1.000 samples data=SwissRoll(N = 1000, Plot=TRUE)