The best advice on using the clustering functions in
clusterExperiment
for large datasets is to avoid
calculating any NxN distance or
similarity matrix. They take a long time to calculate, and a large
amount of memory to store.
The most likely reason to calculate such a matrix is because of the clustering routine used. Methods like PAM or hierarchical clustering use a distance matrix and are not good choices for large datasets.
Prior to version 2.5.5
, our functions would internally
calculate a distance matrix if the clustering algorithm needed it, and
it would be hard for the user to realize that they selected a clustering
routine that needed such a matrix. Now we have added the argument
makeMissingDiss
, which, if FALSE
, will not
calculate any needed distance matrices and instead return an error. We
recommend setting this argument to FALSE
with large
datasets as a caution. If you discover that you are hitting an error,
select a different clustering algorithm that does not need a NxN distance
matrix.
Note, that this may not work with PAM, because PAM takes as
input a matrix x (see
?pam
). But if a x
matrix is given as input, the pam
function simply
calculates internally the distance matrix! The option
makeMissingDiss=FALSE
may not catch this, since the actual
clustering function allows for using an input matrix x. (This is an unfortunate for large
datasets, and we may in the future change how we classify the possible
input into PAM to classify it as a method that only accepts distance
matrices to allow it to be caught by
makeMissingDiss=FALSE
.)
Similarly, using any options regarding silhouette distance will
create a NxN
matrix as part of the silhouette computation in cluster
package. This includes findBestK=TRUE
argument. These
options should only be considered for moderate sized datasets where the
calculation (and storage) of the NxN matrix is not
a problem.
Unfortunately, subsampling and consensus clustering (with
makeConsensus
) operate by clustering based on the
proportion of shared clusterings per pairs of sample, which has been in
past versions stored by clusterExperiment
in a NxN matrix (see
the main tutorial vignette for an explanation of these methods). While
we are working on methods to avoid calculating this matrix, they are not
yet completely operational in avoiding the NxN matrix.
We have, however, in version 2.5.5
made some
infrastructure changes to allow for avoidance of the NxN matrix for
subsampling and consensus clustering if the user has defined a
clustering function to do this (see details below).
We have also in version 2.5.5
changed the clustering
functions to allow the user to request clustering of only unique
representations of the combinations of clusterings from subsampling or
in makeConsensus
, significantly reducing the size of the
NxN matrix
used in the actual clustering step (see below).
Here we document some infrastructure changes made to allow for avoidance of the NxN matrix for subsampling and consensus clustering. These do not, as of yet, actually provide the ability to avoid the NxN calculation for the clustering, but do set up an infrastructure where the user can now provide the appropriate clustering routine to avoid it.
2.5.5
the
results of subsampling would be saved as a NxN matrix, corresponding to
the proportion of times two samples were clustered together across the
B subsamples. As of
2.5.5
the results are simply saved as a NxB matrix, giving
the (integer-valued) cluster assigments of each sample in each
subsample. This NxB matrix will
need to be clustered to get anything interesting, and whether the
clustering of that matrix will require calculating a NxN matrix depends
on the clustering routine set in the mainClusterArgs
(see
below).2.5.5
the makeConsensus
command now expects
clustering techniques that will work directly on the NxB matrices of
clusterings, rather than directly calculating the NxN matrix. Again,
this requires a clustering routine that works on a NxB matrix of
clusterings, and whether the clustering of that matrix will require
calculating a NxN matrix depends on the clustering routine (see
below).inputType="cat"
, see
?ClusterFunctions
), they do this by simply internally
calculating the NxN matrix (and
this is NOT controlled by makeMissingDiss
argument as the
actual clustering function that is called calculates it, not the
clusterExperiment
infrastructure – similarly to PAM above).
We are working on creating a clustering routine that avoids this step;
if the user has such a clustering routine, they can provide this
clustering routine to the functions (see main vignette and
?ClusterFunction
for how to integrate a user-defined
function)makeConsensus
, only the M unique combinations of
clusters are clustered; this can effect the results, since it ignores
the number of samples represented by each of the M combinations (important for
methods like kmeans that take the averages acrosss the samples).
However, it can dramatically reduce the size, no longer requiring
calculation or storage of all the dissimilarities between identically
clustered samples. To choose this option, set
clusterArgs=list(removeDup=TRUE)
in the list of arguments
passed to either mainClusteringArgs
or
subsampleArgs
. This can also be done for the clustering
function of subsampling, but is likely to lead to much less of a
reduction in size.ClusterExperiment
object. Instead
we allow for either storage of the NxN matrix or the
NxB matrix,
or even just the indices of the clusterings that make up the NxB matrix. This
slot was primarily used for the plotCoClustering
command
(basically a heatmap of the NxN matrix), which
is unlikely to be of practical use for extremely large datasets.
However, the plotCoClustering
command will calculate that
NxN matrix
on the fly from the NxB matrix that is
stored, so again should be avoided for large datasets.The package is compatible with HDF5 Matrices, meaning that the package will run if the data given is a reference to a HDF5 file. However, the code may acheive such this compatibility by bringing the full matrix into memory. In particular, the default clustering routines are not compatible with the HDF5 implementation, meaning that they must bring the full dataset into memory for calculations.
The only exception to this is the method “mbkmeans” which calls on
the clustering routine (from the package of the same name). This package
implements a version of kmeans (“Mini-Batch kmeans”) that truly works
with the structure of the HDF5 datasets to avoid bringing the full
dataset into memory. “Mini-batch kmeans” refers to only using a
proportion of the data (a “batch”) at each iteration of the clustering.
The mbkmeans
package integrates this with HDF5 files, among
other formats, meaning that mbkmeans actually is written (in C code) so
as to not bring the entire dataset into memory but only the subset (or
batch) needed for any particular calculation.
Unlike the mbkmeans
package, however, the integration in
clusterExperiment
has not been tested to ensure that the
full dataset is not inadvertantly brought into memory by other
components of clusterExperiment
infrastructure. This is an
ongoing area for improvement. (So far integration of
mbkmeans
as a built-in options in
clusterExperiment
has only been tested so far that it
successfully runs the clustering routine.)
Further comments:
mbkmeans
).mbkmeans
: if
using mbkmeans
with subsample=TRUE, then the ‘classify’
function (i.e. the assignment of samples that were not part of the
subsample to a clustering) is not part of mbkmeans
, and may
bring the entire matrix into memory (when classify is All
or OutOfSample
)