To install the package, please use the following.
This vignette provides an introductory example on how to work with
the mbkmeans
package, which contains an implementation of
the mini-batch k-means algorithm proposed in (Sculley 2010) for large single-cell sequencing
data. This algorithm runs the _k_means iterations on small, random
subsamples of data (``mini batches’’) rather than the entire dataset.
This both speeds up computation and reduces the amount of data that must
be contained in memory.
The main function to be used by the users is mbkmeans
.
This is implemented as an S4 generic and methods are implemented for
matrix
, Matrix
, HDF5Matrix
,
DelayedMatrix
, SummarizedExperiment
, and
SingleCellExperiment
.
Most of this work was inspired by the MiniBatchKmeans()
function implemented in the ClusterR
R package and we
re-use many of the C++ functions implemented there.
Our main contribution here is to provide an interface to the
DelayedArray
and HDF5Array
packages so that
mbkmeans
algorithm only brings into memory at any time the
“mini-batches” needed for the current iteration, and never requires
loading the entire dataset into memory. This allows the user to run the
mini-batch k-means algorithm on very large datasets that do not
fit entirely in memory. mbkmeans
also runs seamlessly on
in-memory matrices provided by the user, and doing so significantly
increases the speed of the algorithm compared to the standard
k-means algorithm ((Hicks et al.
2020)). A complete comparison of mbkmeans
with other
implementations can be found in (Hicks et al.
2020).
The motivation for this work is the clustering of large single-cell
RNA-sequencing (scRNA-seq) datasets, and hence the main focus is on
Bioconductor’s SingleCellExperiment
and
SummarizedExperiment
data container. For this reason,
mbkmeans
assumes a data representation typical of genomic
data, in which genes (variables) are in the rows and cells
(observations) are in the column. This is contrary to most other
statistical applications, and notably to the
stats::kmeans()
and
ClusterR::MiniBatchKmeans()
functions that assume
observations in rows.
We provide a lower-level mini_batch()
function that
expects observations in rows and is expected to be a direct replacement
of ClusterR::MiniBatchKmeans()
for on-disk data
representations such as HDF5
files. The rest of this
document shows the typical use case through the mbkmeans()
interface; users interested in the mini_batch()
function
should refer to its manual page.
To illustrate a typical use case, we use the pbmc4k
dataset of the TENxPBMCData
package. This dataset contains a set of about 4,000 cells from
peripheral blood from a healthy donor and is expected to contain many
types or clusters of cell.
We would note that mbkmeans
is designed for very large
datasets, in particular for datasets large enough that either the full
data matrix can’t be held in memory, or the computations necessary
cannot be computed. This vignette works with a small dataset to allow
the user to quickly follow along and understand the commands. (Hicks et al. 2020) shows the example of using
mbkmeans
with a dataset with 1.3 million cells, which is a
more appropriate size for observing improved memory usage and improved
speed.
First, we load the needed packages.
library(TENxPBMCData)
library(scater)
library(SingleCellExperiment)
library(mbkmeans)
library(DelayedMatrixStats)
Note that in this vignette, we do not aim at identifying biologically meaningful clusters (that would entail a more sophisticated normalization of data and dimensionality reduction), but instead we only aim to show how to run mini-batch k-means on a HDF5-backed matrix.
We normalize the data simply by scaling for the total number of
counts using scater
and select the 1,000 most variable
genes and a random set of 100 cells to speed-up computations.
tenx_pbmc4k <- TENxPBMCData(dataset = "pbmc4k")
set.seed(1034)
idx <- sample(seq_len(NCOL(tenx_pbmc4k)), 100)
sce <- tenx_pbmc4k[, idx]
#normalization
sce <- logNormCounts(sce)
vars <- rowVars(logcounts(sce))
names(vars) <- rownames(sce)
vars <- sort(vars, decreasing = TRUE)
sce1000 <- sce[names(vars)[1:1000],]
sce1000
## class: SingleCellExperiment
## dim: 1000 100
## metadata(0):
## assays(2): counts logcounts
## rownames(1000): ENSG00000090382 ENSG00000204287 ... ENSG00000117748
## ENSG00000135968
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(12): Sample Barcode ... Date_published sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
mbkmeans
The main function, mbkmeans()
, returns a list object
including centroids
, WCSS_per_cluster
(where
WCSS stands for within-cluster-sum-of-squares),
best_initialization
, iters_per_initiazation
and Clusters
.
It takes any matrix-like object as input, such as
SummarizedExperiment
, SingleCellExperiment
,
matrix
, DelayedMatrix
and
HDF5Matrix
.
In this example, the input is a SingleCellExperiment
object.
The number of clusters (such as k in the k-means
algorithm) is set through the clusters
argument. In this
case, we set clusters = 5
for no particular reason. For
SingleCellExperiment
objects, the function provides the
reduceMethod
and whichAssay
arguments. The
reduceMethod
argument should specify the dimensionality
reduction slot to use for the clustering, and the default is “PCA”. Note
that this does not perform PCA but only looks at a slot called
“PCA” already stored in the object. Alternatively, one can specify
whichAssay
as the assay to use as input to mini-batch
k-means. This is used only when reduceMethod
option is NA
. See ?mbkmeans
for more
details.
There are additional arguements in mbkmeans()
that make
the function more flexible and suitable for more situations.
The size of the mini batches is set through the
batch_size
argument. This argument gives the size of the
samples of data brought into memory at any one time, and thus constrains
the memory usage of the alogrithm.
The blocksize()
function can be used to set the batch
size to the maximum allowed by the available memory. It considers both
the number of columns in the dataset and the amount of RAM on the
current matchine to calculate as large of a batch size as reasonable for
the RAM available to the session. The calculation uses
get_ram()
function in benchmarkme
package. See
the benchmarkme
vignette for more details.
## [1] 100
In this simple example, because the whole data fits in memory, the default batch size would be a single batch of size 100.
(Hicks et al. 2020) provides a comparison of the effect of batch size for large datasets (i.e. where the full data cannot be stored in memory). Their results show the algorithm to be robust to a large range of batch sizes (1,000-10,000 cells): accuracy of the results were equivalent if the batch size was above 500-1000 cells, with no noticable advantage in memory usage or speed as long as the batches were smaller than roughly 10,000 cells (corresponding to a subsample of the full data matrix whose memory foot print is not noticeable on most modern computers). For this reason, we would err on the larger range of batch size suggested in (Hicks et al. 2020) – 10,000 cells. Furthermore, since the initialization uses by default the same number of cells as batch size, this gives a robust sample for determining the initial start point.
We would note that it is better to make choices on batch size based on choosing the absolute number of cells in a batch rather than choosing batch size based on a percentage of the overall sample size. If batch size is chosen based on the percentage of cells, very large datasets will use very large batches of cells, negating the memory and speed gains. Moreover such large batches are not needed for good performance of the algorithm. The continual resampling of the batches works like stochastic gradient descent and allows for (local) minimization of the objective function (Sculley 2010) with relatively small batch sizes, as demonstrated on single-cell data in the work of (Hicks et al. 2020). If anything, the size of the batch should be governed more by the complexity of the problem (e.g. the number of underlying true clusters or subtypes), rather than the number of overall cells in the dataset.
The performance of mini-batch k-means greatly depends on the process of initialization. We implemented two different initialization methods:
kmeans++
, as proposed in (Arthur
and Vassilvitskii 2007). The default is “kmeans++”.The percentage of data to use for the initialization centroids is set
through the init_fraction
argument, which should be larger
than 0 and less than 1. The default value is given so that the
proportion matches the value of batch_size
, converted to a
proportion of the data.
res_random <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random")
table(res$Clusters, res_random$Clusters)
##
## 1 2 3 4 5
## 1 0 0 0 0 1
## 2 0 0 1 0 0
## 3 4 0 23 0 0
## 4 0 0 0 0 4
## 5 0 5 0 14 48
Note that large values of init_fraction
will result in
that proportion of the data being held in memory and used in the
initialization of the kmeans++
algorithm. Therefore, large
values of init_fraction
will require a great deal of memory
– potentially more than that used by the actual clustering algorithm.
Indeed keeping the fraction used constant for larger and larger datasets
will result in commiserate numbers of cells being used in the
initialization and increasing the memory usage (similar to the issues
discussed above for batch size). Therefore, we recommend that users keep
the default and make changes to the batch_size
parameter as
desired, ensuring consistent memory usage across both the initialization
stage and the actual clustering stage of the algorithm. As mentioned in
the discussion above on batch_size
, for this reason we
would recommend batch_size
on the larger range suggested as
reasonable in (Hicks et al. 2020) so as to
improve the initialization.
mbkmeans
with multiple values of kThe main parameter to set in k-means and its variants is the
number of clusters k.
mbkmeans
is quick enough to rerun the clustering algorithm
with different number of clusters even in very large datasets.
Here, we apply mbkmeans
with k from 5 to 15 and select the number
of clusters using the elbow method (i.e., the value that corresponds to
the point of inflection on the curve). We note that this is just a rule
of thumb for selecting the number of clusters, and many different
methods exist to decide the appropriate value of k.
To speed up the computations, we will cluster on the top 20 PCs.
sce1000 <- runPCA(sce1000, ncomponents=20)
ks <- seq(5, 15)
res <- lapply(ks, function(k) {
mbkmeans(sce1000, clusters = k,
reduceMethod = "PCA",
calc_wcss = TRUE, num_init=10)
})
wcss <- sapply(res, function(x) sum(x$WCSS_per_cluster))
plot(ks, wcss, type = "b")
From the plot, it seems that k = 12
could be a
reasonable value.
Note that if we set init_fraction = 1
,
initializer = "random"
, and
batch_size = ncol(x)
, we recover the classic
k-means algorithm.
res_full <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random",
batch_size = ncol(sce1000))
res_classic <- kmeans(t(logcounts(sce1000)),
centers = 5)
table(res_full$Clusters, res_classic$cluster)
##
## 1 2 3 4 5
## 1 0 7 0 7 0
## 2 0 0 14 43 0
## 3 2 0 0 0 0
## 4 16 0 0 0 10
## 5 0 0 0 1 0
Note however, that since the two algorithms start from different
random initializations, they will not always converge to the same
solution. Furthremore, if the algorithm uses batches smaller than the
full dataset (batch_size<ncol(x)
), then
mbkmeans
will not converge to the same result as
kmeans
even with the same initialization. This is due to
the fact that kmeans
and mbkmeans
have
different search paths for an optimum and the problem is non-convex.
For a comparison of memory usage, speed, and accuracy of
mbkmeans
with kmeans
for the non-trivial case
where the algorithm uses batches, see (Hicks et
al. 2020), where it is demonstrated that mbkmeans
provides significant speed improvements and uses less memory than
kmeans
, without loss of accuracy.
The bluster
package provides a flexible and extensible framework for clustering in
Bioconductor packages/workflows. The clusterRows()
function
controls dispatch to different clustering algorithms. In this case, we
will use the mini-batch k-means algorithm to cluster cells into
cell populations based on their principal component analysis (PCA)
coordinates.
## [1] 100 20
In the following three scenarios, we pass the number of cluster
centers as a number using the centers=5
argument. In
addition, we can also pass other mbkmeans
arguments
(e.g. batch_size=10
). In the first two cases, we only
return the cluster labels. However, in the third scenario, we ask for
the full mbkmeans
output.
## [1] 2 5 5 3 5 4 3 1 1 3 5 2 5 1 1 1 3 5 1 3 2 1 5 5 1 3 1 1 5 1 3 5 4 1 2 1 4
## [38] 5 5 4 5 4 5 4 5 3 1 3 5 5 5 2 1 3 2 5 4 1 4 3 5 4 1 1 3 5 2 4 5 4 1 5 4 1
## [75] 5 3 1 5 3 1 2 1 4 1 2 5 4 5 1 1 5 1 4 2 4 5 2 5 1 2
## Levels: 1 2 3 4 5
## [1] 3 3 3 3 3 3 3 5 5 3 3 1 3 5 5 5 3 3 5 3 4 5 3 3 5 3 5 5 3 5 3 3 4 5 1 5 4
## [38] 3 3 3 3 3 3 3 3 3 5 3 3 3 3 1 5 3 1 3 3 5 4 3 3 4 5 5 3 3 3 3 3 3 2 3 3 5
## [75] 3 3 5 3 3 5 4 5 3 5 1 3 4 3 5 5 3 5 3 3 3 3 3 3 5 4
## Levels: 1 2 3 4 5
## $clusters
## [1] 5 5 5 5 5 5 5 1 1 5 5 2 5 1 1 1 5 5 3 5 5 1 5 5 1 5 1 1 5 1 5 5 5 1 2 1 5
## [38] 5 5 5 5 5 5 5 5 5 1 5 5 5 5 2 1 5 2 5 5 1 5 5 5 5 1 1 5 5 5 5 5 5 1 5 5 1
## [75] 5 5 1 5 5 1 5 1 5 1 2 5 5 5 1 1 5 1 5 5 5 5 4 5 1 5
## Levels: 1 2 3 4 5
##
## $objects
## $objects$mbkmeans
## $objects$mbkmeans$centroids
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] -15.010417 -0.7962079 -4.968180 -2.7679840 -0.605855 -0.2194566 -1.4958129
## [2,] 1.398788 15.2268393 4.914318 -8.7031508 -0.624902 -2.6562225 -0.8620272
## [3,] -17.543300 -0.3406707 -6.138011 -4.3145411 1.079512 0.6031462 3.6588221
## [4,] 3.505305 9.0542211 4.008002 -0.8464335 -1.494348 -5.7764435 -3.4572454
## [5,] 4.953126 -5.2662492 1.885661 0.1000841 1.303848 -0.6517419 -0.3327930
## [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] -0.6686381 0.7806568 -1.1359341 1.1528491 -1.407013 1.5457278
## [2,] -3.2219319 -3.0456175 -0.2216553 -1.8928053 5.403923 -2.9907297
## [3,] 8.2619290 -4.0679099 -3.2909212 5.7171778 8.625866 4.8241775
## [4,] -3.3455593 -2.0061982 -1.3108509 -9.4799142 1.971986 9.3333830
## [5,] -0.4513868 -1.7769361 -1.2359026 0.4449523 0.151892 -0.9781383
## [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 1.6981873 3.2334236 0.7828917 0.11174810 -0.7921813 2.095090
## [2,] -1.0326219 -0.3752647 -2.2624371 2.48013343 -2.6617969 -2.687306
## [3,] 0.2712364 -1.0289170 5.9773702 -2.34843263 -1.5557200 -3.698316
## [4,] 0.8668148 1.5455810 1.8839526 -4.15320464 -2.9828210 2.413375
## [5,] 2.0870510 -1.4041611 0.3987755 -0.06157393 -0.6216171 -0.435778
## [,20]
## [1,] -0.12527754
## [2,] -2.21443762
## [3,] 2.67774700
## [4,] 0.06607878
## [5,] -0.77234827
##
## $objects$mbkmeans$WCSS_per_cluster
## numeric(0)
##
## $objects$mbkmeans$best_initialization
## [1] 1
##
## $objects$mbkmeans$iters_per_initialization
## [,1]
## [1,] 17
##
## $objects$mbkmeans$Clusters
## [1] 5 5 5 5 5 5 5 1 1 5 5 2 5 1 1 1 5 5 3 5 5 1 5 5 1 5 1 1 5 1 5 5 5 1 2 1 5
## [38] 5 5 5 5 5 5 5 5 5 1 5 5 5 5 2 1 5 2 5 5 1 5 5 5 5 1 1 5 5 5 5 5 5 1 5 5 1
## [75] 5 5 1 5 5 1 5 1 5 1 2 5 5 5 1 1 5 1 5 5 5 5 4 5 1 5
While many of the times clustering will be performed on in-memory
data (for example after PCA), the clusterRows
function also
accepts any matrix-like object (in-memory or on-disk data
representation) that mbkmeans
accepts.
For example, we can cluster using mbkmeans
on the
logcounts
assay in the sce1000
object (similar
to above), which is a DelayedMatrix
object from the DelayedArray
package.
Note: We transpose the matrix as the
clusterRows
function expects observations along the rows
and variables along the columns.
## <1000 x 100> sparse DelayedMatrix object of type "double":
## [,1] [,2] [,3] ... [,99] [,100]
## ENSG00000090382 0.000000 1.482992 0.000000 . 5.4330604 0.0000000
## ENSG00000204287 2.198815 0.000000 0.000000 . 5.3387971 0.0000000
## ENSG00000163220 1.762993 1.482992 0.000000 . 2.1176311 2.0642508
## ENSG00000143546 0.000000 1.135438 1.113099 . 0.3825221 0.0000000
## ENSG00000019582 3.402730 1.762861 1.733874 . 6.2009459 2.0642508
## ... . . . . . .
## ENSG00000145779 0.0000000 0.6766523 0.0000000 . 0.9342195 0.0000000
## ENSG00000106948 0.0000000 1.1354382 0.0000000 . 0.9342195 1.3735556
## ENSG00000155660 0.0000000 0.6766523 0.0000000 . 0.3825221 1.3735556
## ENSG00000117748 0.0000000 0.0000000 0.0000000 . 0.0000000 0.0000000
## ENSG00000135968 0.0000000 0.0000000 0.0000000 . 0.3825221 0.0000000
## [1] 2 2 2 1 2 2 1 3 3 1 2 2 2 3 3 3 1 2 3 1 2 3 2 2 3 1 3 3 2 3 1 2 2 3 2 3 2
## [38] 2 2 2 2 2 2 2 2 1 3 1 2 2 2 2 3 1 2 4 2 3 2 1 2 2 3 3 1 2 2 2 2 2 3 2 2 3
## [75] 2 1 3 2 1 3 2 3 2 3 2 2 2 2 3 3 2 3 2 2 2 2 2 2 3 2
## Levels: 1 2 3 4
This vignette is focused on the use of the mbkmeans
package. For the users interested in learning more about working with
HDF5
files and on-disk data in Bioconductor, we recommend
the excellent DelayedArray
workshop, presented by Peter Hickey at Bioc2020.