HGC
(short for Hierarchical Graph-based Clustering) is
an R package for conducting hierarchical clustering on large-scale
single-cell RNA-seq (scRNA-seq) data. The key idea is to construct a
dendrogram of cells on their shared nearest neighbor (SNN) graph.
HGC
provides functions for building cell graphs and for
conducting hierarchical clustering on the graph. Experiments on
benchmark datasets showed that HGC
can reveal the
hierarchical structure underlying the data, achieve state-of-the-art
clustering accuracy and has better scalability to large single-cell
data. For more information, please refer to the preprint of
HGC
on bioRxiv.
HGC
could be installed from Bioconductor.
The users could also get the newest version from Github.
HGC
takes a matrix as input where row represents cells
and column represents features. Preprocessing steps like normalization
and dimension reduction are necessary so that the constructed graph can
capture the manifold underlying the single-cell data. We recommend users
to follow the standard preprocessing steps in Seurat
.
As a demo input, we stored the top 25 principal components of the Pollen
dataset (Pollen et
al.) in HGC
. The dataset contains 301 cells with two
known labels: labels at the tissue level and the cell line level.
library(HGC)
data(Pollen)
Pollen.PCs <- Pollen[["PCs"]]
Pollen.Label.Tissue <- Pollen[["Tissue"]]
Pollen.Label.CellLine <- Pollen[["CellLine"]]
dim(Pollen.PCs)
## [1] 301 25
## Pollen.Label.Tissue
## blood dermal neural pluripotent
## 113 99 65 24
## Pollen.Label.CellLine
## 2338 2339 BJ GW16 GW21 GW21+3 hiPSC HL60 K562 Kera NPC
## 22 17 37 26 7 17 24 54 42 40 15
There are two major steps for conducting the hierarchical clustering
with HGC
: the graph construction step and the dendrogram
construction step. HGC
provides functions for building a
group of graphs, including the k-nearest neighbor graph (KNN), the
shared nearest neighbor graph (SNN), the continuous k-nearest neighbor
graph (CKNN), etc. These graphs are saved as dgCMatrix
supported by R package Matrix
. Then HGC
can
directly build a hierarchical tree on the graph. A self-built graph or
graphs from other pipelines stored as dgCMatrix
are also
supported.
Pollen.SNN <- SNN.Construction(mat = Pollen.PCs, k = 25, threshold = 0.15)
Pollen.ClusteringTree <- HGC.dendrogram(G = Pollen.SNN)
The output of HGC
is a standard tree following the data
structure hclust()
in R package stats
. The
tree can be cut into specific number of clusters with the function
cutree
.
HGC
provides user-friendly functions to run hierarchical
clustering in the existing pipelines, like Seurat
,
scran
, etc. The section will provide the corresponding
guides.
The functions FindClusteringTree
and
HGC.dendrogram
could read the graphs calculated in the
pipelines. Then they build the dendrograms and output/save the trees. We
will try our best to support the applications of HGC
in
more pipelines.
The Seurat
package is one popular scRNA-seq data processing workflow. It is
designed for QC, analysis and exploration of scRNA-seq data.
Seurat
contains the graph-based clustering methods
Louvain, SLM and Leiden to find the cell clusters. They all run on the
graph built by the function FindNeighbors
in
Seurat
.
Here we provide a guide to run FindClusteringTree
in
Seurat
pipeline using the SNN/KNN graph calculated by
Seurat
. The data comes from the “pbmc3k_tutorial”
of Seurat
. We follow the tutorial to run QC, preprocessing,
dimension reduction and SNN graph construction. Then we run HGC in the
calculated graph with one order.
library(dplyr)
library(Seurat)
library(patchwork)
library(HGC)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir =
"../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k",
min.cells = 3, min.features = 200)
# QC and selecting cells for further analysis
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 &
nFeature_RNA < 2500 & percent.mt < 5)
# Normalizing the data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize",
scale.factor = 10000)
# Identification of highly variable features (feature selection)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst",
nfeatures = 2000)
# Scaling the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
# Perform linear dimensional reduction
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Determine the ‘dimensionality’ of the dataset
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# Construct the graph and cluster the cells with HGC
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusteringTree(pbmc, graph.type = "SNN")
# Output the tree
pbmc.tree <- pbmc@graphs$ClusteringTree
The input of FindClusteringTree
is the
Seurat
object and the function outputs a
Seurat
object containing the dendrogram.
scran
is
a wildly used step-by-step workflow for low-level analysis of scRNA-seq
data. It builds SNN graph with the function buildSNNGraph
and saves the graph as igraph
object. The function
HGC.dendrogram
could run hierarchical clustering with the
igraph
object.
The igraph
package is a toolbox to do graph-related
calculations in R. It has the specific data structure to save graphs and
contains several graph-based clustering functions. Another pipeline OSCA
uses igraph
to cluster the cells, and
HGC.dendrogram
could also help.
Here we follow the tutorial of scran
and show how to use
the HGC.dendrogram
to build clustering tree on the
processed scRNA-seq data.
# Setting up the data
library(scRNAseq)
sce <- GrunPancreasData()
library(scuttle)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats,
percent_subsets="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]
library(scran)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters)
sce <- logNormCounts(sce)
# Variance modelling
dec <- modelGeneVar(sce)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance")
curve(metadata(dec)$trend(x), col="blue", add=TRUE)
# Get the top 10% of genes.
top.hvgs <- getTopHVGs(dec, prop=0.1)
sce <- fixedPCA(sce, subset.row=top.hvgs)
reducedDimNames(sce)
# Automated PC choice
output <- getClusteredPCs(reducedDim(sce))
npcs <- metadata(output)$chosen
reducedDim(sce, "PCAsub") <-
reducedDim(sce, "PCA")[,1:npcs,drop=FALSE]
library(HGC)
# Graph construction
g <- buildSNNGraph(sce, use.dimred="PCAsub")
# Graph-based clustering
cluster.tree <- HGC.dendrogram(G = g)
cluster.k12 <- cutree(cluster.tree, k = 12)
colLabels(sce) <- factor(cluster.k12)
library(scater)
sce <- runTSNE(sce, dimred="PCAsub")
plotTSNE(sce, colour_by="label", text_by="label")
The input of HGC.dendrogram
is the graph saved as
igraph
object, and the output is the tree saved as
hclust
object. The document of HGC.dendrogram
contains more details.
With various published methods in R, results of HGC
can
be visualized easily. Here we use the R package dendextend
as an example to visualize the results on the Pollen dataset. The tree
has been cut into five clusters. And for a better visualization, the
height of the tree has been log-transformed.
Pollen.ClusteringTree$height = log(Pollen.ClusteringTree$height + 1)
Pollen.ClusteringTree$height = log(Pollen.ClusteringTree$height + 1)
HGC.PlotDendrogram(tree = Pollen.ClusteringTree,
k = 5, plot.label = FALSE)
## [1] 1
We can also add a colour bar of the known label under the dendrogram as a comparison of the achieved clustering results.
Pollen.labels <- data.frame(Tissue = Pollen.Label.Tissue,
CellLine = Pollen.Label.CellLine)
HGC.PlotDendrogram(tree = Pollen.ClusteringTree,
k = 5, plot.label = TRUE,
labels = Pollen.labels)
## [1] 1
For datasets with known labels, the clustering results of
HGC
can be evaluated by comparing the consistence between
the known labels and the achieved clusters. Adjusted Rand Index (ARI) is
a wildly used statistics for this purpose. Here we calculate the ARIs of
the clustering results at different levels of the dendrogram with the
two known labels.
Our work shows that the dendrogram construction in HGC
has a linear time complexity. For advanced users, HGC
provides functions to conduct time complexity analysis on their own
data. The construction of the dendrogram is a recursive procedure of two
steps: 1. find the nearest neighbour pair, 2. merge the node pair and
update the graph. For different data structures of graph, there’s a
trade-off between the time consumptions of the two steps. Generally
speaking, storing more information about the graph makes it faster to
find the nearest neighbour pair (step 1) but slower to update the graph
(step 2). We have experimented several datasets and chosen the best data
structure for the overall efficiency.
The key parameters related to the time consumptions of the two steps
are the length of the nearest neighbor chains and the number of nodes
needed to be updated in each iteration, respectively (for more details,
please refer to our preprint).HGC
provides functions to record and visualize these parameters.
Pollen.ParameterRecord <- HGC.parameter(G = Pollen.SNN)
HGC.PlotParameter(Pollen.ParameterRecord, parameter = "CL")
## [1] 1
## [1] 1