Title: | A fast hierarchical graph-based clustering method |
---|---|
Description: | HGC (short for Hierarchical Graph-based Clustering) is an R package for conducting hierarchical clustering on large-scale single-cell RNA-seq (scRNA-seq) data. The key idea is to construct a dendrogram of cells on their shared nearest neighbor (SNN) graph. HGC provides functions for building graphs and for conducting hierarchical clustering on the graph. The users with old R version could visit https://github.com/XuegongLab/HGC/tree/HGC4oldRVersion to get HGC package built for R 3.6. |
Authors: | Zou Ziheng [aut], Hua Kui [aut], XGlab [cre, cph] |
Maintainer: | XGlab <[email protected]> |
License: | GPL-3 |
Version: | 1.15.0 |
Built: | 2024-11-29 05:55:36 UTC |
Source: | https://github.com/bioc/HGC |
This function builds a Continuous K Nearest Neighbor (CKNN) graph in the input feature space using Euclidean distance metric.
CKNN.Construction(mat, k, delta)
CKNN.Construction(mat, k, delta)
mat |
the input data saved as a numerical matrix. The columns are the features and the rows are the samples. |
k |
the number of nearest neighbors for building the CKNN graph. |
delta |
the parameter related with the distance threshold. |
This function fist built a KNN graph from the input data. Then the CKNN
graph is built from the KNN graph. For node i and node j in
the KNN graph, CKNN will link them if the distance d(i,j) between node
i and node j is less than times of the geometric mean of
d_k(i) and d_k(j). Here
is the parameter, d_k(i) and d_k(j)
are the distances from node i or node j to their k nearest neighbor.
An n by n binary dgCMatrix object C, where n is the number of input samples. The matrix C is the adjacency matrix of the built CKNN graph. C[i,j] = 1 means that there is an edge between sample i and sample j.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- CKNN.Construction(Pollen.PCs)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- CKNN.Construction(Pollen.PCs)
The function runs hierarchical clustering with HGC.dendrogram
on the SNN or KNN calculated by the Seurat
pipeline. The output clustering tree is also packaged in the
Seurat
object.
FindClusteringTree(object, graph.type)
FindClusteringTree(object, graph.type)
object |
The Seurat object containing the graphs built with scRNA-seq data. |
graph.type |
The type of graphs used for the hierarchical clustering, could be "SNN" or "KNN". The default value is "SNN". |
For the KNN graph, we symmetrize it by adding its transposition on
the graph. And for the details of data preprocessing and graph construction
by Seurat
, please check the Seurat
vignettes.
An Seurat
object. The clustering tree is saved under the item
graphs
, i.e. object@graphs$ClusteringTree
.
The function needs the R package Seurat
. We recommend that the
version of Seurat
is higher than version 3.0.
## Do not run # require(Seurat) ## DemoData is a input gene expression matrix. # DemoData.seuratobj <- CreateSeuratObject(counts = DemoData, # min.cells = 20) # DemoData.seuratobj <- NormalizeData(object = DemoData.seuratobj, # verbose = F) # DemoData.seuratobj <- ScaleData(object = DemoData.seuratobj, # features = row.names(DemoData.seuratobj), # verbose = F) # DemoData.seuratobj <- FindVariableFeatures(object = DemoData.seuratobj, # nfeatures = 2000, verbose = F) # DemoData.seuratobj <- RunPCA(object = DemoData.seuratobj, # npcs = 100, verbose = F) # DemoData.seuratobj <- FindNeighbors(object = DemoData.seuratobj, # nn.eps = 0.5, k.param = 30, # dims = 1:25, verbose = F) # DemoData.seuratobj <- FindClusteringTree(object = DemoData.seuratobj, # graph.type = "SNN")
## Do not run # require(Seurat) ## DemoData is a input gene expression matrix. # DemoData.seuratobj <- CreateSeuratObject(counts = DemoData, # min.cells = 20) # DemoData.seuratobj <- NormalizeData(object = DemoData.seuratobj, # verbose = F) # DemoData.seuratobj <- ScaleData(object = DemoData.seuratobj, # features = row.names(DemoData.seuratobj), # verbose = F) # DemoData.seuratobj <- FindVariableFeatures(object = DemoData.seuratobj, # nfeatures = 2000, verbose = F) # DemoData.seuratobj <- RunPCA(object = DemoData.seuratobj, # npcs = 100, verbose = F) # DemoData.seuratobj <- FindNeighbors(object = DemoData.seuratobj, # nn.eps = 0.5, k.param = 30, # dims = 1:25, verbose = F) # DemoData.seuratobj <- FindClusteringTree(object = DemoData.seuratobj, # graph.type = "SNN")
Hierarchical clustering on a given undirected graph.
HGC.dendrogram(G)
HGC.dendrogram(G)
G |
an object which represents the adjacency matrix of the graph, where G[i,j] is the weight of the edge between node i and node j, and zero means no link. The supported data structures include |
The function runs a hierarchical clustering on the given graph. It is a recursive procedure of two steps, first, the node pair sampling ratio is used as the distance metric to search the nearest neighbor pairs. Then the neighbor pair are merged and the graph is updated. The whole procedure is accelerated using the nearest neighbor chain algorithm. The algorithm stops when there's only one node left in the updated graph.
An object of class hclust
defined by the hclust
function
in the stats
package. It is a list containing the clustering tree
information with the components:
merge |
an n-1 by 2 matrix. It records the two nodes in each merging step. |
height |
a set of n-1 real values. It is the height of the non-leaf nodes in the tree. |
order |
a vector giving the permutation of the original observations suitable for plotting. |
labels |
labels for the objects being clustered. Same as the rownames of G in default. |
call |
the call which produced the result. |
method |
the cluster method that has been used. |
dist.method |
the distance used here. |
More details about the components are in the hclust.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- SNN.Construction(Pollen.PCs, 25, 0.15) tree = HGC.dendrogram(G)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- SNN.Construction(Pollen.PCs, 25, 0.15) tree = HGC.dendrogram(G)
This function records and outputs the length of the nearest neighbor
chain and the average neighbor
number in each iteration of hierarchical clustering. These values can
be used for the time complexity analysis of HGC.dendrogram
.
HGC.parameter(G)
HGC.parameter(G)
G |
an undirected graph saved as a dgCMatrix. The matrix G is the adjacency matrix of the graph, and element G[i,j] is the weight of the edge between node i and node j. Zeros in the matrix mean no link between nodes here. |
This function contains the whole function of HGC.dendrogram
, but will
record the key parameters during the whole process. The function is provided
for advanced users to conduct time complexity analysis on their own data.
The construction of the dendrogram is a recursive procedure of two steps:
1.finding the nearest neighbour pair, 2. merge the node pair and update the
graph. For different data structures of graph, there's a trade-off between
the time consumptions of the two steps. Generally speaking, storing more
information about the graph makes it faster to find the nearest neighbour
pair (step 1) but slower to update the graph (step 2). We have experimented
serval datasets and chosen the best data structure in HGC.dendrogram
for the overall efficiency.
A 2 by m matrix. The two rows of the matrix are the nearest neighbor chain length and the average neighbor number. m is equal to n-s, where s is the number of unconnected parts in the graph.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- SNN.Construction(Pollen.PCs, 25, 0.15) record = HGC.parameter(G)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- SNN.Construction(Pollen.PCs, 25, 0.15) record = HGC.parameter(G)
The function cut the dendrogram into specific clusters at different levels and compared the clusterings with given labels using Adjusted Rand Index (ARI)
HGC.PlotARIs(tree, k.min, k.max, labels, return.ARI)
HGC.PlotARIs(tree, k.min, k.max, labels, return.ARI)
tree |
the input clustering tree saved as |
k.min |
the minimum number to cut the tree. |
k.max |
the maximum number to cut the tree. |
labels |
a data frame or a matrix to store the label information. Different labels should be in different columns and the users should name the columns correspondingly. |
return.ARI |
a bool variable to choose whether output the ARI matrix. |
ARI is a widely used index to evaluate the consistence between two partitions
of the same samples. This function will first cut a given tree into specific
number of clusters using the function cutree
. Then it calculates the
ARIs between the clustering result and the given labels
with the help of R package mclust
. The function does such cutting and
calculation for different ks between k.min and k.max. Finally it visualize
these results using a line chart. ARIs with different labels are shown as
different lines with different colors in the figure.
A line chart will be drawn and a matrix of the ARIs will be returned.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] Pollen.Label.Tissue <- Pollen[["Tissue"]] Pollen.Label.CellLine <- Pollen[["CellLine"]] Pollen.SNN <- SNN.Construction(Pollen.PCs) Pollen.ClusteringTree <- HGC.dendrogram(G = Pollen.SNN) Pollen.labels <- data.frame(Tissue = Pollen.Label.Tissue, CellLine = Pollen.Label.CellLine) HGC.PlotARIs(tree = Pollen.ClusteringTree, k.min = 2, k.max = 15, labels = Pollen.labels)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] Pollen.Label.Tissue <- Pollen[["Tissue"]] Pollen.Label.CellLine <- Pollen[["CellLine"]] Pollen.SNN <- SNN.Construction(Pollen.PCs) Pollen.ClusteringTree <- HGC.dendrogram(G = Pollen.SNN) Pollen.labels <- data.frame(Tissue = Pollen.Label.Tissue, CellLine = Pollen.Label.CellLine) HGC.PlotARIs(tree = Pollen.ClusteringTree, k.min = 2, k.max = 15, labels = Pollen.labels)
The function will plot the dendrogram, with different colors for different clusters.
HGC.PlotDendrogram(tree, k, plot.label, labels)
HGC.PlotDendrogram(tree, k, plot.label, labels)
tree |
the input clustering tree saved as |
k |
the number of clusters to cut the tree into. |
plot.label |
a bool variable. It decides whether the function will add color bars. |
labels |
a data frame or a matrix to store the label information. Different labels should be in different columns and the users should name the columns correspondingly. The label information will show in the figure as color bars below the dendrogram, and each label takes one color bar. |
The function plots the clustering tree, with alternative colors showing
the clustering results and the label information. It is based on the R
package dendextend
which contains many parameters for the
visualization. For users' convenience, most of the parameters are set to
be the default value. Advanced users could
visit the vignette of dendextend
for more flexible visualization.
The function will return 1 if the dendrogram is successfully drawn.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] Pollen.Label.Tissue <- Pollen[["Tissue"]] Pollen.Label.CellLine <- Pollen[["CellLine"]] Pollen.SNN <- SNN.Construction(Pollen.PCs) Pollen.ClusteringTree <- HGC.dendrogram(G = Pollen.SNN) Pollen.labels <- data.frame(Tissue = Pollen.Label.Tissue, CellLine = Pollen.Label.CellLine) HGC.PlotDendrogram(tree = Pollen.ClusteringTree, k = 5, plot.label = TRUE, labels = Pollen.labels)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] Pollen.Label.Tissue <- Pollen[["Tissue"]] Pollen.Label.CellLine <- Pollen[["CellLine"]] Pollen.SNN <- SNN.Construction(Pollen.PCs) Pollen.ClusteringTree <- HGC.dendrogram(G = Pollen.SNN) Pollen.labels <- data.frame(Tissue = Pollen.Label.Tissue, CellLine = Pollen.Label.CellLine) HGC.PlotDendrogram(tree = Pollen.ClusteringTree, k = 5, plot.label = TRUE, labels = Pollen.labels)
The function visualizes the parameter output from HGC.parameter
.
HGC.PlotParameter(record, parameter)
HGC.PlotParameter(record, parameter)
record |
the input record matrix of parameters from |
parameter |
a string with alternatives "CL" or "ANN". Choose "CL" to plot the chain lengths and "ANN" to plot the average neighbor number. |
The chain length(CL) and average neighbor number(ANN) are key factors
related
with the time complexity of clustering by HGC
. The function
provides the
visualization of them.
The function will return 1 if the dendrogram is successfully drawn.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] Pollen.SNN <- SNN.Construction(Pollen.PCs) Pollen.ParameterRecord <- HGC.parameter(G = Pollen.SNN) HGC.PlotParameter(Pollen.ParameterRecord, parameter = "CL") HGC.PlotParameter(Pollen.ParameterRecord, parameter = "ANN")
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] Pollen.SNN <- SNN.Construction(Pollen.PCs) Pollen.ParameterRecord <- HGC.parameter(G = Pollen.SNN) HGC.PlotParameter(Pollen.ParameterRecord, parameter = "CL") HGC.PlotParameter(Pollen.ParameterRecord, parameter = "ANN")
This function builds an Unweighted K Nearest Neighbor (KNN) graph in the input feature space using Euclidean distance metric.
KNN.Construction(mat, k)
KNN.Construction(mat, k)
mat |
the input data saved as a numerical matrix. The columns are the features and the rows are the samples. |
k |
the number of nearest neighbors for building the KNN graph. |
This function builds a KNN graph of the input data. The main function
comes from the R package RANN
.
An n by n binary dgCMatrix object C, where n is the number of input samples. The matrix C is the adjacency matrix of the built KNN graph. C[i,j] = 1 means that there is an edge between sample i and sample j.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- KNN.Construction(Pollen.PCs)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- KNN.Construction(Pollen.PCs)
This function builds an Unweighted Minimum Spanning Tree (MST) graph in the input feature space using Euclidean distance metric.
MST.Construction(mat)
MST.Construction(mat)
mat |
the input data saved as a numerical matrix. The columns are the features and the rows are the samples. |
This function builds a MST graph of the input data. The main
function come from the R package ape
.
An n by n binary dgCMatrix object C, where n is the number of input samples. The matrix C is the adjacency matrix of the built MST graph. C[i,j] = 1 means that there is an edge between sample i and sample j.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- MST.Construction(Pollen.PCs)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- MST.Construction(Pollen.PCs)
This function builds an Unweighted Perturbed Minimum Spanning Tree (PMST) graph in the input feature space using Euclidean distance metric.
PMST.Construction(mat, iter, r)
PMST.Construction(mat, iter, r)
mat |
the input data saved as a numerical matrix. The columns are the features and the rows are the samples. |
iter |
the number of perturbation. |
r |
the parameter about the strength of the perturbation. |
The function builds a PMST graph of the input data. PMST is the combination of a number of MSTs, which are built in the perturbed data spaces.
An n by n binary dgCMatrix object C, where n is the number of input samples. The matrix C is the adjacency matrix of the built PMST graph. C[i,j] = 1 means that there is an edge between sample i and sample j.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- PMST.Construction(Pollen.PCs)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- PMST.Construction(Pollen.PCs)
The dataset is the low dimensional principal components and labels of cells from the Pollen dataset. The 301 cells are from 11 different cell lines and are classified into 4 tissues.
An object of class list
.
The list
contains three elements: First, a matrix with
301 rows and 25 columns, saved as Pollen[["PCs"]]
. The
rows are cells and columns are principal components. Second,
the label in tissue level, saved as a vector in
Pollen[["Tissue"]]
.
Third, the label in cell line level, saved as a vector in
Pollen[["CellLine"]]
.
https://www.nature.com/articles/nbt.2967
Nearest Neighbor Graph
This function builds an Nearest Neighbor graph in the
input feature space using Euclidean distance metric.
RNN.Construction(mat, max_dist)
RNN.Construction(mat, max_dist)
mat |
the input data saved as a numerical matrix. The columns are the features and the rows are the samples. |
max_dist |
the threshold distance. The edges whose lengths are less than
|
The function builds an Nearest Neighbor graph which
saved as a sparse matrix.
An n by n binary dgCMatrix object C, where n is the number of input samples. The matrix C is the adjacency matrix of the built RNN graph. C[i,j] = 1 means that there is an edge between sample i and sample j.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- RNN.Construction(Pollen.PCs, 20)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- RNN.Construction(Pollen.PCs, 20)
This function builds a Shared Nearest Neighbor (SNN) graph in the input feature space using Euclidean distance metric.
SNN.Construction(mat, k, threshold)
SNN.Construction(mat, k, threshold)
mat |
the input data saved as a numerical matrix. The columns are the features and the rows are the samples. |
k |
the number of nearest neighbor number to build the original KNN. |
threshold |
the threshold parameter for the Jaccard index. The edges in KNN whose Jaccard indices are lower than it will be removed in building the SNN. |
The function builds an SNN which saved as a sparse matrix.
An n by n binary dgCMatrix object C, where n is the number of input samples. The matrix C is the adjacency matrix of the built SNN graph. C[i,j] = 1 means that there is an edge between sample i and sample j.
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- SNN.Construction(Pollen.PCs)
data(Pollen) Pollen.PCs <- Pollen[["PCs"]] G <- SNN.Construction(Pollen.PCs)