ILoReg
is a tool for cell population identification from
single-cell RNA-seq (scRNA-seq) data. In our paper [1], we showed that
ILoReg
was able to identify, by both unsupervised
clustering and visually, rare cell populations that other scRNA-seq data
analysis pipelines were unable to identify.
The figure below illustrates the workflows of ILoReg
and
a typical pipeline that applies feature selection prior to
dimensionality reduction by principal component analysis (PCA).
In contrast to most scRNA-seq data analysis pipelines,
ILoReg
does not reduce the dimensionality of the gene
expression matrix by feature selection. Instead, it performs
probabilistic feature extraction using iterative clustering
projection (ICP), generating an N × k -dimensional
probability matrix, which contains probabilities of each of the N cells belonging to the k clusters. ICP is a novel
self-supervised learning algorithm that iteratively seeks a clustering
with k clusters that maximizes
the adjusted Rand index (ARI) between the clustering C and its projection C′ by L1-regularized logistic
regression. In the ILoReg consensus approach, ICP is run L times and the L probability matrices are merged
into a joint probability matrix and subsequently transformed by
principal component analysis (PCA) into a lower dimensional (N × p) matrix (consensus
matrix). The final clustering step is performed using hierarhical
clustering by the Ward’s method, after which the user can extract a
clustering with K consensus
clusters. However, the user can also use any other clustering method at
this point. Two-dimensional visualization is supported using two popular
nonlinear dimensionality reduction methods: t-distributed
stochastic neighbor embedding (t-SNE) and uniform manifold approximation
and projection (UMAP). Additionally, ILoReg provides user-friendly
functions that enable identification of differentially expressed (DE)
genes and visualization of gene expression.
ILoReg
can be downloaded from Bioconductor and installed
by executing the following command in the R console.
In the following, we go through the different steps of
ILoReg
’s workflow and demonstrate it using a peripheral
blood mononuclear cell (PBMC) dataset. The toy dataset included in the
ILoReg
R package (pbmc3k_500
) contains 500
cells that have been downsampled from the pbmc3k dataset [2]. The
preprocessing was rerun with a newer reference genome (GRCh38.p12) and
Cell Ranger v2.2.0 [3] to identify different immunoglobulin
subpopulations in B-cells.
The only required input for ILoReg
is a log-transformed,
normalized gene expression matrix that has been, with genes/features in
rows and cells/samples in columns. The input can be of
matrix
, data.frame
or dgCMatrix
class, which is then transformed into a sparse object of
dgCMatrix
class. Please note that the method has been
designed to work with sparse data, i.e. with a high
proportion of zero values. If, for example, the features of your dataset
have been standardized, the run time and the memory usage of
ILoReg
will likely be much higher.
## Warning: replacing previous import 'Matrix::det' by 'SparseM::det' when loading
## 'ILoReg'
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(cowplot))
# The dataset was normalized using the LogNormalize method from the Seurat R package.
sce <- SingleCellExperiment(assays = list(logcounts = pbmc3k_500))
sce <- PrepareILoReg(sce)
## Data in `logcounts` slot already of `dgCMatrix` class...
## 13865/13865 genes remain after filtering genes with only zero values.
Running ICP L times in parallel is the most computationally demanding part of the workflow.
In the following, we give a brief summary of the parameters.
As general guidelines on how to adjust the parameters, we recommend leaving r and L to their defaults (r = 5 and L = 200). However, increasing k from 15 to e.g. 30 can reveal new cell subsets that are of interest. Regarding d, increasing it to somewhere between 0.4-0.6 helps if the user wants lower resolution (less distinguishable populations). Increasing C from 0.3 to 1 reduces the number of distinguishable populations, as the logistic regression model filters out fewer genes.
# ICP is stochastic. To obtain reproducible results, use set.seed().
set.seed(1)
# Run ICP L times. This is the slowest step of the workflow,
# and parallel processing can be used to greatly speed it up.
sce <- RunParallelICP(object = sce, k = 15,
d = 0.3, L = 30,
r = 5, C = 0.3,
reg.type = "L1", threads = 0)
## | | | 0% | |== | 3% | |===== | 7% | |======= | 10% | |========== | 14% | |============ | 17% | |============== | 21% | |================= | 24% | |=================== | 28% | |====================== | 31% | |======================== | 34% | |=========================== | 38% | |============================= | 41% | |=============================== | 45% | |================================== | 48% | |==================================== | 52% | |======================================= | 55% | |========================================= | 59% | |=========================================== | 62% | |============================================== | 66% | |================================================ | 69% | |=================================================== | 72% | |===================================================== | 76% | |======================================================== | 79% | |========================================================== | 83% | |============================================================ | 86% | |=============================================================== | 90% | |================================================================= | 93% | |==================================================================== | 97% | |======================================================================| 100%
The L probability matrices are merged into a joint probability matrix, which is then transformed into a lower dimensionality by PCA. Before applying PCA, the user can optionally scale the cluster probabilities to unit-variance.
Optional: PCA requires the user to specify the number of principal components, for which we selected the default value p = 50. To aid in decision making, the elbow plot is commonly used to seek an elbow point, of which proximity the user selects p. In this case the point would be close to p = 10. Trying both a p that is close to the elbow point and the default p = 50 is recommended.
To visualize the data in two-dimensional space, nonlinear dimensionality reduction is performed using t-SNE or UMAP. The input data for this step is the N × p -dimensional consensus matrix.
Visualize the t-SNE and UMAP transformations using the
GeneScatterPlot
function, highlighting expression levels of
CD3D (T cells), CD79A (B cells), CST3
(monocytes, dendritic cells, platelets), FCER1A (myeloid
dendritic cells).
The N × p -dimensional consensus matrix is hierarchically clustered using the Ward’s method.
After the hierarchical clustering, the user needs to define how many
consensus clusters (K) to
extract from the tree dendrogram. The SelectKClusters
function enables extracting a consensus clustering with K clusters. Please note that the
clustering is overwritten every time the function is
called.
Next, we use the ClusteringScatterPlot
function to draw
the t-SNE and UMAP transformations and color each cell according to the
cluster labels.
# Use plot_grid function from the cowplot R package to combine the two plots into one.
plot_grid(ClusteringScatterPlot(sce,
dim.reduction.type = "umap",
return.plot = TRUE,
title = "UMAP",
show.legend=FALSE),
ClusteringScatterPlot(sce,
dim.reduction.type = "tsne",
return.plot = TRUE
,title="t-SNE",
show.legend=FALSE),
ncol = 1
)
TheILoReg
R package provides functions for the
identification of gene markers of clusters. This is accomplished by DE
analysis, where gene expression levels of the cells from each cluster
are compared against the rest of the cells. Currently, the only
supported statistical test is the the Wilcoxon rank-sum test (aka
Mann-Whitney U test). The p-values are corrected for multiple
comparisons using the Bonferroni method. To accelerate the analysis,
genes that are less likely to be DE can be filtered out prior to the
statistical testing using multiple criteria.
gene_markers <- FindAllGeneMarkers(sce,
clustering.type = "manual",
test = "wilcox",
log2fc.threshold = 0.25,
min.pct = 0.25,
min.diff.pct = NULL,
min.cells.group = 3,
return.thresh = 0.01,
only.pos = TRUE,
max.cells.per.cluster = NULL)
## -----------------------------------
## testing cluster 1
## 885 genes left after min.pct filtering
## 885 genes left after min.diff.pct filtering
## 230 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 2
## 873 genes left after min.pct filtering
## 873 genes left after min.diff.pct filtering
## 350 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 3
## 858 genes left after min.pct filtering
## 858 genes left after min.diff.pct filtering
## 313 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 4
## 973 genes left after min.pct filtering
## 973 genes left after min.diff.pct filtering
## 543 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 5
## 1366 genes left after min.pct filtering
## 1366 genes left after min.diff.pct filtering
## 782 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 6
## 1005 genes left after min.pct filtering
## 1005 genes left after min.diff.pct filtering
## 485 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 7
## 993 genes left after min.pct filtering
## 993 genes left after min.diff.pct filtering
## 519 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 8
## 918 genes left after min.pct filtering
## 918 genes left after min.diff.pct filtering
## 452 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 9
## 1180 genes left after min.pct filtering
## 1180 genes left after min.diff.pct filtering
## 622 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 10
## 1069 genes left after min.pct filtering
## 1069 genes left after min.diff.pct filtering
## 481 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 11
## 948 genes left after min.pct filtering
## 948 genes left after min.diff.pct filtering
## 223 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 12
## 879 genes left after min.pct filtering
## 879 genes left after min.diff.pct filtering
## 336 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 13
## 1654 genes left after min.pct filtering
## 1654 genes left after min.diff.pct filtering
## 931 genes left after log2fc.threshold filtering
## -----------------------------------
Select top 10 and 1 gene markers based on the log2 fold-change and the Bonferroni adjusted p-value.
top10_log2FC <- SelectTopGenes(gene_markers,
top.N = 10,
criterion.type = "log2FC",
inverse = FALSE)
top1_log2FC <- SelectTopGenes(gene_markers,
top.N = 1,
criterion.type = "log2FC",
inverse = FALSE)
top10_adj.p.value <- SelectTopGenes(gene_markers,
top.N = 10,
criterion.type = "adj.p.value",
inverse = TRUE)
top1_adj.p.value <- SelectTopGenes(gene_markers,
top.N = 1,
criterion.type = "adj.p.value",
inverse = TRUE)
Draw the t-SNE and UMAP transformations, highlighting expression levels of the top 1 gene markers based on the log2 fold-change.
GeneHeatmap
function enables visualizing gene markers in
a heatmap, where cells and genes are grouped by the clustering.
RenameAllClusters
enables renaming all clusters at
once.
sce <- RenameAllClusters(sce,
new.cluster.names = c("GZMK+/CD8+ T cells",
"IGKC+ B cells",
"Naive CD4+ T cells",
"NK cells",
"CD16+ monocytes",
"CD8+ T cells",
"CD14+ monocytes",
"IGLC+ B cells",
"Intermediate monocytes",
"IGKC+/IGLC+ B cells",
"Memory CD4+ T cells",
"Naive CD8+ T cells",
"Dendritic cells"))
Draw the gene heatmap again, but with the clusters renamed.
ILoReg
provides additional functionality for performing
tasks, which are sometimes required in scRNA-seq data analysis.
The optimal number of clusters can be estimated by calculating the
average silhouette value across the cells for a set of clusterings
within a range of different K
values (e.g. [2, 50]). The clustering
mathing to the highest average silhouette is saved to
clustering.optimal
slot. Therefore, the clustering acquired
using the SelectKClusters
function is not overwritten.
## optimal K: 5, average silhouette score: 0.418375297485223
# Select a clustering with K=5 clusters
sce <- SelectKClusters(sce,K=5)
# Generate a custom annotation with K=5 clusters and change the names to the five first alphabets.
custom_annotation <- plyr::mapvalues(metadata(sce)$iloreg$clustering.manual,c(1,2,3,4,5),LETTERS[1:5])
# Visualize the annotation
AnnotationScatterPlot(sce,
annotation = custom_annotation,
return.plot = FALSE,
dim.reduction.type = "tsne",
show.legend = FALSE)
sce <- SelectKClusters(sce,K=13)
sce <- RenameAllClusters(sce,
new.cluster.names = c("GZMK+/CD8+ T cells",
"IGKC+ B cells",
"Naive CD4+ T cells",
"NK cells",
"CD16+ monocytes",
"CD8+ T cells",
"CD14+ monocytes",
"IGLC+ B cells",
"Intermediate monocytes",
"IGKC+/IGLC+ B cells",
"Memory CD4+ T cells",
"Naive CD8+ T cells",
"Dendritic cells"))
# Identify DE genes between naive and memory CD4+ T cells
GM_naive_memory_CD4 <- FindGeneMarkers(sce,
clusters.1 = "Naive CD4+ T cells",
clusters.2 = "Memory CD4+ T cells",
logfc.threshold = 0.25,
min.pct = 0.25,
return.thresh = 0.01,
only.pos = TRUE)
## clustering.type='manual'testing cluster group.1
## 918 genes left after min.pct filtering
## 918 genes left after min.diff.pct filtering
## 242 genes left after logfc.threshold filtering
# Find gene markers for dendritic cells
GM_dendritic <- FindGeneMarkers(sce,
clusters.1 = "Dendritic cells",
logfc.threshold = 0.25,
min.pct = 0.25,
return.thresh = 0.01,
only.pos = TRUE)
## clustering.type='manual'testing cluster group.1
## 1654 genes left after min.pct filtering
## 1654 genes left after min.diff.pct filtering
## 931 genes left after logfc.threshold filtering
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] cowplot_1.1.3 SingleCellExperiment_1.29.1
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.1
## [7] IRanges_2.41.1 S4Vectors_0.45.2
## [9] BiocGenerics_0.53.3 generics_0.1.3
## [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [13] ILoReg_1.17.0 knitr_1.49
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 gld_2.6.6 readxl_1.4.3
## [4] rlang_1.1.4 magrittr_2.0.3 e1071_1.7-16
## [7] compiler_4.4.2 parallelDist_0.2.6 png_0.1-8
## [10] reshape2_1.4.4 vctrs_0.6.5 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [16] XVector_0.47.0 labeling_0.4.3 utf8_1.2.4
## [19] rmarkdown_2.29 haven_2.5.4 UCSC.utils_1.3.0
## [22] xfun_0.49 zlibbioc_1.52.0 cachem_1.1.0
## [25] jsonlite_1.8.9 DelayedArray_0.33.2 parallel_4.4.2
## [28] aricode_1.0.3 cluster_2.1.6 DescTools_0.99.58
## [31] R6_2.5.1 stringi_1.8.4 bslib_0.8.0
## [34] RColorBrewer_1.1-3 reticulate_1.40.0 boot_1.3-31
## [37] jquerylib_0.1.4 cellranger_1.1.0 Rcpp_1.0.13-1
## [40] iterators_1.0.14 snow_0.4-4 Matrix_1.7-1
## [43] tidyselect_1.2.1 rstudioapi_0.17.1 abind_1.4-8
## [46] yaml_2.3.10 viridis_0.6.5 codetools_0.2-20
## [49] doRNG_1.8.6 plyr_1.8.9 lattice_0.22-6
## [52] tibble_3.2.1 withr_3.0.2 askpass_1.2.1
## [55] evaluate_1.0.1 Rtsne_0.17 RcppParallel_5.1.9
## [58] proxy_0.4-27 pillar_1.9.0 BiocManager_1.30.25
## [61] rngtools_1.5.2 foreach_1.5.2 hms_1.1.3
## [64] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0
## [67] rootSolve_1.8.2.4 class_7.3-22 glue_1.8.0
## [70] pheatmap_1.0.12 lmom_3.2 maketools_1.3.1
## [73] LiblineaR_2.10-24 tools_4.4.2 dendextend_1.19.0
## [76] sys_3.4.3 data.table_1.16.2 SparseM_1.84-2
## [79] RSpectra_0.16-2 forcats_1.0.0 Exact_3.3
## [82] fastcluster_1.2.6 buildtools_1.0.0 mvtnorm_1.3-2
## [85] grid_4.4.2 umap_0.2.10.0 colorspace_2.1-1
## [88] GenomeInfoDbData_1.2.13 cli_3.6.3 fansi_1.0.6
## [91] expm_1.0-0 S4Arrays_1.7.1 viridisLite_0.4.2
## [94] dplyr_1.1.4 doSNOW_1.0.20 gtable_0.3.6
## [97] sass_0.4.9 digest_0.6.37 SparseArray_1.7.2
## [100] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
## [103] httr_1.4.7 openssl_2.2.2 MASS_7.3-61
If you have questions related to ILoReg
, please contact
us here.