CHETAH (CHaracterization of cEll Types Aided by Hierarchical
classification) is a package for cell type identification of single-cell
RNA-sequencing (scRNA-seq) data.
A pre-print of the article describing CHETAH is available at bioRxiv.
Summary: Cell types are assigned by correlating the input data to a reference in a hierarchical manner. This creates the possibility of assignment to intermediate types if the data does not allow to fully classify to one of the types in the reference. CHETAH is built to work with scRNA-seq references, but will also work (with limited capabilities) with RNA-seq or micro-array reference datasets. So, to run CHETAH, you will only need:
SingleCellExperiment
To run chetah on an input count matrix input_counts
with
t-SNE1 coordinates in input_tsne
, and a
reference count matrix ref_counts
with celltypes vector
ref_ct
, run:
## Make SingleCellExperiments
reference <- SingleCellExperiment(assays = list(counts = ref_counts),
colData = DataFrame(celltypes = ref_ct))
input <- SingleCellExperiment(assays = list(counts = input_counts),
reducedDims = SimpleList(TSNE = input_tsne))
## Run CHETAH
input <- CHETAHclassifier(input = input, ref_cells = reference)
## Plot the classification
PlotCHETAH(input)
## Extract celltypes:
celltypes <- input$celltype_CHETAH
A tumor micro-environment reference dataset containing all major cell types for tumor data can be downloaded: here. This reference can be used for all (tumor) input datasets.
CHETAH constructs a classification tree by hierarchically clustering the reference data. The classification is guided by this tree. In each node of the tree, input cells are either assigned to the right, or the left branch. A confidence score is calculated for each of these assignments. When the confidence score for an assignment is lower than the threshold (default = 0.1), the classification for that cell stops in that node.
This results in two types of classifications:
CHETAH will be a part of Bioconductor starting at release 2.9 (30th of April), and will be available by:
## Install BiocManager is necessary
if (!require("BiocManager")) {
install.packages("BiocManager")
}
BiocManager::install('CHETAH')
# Load the package
library(CHETAH)
The development version can be downloaded from the development version of Bioconductor (in R v3.6).
If you have your data stored as SingleCellExperiments
,
continue to the next step. Otherwise, you need the
following data before you begin:
As an example on how to prepare your data, we will use melanoma input data from Tirosh et al. and head-neck tumor reference data from Puram et al. as an example.
## load CHETAH's datasets
data('headneck_ref')
data('input_mel')
## To prepare the data from the package's internal data, run:
celltypes_hn <- headneck_ref$celltypes
counts_hn <- assay(headneck_ref)
counts_melanoma <- assay(input_mel)
tsne_melanoma <- reducedDim(input_mel)
## The input data: a Matrix
class(counts_melanoma)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
counts_melanoma[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> mel_cell1 mel_cell2 mel_cell3 mel_cell4 mel_cell5
#> ELMO2 . . . 4.5633 .
#> PNMA1 . 4.3553 . . .
#> MMP2 . . . . .
#> TMEM216 . . . . 5.5624
#> TRAF3IP2-AS1 2.1299 4.0542 2.4209 1.6531 1.3144
## The reduced dimensions of the input cells: 2 column matrix
tsne_melanoma[1:5, ]
#> tSNE_1 tSNE_2
#> mel_cell1 4.5034553 13.596680
#> mel_cell2 -4.0025667 -7.075722
#> mel_cell3 0.4734054 9.277648
#> mel_cell4 3.2201815 11.445236
#> mel_cell5 -0.3354758 5.092415
all.equal(rownames(tsne_melanoma), colnames(counts_melanoma))
#> [1] TRUE
## The reference data: a Matrix
class(counts_hn)
#> [1] "matrix" "array"
counts_hn[1:5, 1:5]
#> hn_cell1 hn_cell2 hn_cell3 hn_cell4 hn_cell5
#> ELMO2 0.00000 0 0.00000 1.55430 4.2926
#> PNMA1 0.00000 0 0.00000 4.55360 0.0000
#> MMP2 0.00000 0 7.02880 4.50910 6.3006
#> TMEM216 0.00000 0 0.00000 0.00000 0.0000
#> TRAF3IP2-AS1 0.14796 0 0.65352 0.28924 3.6365
## The cell types of the reference: a named character vector
str(celltypes_hn)
#> Named chr [1:180] "Fibroblast" "Fibroblast" "Fibroblast" "Fibroblast" ...
#> - attr(*, "names")= chr [1:180] "hn_cell1" "hn_cell2" "hn_cell3" "hn_cell4" ...
## The names of the cell types correspond with the colnames of the reference counts:
all.equal(names(celltypes_hn), colnames(counts_melanoma))
#> [1] "Lengths (180, 150) differ (string compare on first 150)"
#> [2] "150 string mismatches"
SingleCellExperiments
CHETAH expects data to be in the format of a
SingleCellExperiment
, which is an easy way to store
different kinds of data together. Comprehensive information on this data
type can be found here.
A SingleCellExperiment
holds three things:
assays
list
of Matrices
colData
DataFrames
ReducedDims
SimpleList
of 2-column data.frames
or
matrices
CHETAH needs
SingleCellExperiment
with:
SingleCellExperiment
with:
For the example data, we would make the two objects by running:
## For the reference we define a "counts" assay and "celltypes" metadata
headneck_ref <- SingleCellExperiment(assays = list(counts = counts_hn),
colData = DataFrame(celltypes = celltypes_hn))
## For the input we define a "counts" assay and "TSNE" reduced dimensions
input_mel <- SingleCellExperiment(assays = list(counts = counts_melanoma),
reducedDims = SimpleList(TSNE = tsne_melanoma))
assay
/reducedDim
in an object and “celltypes”
for the reference’s colData
. See ?CHETAHclassifier and
?PlotCHETAH on how to change this behaviour.
Now that the data is prepared, running chetah is easy:
input_mel <- CHETAHclassifier(input = input_mel,
ref_cells = headneck_ref)
#> Preparing data....
#> Running analysis...
CHETAH returns the input object, but added:
int_colData
and int_metadata
, not
meant for direct interaction, but
PlotCHETAH
and CHETAHshiny
CHETAH’s classification can be visualized using:
PlotCHETAH
. This function plots both the classification
tree and the t-SNE (or other provided reduced dimension) map.
Either the final types or the intermediate
types are colored in these plots. The non-colored types are
represented in a grayscale.
To plot the final types:
Conversely, to color the intermediate types:
If you would like to use the classification, and thus the colors, in another package (e.g. Seurat2), you can extract the colors using:
CHETAHshiny
The classification of CHETAH and other outputs like profile and confidence scores can be visualized in a shiny application that allows for easy and interactive analysis of the classification.
Here you can view:
The following command will open the shiny application as in an R window. The page can also be opened in your default web-browser by clicking “Open in Browser” at the very top.
CHETAH calculates a confidence score for each assignment of an input
cell to one of the branches of a node.
The confidence score:
The default confidence threshold of CHETAH is
0.1.
This means that whenever a cell is assigned to a branch and the
confidence of that assignment is lower than 0.1, the classification will
stop in that node.
The confidence threshold can be adjusted in order to classify more or fewer cells to a final type:
For example, to only classify cells with very high confidence:
Conversely, to classify all cells:
For renaming types in the tree, CHETAH comes with the
RenameBelowNode
function. This can be interesting when you
are more interested in the general types, type in the different
intermediate and final types.
For the example data, let’s say that we are not interested in all the different subtypes of T-cells (under Node6 and Node7), we can name all these cells “T cells” by running:
input_mel <- RenameBelowNode(input_mel, whichnode = 6, replacement = "T cell")
PlotCHETAH(input = input_mel, tree = FALSE)
To reset the classification to its default, just run
Classify
again:
CHETAH can use any scRNA-seq reference, but the used reference
greatly influences the classification.
The following general rules apply on choosing and creating a
reference:
To reduce computation time with very big references, first try to
subsample each cell type to 100-200 cells. CHETAH should have very
similar performance to using all cells. For a
SingleCellExperiment
“ref” with cell type metadata
“celltypes”, this could be done by:
CHETAH does not require normalized input data, but the reference data has to be normalized beforehand. The reference data that is used in this vignette is already normalized. Only for sake of the example, we use this dataset anyway to perform nomalization:
Certainly with reference with relatively high drop-out rates, CHETAH can be influenced by highly expressed, and thus highly variable, genes. In our experience, mainly ribosomal protein genes can cause such an effect. We therefore delete these genes, using the “ribosomal” list from here
The performance of CHETAH is heavily dependent on the quality of the
reference.
The quality of the reference is affected by:
To see how well CHETAH can distinguish between the cell types in a
reference,
CorrelateReference
and more importantly
ClassifyReference
can be run.
CorrelateReference
is a function that, for every
combination of two cell types, finds the genes with the highest
fold-change between the two and uses these to correlate them to each
other. If the reference is good, all types will correlate poorly or even
better, will anti-correlate.
CorrelateReference(ref_cells = headneck_ref)
#> Running... in case of 1000s of cells, this may take a couple of minutes
In this case, most cell types will be distinguishable: many types don’t correlate, or anti-correlate. However, some types are quite similar. Regulatory and CD4 T cells, or CD4 and CD8 T cells, might be hard to distinguish in the input data.
Another check to see whether CHETAH can distinguish between the cell
types in the reference is ClassifyReference
. This function
uses the reference to classify the reference itself. If CHETAH works
well with the reference, there should be almost no mix-ups in the
classification, i.e. all cells of type A should be classified to type
A.
In this reference, there is never more than 10% mix-up between two cell types. In addition, a low percentage of cells is classified as an intermediate type. Most mix-ups occur between subtypes of T cells. In this case the user should be aware that these cell type labels have the highest chance to interchange.
CHETAH is optimized to give good results in most analyses, but it can
happen that a classification is imperfect. When CHETAH does not give the
desired output (too little cells are classified, visually random
classification, etc),
These are the following steps to take (in this order):
input[!(grepl("^RP", rownames(input))), ]
is an
imperfect, but very quick way to do this.n_genes
parameter).
1 Van Der Maaten and Hinton (2008). Visualizing
high-dimensional data using t-sne. J Mach Learn Res. 9:
2579-2605. doi: 10.1007/s10479-011-0841-3.
2 Satija et al. (2015) Spatial reconstruction of single-cell
gene expression data. Nat Biotechnol. 33(5):495-502. May 2015.
doi: 10.1038/nbt.3192. More information at: https://satijalab.org/seurat/
3 Picelli et al. (2013) Smart-seq2 for sensitive full-length
transcriptome profiling in single cells. Nat Methods. 10(11):
1096-1100. doi: 10.1038/nmeth.2639.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] CHETAH_1.23.0 SingleCellExperiment_1.29.1
#> [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [7] IRanges_2.41.1 S4Vectors_0.45.2
#> [9] BiocGenerics_0.53.3 generics_0.1.3
#> [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [13] ggplot2_3.5.1 Matrix_1.7-1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.2 dplyr_1.1.4
#> [4] farver_2.1.2 viridis_0.6.5 fastmap_1.2.0
#> [7] lazyeval_0.2.2 promises_1.3.2 digest_0.6.37
#> [10] mime_0.12 lifecycle_1.0.4 magrittr_2.0.3
#> [13] compiler_4.4.2 rlang_1.1.4 sass_0.4.9
#> [16] tools_4.4.2 utf8_1.2.4 yaml_2.3.10
#> [19] corrplot_0.95 data.table_1.16.2 knitr_1.49
#> [22] S4Arrays_1.7.1 labeling_0.4.3 htmlwidgets_1.6.4
#> [25] DelayedArray_0.33.2 plyr_1.8.9 RColorBrewer_1.1-3
#> [28] abind_1.4-8 withr_3.0.2 purrr_1.0.2
#> [31] sys_3.4.3 bioDist_1.79.0 grid_4.4.2
#> [34] fansi_1.0.6 xtable_1.8-4 colorspace_2.1-1
#> [37] scales_1.3.0 cli_3.6.3 rmarkdown_2.29
#> [40] crayon_1.5.3 httr_1.4.7 reshape2_1.4.4
#> [43] cachem_1.1.0 stringr_1.5.1 zlibbioc_1.52.0
#> [46] XVector_0.47.0 vctrs_0.6.5 jsonlite_1.8.9
#> [49] maketools_1.3.1 dendextend_1.19.0 plotly_4.10.4
#> [52] jquerylib_0.1.4 tidyr_1.3.1 glue_1.8.0
#> [55] cowplot_1.1.3 stringi_1.8.4 gtable_0.3.6
#> [58] later_1.4.1 UCSC.utils_1.3.0 munsell_0.5.1
#> [61] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
#> [64] GenomeInfoDbData_1.2.13 R6_2.5.1 evaluate_1.0.1
#> [67] shiny_1.9.1 lattice_0.22-6 pheatmap_1.0.12
#> [70] httpuv_1.6.15 bslib_0.8.0 Rcpp_1.0.13-1
#> [73] gridExtra_2.3 SparseArray_1.7.2 xfun_0.49
#> [76] buildtools_1.0.0 pkgconfig_2.0.3