Title: | Fast and accurate scRNA-seq cell type identification |
---|---|
Description: | CHETAH (CHaracterization of cEll Types Aided by Hierarchical classification) is an accurate, selective and fast scRNA-seq classifier. Classification is guided by a reference dataset, preferentially also a scRNA-seq dataset. By hierarchical clustering of the reference data, CHETAH creates a classification tree that enables a step-wise, top-to-bottom classification. Using a novel stopping rule, CHETAH classifies the input cells to the cell types of the references and to "intermediate types": more general classifications that ended in an intermediate node of the tree. |
Authors: | Jurrian de Kanter [aut, cre], Philip Lijnzaad [aut] |
Maintainer: | Jurrian de Kanter <[email protected]> |
License: | file LICENSE |
Version: | 1.23.0 |
Built: | 2024-12-29 04:06:48 UTC |
Source: | https://github.com/bioc/CHETAH |
CHETAH classifies an input dataset by comparing it to a reference dataset in a stepwise, top-to-bottom fashion. See 'details' for a full explanation. NOTE: We recommend to use all the default parameters
CHETAHclassifier( input, ref_cells = NULL, ref_profiles = NULL, ref_ct = "celltypes", input_c = NA, ref_c = NA, thresh = 0.1, gs_method = c("fc", "wilcox"), cor_method = c("spearman", "kendall", "pearson", "cosine"), clust_method = c("average", "single", "complete", "ward.D2", "ward.D", "mcquitty", "median", "centroid"), clust_dist = bioDist::spearman.dist, n_genes = 200, pc_thresh = 0.2, p_thresh = 0.05, fc_thresh = 1.5, subsample = FALSE, fix_ngenes = TRUE, plot.tree = FALSE, only_pos = FALSE, print_steps = FALSE )
CHETAHclassifier( input, ref_cells = NULL, ref_profiles = NULL, ref_ct = "celltypes", input_c = NA, ref_c = NA, thresh = 0.1, gs_method = c("fc", "wilcox"), cor_method = c("spearman", "kendall", "pearson", "cosine"), clust_method = c("average", "single", "complete", "ward.D2", "ward.D", "mcquitty", "median", "centroid"), clust_dist = bioDist::spearman.dist, n_genes = 200, pc_thresh = 0.2, p_thresh = 0.05, fc_thresh = 1.5, subsample = FALSE, fix_ngenes = TRUE, plot.tree = FALSE, only_pos = FALSE, print_steps = FALSE )
input |
required: an input SingleCellExperiment.
(see: Bioconductor,
and the vignette |
ref_cells |
required: A reference SingleCellExperiment, with
the cell types in the "celltypes" colData (or otherwise defined in |
ref_profiles |
optional In case of bulk-RNA seq or micro-arrays, an expression matrix with one (average) reference expression profile per cell type in the columns. ('ref_cells' must be left empty) |
ref_ct |
the colData of |
input_c |
the name of the assay of the input to use.
|
ref_c |
same as |
thresh |
the initial confidence threshold, which can be changed after running
by |
gs_method |
method for gene selection. In every node of the tree:
"fc" = quick method: either a fixed number ( |
cor_method |
the correlation measure: one of: "spearman" (default), "kendall", "pearson", "cosine" |
clust_method |
the method used for clustering the reference profiles.
One of the methods from |
clust_dist |
a distance measure, default: |
n_genes |
The number of genes used in every step. Only used if
|
pc_thresh |
when: gs_method = "wilcox", only genes are selected
for which more than a |
p_thresh |
when: gs_method = "wilcox" , only genes are selected
that have a p-value < |
fc_thresh |
when: gs_method = "wilcox" or gs_method = "fc"
AND fix_ngenes = FALSE,
only genes are selected that have a log2 fld-change > |
subsample |
to prevent reference types with a lot of cells to influence
the gene selection, subsample types with more that |
fix_ngenes |
when: gs_method = "fc" use a fixed number of genes
for all correlations. when: gs_method = "wilcox"
use a maximum of genes per step.
When |
plot.tree |
Plot the classification tree. |
only_pos |
not recommended: only use genes for a reference type that are higher expressed in that type, than the others in that node. |
print_steps |
whether the number of genes (postive and negative) per step per ref_cell_type should be printed |
CHETAH will hierarchically cluster reference data
to produce a classification tree (ct).
In each node of the ct, CHETAH will
assign each input cell to on of the two branches, based on gene selections,
correlations and calculation of profile and confidence scores.
The assignement will only performed if the confidence score for
such an assignment is higher than the Confidence Threshold.
If this is not the case, classification for the cell will stop in the current node.
Some input cells will reach the leaf nodes of the ct (the pre-defined cell types),
these classifications are called final types
For other cells, assignment will stop in a node. These classifications
are called intermediate types.
A SingleCellExperiment with added: - input$celltype_CHETAH a named character vector that can directly be used in any other workflow/method. - "hidden" 'int_colData' and 'int_metadata', not meant for direct interaction, but which can all be viewed and interacted with using: 'PlotCHETAH' and 'CHETAHshiny' A list containing the following objects is added to input$int_metadata$CHETAH
classification a named vector: the classified types with the corresponding names of the input cells
tree the hclust object of the classification tree
nodetypes A list with the cell types under each node
nodecoor the coordinates of the nodes of the classification tree
genes A list per node, containing a list per reference type with the genes used for the profile scores of that type
parameters The parameters used
A nested DataFrame is added to input$int_colData$CHETAH. It holds 3 top-levels DataFrames
prof_scores A list with the profile scores
conf_scores A list with the confidence scores
correlations A list with the correlations of the input cells to the reference profiles
data('input_mel') data('headneck_ref') ## Melanoma data from Tirosh et al. (2016) Science input_mel ## Head-Neck data from Puram et al. (2017) Cancer Cell headneck_ref input_mel <- CHETAHclassifier(input = input_mel, ref_cells = headneck_ref)
data('input_mel') data('headneck_ref') ## Melanoma data from Tirosh et al. (2016) Science input_mel ## Head-Neck data from Puram et al. (2017) Cancer Cell headneck_ref input_mel <- CHETAHclassifier(input = input_mel, ref_cells = headneck_ref)
Launch a web page to interactively go trough the classification
CHETAHshiny(input, redD = NA, input_c = NA)
CHETAHshiny(input, redD = NA, input_c = NA)
input |
a SingleCellExperiment on which |
redD |
the name of the reducedDim of the input to use for plotting |
input_c |
the name of the assay of the input to use.
|
Opens a web page in your default browser
CHETAHclassifier
using a confidence threshold (Re)classify after running CHETAHclassifier
using a confidence threshold
NOTE: In case of bulk reference profiles: only the correlations will be used,
as the data does not allow for profile or confidence scores to be calculated.
Classify(input, thresh = 0.1, return_clas = FALSE)
Classify(input, thresh = 0.1, return_clas = FALSE)
input |
a SingleCellExperiment on which |
thresh |
a confidence threshold between -0 and 2. |
return_clas |
Instead of returning the SingleCellExperiment, only return the classification vector |
a charachter vector of the cell types with the names of the cells
data('input_mel') data('headneck_ref') ## Classify all cells input_mel <- Classify(input_mel, 0) ## Classify only cells with a very high confidence input_mel <- Classify(input_mel, 1) ## Back to the default input_mel <- Classify(input_mel) ## Return only the classification vector celltypes <- Classify(input_mel, 1, return_clas = TRUE)
data('input_mel') data('headneck_ref') ## Classify all cells input_mel <- Classify(input_mel, 0) ## Classify only cells with a very high confidence input_mel <- Classify(input_mel, 1) ## Back to the default input_mel <- Classify(input_mel) ## Return only the classification vector celltypes <- Classify(input_mel, 1, return_clas = TRUE)
Use a reference dataset to classify itself. A good reference should have almost no mixture between reference cells.
ClassifyReference( ref_cells, ref_ct = "celltypes", ref_c = "counts", return = FALSE, ... )
ClassifyReference( ref_cells, ref_ct = "celltypes", ref_c = "counts", return = FALSE, ... )
ref_cells |
the reference, similar to
|
ref_ct |
the colData of |
ref_c |
same as |
return |
return the matrix that was used to produce the plot |
... |
Other variables to pass to
|
A square plot. The rows are the original cell types, the columns the classifion labels. The colors and sizes of the squares indicate which part of the cells of the rowname type are classified to the type of the column name. On the left of the plot, the percentage of cells that is classified to an intermediate type is plotted. A good reference would classify nearly 100
data('headneck_ref') ClassifyReference(ref_cells = headneck_ref)
data('headneck_ref') ClassifyReference(ref_cells = headneck_ref)
Correlate all reference profiles to each other using differentially expressed genes.
CorrelateReference( ref_cells = NULL, ref_profiles = NULL, ref_ct = "celltypes", ref_c = NA, return = FALSE, n_genes = 200, fix_ngenes = TRUE, print_steps = FALSE, only_pos = FALSE )
CorrelateReference( ref_cells = NULL, ref_profiles = NULL, ref_ct = "celltypes", ref_c = NA, return = FALSE, n_genes = 200, fix_ngenes = TRUE, print_steps = FALSE, only_pos = FALSE )
ref_cells |
the reference, similar to
|
ref_profiles |
similar to
|
ref_ct |
the colData of |
ref_c |
the assay of |
return |
return the matrix that was used to produce the plot |
n_genes |
as in |
fix_ngenes |
as in |
print_steps |
as in |
only_pos |
as in |
A square plot. The values show how much two reference profiles correlate, when using the genes with the highest fold-change.
data('headneck_ref') CorrelateReference(ref_cells = headneck_ref)
data('headneck_ref') CorrelateReference(ref_cells = headneck_ref)
A SingleCellExperiment with celltypes in the "celltypes" colData. A subset of the Head-Neck data from Puram et al. (2017) Cancer Cell.
data('headneck_ref')
data('headneck_ref')
A list of expression matrices. Each object is named as the cell type of the cells in that matrix. Each matrix has the cell (names) in the colums and the genes in the rows.
for the original data: GEO
Puram et al. (2017) Cancer Cell 171:1611-1624
headneck_ref
It holds subset of the Melanoma data, from Tirosh et al. (2016), Science.A SingleCellExperiment on which CHEATHclassifier is run using the headneck_ref
It holds subset of the Melanoma data, from Tirosh et al. (2016), Science.
data('input_mel')
data('input_mel')
This is a SingleCellExperiment
for the original data: GEO
Tirosh et al. (2016) Science 6282:189-196
Plot the CHETAH classification on 2D visulization like t-SNE + the corresponding classification tree, colored with the same colors
PlotCHETAH( input, redD = NA, interm = FALSE, return = FALSE, tree = TRUE, pt.size = 1, return_col = FALSE, col = NULL )
PlotCHETAH( input, redD = NA, interm = FALSE, return = FALSE, tree = TRUE, pt.size = 1, return_col = FALSE, col = NULL )
input |
a SingleCellExperiment on which |
redD |
the name of the reducedDim of the input to use for plotting |
interm |
color the intermediate instead of the final types |
return |
return the plot instead of printing it |
tree |
plot the tree, along with the classification |
pt.size |
the point-size of the classication plot |
return_col |
whether the colors that are used for the classification plot should be returned |
col |
custom colors for the cell types. the colors should be named with the corresponding cell types |
a ggplot object
data('input_mel') #' ## Standard plot (final types colored) PlotCHETAH(input = input_mel) ## Intermediate types colored PlotCHETAH(input = input_mel, interm = TRUE) ## Plot only the t-SNE plot PlotCHETAH(input = input_mel, tree = FALSE)
data('input_mel') #' ## Standard plot (final types colored) PlotCHETAH(input = input_mel) ## Intermediate types colored PlotCHETAH(input = input_mel, interm = TRUE) ## Plot only the t-SNE plot PlotCHETAH(input = input_mel, tree = FALSE)
Plots the chetah classification tree with nodes numbered
PlotTree( input, col = NULL, col_nodes = NULL, return = FALSE, no_bgc = FALSE, plot_limits = c(-0.4, 0.1), labelsize = 6 )
PlotTree( input, col = NULL, col_nodes = NULL, return = FALSE, no_bgc = FALSE, plot_limits = c(-0.4, 0.1), labelsize = 6 )
input |
a SingleCellExperiment on which |
col |
a vector of colors, with the names of the reference cell types |
col_nodes |
a vector of colors, ordered for node 1 till the last node |
return |
instead of printing, return the ggplot object |
no_bgc |
remove the background color from the node numbers |
plot_limits |
define the Decreasing the former further is usefull when the labels are cut of the plot (default = c(-0,25, 01)). |
labelsize |
the size of the intermediate and leaf node labels (default = 6) |
A ggplot object of the classification tree
data('input_mel') PlotTree(input = input_mel)
data('input_mel') PlotTree(input = input_mel)
Plots a variable on a t-SNE
PlotTSNE( toplot, input, redD = NA, col = NULL, return = FALSE, limits = NULL, pt.size = 1, shiny = NULL, y_limits = NULL, x_limits = NULL, legend_label = "" )
PlotTSNE( toplot, input, redD = NA, col = NULL, return = FALSE, limits = NULL, pt.size = 1, shiny = NULL, y_limits = NULL, x_limits = NULL, legend_label = "" )
toplot |
the variable that should be plotted. Either a character vector
or a factor, or a (continuous) numeric.
If toplot is not named with the rownames of |
input |
a SingleCellExperiment on which |
redD |
the name of the reducedDim of the input to use for plotting |
col |
a vector of colors. If |
return |
instead of printing, return the ggplot object |
limits |
the limits of the continuous variable to plot. When not provided the minimal and maximal value will be used |
pt.size |
the point-size |
shiny |
Needed for the shiny application: should always be NULL |
y_limits |
the y-axis limits |
x_limits |
the x-axis limits, if NULL |
legend_label |
the label of the legend |
A ggplot object
data('input_mel') CD8 <- assay(input_mel)['CD8A', ] PlotTSNE(toplot = CD8, input = input_mel)
data('input_mel') CD8 <- assay(input_mel)['CD8A', ] PlotTSNE(toplot = CD8, input = input_mel)
In the CHETAH classification, replace the name of a Node and all the names of the final and intermediate types under that Node.
RenameBelowNode( input, whichnode, replacement, nodes_exclude = NULL, types_exclude = NULL, node_only = FALSE, return_clas = FALSE )
RenameBelowNode( input, whichnode, replacement, nodes_exclude = NULL, types_exclude = NULL, node_only = FALSE, return_clas = FALSE )
input |
a SingleCellExperiment on which |
whichnode |
the number of the Node |
replacement |
a character vector that replaces the names under the selected Node |
nodes_exclude |
optional the names of the types that should NOT be replaced |
types_exclude |
optional numbers of the Nodes under the selected Node, that should NOT be replaced |
node_only |
only rename the Node itself, without affecting the types under that Node |
return_clas |
Instead of returning the SingleCellExperiment, only return the classification vector |
The SingleCellExperiment with the new classification or if 'return_clas = TRUE' the classification vector.
## In the example data replace all T-cell subtypes by "T cell" data('input_mel') #' input_mel <- RenameBelowNode(input = input_mel, whichnode = 7, replacement = "T cell")
## In the example data replace all T-cell subtypes by "T cell" data('input_mel') #' input_mel <- RenameBelowNode(input = input_mel, whichnode = 7, replacement = "T cell")