| Title: | Conformal Inference for Cell Type Annotation |
|---|---|
| Description: | Builds prediction interval for cell type annotation using conformal inference and conformal risk control. It provides two main methods. The first one gives prediction intervals with coverage guarantees based on standard conformal inference. The second one instead gives hierarchical prediction intervals that are consistent with the cell ontology. |
| Authors: | Daniela Corbetta [aut, cre] (ORCID: <https://orcid.org/0009-0008-5026-8271>), Tram Nguyen [ctb], Nitesh Turaga [ctb], Ludwig Geistlinger [ctb] |
| Maintainer: | Daniela Corbetta <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 1.1.0 |
| Built: | 2026-05-30 09:52:32 UTC |
| Source: | https://github.com/bioc/scConform |
Given a prediction set and an ontology represented as a directed graph, this function returns the most specific common ancestor of the labels in the prediction set. It is mainly intended for hierarchical conformal prediction, where a set of predicted labels can be summarized by a single ontology term representing their common ancestor.
getCommonAncestor(pred_set, onto)getCommonAncestor(pred_set, onto)
pred_set |
character vector of labels included in the prediction set.
These labels should correspond to node names in |
onto |
an |
A character string corresponding to the most specific common ancestor
of the labels in pred_set according to the ontology onto.
library(igraph) # Let's build a random ontology onto <- graph_from_literal( animal -+dog:cat, cat -+british:persian, dog -+cocker:retriever, retriever -+golden:labrador ) # Let's consider this prediction set pred_set <- c("golden", "labrador", "cocker") com_anc <- getCommonAncestor(pred_set, onto)library(igraph) # Let's build a random ontology onto <- graph_from_literal( animal -+dog:cat, cat -+british:persian, dog -+cocker:retriever, retriever -+golden:labrador ) # Let's consider this prediction set pred_set <- c("golden", "labrador", "cocker") com_anc <- getCommonAncestor(pred_set, onto)
This function returns conformal prediction sets for the cell
type of cells in a query dataset. It implements two methods: standard split
conformal inference and a hierarchical conformal risk-control approach that
incorporates the cell ontology structure. Depending on the input and on the
value of return_sc, the output is either a list of prediction sets or a
SingleCellExperiment/SpatialExperiment object with prediction sets stored
in the colData.
getPredictionSets( x_query, x_cal, y_cal, onto = NULL, alpha = 0.1, lambdas = seq(0.001, 0.999, length.out = 100), follow_ontology = TRUE, resample = FALSE, labels = NULL, return_sc = NULL, pr_name = "pred.set", simplify = FALSE, method = "full", BPPARAM = SerialParam() )getPredictionSets( x_query, x_cal, y_cal, onto = NULL, alpha = 0.1, lambdas = seq(0.001, 0.999, length.out = 100), follow_ontology = TRUE, resample = FALSE, labels = NULL, return_sc = NULL, pr_name = "pred.set", simplify = FALSE, method = "full", BPPARAM = SerialParam() )
x_query |
query data for which we want to build prediction sets. This can
be either a |
x_cal |
calibration data. This can be either a
|
y_cal |
a vector of length |
onto |
An |
alpha |
Numeric value between 0 and 1 that indicates the allowed miscoverage |
lambdas |
a numeric vector of possible lambda values to be considered.
Necessary only when |
follow_ontology |
Logical. If |
resample |
Logical. If |
labels |
Character vector of labels of different considered cell types.
Necessary if
|
return_sc |
Logical. Parameter the controls the output type. If
|
pr_name |
Character string giving the name of the |
simplify |
Logical. If |
method |
character string or function specifying how hierarchical
prediction sets are constructed when
Alternatively, |
BPPARAM |
BiocParallel instance for parallel computing. Default is
|
Conformal inference is a statistical
framework that allows to build
prediction sets for any probabilistic or machine learning model. Suppose we
have a classification task with classes. We fit a classification
model that outputs estimated probabilities for each class:
. Split conformal inference requires to reserve a
portion of the labelled training data, , to
be used as calibration data. Given and the calibration data,
the objective of conformal inference is to build, for a new observation
a prediction set that
satisfies
for a user-chosen error rate . Note that conformal inference is
distribution-free and the sets provided have finite-samples validity.
The only assumption is that the test data and the calibration data are
exchangeable. The algorithm of split conformal inference is the following:
For the data in the calibration set,
, obtain the conformal scores, . These scores will be high when the model is assigning a
small probability to the true class, and low otherwise.
Obtain , the
empirical quantile of the conformal
scores.
Finally, for a new observation , construct a prediction
set by including all the classes for which the estimated probability is
higher than :
Let be the class with maximum estimated probability.
Moreover, given a directed graph let and
be the set on children nodes and ancestor nodes of
, respectively. Finally, for each node define a score
as the sum of the predicted probabilities of the leaf nodes that
are children of .
To build the sets we propose the following algorithm:
where .
In words, we start from the predicted class and we go up in the graph until
we find an ancestor of that has a score that is at least
and include in the prediction sets all its children.
For theoretical reasons, to this subgraph we add all the other
ones that contain for which the score is less than
. To choose , we follow eq. (4) in Angelopoulos et
al. (2023), considering the miscoverage as loss function. In this way, it is
still guaranteed that
The construction described above corresponds to the default choice
method = "full". Other values of method implement alternative
nested prediction-set constructions that incorporate the ontology structure
in different ways. All methods are calibrated using the same conformal
risk-control procedure to select the threshold parameter .
return_sc = TRUE:A SingleCellExperiment or
SpatialExperiment object with the prediction sets stored in the
colData. The name of the corresponding variable is given by pr_name.
return_sc = FALSE:A list of length equal to the number of cells in the query data. Each element contains the prediction set for one cell.
Corbetta, D. et al. Conformal inference for cell type annotation with graph-structured constraints. arXiv preprint arXiv:2410.23786.
Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511.
Angelopoulos, A. N. et al. Conformal risk control. arXiv preprint arXiv:2208.02814.
# random p matrix set.seed(1040) p <- matrix(rnorm(2000 * 4), ncol = 4) # Normalize the matrix p to have all numbers between 0 and 1 that sum to 1 # by row p <- exp(p - apply(p, 1, max)) p <- p / rowSums(p) cell_types <- c("T (CD4+)", "T (CD8+)", "B", "NK") colnames(p) <- cell_types # Take 1000 rows as calibration and 1000 as test p_cal <- p[1:1000, ] p_test <- p[1001:2000, ] # Randomly create the vector of real cell types for p_cal and p_test y_cal <- sample(cell_types, 1000, replace = TRUE) y_test <- sample(cell_types, 1000, replace = TRUE) # Obtain conformal prediction sets conf_sets <- getPredictionSets( x_query = p_test, x_cal = p_cal, y_cal = y_cal, onto = NULL, alpha = 0.1, follow_ontology = FALSE, resample = FALSE, labels = cell_types, return_sc = FALSE )# random p matrix set.seed(1040) p <- matrix(rnorm(2000 * 4), ncol = 4) # Normalize the matrix p to have all numbers between 0 and 1 that sum to 1 # by row p <- exp(p - apply(p, 1, max)) p <- p / rowSums(p) cell_types <- c("T (CD4+)", "T (CD8+)", "B", "NK") colnames(p) <- cell_types # Take 1000 rows as calibration and 1000 as test p_cal <- p[1:1000, ] p_test <- p[1001:2000, ] # Randomly create the vector of real cell types for p_cal and p_test y_cal <- sample(cell_types, 1000, replace = TRUE) y_test <- sample(cell_types, 1000, replace = TRUE) # Obtain conformal prediction sets conf_sets <- getPredictionSets( x_query = p_test, x_cal = p_cal, y_cal = y_cal, onto = NULL, alpha = 0.1, follow_ontology = FALSE, resample = FALSE, labels = cell_types, return_sc = FALSE )
This function takes as input a prediction set and an ontology and plots the ontology, highlighting the labels included in the set.
plotResult( pred_set, onto, probs = NULL, col_grad = c("lemonchiffon", "orange", "darkred"), attrs = NULL, k = 4, title = NULL, add_scores = TRUE, ... )plotResult( pred_set, onto, probs = NULL, col_grad = c("lemonchiffon", "orange", "darkred"), attrs = NULL, k = 4, title = NULL, add_scores = TRUE, ... )
pred_set |
character vector containing the labels in the prediction set |
onto |
an |
probs |
numeric vector of estimated probabilities for the classes. The
names of |
col_grad |
character vector of colors used to highlight the classes. If
|
attrs |
attrs list of additional graphical attributes passed to
|
k |
integer number of decimal digits to consider in |
title |
title of the plot |
add_scores |
Logical. If |
... |
additional graphical parameters passed to |
A plot of the ontology with the labels in the prediction set highlighted.
library(igraph) # Let's build a random ontology onto <- graph_from_literal( animal -+dog:cat, cat -+british:persian, dog -+cocker:retriever, retriever -+golden:labrador ) # Let's consider this prediction set pred_set <- c("golden", "labrador", "cocker") plotResult(pred_set, onto, col_grad = "pink", add_scores = FALSE, title = "Prediction set" )library(igraph) # Let's build a random ontology onto <- graph_from_literal( animal -+dog:cat, cat -+british:persian, dog -+cocker:retriever, retriever -+golden:labrador ) # Let's consider this prediction set pred_set <- c("golden", "labrador", "cocker") plotResult(pred_set, onto, col_grad = "pink", add_scores = FALSE, title = "Prediction set" )