library(knitr)
library(ClustIRR)
library(igraph)
library(ggplot2)
library(ggrepel)
library(patchwork)
Adaptive immunity relies on diverse immune receptor repertoires (IRRs: B- and T-cell receptor repertoires) to protect the host against genetically diverse and rapidly evolving pathogens, such as viruses, bacteria, and cancers. The sequence diversity of B- and T-cell receptors (BCRs and TCRs) originates, in part, from V(D)J recombination, a process in which different germline-encoded genes are joined to form unique immune receptors. As a result, practically every newly formed naive mature T and B cell is equipped with a distinct immune receptor (IR), enabling them to recognize unique sets of antigens.
B cells bind antigens directly through the complementarity-determining regions (CDRs) of their BCRs, while T cells recognize antigenic peptides presented by major histocompatibility complex (MHC) molecules via the CDRs of their TCRs. Antigen recognition can lead to B/T cell activation, causing these cells to rapidly proliferate and form antigen-specific clones capable of mounting an effective immune response.
Recent studies have shown that sequence similarity between TCRs often indicates shared antigen specificity. Therefore, by clustering TCR sequences from high-throughput sequencing (HT-seq) data, we can identify communities of TCRs with shared antigen specificity. By tracking the dynamics of these communities over time or across biological conditions, we may learn how our immune system responds to e.g. cancer immunotherapies, vaccines, and antiviral drugs, which can help us improve these treatments.
This vignette introduces ClustIRR, a computational method that detects communities of immune receptors with similar specificity and employs Bayesian models to quantify differential community occupancy between IRRs from different biological conditions (e.g. before and after cancer treatment).
ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of IRRs.
To install ClustIRR please start R and enter:
The main input of ClustIRR
is an IRR (s
), which should be provided as data.frame. The
rows in the data.frame correspond to clones (clone =
group of cells derived from a common parent cell by clonal expansion).
We use the following data from each clone:
In a typical scenario, the user will have more than one IRR (see workflow). For instance, the user will analyze longitudinal IRR data, i.e., two or three IRRs taken at different time points; or across different biological conditions.
Let’s have a look at an example IRR: two TCRαβ repertoires a and b.
set.seed(127)
n <- 300
# get 300 CDR3a and CDR3b pairs from the data -> IRR a
a <- data.frame(CDR3a = CDR3ab$CDR3a[1:n],
CDR3b = CDR3ab$CDR3b[1:n],
clone_size=rpois(n = n, lambda = 3)+1,
sample = "a")
# get 200 CDR3a and CDR3b pairs from the data -> IRR b
b <- data.frame(CDR3a = CDR3ab$CDR3a[101:(n+100)],
CDR3b = CDR3ab$CDR3b[101:(n+100)],
clone_size=rpois(n = n, lambda = 3)+1,
sample = "b")
'data.frame': 300 obs. of 4 variables:
$ CDR3a : chr "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
$ CDR3b : chr "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
$ clone_size: num 3 2 3 2 3 5 4 3 5 5 ...
$ sample : chr "a" "a" "a" "a" ...
'data.frame': 300 obs. of 4 variables:
$ CDR3a : chr "CASSRLEGSDTQYF" "CASTEYGNTIYF" "CAWSVLSDTQYF" "CSVERDRADGYTF" ...
$ CDR3b : chr "CSARGRTSVNEQFF" "CSVAEGLGTEAFF" "CVSSSTGNNQPQHF" "CATSRPGGEWETQYF" ...
$ clone_size: num 2 4 4 5 3 4 9 3 5 4 ...
$ sample : chr "b" "b" "b" "b" ...
cluster_irr
IRRs, such as T-cell receptor repertoires, are made up of T-cells which are distributed over T-cell clones. TCR clones with identical pairs of CDR3α and CDR3β sequences most likely recognize the same sets of antigens. Meanwhile, TCR clones with similar pairs of CDR3α and CDR3β sequences may also share common specificity. ClustIRR aims to quantify the similarity between pairs of TCR clones based on the similarities of their CDR3s sequences.
How to compute a similarity score between a pair of CDR3 sequences?
Pair of sequences, a and b, are aligned with the Needleman-Wunsch algorithm. The output is an alignment score (ω). Identical or similar CDR3 sequence pairs get a large positive ω, and dissimilar CDR3 sequence pairs get a low (or even negative) ω.
To make sure that ω is comparable across pairs of CDR3s with different lengths, ClustIRR divides (normalizes) ω by the length of the longest CDR3 sequences in each pair: where |a| and |b| are the lengths of CDR3 sequences a and b; and ω̄ is the normalized alignment score.
The CDR3 cores, which represent the central parts of the
CDR3 loop and tend to have high probability of making a contact with the
antigen, are also compared. ClustIRR
constructs the CDR3 cores by trimming few residues (defined by
control$trim_flanks
) from either end of each CDR3
sequences. These are then aligned and scored based on the same
algorithm, yielding for each pair of CDR3 cores a normalized alignment
scores ω̄c.
This strategy is computationally very expensive!
For large IRRs with n > 106 this algorithm
requires significant computational resources. To mitigate this
challenge, we employ a screening step in which dissimilar sequences
pairs are flagged. In short, each CDR3 is used as a query in a
fast protein-BLAST search as implemented in the
R-package blaster, while the remaining CDR3s are considered as a
database of amino acid sequences against which the query is compared.
CDR3 sequences which share at least 70% sequence identity (user
parameter control$gmi
) with the query are selected, and
only these are aligned with query CDR3. For the remaining CDR3 pairs we
assume ω̄ = 0.
Step 1. involves calling the function clust_irr
which
returns an S4 object of class clust_irr
.
Next, we show the chain-specific similarity scores between CDR3s sequences. Each row is a pair of CDR3 sequences from the repertoire. For each pair we have the following metrics:
max_len
: length of the longer CDR3 sequence in the
pairmax_clen
: length of the longer CDR3 core sequence in
the pairweight
: ω =
BLOSUM62 score of the complete CDR3 alignmentcweight
= ωc: BLOSUM62
score of CDR3 coresnweight
= ω̄:
normalized weight
by max_len
ncweight
= ω̄c: normalized
cweight
by max_clen
The results for CDR3a:
from_cdr3 | to_cdr3 | weight | cweight | nweight | ncweight | max_len | max_clen |
---|---|---|---|---|---|---|---|
CASSLEADHEKLFF | CASSVGEATNEKLFF | 99 | 42 | 6.60 | 4.67 | 15 | 9 |
CASSPRREHDTEAFF | CASSPGRSTEAFF | 91 | 34 | 6.07 | 3.78 | 15 | 9 |
CASSPTSYRAGASYEQYF | CASSPTSGGLTYEQYF | 119 | 60 | 6.61 | 5.00 | 18 | 12 |
CASHPDNTDTQYF | CASTGTNPTDTQYF | 90 | 31 | 6.43 | 3.88 | 14 | 8 |
CASSLYVRAARYNEQFF | CASSLGLADYNEQFF | 102 | 44 | 6.00 | 4.00 | 17 | 11 |
CASRRDTDNSPLHF | CASSEDGGDNSPLHF | 103 | 44 | 6.87 | 4.89 | 15 | 9 |
… the same table as CDR3b sequence pairs:
from_cdr3 | to_cdr3 | weight | cweight | nweight | ncweight | max_len | max_clen |
---|---|---|---|---|---|---|---|
CASSLEADHEKLFF | CASSVGEATNEKLFF | 99 | 42 | 6.60 | 4.67 | 15 | 9 |
CASSPRREHDTEAFF | CASSPGRSTEAFF | 91 | 34 | 6.07 | 3.78 | 15 | 9 |
CASSPTSYRAGASYEQYF | CASSPTSGGLTYEQYF | 119 | 60 | 6.61 | 5.00 | 18 | 12 |
CASHPDNTDTQYF | CASTGTNPTDTQYF | 90 | 31 | 6.43 | 3.88 | 14 | 8 |
CASSLYVRAARYNEQFF | CASSLGLADYNEQFF | 102 | 44 | 6.00 | 4.00 | 17 | 11 |
CASRRDTDNSPLHF | CASSEDGGDNSPLHF | 103 | 44 | 6.87 | 4.89 | 15 | 9 |
The function clust_irr
performs automatic annotation of
TCR clones based on databases (DBs) including: VDJdb, TCR3d, McPAS-TCR.
The control parameter control$db_edit=0
(default) controls
an edit distance threshold. If the edit distance between an input CDR3
and a DB CDR3 sequence is smaller then or equal to
control$db_edit
, then the input CDR3s inherits the antigen
specificity data of the DB CDR3s.
To access these annotations see:
# control = list(gmi = 0.7, trim_flank_aa = 3, db_dist = 0, db_custom = NULL)
kable(head(get_clustirr_inputs(c_a)$s), digits = 2)
CDR3a | CDR3b | clone_size | sample | id | db_vdjdb_CDR3b | info_vdjdb_CDR3b | db_vdjdb_CDR3a | info_vdjdb_CDR3a | db_mcpas_CDR3b | info_mcpas_CDR3b | db_mcpas_CDR3a | info_mcpas_CDR3a | db_tcr3d_CDR3b | info_tcr3d_CDR3b | db_tcr3d_CDR3a | info_tcr3d_CDR3a | Ag_species | Ag_gene |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CASSLRGAHNEQFF | CASTVTSGSNEKLFF | 3 | a | 1 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | 0 | ||||||
CASGLRQGYGYTF | CASSLTGTGYTF | 2 | a | 2 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | 0 | ||||||
CSAGGFRETQYF | CSALTPGLIYNEQFF | 3 | a | 3 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | 0 | ||||||
CGSSLSQGSEQYF | CSARASWGTNEKLFF | 2 | a | 4 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | 0 | ||||||
CSPRGPYGYTF | CASSVREAENQPQHF | 3 | a | 5 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | 0 | ||||||
CSLGPSKSQYF | CASCTVGNQPQHF | 5 | a | 6 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | 0 |
get_graph
or get_joint_graph
Next, ClustIRR
builds a graph. If we analyze one IRR, then we may employ the function
get_graph
, which converts the clust_irr
object
into an igraph
object. Meanwhile, if we are analyzing two
ore more IRRs, then we can use the function get_joint_graph
to generate a joint igraph
object. In this case, edges
between TCR clones from different IRRs are computed using the same
procedure outlined in step 1.
The graphs have nodes and weighted edges:
The graph is an igraph
object. We can use the
igraph
functions to inspect different properties of the
graph, such as, the distribution of edge weights (shown below). Notice,
that the edge weights vary drastically between the edges.
from | to | weight | cweight | nweight | ncweight | max_len | max_clen | type | chain | clustering |
---|---|---|---|---|---|---|---|---|---|---|
a|39 | a|69 | 105 | 48 | 8.08 | 6.86 | 13 | 7 | within-repertoire | CDR3b | global |
a|117 | a|211 | 108 | 49 | 7.71 | 6.12 | 14 | 8 | within-repertoire | CDR3b | global |
a|133 | a|196 | 114 | 55 | 6.71 | 5.00 | 17 | 11 | within-repertoire | CDR3b | global |
a|35 | a|212 | 117 | 58 | 7.80 | 6.44 | 15 | 9 | within-repertoire | CDR3b | global |
a|196 | a|212 | 110 | 51 | 6.88 | 5.10 | 16 | 10 | within-repertoire | CDR3b | global |
a|76 | a|187 | 114 | 57 | 7.60 | 6.33 | 15 | 9 | within-repertoire | CDR3b | global |
Below we show the distributions of the edge attributes
ncweight
and nweight
between CDR3α and CDR3β sequence pairs in the IRR.
ggplot(data = e)+
geom_density(aes(ncweight, col = chain))+
geom_density(aes(nweight, col = chain), linetype = "dashed")+
theme_bw()+
xlab(label = "edge weight (solid = ncweight, dashed = nweight)")+
theme(legend.position = "top")
Here we have two IRRs. We can use the function by
get_joint_graph
to create a joint graph. This function
computes edges between the TCR clones from the different IRRs (as
described in step 1.). We do this in the following code blocks.
ClustIRR employs graph-based community detection (GCD) algorithms, such as Louvain or Leiden, to identify densely connected communities.
But first, we must decide how to compute a similarity between two nodes, i and j, (e.g. TCR clones) based on the similarity scores between their CDR3 sequences (compute in step 1.). We will refer to this metric as ω(i, j).
If the data contains CDR3 sequences from only one chain, such as
CDR3β, then ω(i, j) is defined
as The user can decide among the two definitions by specifying
weight
= “ncweight” (default; $\omega(i,j)=\bar{\omega_c}$) or
weight
= “nweight” (default; ω(i, j) = ω̄).
To compute the similarity score between TCR clones, i and j, we compute the average alignment score from their CDR3α and CDR3β alignment scores (in the next, I will use TCRαβ as an example, however, this approach can also be used to compare TCRγδ or BCRIgH-IgL clones): where ω̄α and ω̄β are the alignment scores for the CDR3α and CDR3β sequences, respectively; and ω̄cα and ω̄cβ are the alignment scores for the CDR3α and CDR3β cores, respectively. Based on this metric, the contributions of CDR3α and CDR3β towards the overall similarity of the TCR clones are assigned equal weights.
ClustIRR provides two additional metrics for computing similarity scores between TCR clones, including a strict metric, which assigns high similarity score to a pair of TCR clones only if both of their CDR3α and CDR3β sequence pairs are similar and a loose metric, which assigns high similarity score to a pair of TCR clones if either of their CDR3α and CDR3β sequence pairs are similar
The user has the following options:
algorithm
: “leiden” (default) or “louvain”resolution
: GCD resolution = 1 (default)weight
: “ncweight” (default) or “nweight”metric
: “average” (default), “strict” or “loose”chains
: “CDR3a” or “CDR3b” or c(“CDR3a”, “CDR3b”)gcd <- detect_communities(graph = g$graph,
algorithm = "leiden",
resolution = 1,
weight = "ncweight",
metric = "average",
chains = c("CDR3a", "CDR3b"))
The function detect_communities
generates a complex
output. Lets investigate its elements:
[1] "community_occupancy_matrix" "community_summary"
[3] "node_summary" "graph"
[5] "input_config"
The main element is community_occupancy_matrix
, which
contains the number of T-cells in each community (row) and IRR (column).
Here we have two IRRs (two columns) and about 300 communities. This
matrix is the main input of the function dco
(step 4.), to
detect differences in the community occupancy between IRRs.
[1] 219 2
a b
1 11 0
2 2 0
3 3 0
4 2 0
5 22 17
6 5 0
plot(x = gcd$community_occupancy_matrix[, "a"],
y = gcd$community_occupancy_matrix[, "b"],
xlab = "community occupancy [cells in IRR a]",
ylab = "community occupancy [cells in IRR b]")
Also see community_summary
. Here we provide
community-specific summaries, as rows, including:
clones_a
, clone_b
, clones_n
:
the frequency of clones in the community coming from IRR a, b and in
total (n)cells_a
, cells_b
, cells_n
:
the frequency of cell in the community coming from IRR a, b and in total
(n)w
: the mean inter-clone similarity (ω(i, j))w_CDR3a
, w_CDR3b
: the contributions of
CDR3α and CDR3β to w
n_CDR3a
, n_CDR3b
: number of edges between
CDR3α and CDR3β sequencescommunity | clones_a | clones_b | clones_n | cells_a | cells_b | cells_n | w | w_CDR3a | w_CDR3b | n_CDR3a | n_CDR3b |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 0 | 3 | 11 | 0 | 11 | 3.12 | 6.25 | 0.00 | 2 | 0 |
2 | 1 | 0 | 1 | 2 | 0 | 2 | 0.00 | 0.00 | 0.00 | 0 | 0 |
3 | 1 | 0 | 1 | 3 | 0 | 3 | 0.00 | 0.00 | 0.00 | 0 | 0 |
4 | 1 | 0 | 1 | 2 | 0 | 2 | 0.00 | 0.00 | 0.00 | 0 | 0 |
5 | 4 | 4 | 8 | 22 | 17 | 39 | 4.15 | 1.41 | 6.88 | 3 | 20 |
6 | 1 | 0 | 1 | 5 | 0 | 5 | 0.00 | 0.00 | 0.00 | 0 | 0 |
What is the contribution of CDR3a and CDR3b weights to the formation of communities?
Notice the big dot at coordinatex 0,0. Communities made up from a single node have no within-community edges →
w_CDR3a
= 0w_CDR3b
= 0w
= 0ggplot(data = gcd$community_summary)+
geom_point(aes(x = w_CDR3a, y = w_CDR3b, size = cells_n), shape=21)+
xlab(label = "CDR3a ncweight")+
ylab(label = "CDR3b ncweight")+
scale_size_continuous(range = c(0.5, 5))+
theme_bw(base_size = 10)
Node-specific (TCR clone-specific) summaries are provided in
node_summary
name | clone_id | Ag_gene | Ag_species | info_tcr3d_CDR3a | db_tcr3d_CDR3a | info_tcr3d_CDR3b | db_tcr3d_CDR3b | info_mcpas_CDR3a | db_mcpas_CDR3a | info_mcpas_CDR3b | db_mcpas_CDR3b | info_vdjdb_CDR3a | db_vdjdb_CDR3a | info_vdjdb_CDR3b | db_vdjdb_CDR3b | sample | clone_size | CDR3b | CDR3a | community | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a|1 | a|1 | 1 | 0 | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | a | 3 | CASTVTSGSNEKLFF | CASSLRGAHNEQFF | 1 | ||||||
a|2 | a|2 | 2 | 0 | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | a | 2 | CASSLTGTGYTF | CASGLRQGYGYTF | 2 | ||||||
a|3 | a|3 | 3 | 0 | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | a | 3 | CSALTPGLIYNEQFF | CSAGGFRETQYF | 3 | ||||||
a|4 | a|4 | 4 | 0 | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | a | 2 | CSARASWGTNEKLFF | CGSSLSQGSEQYF | 4 | ||||||
a|5 | a|5 | 5 | 0 | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | a | 3 | CASSVREAENQPQHF | CSPRGPYGYTF | 5 | ||||||
a|6 | a|6 | 6 | 0 | 0 | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | a | 5 | CASCTVGNQPQHF | CSLGPSKSQYF | 6 |
Do we see growing or shrinking communities in a given IRRs?
We employ a Bayesian model to quantify the relative abundance (occupancy) of individual communities in each IRR (minimum number of IRRs = 2).
For DCO analysis of two IRRs The model output is the parameter δ = δ1, δ2, …, δk, where k is the number of communities. Growing community i between IRR a vs. IRR b, results in δi > 0, shrinking community i results in δi < 0.
For DCO analysis of more than two IRRs The model output for IRR a is the parameter vector βa = β1a, β2a, …, βka, which describes the effect of IRR a on the relative occupancy in each community.
Given two IRRs, a and b, we can quantify the differential community occupancy (DCO):
d <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix,
mcmc_control = list(mcmc_warmup = 500,
mcmc_iter = 1500,
mcmc_chains = 3,
mcmc_cores = 1,
mcmc_algorithm = "NUTS",
adapt_delta = 0.9,
max_treedepth = 10))
SAMPLING FOR MODEL 'dm' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.00019 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.9 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 1500 [ 0%] (Warmup)
Chain 1: Iteration: 150 / 1500 [ 10%] (Warmup)
Chain 1: Iteration: 300 / 1500 [ 20%] (Warmup)
Chain 1: Iteration: 450 / 1500 [ 30%] (Warmup)
Chain 1: Iteration: 501 / 1500 [ 33%] (Sampling)
Chain 1: Iteration: 650 / 1500 [ 43%] (Sampling)
Chain 1: Iteration: 800 / 1500 [ 53%] (Sampling)
Chain 1: Iteration: 950 / 1500 [ 63%] (Sampling)
Chain 1: Iteration: 1100 / 1500 [ 73%] (Sampling)
Chain 1: Iteration: 1250 / 1500 [ 83%] (Sampling)
Chain 1: Iteration: 1400 / 1500 [ 93%] (Sampling)
Chain 1: Iteration: 1500 / 1500 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 4.051 seconds (Warm-up)
Chain 1: 6.569 seconds (Sampling)
Chain 1: 10.62 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'dm' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 9.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 1500 [ 0%] (Warmup)
Chain 2: Iteration: 150 / 1500 [ 10%] (Warmup)
Chain 2: Iteration: 300 / 1500 [ 20%] (Warmup)
Chain 2: Iteration: 450 / 1500 [ 30%] (Warmup)
Chain 2: Iteration: 501 / 1500 [ 33%] (Sampling)
Chain 2: Iteration: 650 / 1500 [ 43%] (Sampling)
Chain 2: Iteration: 800 / 1500 [ 53%] (Sampling)
Chain 2: Iteration: 950 / 1500 [ 63%] (Sampling)
Chain 2: Iteration: 1100 / 1500 [ 73%] (Sampling)
Chain 2: Iteration: 1250 / 1500 [ 83%] (Sampling)
Chain 2: Iteration: 1400 / 1500 [ 93%] (Sampling)
Chain 2: Iteration: 1500 / 1500 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 3.714 seconds (Warm-up)
Chain 2: 6.487 seconds (Sampling)
Chain 2: 10.201 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'dm' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 9.7e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 1500 [ 0%] (Warmup)
Chain 3: Iteration: 150 / 1500 [ 10%] (Warmup)
Chain 3: Iteration: 300 / 1500 [ 20%] (Warmup)
Chain 3: Iteration: 450 / 1500 [ 30%] (Warmup)
Chain 3: Iteration: 501 / 1500 [ 33%] (Sampling)
Chain 3: Iteration: 650 / 1500 [ 43%] (Sampling)
Chain 3: Iteration: 800 / 1500 [ 53%] (Sampling)
Chain 3: Iteration: 950 / 1500 [ 63%] (Sampling)
Chain 3: Iteration: 1100 / 1500 [ 73%] (Sampling)
Chain 3: Iteration: 1250 / 1500 [ 83%] (Sampling)
Chain 3: Iteration: 1400 / 1500 [ 93%] (Sampling)
Chain 3: Iteration: 1500 / 1500 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 3.652 seconds (Warm-up)
Chain 3: 6.468 seconds (Sampling)
Chain 3: 10.12 seconds (Total)
Chain 3:
Before we can start interpreting the model results, we have to make sure that the model is valid. One standard approach is to check whether our model can retrodict the observed data (community occupancy matrix) which was used to fit model parameters.
General idea of posterior predictive checks:
ClustIRR provides y and ŷ of each IRR, which we can visualize with ggplot:
ggplot(data = d$posterior_summary$y_hat)+
facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
col = "darkgray", width=0)+
geom_point(aes(x = y_obs, y = mean), size = 0.8)+
xlab(label = "observed y")+
ylab(label = "predicted y (and 95% HDI)")+
theme_bw(base_size = 10)
We can compare DAC in two directions: a vs. b or b vs. a.
Two different parameters δ and ϵ are available. δ is the difference between the different samples β parameter for each community. It can be interpreted as the effect of each community, regardless of it’s cell count. This can be useful to detect changes in rare clonotypes with low cell counts. δ has a range of −∞ to +∞.
ggplot(data = d$posterior_summary$delta)+
facet_wrap(facets = ~contrast, ncol = 1)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95),
col = "darkgray", width =0)+
geom_point(aes(x = community, y = mean), size = 0.5)+
theme_bw(base_size = 10)+
theme(legend.position = "none")+
ylab(label = expression(delta))+
scale_x_continuous(expand = c(0,0))
ϵ is the difference between the different samples regenerated multinomial probability p for each community. It can be interpreted as the effect of each community, relative to the different sample and community sizes. This can be useful to detect medium to big effects in a concise way. ϵ has a range of −1 to +1.
ggplot(data = d$posterior_summary$epsilon)+
facet_wrap(facets = ~contrast, ncol = 1)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95),
col = "darkgray", width =0)+
geom_point(aes(x = community, y = mean), size = 0.5)+
theme_bw(base_size = 10)+
theme(legend.position = "none")+
ylab(label = expression(epsilon))+
scale_x_continuous(expand = c(0,0))
Distribution of mean δs
ggplot(data = d$posterior_summary$delta)+
facet_wrap(facets = ~contrast, ncol = 1)+
geom_histogram(aes(mean), bins = 100)+
xlab(label = expression(delta))+
theme_bw(base_size = 10)
Distribution of mean ϵs
The function dco
takes as its main input a community
occupancy matrix. This enables users who are accustomed to using
complementary algorithm for detecting specificity groups, such as,
GLIPH, TCRdist3, GIANA, and iSMART, to skip steps 1-3 of the ClustIRR
workflow, and to proceed with analysis for DCO.
Imagine that we have three IRRs, a, b and c, obtained from one patient at three timepoints.
# repertoire size
n <- 200
# a
a <- data.frame(CDR3a = CDR3ab$CDR3a[1:n],
CDR3b = CDR3ab$CDR3b[1:n],
sample = "a")
# b
b <- data.frame(CDR3a = CDR3ab$CDR3a[1:n],
CDR3b = CDR3ab$CDR3b[1:n],
sample = "b")
# c
c <- data.frame(CDR3a = CDR3ab$CDR3a[1:n],
CDR3b = CDR3ab$CDR3b[1:n],
sample = "c")
get_clonal_expansion <- function(n, p_expanded) {
s <- sample(x = c(0, 1), size = n, prob = c(1-p_expanded,
p_expanded), replace = T)
y <- vapply(X = s, FUN.VALUE = numeric(1), FUN = function(x) {
if(x == 0) {
return(rpois(n = 1, lambda = 0.5))
}
return(rpois(n = 1, lambda = 50))
})
return(y)
}
# simulate expansion of specific communities
set.seed(1243)
clone_size <- rpois(n = n, lambda = 3)+1
expansion_factor <- get_clonal_expansion(n = n, p_expanded = 0.02)
a$clone_size <- clone_size
b$clone_size <- clone_size+expansion_factor*1
c$clone_size <- clone_size+expansion_factor*2
cluster_irr
analyzed each TCRs
repertoireget_graph
and
plot_graph
visualize specificity structuresWe can also plot a graph of the global specificity structure in TCR repertoire a, b and c.
detect_communities
identifies
communities in the graphAre there densely connected sets of nodes (=communities) in this graph?
To answer this question we can use graph-based community detection
(GCD) algorithms, such as Leiden or Louvain. As input for GCD we can use
nweight
or ncweight
(default) between CDR3a,
CDR3b or both CDR3a and CDR3b.
gcd <- detect_communities(graph = g$graph,
weight = "ncweight",
algorithm = "leiden",
resolution = 1,
chains = c("CDR3a", "CDR3b"))
How many cells in each community from the three IRRs?
(ggplot(data = gcd$community_summary)+
geom_point(aes(x = cells_a, y = cells_b), shape = 21)+
theme_bw(base_size = 10))|
(ggplot(data = gcd$community_summary)+
geom_point(aes(x = cells_a, y = cells_c), shape = 21)+
theme_bw(base_size = 10))|
(ggplot(data = gcd$community_summary)+
geom_point(aes(x = cells_b, y = cells_c), shape = 21)+
theme_bw(base_size = 10))+
plot_annotation(tag_level = 'A')
The number of cells in each IRR and community are stored as cells in
the matrix community_occupancy_matrix
. Rows are
communities, and columns are IRRs
a b c
1 10 12 14
2 2 3 4
3 2 2 2
4 5 5 5
5 13 14 15
6 5 5 5
dco
performs differential
community occupancy (DCO) analysisDo we see expanding or shrinking communities in a given IRRs?
We employ a Bayesian model to quantify the relative abundance (occupancy) of individual communities, and the differential community occupancy between IRRs.
d <- dco(community_occupancy_matrix = community_occupancy_matrix,
mcmc_control = list(mcmc_warmup = 300,
mcmc_iter = 900,
mcmc_chains = 3,
mcmc_cores = 1,
mcmc_algorithm = "NUTS",
adapt_delta = 0.95,
max_treedepth = 11))
SAMPLING FOR MODEL 'dm' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 8.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.85 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 900 [ 0%] (Warmup)
Chain 1: Iteration: 90 / 900 [ 10%] (Warmup)
Chain 1: Iteration: 180 / 900 [ 20%] (Warmup)
Chain 1: Iteration: 270 / 900 [ 30%] (Warmup)
Chain 1: Iteration: 301 / 900 [ 33%] (Sampling)
Chain 1: Iteration: 390 / 900 [ 43%] (Sampling)
Chain 1: Iteration: 480 / 900 [ 53%] (Sampling)
Chain 1: Iteration: 570 / 900 [ 63%] (Sampling)
Chain 1: Iteration: 660 / 900 [ 73%] (Sampling)
Chain 1: Iteration: 750 / 900 [ 83%] (Sampling)
Chain 1: Iteration: 840 / 900 [ 93%] (Sampling)
Chain 1: Iteration: 900 / 900 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 3.972 seconds (Warm-up)
Chain 1: 5.872 seconds (Sampling)
Chain 1: 9.844 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'dm' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 7.5e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.75 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 900 [ 0%] (Warmup)
Chain 2: Iteration: 90 / 900 [ 10%] (Warmup)
Chain 2: Iteration: 180 / 900 [ 20%] (Warmup)
Chain 2: Iteration: 270 / 900 [ 30%] (Warmup)
Chain 2: Iteration: 301 / 900 [ 33%] (Sampling)
Chain 2: Iteration: 390 / 900 [ 43%] (Sampling)
Chain 2: Iteration: 480 / 900 [ 53%] (Sampling)
Chain 2: Iteration: 570 / 900 [ 63%] (Sampling)
Chain 2: Iteration: 660 / 900 [ 73%] (Sampling)
Chain 2: Iteration: 750 / 900 [ 83%] (Sampling)
Chain 2: Iteration: 840 / 900 [ 93%] (Sampling)
Chain 2: Iteration: 900 / 900 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 3.396 seconds (Warm-up)
Chain 2: 5.834 seconds (Sampling)
Chain 2: 9.23 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'dm' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 7.4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.74 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 900 [ 0%] (Warmup)
Chain 3: Iteration: 90 / 900 [ 10%] (Warmup)
Chain 3: Iteration: 180 / 900 [ 20%] (Warmup)
Chain 3: Iteration: 270 / 900 [ 30%] (Warmup)
Chain 3: Iteration: 301 / 900 [ 33%] (Sampling)
Chain 3: Iteration: 390 / 900 [ 43%] (Sampling)
Chain 3: Iteration: 480 / 900 [ 53%] (Sampling)
Chain 3: Iteration: 570 / 900 [ 63%] (Sampling)
Chain 3: Iteration: 660 / 900 [ 73%] (Sampling)
Chain 3: Iteration: 750 / 900 [ 83%] (Sampling)
Chain 3: Iteration: 840 / 900 [ 93%] (Sampling)
Chain 3: Iteration: 900 / 900 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 3.979 seconds (Warm-up)
Chain 3: 5.804 seconds (Sampling)
Chain 3: 9.783 seconds (Total)
Chain 3:
Before we can start interpreting the model results, we have to make sure that the model is valid. One standard approach is to check whether our model can retrodict the observed data (community occupancy matrix) which was used to fit model parameters.
General idea of posterior predictive checks:
ClustIRR provides y and ŷ of each IRR, which we can visualize with ggplot:
ggplot(data = d$posterior_summary$y_hat)+
facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
col = "darkgray", width=0)+
geom_point(aes(x = y_obs, y = mean), size = 0.8)+
xlab(label = "observed y")+
ylab(label = "predicted y (and 95% HDI)")+
theme_bw(base_size = 10)
Notice that some (about 2%) β coefficients are far from β = 0.
Remember: we simulated clonal expansion in 2% of the communities!
beta <- d$posterior_summary$beta
ggplot(data = beta)+
facet_wrap(facets = ~sample, ncol = 1)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95,
col = sample), width = 0)+
geom_point(aes(x = community, y = mean, col = sample), size = 0.5)+
theme_bw(base_size = 10)+
theme(legend.position = "top")+
ylab(label = expression(beta))+
scale_x_continuous(expand = c(0,0))
If a given community i is differentially expressed between two AIRRs, a and b, then we may expect to see a difference in the credible values of βia and βib. We define this as δia − b.
δia − b = βia − βib
Lets look at δa − b,
δa − c
and δb − c
for different communities. This information is stored in
posterior_summary$delta
of the output. We see clear
differences (δ! = 0) for at
least 3 communities.
Remember: we simulated clonal expansion in about 2% of the communities!
delta <- d$posterior_summary$delta
delta <- delta[delta$contrast %in% c("a-b", "a-c", "b-c"), ]
ggplot(data = delta)+
facet_wrap(facets = ~contrast, ncol = 1)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), width=0)+
geom_point(aes(x = community, y = mean), size = 0.5)+
theme_bw(base_size = 10)+
theme(legend.position = "top")+
ylab(label = expression(delta))+
scale_x_continuous(expand = c(0,0))
We can also expect to see a difference in the credible values of pia and pib, which incorporates the differences between community sizes. We define this as ϵia − b.
ϵia − b = pia − pib
We see clear differences (ϵ! = 0), but only for the bigger 2
communities. This information is stored in
posterior_summary$epsilon
of the output.
epsilon <- d$posterior_summary$epsilon
epsilon <- epsilon[epsilon$contrast %in% c("a-b", "a-c", "b-c"), ]
ggplot(data = epsilon)+
facet_wrap(facets = ~contrast, ncol = 1)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), width=0)+
geom_point(aes(x = community, y = mean), size = 0.5)+
theme_bw(base_size = 10)+
theme(legend.position = "top")+
ylab(label = expression(epsilon))+
scale_x_continuous(expand = c(0,0))
We can look at the histograms of the effect size means.
ggplot(data = d$posterior_summary$delta)+
facet_wrap(facets = ~contrast, ncol = 1)+
geom_histogram(aes(mean), bins = 100)+
xlab(label = expression(delta))+
theme_bw(base_size = 10)
ggplot(data = d$posterior_summary$epsilon)+
facet_wrap(facets = ~contrast, ncol = 1)+
geom_histogram(aes(mean), bins = 100)+
xlab(label = expression(epsilon))+
theme_bw(base_size = 10)
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.3.0 ggrepel_0.9.6 ggplot2_3.5.1 igraph_2.1.2
[5] ClustIRR_1.5.24 knitr_1.49 BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] stringdist_0.9.14 gtable_0.3.6 tensorA_0.36.2.1
[4] xfun_0.49 bslib_0.8.0 QuickJSR_1.4.0
[7] htmlwidgets_1.6.4 visNetwork_2.1.2 inline_0.3.20
[10] vctrs_0.6.5 tools_4.4.2 generics_0.1.3
[13] stats4_4.4.2 curl_6.0.1 parallel_4.4.2
[16] tibble_3.2.1 pkgconfig_2.0.3 checkmate_2.3.2
[19] distributional_0.5.0 RcppParallel_5.1.9 lifecycle_1.0.4
[22] farver_2.1.2 compiler_4.4.2 stringr_1.5.1
[25] munsell_0.5.1 codetools_0.2-20 htmltools_0.5.8.1
[28] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
[31] yaml_2.3.10 pillar_1.10.0 jquerylib_0.1.4
[34] cachem_1.1.0 StanHeaders_2.32.10 abind_1.4-8
[37] parallelly_1.41.0 posterior_1.6.0 rstan_2.32.6
[40] digest_0.6.37 stringi_1.8.4 future_1.34.0
[43] reshape2_1.4.4 listenv_0.9.1 labeling_0.4.3
[46] maketools_1.3.1 fastmap_1.2.0 grid_4.4.2
[49] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
[52] loo_2.8.0 pkgbuild_1.4.5 future.apply_1.11.3
[55] withr_3.0.2 scales_1.3.0 backports_1.5.0
[58] rmarkdown_2.29 matrixStats_1.4.1 globals_0.16.3
[61] gridExtra_2.3 evaluate_1.0.1 V8_6.0.0
[64] rstantools_2.4.0 rlang_1.1.4 Rcpp_1.0.13-1
[67] glue_1.8.0 BiocManager_1.30.25 jsonlite_1.8.9
[70] R6_2.5.1 blaster_1.0.7 plyr_1.8.9
cluster_irr
get_graph
or
get_joint_graph
cluster_irr
analyzed each TCRs
repertoireget_graph
and plot_graph
visualize
specificity structuresdetect_communities
identifies communities in
the graphdco
performs differential community occupancy
(DCO) analysis