library(knitr)
library(ClustIRR)
library(igraph)
library(ggplot2)
theme_set(new = theme_bw(base_size = 10))
library(ggrepel)
library(patchwork)

Introduction

Adaptive immunity employs diverse immune receptor repertoires (IRRs; B- and T-cell receptors) to combat evolving pathogens including viruses, bacteria, and cancers. Receptor diversity arises through V(D)J recombination - combinatorial assembly of germline genes generating unique sequences. Each naive lymphocyte consequently expresses a distinct receptor, enabling broad antigen recognition.

B cells engage antigens directly via BCR complementarity-determining regions (CDRs), while T cells recognize peptide-MHC complexes through TCR CDRs. Antigen recognition triggers clonal expansion, producing effector populations that mount targeted immune responses.

High-throughput sequencing (HT-seq) enables tracking of TCR/BCR community dynamics across biological conditions (e.g., pre-/post-treatment), offering insights into responses to immunotherapy and vaccination. However, two key challenges complicate this approach:

Extreme diversity and privacy: TCRs/BCRs are highly personalized, with incomplete sampling particularly problematic in clinical settings where sample volumes are limited. Even comprehensive sampling reveals minimal repertoire overlap between individuals.
Similar TCRs/BCRs recognize similar antigens

This vignette introduces ClustIRR, a computational method that addresses these challenges by: (1) Identifying specificity-associated receptor communities through sequence clustering, and (2) Applying Bayesian models to quantify differential community abundance across conditions.

Installation

ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of IRRs.

To install ClustIRR please start R and enter:

if(!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ClustIRR")

ClustIRR algorithm

Input

The main input of ClustIRR is a repertoire (s), which should be provided as data.frame. The rows in the data.frame correspond to clones (clone = group of cells derived from a common parent cell by clonal expansion). We use the following data from each clone:

CDR3 amino acid sequences from one/both chains (e.g. CDR3α and CDR3β from TCRαβs).
Clone size, which refers to the frequency of cells that belong to the clone.

In a typical scenario, the user will have more than one repertoire (see workflow). For instance, the user will analyze longitudinal repertoire data, i.e., two or three repertoires taken at different time points; or across different biological conditions.

Let’s look at dataset D1 that is provided within ClustIRR. D1 contains three TCRαβ repertoires: a, b, and c and their metadata: ma, mb and mc.

data("D1", package = "ClustIRR")
str(D1)

List of 6
 $ a :'data.frame': 500 obs. of  4 variables:
  ..$ CDR3a     : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
  ..$ CDR3b     : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
  ..$ sample    : chr [1:500] "a" "a" "a" "a" ...
  ..$ clone_size: num [1:500] 3 2 2 5 3 5 5 5 3 4 ...
 $ b :'data.frame': 500 obs. of  4 variables:
  ..$ CDR3a     : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
  ..$ CDR3b     : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
  ..$ sample    : chr [1:500] "b" "b" "b" "b" ...
  ..$ clone_size: num [1:500] 3 2 2 7 4 5 5 7 3 4 ...
 $ c :'data.frame': 500 obs. of  4 variables:
  ..$ CDR3a     : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
  ..$ CDR3b     : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
  ..$ sample    : chr [1:500] "c" "c" "c" "c" ...
  ..$ clone_size: num [1:500] 3 2 2 9 5 5 5 9 3 4 ...
 $ ma:'data.frame': 500 obs. of  8 variables:
  ..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ cell    : chr [1:500] "CD8" "CD4" "CD8" "CD8" ...
  ..$ HLA_A   : chr [1:500] "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" ...
  ..$ age     : num [1:500] 24 24 24 24 24 24 24 24 24 24 ...
  ..$ TRAV    : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
  ..$ TRAJ    : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
  ..$ TRBV    : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
  ..$ TRBJ    : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
 $ mb:'data.frame': 500 obs. of  8 variables:
  ..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ cell    : chr [1:500] "CD8" "CD4" "CD4" "CD4" ...
  ..$ HLA_A   : chr [1:500] "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" ...
  ..$ age     : num [1:500] 30 30 30 30 30 30 30 30 30 30 ...
  ..$ TRAV    : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
  ..$ TRAJ    : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
  ..$ TRBV    : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
  ..$ TRBJ    : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
 $ mc:'data.frame': 500 obs. of  8 variables:
  ..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ cell    : chr [1:500] "CD8" "CD8" "CD8" "CD8" ...
  ..$ HLA_A   : chr [1:500] "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" ...
  ..$ age     : num [1:500] 40 40 40 40 40 40 40 40 40 40 ...
  ..$ TRAV    : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
  ..$ TRAJ    : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
  ..$ TRBV    : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
  ..$ TRBJ    : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...

Extract the data.frames for each TCR repertoire and their metadata:

# Extract TCRab repertoires: a, b and c
a <- D1$a
b <- D1$b
c <- D1$c

# Extract metadata for TCRab repertoires 'a', 'b' and 'c'.
# This data is optional. It may contain and may contain diverse 
# supplementary (clone-specific and repertoire-specific) data for 
# each TCRab repertoire.
meta_a <- D1$ma
meta_b <- D1$mb
meta_c <- D1$mc

Algorithm

ClustIRR performs the following steps.

Compute similarities between T-cell clones within each TCR repertoire
Construct a graph from each TCR repertoire
Construct a joint similarity graph (J)
Detect communities in J
Analyze Differential Community Occupancy (DCO)

a. Between individual TCR repertoires with model M

b. Between groups of TCR repertoires from biological conditions with model M_h
Inspect results

Step 1. Compute TCR clone similarities in a repertoire with `cluster_irr`

ClustIRR aims to quantify the similarity between pairs of TCR clones based on the similarities of their CDR3s sequences. For this it employs Basic Local-Alignment Search Tool (BLAST) via the R-package blaster. Briefly, a protein database is constructed from all CDR3 sequences, and each CDR3 sequence is used as a query. This enables fast sequence similarity searches. Furthermore, only CDR3 sequences matches with ≥ 70% sequence identity to the query are retained. This step reduces the computational and memory requirements, without impacting downstream community analyses, as CDR3 sequences with lower typically yield low similarity scores.

For matched CDR3 pair, an alignment score (ω) is computed using BLOSUM62 substitution scores with gap opening penalty of -10 and gap extension penalty of -4. ω is the sum of substitution scores and gap penalties in the alignment. Identical or highly similar CDR3 sequence pairs receive large positive ω scores, while dissimilar pairs receive low or negative ω. To normalize ω for alignment length, ClustIRR computes ω̄ = ω/l, where l is the alignment length yielding normalized alignment score ω̄. This normalization, also used in iSMART (Zhang, 2020), ensures comparability across CDR3 pairs of varying lengths.

ClustIRR also computes alignment scores for the CDR3 core regions (ω^c and ω̄^c). The CDR3 core, representing the central loop region with high antigen-contact probability (Glanville, 2017), is generated by trimming three residues from each end of the CDR3 sequence. Comparing ω̄^c and ω̄ allows assessment of whether sequence similarity is concentrated in the core or flanking regions.

Run `cluster_irr`

Step 1. involves calling the function clust_irr which returns an S4 object of class clust_irr.

# perform clust_irr analysis of repertoire a, b and c
cl_a <- cluster_irr(s = a, meta = meta_a, control = list(gmi = 0.7))

cl_b <- cluster_irr(s = b, meta = meta_b, control = list(gmi = 0.7))

cl_c <- cluster_irr(s = c, meta = meta_c, control = list(gmi = 0.7))

Next, we show the chain-specific similarity scores between CDR3s sequences. Each row is a pair of CDR3 sequences from the repertoire. For each pair we have the following metrics:

max_len: length of the longer CDR3 sequence in the pair
max_clen: length of the longer CDR3 core sequence in the pair
weight: ω = BLOSUM62 score of the complete CDR3 alignment
cweight= ω_c: BLOSUM62 score of CDR3 cores
nweight = ω̄: normalized weight by max_len
ncweight = ω̄_c: normalized cweight by max_clen

The results for CDR3α sequence pairs from repertoire a:

kable(head(cl_a@clust$CDR3a), digits = 2)

from_cdr3	to_cdr3	weight	cweight	nweight	ncweight	max_len	max_clen
CASSLEADHEKLFF	CASSVGEATNEKLFF	99	42	6.60	4.67	15	9
CASSPRAGGYQETQYF	CASSPMTSGGSQETQYF	121	62	7.12	5.64	17	11
CASSFGTGLEEQYF	CASSFGTGDQPQHF	110	56	7.86	7.00	14	8
CASSFGTGLEEQYF	CASSFGWGGDTQYF	108	49	7.71	6.12	14	8
CASSPRREHDTEAFF	CASSPGRSTEAFF	91	34	6.07	3.78	15	9
CASSPTSYRAGASYEQYF	CASSPTSGGLTYEQYF	119	60	6.61	5.00	18	12

… the same table for CDR3β sequence pairs from repertoire a:

kable(head(cl_a@clust$CDR3b), digits = 2)

from_cdr3	to_cdr3	weight	cweight	nweight	ncweight	max_len	max_clen
CASSPGGDEAFF	CASSQGRGTDEAFF	81	24	5.79	3.00	14	8
CASSPGGDEAFF	CASSPRTTEAFF	92	35	7.67	5.83	12	6
CASKVPSGESSYNEQFF	CASSHPSSSYNEQFF	110	52	6.47	4.73	17	11
CASKVPSGESSYNEQFF	CASSVASGGTYNEQFF	116	58	6.82	5.27	17	11
CASGLNVNTEAFF	CASSLNRVNTEAFF	101	44	7.21	5.50	14	8
CASGLNVNTEAFF	CASGLVGNTEAFF	105	48	8.08	6.86	13	7

Notice that very similar CDR3α or CDR3β pairs have high normalized alignment scores (ω̄). We have similar tables for repertoire b and c.

Annotation of CDR3s

The function clust_irr performs automatic annotation of TCR clones based on multiple databases (DBs), including: VDJdb, TCR3d, McPAS-TCR. Each TCR clone recieves two types of annotations (per chain and per database):

Ag_species: the name of the antigen species recognized by the clone’s CDR3
Ag_gene: the name of the antigen gene recognized by clone’s CDR3

Annotation process: annotation is performed based on an edit distance threshold, controlled by the user through the parameter control$db_edit (default = 0). If the edit distance between an input CDR3 sequence and a database (DB) CDR3 sequence is less than or equal to control$db_edit, the input CDR3 inherits the annotation data of the matched DB CDR3.

This information will be used for various downstream analyses (see the next steps).

Step 2. Construct TCR repertoire graphs with `get_graph`

Next, ClustIRR builds a graph. If we analyze one TCRαβ repertoire, then we may employ the function get_graph, which converts the clust_irr object into an igraph object. Meanwhile, if we are analyzing two or more repertoires, then we can use the function get_joint_graph to generate a joint igraph object. In this case, edges between TCR clones from different TCR repertoires are computed using the same procedure outlined in step 1.

The graphs have nodes and weighted edges:

nodes: clones from each TCR repertoire. Each clone attribute (clone size, CDR3 sequences, etc.) is provided as node attribute
edges: undirected edges connecting pairs of nodes based on CDR3α and CDR3β similarity scores (ω̄ and ω̄^c) in each TCR repertoire (computed in step 1.)

g <- get_graph(clust_irr = cl_a)

Inspect graphs with `plot_graph`

We can use the function plot_graph for interactive visualization of relatively small graphs.

For instance, we can highligh TCR clones that recognize certain antigenic species or genes (see dropdown menu), or use the hoovering function to look at the CDR3 sequences of nodes.

Do this now!

plot_graph(g, select_by = "Ag_species", as_visnet = TRUE)

`get_graph` and `get_joint_graph` generate `igraph` objects

We can use igraph functions to inspect various properties of the graph. For instance, below, we extract the edge attributes and visualize the distributions of the edge attributes ncweight and nweight for all CDR3α and CDR3β sequence pairs.

# data.frame of edges and their attributes
e <- igraph::as_data_frame(x = g$graph, what = "edges")

ggplot(data = e)+
  geom_density(aes(ncweight, col = chain))+
  geom_density(aes(nweight, col = chain), linetype = "dashed")+
  xlab(label = "edge weight (solid = ncweight, dashed = nweight)")+
  theme(legend.position = "top")

Can you guess why we observe bimodal distributions? (Hint: CDR3s have different lengths)

Step 3. Constructing a joint similarity graph with `get_joint_graph`

In dataset D1 we have three TCR repertoires. Hence, we can use the function get_joint_graph to create a joint graph. This function computes similarities between CDR3 sequences of clones from different repertoires, creating new edges between nodes from different graphs with ω̄ and ω̄^c as edge attributes. The joint graph J is structured as an igraph object.

g <- get_joint_graph(clust_irrs = c(cl_a, cl_b, cl_c))

Once again, we will use the function plot_graph for interactive visualization. However, even in this small case study–where each TCR repertoire contains only 500 TCR clones–we observe that the joint graph is complex! This complexity makes qualitative analyses impractical. To address this, ClustIRR focuses on quantitative analyses (see the next step).

plot_graph(g = g, select_by = "Ag_gene", as_visnet = TRUE, node_opacity = 0.8)

Step 4. community detection with `detect_communities`

ClustIRR employs graph-based community detection (GCD) algorithms, such as Louvain, Leiden or InfoMap, to identify communities of nodes that have high density of edges among each other, and low density of edges with nodes outside the community.

First, the similarity score between T-cell clones i and j is defined as the average CDR3α and CDR3β alignment scores:

where ω̄_α and ω̄_β are the alignment scores for the CDR3α and CDR3β, respectively. If a chain is missing, its alignment score is set to 0.

The user has the following options:

algorithm: “leiden” (default) “louvain”, or “infomap”
resolution: GCD resolution = 1 (default)
iterations: number of clustering iterations (default = 100) to ensure robust results
weight: “ncweight” (default) or “nweight”
chains: “CDR3a” or “CDR3b” or c(“CDR3a”, “CDR3b”)

gcd <- detect_communities(graph = g$graph, 
                          algorithm = "leiden",
                          resolution = 1,
                          iterations = 100,
                          weight = "ncweight",
                          chains = c("CDR3a", "CDR3b"))

Inspecting the outputs of `detect_communities`

The function detect_communities generates a complex output. Lets investigate its elements:

names(gcd)

[1] "community_occupancy_matrix" "community_summary"         
[3] "node_summary"               "graph"                     
[5] "graph_structure_quality"    "input_config"

The main element is community_occupancy_matrix, which contains the number of T-cells in each community (row) and repertoire (column). Here we have three repertoires (three columns) and about 260 communities. This matrix is the main input of the function dco (step 5.), to detect differences in the community occupancy between repertoires.

dim(gcd$community_occupancy_matrix)

[1] 261   3

head(gcd$community_occupancy_matrix)

Visualizing community abundance matrices with `get_honeycombs`

honeycomb <- get_honeycombs(com = gcd$community_occupancy_matrix)

In the honeycomb plots shown below, several communities (black dots) appear far from the diagonal. This indicates that these communities contain more cells in repertoires b and c (y-axes in panels A and B) compared to repertoire a (x-axes in panels A and B). Meanwhile, the same points are generally close to the diagonal in panel C but remain slightly more abundant in repertoire c (y-axis) compared to b (x-axis).

In step 5, ClustIRR will provide a quantitative answer to the question: Which communities are differentially abundant between pairs of repertoires?

Importantly, the color of the hexagons encodes the density of communities in the 2D scatterplots: dark hexagons indicate high a frequency of communities, while light hexagons represent sparsely populated regions.

wrap_plots(honeycomb, nrow = 1)+
    plot_annotation(tag_levels = 'A')

Also see community_summary. In the data.frame wide we provide community summaries in each row across all samples, including:

clones_a, clone_b, clone_c, clones_n: the frequency of clones in the community coming from repertoire a, b, c and in total (n)
cells_a, cells_b, cells_c, cells_n: the frequency of cell in the community coming from repertoires a, b, c and in total (n)
w: average inter-clone similarity
ncweight_CDR3a/b, nweight_CDR3a/b: raw and normalized similarity for CDR3α and CDR3β sequences
n_CDR3a, n_CDR3b: number of edges between CDR3α and CDR3β sequences

kable(head(gcd$community_summary$wide), digits = 1)

community	clones_a	clones_b	clones_c	clones_n	cells_a	cells_b	cells_c	cells_n	w	ncweight_CDR3a	ncweight_CDR3b	nweight_CDR3a	nweight_CDR3b	n_CDR3a	n_CDR3b
1	3	3	3	9	10	10	10	30	5.2	3.1	3.1	3.9	3.1	27	9
2	1	1	1	3	2	2	2	6	9.3	NaN	9.3	NaN	9.6	3	3
3	1	1	1	3	2	2	2	6	9.4	NaN	9.4	NaN	9.6	3	3
4	1	1	1	3	5	7	9	21	9.1	NaN	9.1	NaN	9.5	3	3
5	6	6	6	18	28	31	34	93	4.0	0.0	4.0	0.0	4.7	18	108
6	1	1	1	3	5	5	5	15	9.6	NaN	9.6	NaN	9.8	3	3

In the data.frame tall we provide community and sample/repertoire summaries in each row.

kable(head(gcd$community_summary$tall), digits = 2)

	community	sample	clones	cells	w	ncweight_CDR3a	ncweight_CDR3b	nweight_CDR3a	nweight_CDR3b	n_CDR3a	n_CDR3b
1	1	a	3	10	5.17	3.12	3.09	3.86	3.14	27	9
2	1	b	3	10	5.17	3.12	3.09	3.86	3.14	27	9
3	1	c	3	10	5.17	3.12	3.09	3.86	3.14	27	9
334	112	a	5	30	3.33	2.24	1.43	3.37	1.47	96	15
335	112	b	5	31	3.33	2.24	1.43	3.37	1.47	96	15
336	112	c	5	32	3.33	2.24	1.43	3.37	1.47	96	15

Node-specific (TCR clone-specific) summaries are provided in node_summary

kable(head(gcd$node_summary), digits = 2)

	name	clone_id	info_tcr3d_CDR3a	info_tcr3d_CDR3b	info_mcpas_CDR3a	info_mcpas_CDR3b	info_vdjdb_CDR3a	info_vdjdb_CDR3b	clone_size	sample	CDR3b	CDR3a	cell	HLA_A	age	TRAV	TRAJ	TRBV	TRBJ	community
a\|1	a\|1	1	<db:tcr3d\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:tcr3d\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:mcpas\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	3	a	CASTVTSGSNEKLFF	CASSLRGAHNEQFF	CD8	HLA-A∗24	24	TRAV27	TRAJ2-1	TRBV6-8	TRBJ1-4	1
a\|2	a\|2	2	<db:tcr3d\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:tcr3d\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:mcpas\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	2	a	CASSLTGTGYTF	CASGLRQGYGYTF	CD4	HLA-A∗24	24	TRAV27	TRAJ1-2	TRBV7-3	TRBJ1-2	2
a\|3	a\|3	3	<db:tcr3d\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:tcr3d\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:mcpas\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	2	a	CSALTPGLIYNEQFF	CSAGGFRETQYF	CD8	HLA-A∗24	24	TRAV20-1	TRAJ2-5	TRBV20-1	TRBJ2-1	3
a\|4	a\|4	4	<db:tcr3d\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:tcr3d\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:mcpas\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	5	a	CSARASWGTNEKLFF	CGSSLSQGSEQYF	CD8	HLA-A∗24	24	TRAV7-7	TRAJ2-7	TRBV20-1	TRBJ1-4	4
a\|5	a\|5	5	<db:tcr3d\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:tcr3d\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:mcpas\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	3	a	CASSVREAENQPQHF	CSPRGPYGYTF	CD8	HLA-A∗24	24	TRAV20-1	TRAJ1-2	TRBV4-1	TRBJ1-5	5
a\|6	a\|6	6	<db:tcr3d\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:tcr3d\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|Reference:>	<db:mcpas\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:mcpas\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3a\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	<db:VDJdb\|chain:CDR3b\|Antigen_species:\|Antigen_gene:\|CDR3_species:\|Reference:>	5	a	CASCTVGNQPQHF	CSLGPSKSQYF	CD8	HLA-A∗24	24	TRAV29-1	TRAJ2-3	TRBV19	TRBJ1-5	6

Decoding communities with `decode_community`

Community with ID=8 contains 42 TCR clones. These are connected based on their CDR3β sequences.

ns_com_8 <- gcd$node_summary[gcd$node_summary$community == 8, ]

table(ns_com_8$sample)


 a  b  c 
14 14 14

We can extract it and visualize it using igraph:

par(mai = c(0,0,0,0))
plot(subgraph(graph = gcd$graph, vids = which(V(gcd$graph)$community == 8)))

Furthermore, we can “decompose” this graph using decode_community based on the attributes of the edges (edge_filter) and nodes (node_filter).

# we can pick from these edge attributes
edge_attr_names(graph = gcd$graph)

[1] "nweight"  "ncweight" "chain"    "w"

The following edge-filter will instruct ClustIRR to keep edges with with edge attributes: nweight>=3 AND ncweight>=3

edge_filter <- rbind(data.frame(name = "nweight", value = 3, operation = ">="),
                     data.frame(name = "ncweight", value = 3, operation = ">="))

# we can pick from these node attributes
vertex_attr_names(graph = gcd$graph)

 [1] "name"             "clone_id"         "Ag_gene"          "Ag_species"      
 [5] "info_tcr3d_CDR3a" "db_tcr3d_CDR3a"   "info_tcr3d_CDR3b" "db_tcr3d_CDR3b"  
 [9] "info_mcpas_CDR3a" "db_mcpas_CDR3a"   "info_mcpas_CDR3b" "db_mcpas_CDR3b"  
[13] "info_vdjdb_CDR3a" "db_vdjdb_CDR3a"   "info_vdjdb_CDR3b" "db_vdjdb_CDR3b"  
[17] "clone_size"       "sample"           "CDR3b"            "CDR3a"           
[21] "cell"             "HLA_A"            "age"              "TRAV"            
[25] "TRAJ"             "TRBV"             "TRBJ"             "community"

The following node-filter will instruct ClustIRR to retain edges between nodes that have shared node attributed with respect to ALL of the following node attributes:

node_filter <- data.frame(name = c("TRBV", "TRBJ", "HLA_A"))

c8 <- decode_communities(community_id = 8, 
                         graph = gcd$graph,
                         edge_filter = edge_filter,
                         node_filter = node_filter)

par(mar = c(0, 0, 0, 0))
plot(c8, vertex.size = 10)

as_data_frame(x = c8, what = "vertices")[,c("name", "component_id", 
                                            "CDR3b", "TRBV", "TRBJ",
                                            "CDR3a", "TRAV", "TRAJ")]

       name component_id           CDR3b     TRBV    TRBJ              CDR3a
a|8     a|8            1  CASSGRGGDNEQFF  TRBV7-9 TRBJ2-1      CASGTLSYNEQFF
a|271 a|271            1  CASSQAGGRNEQFF  TRBV7-9 TRBJ2-1   CTSSQVGVGMNTEAFF
a|22   a|22            1 CASSQAGYSYNEQFF    TRBV9 TRBJ2-1     CASSPPLAEETQYF
a|23   a|23            1  CASSYSGHYNEQFF  TRBV6-5 TRBJ2-1     CSVVGQLTGYEQYF
a|124 a|124            1  CASAPGTSYNEQFF  TRBV7-6 TRBJ2-1   CASSLYGTERADEQYF
a|166 a|166            1  CASSLGTPYNEQFF   TRBV13 TRBJ2-1      CASGQTDPNGYTF
a|340 a|340            1  CASSGRSSYNEQFF TRBV12-4 TRBJ2-1   CASSLPAERPDYGYTF
a|379 a|379            1  CASSQGRSYNEQFF  TRBV4-1 TRBJ2-1      CSASGTEDNEQFF
a|408 a|408            1  CASSPLTGRNEQFF   TRBV27 TRBJ2-1     CASSLGASEYEQYF
a|429 a|429            1  CASSGAGQFNEQFF   TRBV19 TRBJ2-1       CSVRDLGETQYF
a|456 a|456            1  CASSELQSYNEQFF  TRBV6-1 TRBJ2-1 CSGAGWIGFPDQYNEQFF
a|464 a|464            1  CASSLAGGYNEQFF  TRBV6-8 TRBJ2-1    CASSIYMGKNTEAFF
a|482 a|482            1  CASSLLGSYNEQFF  TRBV6-2 TRBJ2-1  CASMGRQGRDGNEKLFF
a|500 a|500            1  CASSSSPGDNEQFF  TRBV7-3 TRBJ2-1     CASSYLAAQETQYF
b|8     b|8            1  CASSGRGGDNEQFF  TRBV7-9 TRBJ2-1      CASGTLSYNEQFF
b|271 b|271            1  CASSQAGGRNEQFF  TRBV7-9 TRBJ2-1   CTSSQVGVGMNTEAFF
b|22   b|22            1 CASSQAGYSYNEQFF    TRBV9 TRBJ2-1     CASSPPLAEETQYF
b|23   b|23            1  CASSYSGHYNEQFF  TRBV6-5 TRBJ2-1     CSVVGQLTGYEQYF
b|124 b|124            1  CASAPGTSYNEQFF  TRBV7-6 TRBJ2-1   CASSLYGTERADEQYF
b|166 b|166            1  CASSLGTPYNEQFF   TRBV13 TRBJ2-1      CASGQTDPNGYTF
b|340 b|340            1  CASSGRSSYNEQFF TRBV12-4 TRBJ2-1   CASSLPAERPDYGYTF
b|379 b|379            1  CASSQGRSYNEQFF  TRBV4-1 TRBJ2-1      CSASGTEDNEQFF
b|408 b|408            1  CASSPLTGRNEQFF   TRBV27 TRBJ2-1     CASSLGASEYEQYF
b|429 b|429            1  CASSGAGQFNEQFF   TRBV19 TRBJ2-1       CSVRDLGETQYF
b|456 b|456            1  CASSELQSYNEQFF  TRBV6-1 TRBJ2-1 CSGAGWIGFPDQYNEQFF
b|464 b|464            1  CASSLAGGYNEQFF  TRBV6-8 TRBJ2-1    CASSIYMGKNTEAFF
b|482 b|482            1  CASSLLGSYNEQFF  TRBV6-2 TRBJ2-1  CASMGRQGRDGNEKLFF
b|500 b|500            1  CASSSSPGDNEQFF  TRBV7-3 TRBJ2-1     CASSYLAAQETQYF
c|8     c|8            1  CASSGRGGDNEQFF  TRBV7-9 TRBJ2-1      CASGTLSYNEQFF
c|271 c|271            1  CASSQAGGRNEQFF  TRBV7-9 TRBJ2-1   CTSSQVGVGMNTEAFF
c|22   c|22            1 CASSQAGYSYNEQFF    TRBV9 TRBJ2-1     CASSPPLAEETQYF
c|23   c|23            1  CASSYSGHYNEQFF  TRBV6-5 TRBJ2-1     CSVVGQLTGYEQYF
c|124 c|124            1  CASAPGTSYNEQFF  TRBV7-6 TRBJ2-1   CASSLYGTERADEQYF
c|166 c|166            1  CASSLGTPYNEQFF   TRBV13 TRBJ2-1      CASGQTDPNGYTF
c|340 c|340            1  CASSGRSSYNEQFF TRBV12-4 TRBJ2-1   CASSLPAERPDYGYTF
c|379 c|379            1  CASSQGRSYNEQFF  TRBV4-1 TRBJ2-1      CSASGTEDNEQFF
c|408 c|408            1  CASSPLTGRNEQFF   TRBV27 TRBJ2-1     CASSLGASEYEQYF
c|429 c|429            1  CASSGAGQFNEQFF   TRBV19 TRBJ2-1       CSVRDLGETQYF
c|456 c|456            1  CASSELQSYNEQFF  TRBV6-1 TRBJ2-1 CSGAGWIGFPDQYNEQFF
c|464 c|464            1  CASSLAGGYNEQFF  TRBV6-8 TRBJ2-1    CASSIYMGKNTEAFF
c|482 c|482            1  CASSLLGSYNEQFF  TRBV6-2 TRBJ2-1  CASMGRQGRDGNEKLFF
c|500 c|500            1  CASSSSPGDNEQFF  TRBV7-3 TRBJ2-1     CASSYLAAQETQYF
          TRAV    TRAJ
a|8     TRAV19 TRAJ2-1
a|271  TRAV4-2 TRAJ1-1
a|22   TRAV7-9 TRAJ2-5
a|23  TRAV29-1 TRAJ2-7
a|124 TRAV12-4 TRAJ2-7
a|166 TRAV25-1 TRAJ1-2
a|340  TRAV7-2 TRAJ1-2
a|379 TRAV20-1 TRAJ2-1
a|408 TRAV12-4 TRAJ2-7
a|429 TRAV29-1 TRAJ2-5
a|456 TRAV29-1 TRAJ2-1
a|464   TRAV19 TRAJ1-1
a|482   TRAV19 TRAJ1-4
a|500  TRAV6-6 TRAJ2-5
b|8     TRAV19 TRAJ2-1
b|271  TRAV4-2 TRAJ1-1
b|22   TRAV7-9 TRAJ2-5
b|23  TRAV29-1 TRAJ2-7
b|124 TRAV12-4 TRAJ2-7
b|166 TRAV25-1 TRAJ1-2
b|340  TRAV7-2 TRAJ1-2
b|379 TRAV20-1 TRAJ2-1
b|408 TRAV12-4 TRAJ2-7
b|429 TRAV29-1 TRAJ2-5
b|456 TRAV29-1 TRAJ2-1
b|464   TRAV19 TRAJ1-1
b|482   TRAV19 TRAJ1-4
b|500  TRAV6-6 TRAJ2-5
c|8     TRAV19 TRAJ2-1
c|271  TRAV4-2 TRAJ1-1
c|22   TRAV7-9 TRAJ2-5
c|23  TRAV29-1 TRAJ2-7
c|124 TRAV12-4 TRAJ2-7
c|166 TRAV25-1 TRAJ1-2
c|340  TRAV7-2 TRAJ1-2
c|379 TRAV20-1 TRAJ2-1
c|408 TRAV12-4 TRAJ2-7
c|429 TRAV29-1 TRAJ2-5
c|456 TRAV29-1 TRAJ2-1
c|464   TRAV19 TRAJ1-1
c|482   TRAV19 TRAJ1-4
c|500  TRAV6-6 TRAJ2-5

Step 5. differential community occupancy (DCO) with `dco`

Do we see expanding or contracting communities in a given repertoire?

We employ a Bayesian model to quantify the relative abundance (occupancy) of individual communities in each repertoire.

The model output for repertoire a is the parameter vector β^a = β₁^a, β₂^a, …, β_k^a, which describes the effect of repertoire a on the relative occupancy in each community.

Given two repertoires, a and b, we can quantify the differential community occupancy (DCO):

d <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix,
         mcmc_control = list(mcmc_warmup = 300,
                             mcmc_iter = 600,
                             mcmc_chains = 2,
                             mcmc_cores = 1,
                             mcmc_algorithm = "NUTS",
                             adapt_delta = 0.9,
                             max_treedepth = 10))


SAMPLING FOR MODEL 'dm' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000206 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.06 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 600 [  0%]  (Warmup)
Chain 1: Iteration:  60 / 600 [ 10%]  (Warmup)
Chain 1: Iteration: 120 / 600 [ 20%]  (Warmup)
Chain 1: Iteration: 180 / 600 [ 30%]  (Warmup)
Chain 1: Iteration: 240 / 600 [ 40%]  (Warmup)
Chain 1: Iteration: 300 / 600 [ 50%]  (Warmup)
Chain 1: Iteration: 301 / 600 [ 50%]  (Sampling)
Chain 1: Iteration: 360 / 600 [ 60%]  (Sampling)
Chain 1: Iteration: 420 / 600 [ 70%]  (Sampling)
Chain 1: Iteration: 480 / 600 [ 80%]  (Sampling)
Chain 1: Iteration: 540 / 600 [ 90%]  (Sampling)
Chain 1: Iteration: 600 / 600 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 7.243 seconds (Warm-up)
Chain 1:                6.372 seconds (Sampling)
Chain 1:                13.615 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'dm' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000166 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.66 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 600 [  0%]  (Warmup)
Chain 2: Iteration:  60 / 600 [ 10%]  (Warmup)
Chain 2: Iteration: 120 / 600 [ 20%]  (Warmup)
Chain 2: Iteration: 180 / 600 [ 30%]  (Warmup)
Chain 2: Iteration: 240 / 600 [ 40%]  (Warmup)
Chain 2: Iteration: 300 / 600 [ 50%]  (Warmup)
Chain 2: Iteration: 301 / 600 [ 50%]  (Sampling)
Chain 2: Iteration: 360 / 600 [ 60%]  (Sampling)
Chain 2: Iteration: 420 / 600 [ 70%]  (Sampling)
Chain 2: Iteration: 480 / 600 [ 80%]  (Sampling)
Chain 2: Iteration: 540 / 600 [ 90%]  (Sampling)
Chain 2: Iteration: 600 / 600 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 7.306 seconds (Warm-up)
Chain 2:                6.298 seconds (Sampling)
Chain 2:                13.604 seconds (Total)
Chain 2:

Step 6. Inspect result

Visualizing the distribution of β with `get_beta_violins`

beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
                                 beta = d$posterior_summary$beta,
                                 ag_species = NULL,
                                 db = "vdjdb",
                                 db_dist = 0,
                                 chain = "both")

beta_violins$violins

[[1]]

Compare βs of clones specific for CMV, EBV, flu or MLANA?

beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
                                 beta = d$posterior_summary$beta,
                                 ag_species = c("CMV", "EBV", "Influenza"),
                                 ag_genes = c("MLANA"),
                                 db = "vdjdb",
                                 db_dist = 0,
                                 chain = "both")

patchwork::wrap_plots(beta_violins$violins, ncol = 2)

Compare β between repertoires with `get_beta_scatterplot`

beta_scatterplots <- get_beta_scatterplot(node_summary = gcd$node_summary,
                                          beta = d$posterior_summary$beta,
                                          ag_species = c("CMV"),
                                          db = "vdjdb",
                                          db_dist = 0,
                                          chain = "both")

patchwork::wrap_plots(beta_scatterplots$scatterplots[[1]], ncol = 3)

Posterior predictive checks

Before we can start interpreting the model results, we have to make sure that the model is valid. One standard approach is to check whether our model can retrodict the observed data (community occupancy matrix) which was used to fit model parameters.

General idea of posterior predictive checks:

fit model based on data y
simulate new data ŷ
compare y and ŷ

ClustIRR provides y and ŷ of each repertoire, which we can visualize with ggplot:

ggplot(data = d$posterior_summary$y_hat)+
  facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
                col = "darkgray", width=0)+
  geom_point(aes(x = y_obs, y = mean), size = 0.8)+
  xlab(label = "observed y")+
  ylab(label = "predicted y (and 95% HDI)")

Differential community abundance results → par. δ

Given two repertoires, a and b, we can quantify the differential community occupancy (DCO):

Importantly, ClustIRR always computes both contrasts (a vs. b and b vs. a): δ_i^a − b and δ_i^b − a.

ggplot(data = d$posterior_summary$delta)+
    facet_wrap(facets = ~contrast, ncol = 2)+
    geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), 
                  col = "lightgray", width = 0)+
    geom_point(aes(x = community, y = mean), shape = 21, fill = "black", 
               stroke = 0.4, col = "white", size = 1.25)+
    theme(legend.position = "top")+
    ylab(label = expression(delta))+
    scale_x_continuous(expand = c(0,0))

Given two repertoires, a and b, ClustIRR also quantifies absolute differences in community probabilities:

Again, both contrasts are computed: ϵ_i^a − b and ϵ_i^b − a.

ggplot(data = d$posterior_summary$epsilon)+
    facet_wrap(facets = ~contrast, ncol = 2)+
    geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), 
                  col = "lightgray", width = 0)+
    geom_point(aes(x = community, y = mean), shape = 21, fill = "black", 
               stroke = 0.4, col = "white", size = 1.25)+
    theme(legend.position = "top")+
    ylab(label = expression(epsilon))+
    scale_x_continuous(expand = c(0,0))

Conclusion: you can also use custom community occupancy matrix for DCO!

The function dco takes as its main input a community occupancy matrix. This enables users who are accustomed to using complementary algorithm for detecting specificity groups, such as, GLIPH, TCRdist3, GIANA, and iSMART, to skip steps 1-3 of the ClustIRR workflow, and to proceed with analysis for DCO.

community	clones_a	clones_b	clones_c	clones_n	cells_a	cells_b	cells_c	cells_n	w	ncweight_CDR3a	ncweight_CDR3b	nweight_CDR3a	nweight_CDR3b	n_CDR3a	n_CDR3b
1	3	3	3	9	10	10	10	30	5.2	3.1	3.1	3.9	3.1	27	9
2	1	1	1	3	2	2	2	6	9.3	NaN	9.3	NaN	9.6	3	3
3	1	1	1	3	2	2	2	6	9.4	NaN	9.4	NaN	9.6	3	3
4	1	1	1	3	5	7	9	21	9.1	NaN	9.1	NaN	9.5	3	3
5	6	6	6	18	28	31	34	93	4.0	0.0	4.0	0.0	4.7	18	108
6	1	1	1	3	5	5	5	15	9.6	NaN	9.6	NaN	9.8	3	3

community	clones_a	clones_b	clones_c	clones_n	cells_a	cells_b	cells_c	cells_n	w	ncweight_CDR3a	ncweight_CDR3b	nweight_CDR3a	nweight_CDR3b	n_CDR3a	n_CDR3b
1	3	3	3	9	10	10	10	30	5.2	3.1	3.1	3.9	3.1	27	9
2	1	1	1	3	2	2	2	6	9.3	NaN	9.3	NaN	9.6	3	3
3	1	1	1	3	2	2	2	6	9.4	NaN	9.4	NaN	9.6	3	3
4	1	1	1	3	5	7	9	21	9.1	NaN	9.1	NaN	9.5	3	3
5	6	6	6	18	28	31	34	93	4.0	0.0	4.0	0.0	4.7	18	108
6	1	1	1	3	5	5	5	15	9.6	NaN	9.6	NaN	9.8	3	3

Decoding T- and B-cell receptor repertoires with ClustIRR

Introduction

Installation

ClustIRR algorithm

Input

Algorithm

Step 1. Compute TCR clone similarities in a repertoire with cluster_irr

Run cluster_irr

Annotation of CDR3s

Step 2. Construct TCR repertoire graphs with get_graph

Inspect graphs with plot_graph

get_graph and get_joint_graph generate igraph objects

Step 3. Constructing a joint similarity graph with get_joint_graph

Step 4. community detection with detect_communities

Inspecting the outputs of detect_communities

Visualizing community abundance matrices with get_honeycombs

Decoding communities with decode_community

Step 5. differential community occupancy (DCO) with dco