Title: | Clustering of immune receptor repertoires |
---|---|
Description: | ClustIRR analyzes repertoires of B- and T-cell receptors. It starts by identifying communities of immune receptors with similar specificities, based on the sequences of their complementarity-determining regions (CDRs). Next, it employs a Bayesian probabilistic models to quantify differential community occupancy (DCO) between repertoires, allowing the identification of expanding or contracting communities in response to e.g. infection or cancer treatment. |
Authors: | Simo Kitanovski [aut, cre] , Kai Wollek [aut] |
Maintainer: | Simo Kitanovski <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 1.5.24 |
Built: | 2024-12-22 03:38:30 UTC |
Source: | https://github.com/bioc/ClustIRR |
Predefined scoring matrix for amino acid or nucleoitide alignments.
data("BLOSUM62")
data("BLOSUM62")
BLOSUM62 is a square symmetric matrix. Rows and columns are identical single letters, representing nucleotide or amino acid. Elements are integer coefficients (substitution scores).
BLOSUM62 was obtained from NCBI (the same matrix used by the stand- alone BLAST software).
See https://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
See https://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
data(BLOSUM62, package = "ClustIRR") BLOSUM62
data(BLOSUM62, package = "ClustIRR") BLOSUM62
and
chains of 10,000 T cell
receptorsMock data set containing amino acid sequences of paired CDR3s from the
and
chains of 10,000 T cell receptors. All CDR3
sequences were drawn from a larger set of CDR3
sequences from
human naive CD8+ T cells.
data(CDR3ab)
data(CDR3ab)
data.frame
with 10,000 rows and 2 columns CDR3a
and CDR3b
.
data(CDR3ab) loads the object CDR3ab, which is a data.frame with two columns and 10,000 rows.
data("CDR3ab")
data("CDR3ab")
Objects of the class clust_irr
are generated by the function
cluster_irr
. These objects are used to store the clustering results
in a structured way, such that they may be used as inputs of other ClustIRR
functions (e.g. get_graph, plot_graph, etc.).
The output is an S4
object of class clust_irr
. This object
contains two sublists:
clust
, list, contains clustering results for each IR chain.
The results are stored as data.frame in separate sub-list named appropriately
(e.g. CDR3a, CDR3b, CDR3g, etc.). Each row in the data.frames contains a pair
of CDR3s.
The remaining columns contain similarity scores for the complete CDR3 sequences
(column weight
) or their cores (column cweight
). The columns
max_len
and max_clen
store the length of the longer CDR3 and
CDR3 core sequence in the pair, and these used to normalize the scores
weight
and cweight
: the normalized scores are shown in the
columns nweight
and ncweight
inputs
, list, contains all user provided inputs (see Arguments)
clust |
list, contains clustering results for each TCR/BCR chain. The results are stored in separate sub-list named appropriately (e.g. CDR3a, CDR3b, CDR3g, etc.) |
inputs |
list, contains all user provided inputs |
The output is an S4 object of class clust_irr
To access the slots of clust_irr
object we have two accessor
functions. In the description below, x
is a clust_irr
object.
get_clustirr_clust(x)
:
Extract the clustering results (slot clust)
get_clustirr_inputs(x)
:
Extract the processed inputs (slot inputs)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are globally similar knitr::kable(head(slot(c, "clust")$CDR3b)) # clust_irr S4 object generated 'manually' from the individual results new_clust_irr <- new("clust_irr", clust = slot(object = c, name = "clust"), inputs = slot(object = c, name = "inputs")) # we should get identical outputs identical(x = new_clust_irr, y = c)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are globally similar knitr::kable(head(slot(c, "clust")$CDR3b)) # clust_irr S4 object generated 'manually' from the individual results new_clust_irr <- new("clust_irr", clust = slot(object = c, name = "clust"), inputs = slot(object = c, name = "inputs")) # we should get identical outputs identical(x = new_clust_irr, y = c)
cluster_irr
computes similarities between immune receptors (IRs =
T-cell and B-cell receptors) based on their CDR3 sequences.
cluster_irr(s, control = list(gmi = 0.7, trim_flank_aa = 3, db_dist = 0, db_custom = NULL))
cluster_irr(s, control = list(gmi = 0.7, trim_flank_aa = 3, db_dist = 0, db_custom = NULL))
s |
a data.frame with complementarity determining region 3 (CDR3) amino acid sequences observed in IRR clones (data.frame rows). The data.frame has the following columns (IR clone features):
|
control |
auxiliary parameters to control the algorithm's behavior. See the details below:
|
IRRs, such as T-cell receptor repertoires, are made up of T-cells which
are distributed over T-cell clones. TCR clones with identical
pairs of CDR3 and CDR3
sequences most likely
recognize the same sets of antigens. Meanwhile, TCR clones with
similar pairs of CDR3
and CDR3
sequences
may also share common specificity. ClustIRR aims to quantify the
similarity between pairs of TCR clones based on the similarities of their
CDR3s sequences.
How to compute a similarity score between a pair of CDR3 sequences?
Pair of sequences, and
, are aligned with the Needleman-Wunsch
algorithm (BLOSUM62 substitution matrix is used for scoring). The output is an
alignment score (
). Identical or similar CDR3 sequence pairs get a
large positive
, and dissimilar CDR3 sequence pairs get a low (or
even negative)
.
To make sure that is comparable across pairs of CDR3s with
different lengths, ClustIRR divides (normalizes)
by the length of the longest CDR3 sequences in each pair:
where and
are the lengths of CDR3 sequences
and
; and
is the normalized alignment score.
The CDR3 cores, which represent the central parts of the CDR3 loop
and tend to have high probability of making a contact with the antigen,
are compared with the same procedure. ClustIRR constructs the CDR3 cores
by trimming few residues (defined by control$trim_flanks
) from either
end of each CDR3 sequences. These are then aligned and scored based on the
same algorithm, yielding for each pair of CDR3 cores a normalized alignment
scores .
This strategy is computationally very expensive!
For large IRRs with this algorithm requires significant
computational resources. To mitigate this challenge, we employ a
screening step in which dissimilar sequences pairs are flagged. In
short, each CDR3 is used as a query in a fast protein-BLAST search
as implemented in the R-package blaster, while the remaining CDR3s are
considered as a database of amino acid sequences against which the query
is compared. CDR3 sequences which share at least 70% sequence identity
(user parameter
control$gmi
) with the query are selected, and only
these are aligned with query CDR3. For the remaining CDR3 pairs we
assume .
The output is an S4
object of class clust_irr
. This object
contains two sublists:
clust
, list, contains clustering results for each IR chain.
The results are stored as data.frame in separate sub-list named appropriately
(e.g. CDR3a, CDR3b, CDR3g, etc.). Each row in the data.frames contains a pair
of CDR3s.
The remaining columns contain similarity scores for the complete CDR3 sequences
(column weight
) or their cores (column cweight
). The columns
max_len
and max_clen
store the length of the longer CDR3 sequence
and core in the pair, and these used to normalize the scores weight
and
cweight
: the normalized scores are shown in the columns nweight
and ncweight
inputs
, list, contains all user provided inputs (see Arguments)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are similar knitr::kable(head(slot(c, "clust")$CDR3b))
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are similar knitr::kable(head(slot(c, "clust")$CDR3b))
This algorithm takes as input a community matrix, and quantifies the relative enrichment/depletion of individual communities in each sample using a Bayesian hierarchical model.
dco(community_occupancy_matrix, mcmc_control, compute_delta=TRUE, groups = NA)
dco(community_occupancy_matrix, mcmc_control, compute_delta=TRUE, groups = NA)
community_occupancy_matrix |
matrix, rows are communities, columns are repertoires, matrix entries are numbers of cells in each community and repertoire. |
mcmc_control |
list, configurations for the Markov Chain Monte Carlo (MCMC) simulation. |
mcmc_warmup = 750; number of MCMC warmups
mcmc_iter = 1500; number of MCMC iterations
mcmc_chains = 4; number of MCMC chains
mcmc_cores = 1; number of computer cores
mcmc_algorithm = "NUTS"; which MCMC algorithm to use
adapt_delta = 0.95; MCMC step size
max_treedepth = 12; the max value, in exponents of 2, of what the binary tree size in NUTS should have.
compute_delta |
should delta be computed by the Stan model? This may be take up extra memory. |
groups |
vector with integers |
The output is a list with the folling elements:
fit |
model fit (stan object) |
posterior_summary |
nested list with data.frames, summary of model parameters, including their means, medians, 95% credible intervals, etc. Predicted observations (y_hat), which are useful for posterior predictive checks are also provided. |
community_occupancy_matrix |
matrix, rows are communities, columns are repertoires, matrix entries are numbers of cells in each community and repertoire. |
mcmc_control |
mcmc configuration inputs provided as list. |
compute_delta |
the input compute_delta. |
groups |
the input groups. |
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco)
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco)
Performs graph-based community detection to find densely connected groups of
nodes in graph constructed by get_graph
or get_joint_graph
.
detect_communities(graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains)
detect_communities(graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains)
graph |
|
algorithm |
graph-based community detection (GCD) method: leiden (default) or louvain. |
resolution |
clustering resolution (default = 1) for the GCD. |
weight |
which edge weight metric (default = ncweight) should be used for GCD |
metric |
possible metrics: "average" (default), "strict" or "loose". |
chains |
which chains should be used for clustering? For instance: chains = "CDR3a"; or chains = CDR3b; or chains = c("CDR3a", "CDR3b"). |
ClustIRR employs graph-based community detection (GCD) algorithms, such as
Louvain or Leiden, to identify densely connected nodes. But first, we must
decide how to compute a similarity between two nodes, and
,
(e.g. TCR clones) based on the similarity scores between their CDR3 sequences
(computed in
clust_irr
) and use this metric as edge weight
.
Scenario 1
If our IRR data data contains CDR3 sequences from only one chain, such as
CDR3, then
is defined as
The user can decide among the two definitions by specifying
weight
= "ncweight"
(default)
weight
= "nweight"
Scenario 2
If our IRR data contains CDR3 sequences from both chains (paired data)
To compute the similarity score between TCR clones, and
,
we compute the average alignment score (
metric
=average)
from their CDR3 and CDR3
alignment scores (in the
next, I will use TCR
as an example, however, this approach
can also be used to compare TCR
or BCRIgH-IgL
clones):
,
where and
are the
alignment scores for the CDR3
and CDR3
sequences,
respectively; and
and
are the alignment scores for the CDR3
and CDR3
cores,
respectively. Based on this metric, CDR3
and CDR3
contribute towards the overall similarity of the TCR clones with equal weights.
ClustIRR provides two additional metrics for computing similarity scores
between TCR clones, including a metric
=strict, which assigns
high similarity score to a pair of TCR clones only if both of their
CDR3 and CDR3
sequence pairs are similar
,
and a metric
=loose, which assigns high similarity score to
a pair of TCR clones if either of their CDR3 and CDR3
sequence pairs are similar
,
The output is a list with the folling elements:
community_occupancy_matrix |
matrix, rows are communities, columns are repertoires, matrix entries are numbers of cells in each community and repertoire. |
community_summary |
data.frame, rows are communities and their properties are provided as columns. |
node_summary |
data.frame, rows are nodes (clones) and their properties are provided as columnscontains all user provided. |
graph |
igraph object, processed graph object |
input_config |
list, inputs provided as list. |
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:300, "CDR3a"], CDR3b = CDR3ab[1:300, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[201:400, "CDR3a"], CDR3b = CDR3ab[201:400, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community occupancymatrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary) # look at the node summary head(gcd$node_summary)
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:300, "CDR3a"], CDR3b = CDR3ab[1:300, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[201:400, "CDR3a"], CDR3b = CDR3ab[201:400, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community occupancymatrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary) # look at the node summary head(gcd$node_summary)
igraph
object from clust_irr
objectGiven a clust_irr
object generated by the function cluster_irr
,
the function get_graph
constructs an igraph
object.
The graph nodes represent IR clones. Undirected edges are drawn between
pairs of nodes, and the attributes of these edges are assigned based
on the clust_irr
outputs: ,
, etc.
get_graph(clust_irr)
get_graph(clust_irr)
clust_irr |
S4 object generated by the function |
The output is a list with the following elements. First, the list contains
an igraph
object. The graph nodes and edges contain attributes encoded
in the clust_irr
objects. Second, it contains a data.frame in which
rows are clones (nodes) in the graph. Third, the list contains the logical
variable joint_graph
, which is set to TRUE
if the graph is a
joint graph generated by the function get_joint_graph
and FALSE
if the graph is not a joint graph generated by get_graph
.
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) names(g)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) names(g)
igraph
object from multiple clust_irr
objectsGiven a vector of clust_irr
objects, generated by the function
cluster_irr
, the function get_joint_graph
performs the
following steps:
1. runs the function get_graph
on each clust_irr
object
2. merges the nodes: if graph a
and b
have |a|
and |b|
nodes, then the joint graph has |a|+|b|
nodes,
regardless of whether exactly the same clone (vertex) is found in both
graphs.
3. draws edges between nodes from the different graphs using the same
algorithm for drawing edges between nodes within an IRR (see function
clust_irr
).
4. return a joint graph as igraph
object
5. return the input clustirr object list
6. return a logical joint_graph
=TRUE
get_joint_graph(clust_irrs, cores = 1)
get_joint_graph(clust_irrs, cores = 1)
clust_irrs |
A list of at least two S4 objects generated with the
function |
cores |
number of computer cores to use (default = 1) |
The main output is an igraph
object.
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "a", clone_size = 1) b <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "b", clone_size = 1) # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get graph g <- get_joint_graph(clust_irrs = c) names(g)
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "a", clone_size = 1) b <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "b", clone_size = 1) # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get graph g <- get_joint_graph(clust_irrs = c) names(g)
data.frame with CDR3a
and/or CDR3b
sequences and their
matching antigenic epitopes obtained from McPAS-TCR. The remaining CDR3
columns are set to NA. For data processing details see the script
inst/script/get_mcpastcr.R
data(mcpas)
data(mcpas)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference (Pubmed ID)
data(mcpas) loads the object McPAS-TCR
data(mcpas)
data(mcpas)
This function visualizes a graph. The main input is g
object
created by the function get_graph
.
plot_graph(g, select_by = "Ag_species", as_visnet = FALSE, show_singletons = TRUE, node_opacity = 1)
plot_graph(g, select_by = "Ag_species", as_visnet = FALSE, show_singletons = TRUE, node_opacity = 1)
g |
Object returned by the functions |
as_visnet |
logical, if |
select_by |
character string, two values are possible: "Ag_species" or
"Ag_gene". This only has an effect if |
show_singletons |
logical, if |
node_opacity |
probability, controls the opacity of node colors. Lower values corresponding to more transparent colors. |
The output is an igraph
or visNetwork
plot.
The size of the vertices increases linearly as the logarithm of the degree of the clonal expansion (number of cells per clone) in the corresponding clones.
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) # plot graph with vertices as clones plot_graph(g, as_visnet=FALSE, show_singletons=TRUE, node_opacity = 0.8)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) # plot graph with vertices as clones plot_graph(g, as_visnet=FALSE, show_singletons=TRUE, node_opacity = 0.8)
data.frame with paired CDR3a
and CDR3b
CDR3 sequences and their
matching epitopes obtained from TCR3d. The remaining CDR3 columns are set to
NA. The antigenic epitopes come from cancer antigens and from viral antigens.
For data processing details see the script inst/script/get_tcr3d.R
data(tcr3d)
data(tcr3d)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference ID
data(tcr3d) loads the object tcr3d
data("tcr3d")
data("tcr3d")
data.frame with unpaired CDR3a
or CDR3b
sequences and their
matching epitopes obtained from VDJdb. The remaining CDR3 columns are set
to NA. For data processing details see the script inst/script/get_vdjdb.R
data(vdjdb)
data(vdjdb)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference (Pubmed ID)
data(vdjdb) loads the object vdjdb
data("vdjdb")
data("vdjdb")