Title: | Clustering of immune receptor repertoires |
---|---|
Description: | ClustIRR is a quantitative method for clustering of immune receptor repertoires (IRRs). The algorithm identifies groups of T or B cell receptors (TCRs or BCRs) with possibly similar specificity directly from the sequences of their complementarity determining regions. ClustIRR uses graphs to visualize the specificity structures of IRRs. |
Authors: | Simo Kitanovski [aut, cre] |
Maintainer: | Simo Kitanovski <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 1.3.20 |
Built: | 2024-07-17 19:32:56 UTC |
Source: | https://github.com/bioc/ClustIRR |
and
chains of 10,000 T cell
receptorsMock data set containing amino acid sequences of paired CDR3s from the
and
chains of 10,000 T cell receptors. All CDR3
sequences were drawn from a larger set of CDR3
sequences from
human naive CD8+ T cells.
data(CDR3ab)
data(CDR3ab)
data.frame
with 10,000 rows and 2 columns CDR3a
and CDR3b
.
data(CDR3ab) loads the object CDR3ab, which is a data.frame with two columns and 10,000 rows.
data("CDR3ab")
data("CDR3ab")
Objects of the class clust_irr
are generated by the function
cluster_irr
. These objects are used to store the clustering results
in a structured way, such that they may be used as inputs of other ClustIRR
functions (e.g. get_graph, plot_graph, etc.).
Below we provide a detailed description of the slots of clust_irr
.
clust_irr
objects contain two sublists:
clust list, contains clustering results for each TCR/BCR chain. The results are stored in separate sub-list named appropriately (e.g. CDR3a, CDR3b, CDR3g, etc.). In the following we who the typical structure of these lists:
local
- list, local clustering results
m
- data.frame, motif enrichment results with columns:
motif
- motif sequence
f_s
- observed motif counts in s
f_r
- observed motif counts in r
n_s
- number of all observed motifs in s
n_r
- number of all observed motifs in r
k
- motif length
ove
- mean observed/expected relative motif frequency
ove_ci_l95
- 95% confidence intervals of ove (lower
boundary)
ove_ci_h95
- 95% confidence intervals of ove (upper
boundary)
p_value
- p-value from Fisher's exact test
fdr
- false discovery rate, i.e. adjusted p-value by
Benjamini & Hochberg correction
pass
- logical value indicating whether a motifs are
enriched (pass=TRUE
) given the user-defined thresholds in
control
lp
- data.frame, enriched motifs are linked to their
original CDR3 sequences and shown as rows in the data.frame with
the following columns:
cdr3
- CDR3 amino acid sequence
cdr3_core
- core portion of the CDR3 sequence,
obtained by trimming trim_flank_aa
amino acids (user-
defined parameter). If trim_flank_aa
= 0, then
cdr3
= cdr3_core
motif
- enriched motif from cdr3_core
global
- data.frame, global clustering results. Pairs of
globally similar CDR3s are shown in each row (analogous to lp
).
If global_smart
=FALSE in the control, then global clustering is
done based on Hamming distances and the remaining columns of this
data.frame are not important. Else, if global_smart
=FALSE, then
the remainig columns are relevant, i.e. global similarity scores are
shown for the complete CDR3 sequence pairs (column weight
) or
their core (trimmed) CDR3 sequence part (column cweight
). The
column max_len
stores the the maximum length in each pair of
CDR3 sequences, and is used to normalize the scores weight
and
cweight
: the normalized scores are shown in the columns
nweight
and ncweight
.
inputs:list, contains all user provided inputs
clust |
list, contains clustering results for each TCR/BCR chain. The results are stored in separate sub-list named appropriately (e.g. CDR3a, CDR3b, CDR3g, etc.) |
inputs |
list, contains all user provided inputs |
The output is an S4 object of class clust_irr
To access the slots of clust_irr
object we have two accessor
functions. In the description below, x
is a clust_irr
object.
get_clustirr_clust(x)
:
Extract the clustering results (slot clust)
get_clustirr_inputs(x)
:
Extract the processed inputs (slot inputs)
# inputs data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:1000, "CDR3b"]) r <- data.frame(CDR3b = CDR3ab[1:5000, "CDR3b"]) # controls: auxiliary inputs control <- list(global_smart = TRUE, global_max_hdist = 1, global_min_identity = 0.7, local_max_fdr = 0.05, local_min_o = 1, trim_flank_aa = 3, global_pairs = NULL, low_mem = FALSE) # clust_irr S4 object generated by function cluster_irr clust_irr_output <- cluster_irr(s = s, r = r,ks = 4,cores = 1,control = control) # clust_irr S4 object generated 'manually' from the individual results new_clust_irr <- new("clust_irr", clust = slot(object = clust_irr_output, name = "clust"), inputs = slot(object = clust_irr_output, name = "inputs")) # we should get identical outputs identical(x = new_clust_irr, y = clust_irr_output)
# inputs data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:1000, "CDR3b"]) r <- data.frame(CDR3b = CDR3ab[1:5000, "CDR3b"]) # controls: auxiliary inputs control <- list(global_smart = TRUE, global_max_hdist = 1, global_min_identity = 0.7, local_max_fdr = 0.05, local_min_o = 1, trim_flank_aa = 3, global_pairs = NULL, low_mem = FALSE) # clust_irr S4 object generated by function cluster_irr clust_irr_output <- cluster_irr(s = s, r = r,ks = 4,cores = 1,control = control) # clust_irr S4 object generated 'manually' from the individual results new_clust_irr <- new("clust_irr", clust = slot(object = clust_irr_output, name = "clust"), inputs = slot(object = clust_irr_output, name = "inputs")) # we should get identical outputs identical(x = new_clust_irr, y = clust_irr_output)
This algorithm finds groups of TCRs or BCRs with similar specificity. Two clustering strategies are employed:
Local clustering
Global clustering
Local clustering
The aim of local clustering is to find motifs (contiguous k-mers of the CDR
sequence) that are overrepresented in repertoire s
compared to
repertoire r
. This is an outline of the local clustering procedure:
Trim CDR3 flanks based on control$trim_flank_aa
For each motif found in s
compute the following:
motif frequencies in data set s
() and
r
()
total number of motifs in data set s
() and
r
()
ratio of observed vs. expected motif counts using the following
formula: OvE=)
probability of finding the observed or a larger OvE for
motif
given that the null hypothesis is true is computed with
the Fisher's exact test
if a motif passes the criteria defined in control list, set flag
, else
Global clustering The aim of global clustering is to find similar
CDR3 sequences in repertoire s
. This is an outline of the global
clustering approaches implemented in ClustIRR:
If global_smart
=FALSE
For each pair of equal-length CDR3 sequences and
we compute
the Hamming distance
. If
global_max_hdist
(user-defined input), then and
are globally similar.
If global_smart
=TRUE
We find pairs of CDR3 sequences that satisfy a minimum sequence identity
defined by global_min_identity
. Each pair of sequences are aligned.
Then, we compute the length of the longest of the two CDR3 sequences (column
max_len
). Then, we compute four alignment scores: a) weight
-
BLOSUM62 score of the aligned CDR3 sequences; b) nweight
- BLOSUM62
score of the aligned CDR3 sequences normalized by max_len
; c)
cweight
- BLOSUM62 score of the aligned CDR3 cores (untrimmed part of
the CDR3 sequences); d) ncweight
- BLOSUM62 score of the aligned CDR3
cores (untrimmed part of the CDR3 sequences) normalized by the length of the
longest core sequence.
Alternatively, the user can provide a matrix of globally similar CDR3 sequence pairs, computed by a complementary approachs such as TCRdist.
cluster_irr( s, r, ks = 4, cores = 1, control = list(global_smart = FALSE, global_max_hdist = 1, global_min_identity = 0.7, local_max_fdr = 0.05, local_min_o = 1, trim_flank_aa = 3, global_pairs = NULL, low_mem = FALSE))
cluster_irr( s, r, ks = 4, cores = 1, control = list(global_smart = FALSE, global_max_hdist = 1, global_min_identity = 0.7, local_max_fdr = 0.05, local_min_o = 1, trim_flank_aa = 3, global_pairs = NULL, low_mem = FALSE))
s |
data.frame, complementarity determining region 3 (CDR3) amino acid sequences observed in an immune receptor repertoire (IRR). The data.frame can have either one column or two columns:
|
r |
data.frame, reference (or control) repertoire of CDR3 sequences. Must
have the same structure (number of columns and column names) as |
ks |
integer or integer vector, motif lengths. |
cores |
integer, number of CPU cores, |
control |
list, a named list of auxiliary parameters to control algorithm's behavior. See the details below:
|
The output is an S4 object of class clust_irr
. This object contains
two sublists:
clust |
list, contains clustering results for each TCR/BCR chain. The results are stored in separate sub-list named appropriately (e.g. CDR3a, CDR3b, CDR3g, etc.). In the following we who the typical structure of these lists:
|
inputs |
list, contains all user provided inputs (see Arguments) |
# load package input data data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], clone_size = 1) r <- data.frame(CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1) # artificially enrich motif 'RQWW' inside sample dataset substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "RQWW" # add an artificial clonal expansion of two sequences to the sample dataset s <- rbind(s, data.frame(CDR3b = c("CATSRAAKPDGLRALETQYF", "CATSRAAKPDRQWWLSTQYF"), clone_size = 10)) # run analysis out <- cluster_irr(s = s, r = r, ks = 4, cores = 1, control = list( global_smart = TRUE, global_max_hdist = 1, local_max_fdr = 0.05, local_min_o = 1, trim_flank_aa = 3, global_pairs = NULL, low_mem = FALSE)) # output class class(out) # output structure str(out) # inspect motif enrichment results knitr::kable(head(slot(out, "clust")$CDR3b$local$m)) # inspect which CDR3bs are globally similar knitr::kable(head(slot(out, "clust")$CDR3b$global)) # get graph g <- get_graph(out) # plot graph plot_graph(g) # plot graph as visgraph plot_graph(g, as_visnet = TRUE)
# load package input data data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], clone_size = 1) r <- data.frame(CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1) # artificially enrich motif 'RQWW' inside sample dataset substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "RQWW" # add an artificial clonal expansion of two sequences to the sample dataset s <- rbind(s, data.frame(CDR3b = c("CATSRAAKPDGLRALETQYF", "CATSRAAKPDRQWWLSTQYF"), clone_size = 10)) # run analysis out <- cluster_irr(s = s, r = r, ks = 4, cores = 1, control = list( global_smart = TRUE, global_max_hdist = 1, local_max_fdr = 0.05, local_min_o = 1, trim_flank_aa = 3, global_pairs = NULL, low_mem = FALSE)) # output class class(out) # output structure str(out) # inspect motif enrichment results knitr::kable(head(slot(out, "clust")$CDR3b$local$m)) # inspect which CDR3bs are globally similar knitr::kable(head(slot(out, "clust")$CDR3b$global)) # get graph g <- get_graph(out) # plot graph plot_graph(g) # plot graph as visgraph plot_graph(g, as_visnet = TRUE)
clust_irr
objectAs input we take a clust_irr
object generated by the function
cluster_irr
.
From the clust_irr
object we generate a graph in which the vertices
represent clones, and undirected edges are drawn between a pair of vertices if
the corresponding clones are locally and/or globally similar (see clustering
in the documentation of cluster_irr
.
The main output is an igraph
object.
Furthermore, we compare CDR3s from the graph against CDR3 sequences with known
epitopes from these databases: VDJdb, McPAS-TCR and TCR3d. The comparison
is done based on the edit distance, where the parameter edit_dist
controls the edit distance threshold (default=0). If the edit distance
between two CDR3s is smaller than edit_dist
, then the database
information related to this CDR3 is transfered as vertex attribute to the
corresponding vertex.
With the parameter custom_db
, the user can provide additional
databases with CDR3 sequences and their cognate antigens.
get_graph(clust_irr, sample_id, custom_db = NULL, edit_dist = 0)
get_graph(clust_irr, sample_id, custom_db = NULL, edit_dist = 0)
clust_irr |
S4 object generated by the function |
sample_id |
character, name of the repertoire. It will be appended to
each node's name. If missing, a random |
custom_db |
data.frame, custom database mapping CDR3 sequences to their
cognate antigens. The structure of |
edit_dist |
integer, edit distance threshold. CDR3 sequences from the
graph will be matched with CDR3 sequences with known antigenic specificities
(from a database). If their edit distance is lower or equal to |
The main output is a list. The list contains an igraph
object. The graph
vertices and edges contain attributes. Furthermore, it contains a data.frame
in which rows are clones (vertices) in the graph. Finally, the list contains
the logical variable joint_graph
, which is set to TRUE
if the
graph is a joint graph generated by the function get_joint_graph
and
FALSE
if the graph is not a joint graph generated by get_graph
.
# load package input data data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], clone_size = 1) r <- data.frame(CDR3b = CDR3ab[1:5000, "CDR3b"], clone_size = 1) # artificially enrich motif 'RWGW' inside sample dataset substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "RWGW" # add an artificial clonal expansion of two sequences to the sample dataset s <- rbind(s, data.frame(CDR3b = c("CATSRADKPDGLDALETQYF", "CATSRAAKPDGLAALSTQYF"), clone_size = 5)) # run ClustIRR analysis out <- cluster_irr(s = s, r = r, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) # get graph g <- get_graph(out) names(g)
# load package input data data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], clone_size = 1) r <- data.frame(CDR3b = CDR3ab[1:5000, "CDR3b"], clone_size = 1) # artificially enrich motif 'RWGW' inside sample dataset substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "RWGW" # add an artificial clonal expansion of two sequences to the sample dataset s <- rbind(s, data.frame(CDR3b = c("CATSRADKPDGLDALETQYF", "CATSRAAKPDGLAALSTQYF"), clone_size = 5)) # run ClustIRR analysis out <- cluster_irr(s = s, r = r, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) # get graph g <- get_graph(out) names(g)
clust_irr
objectsAs input we take at least two clust_irr
objects generated by the
function cluster_irr
.
From each clust_irr
object we generate a graph (with the function
get_graph
) in which the vertices represent clones, and undirected
edges are drawn between a pair of vertices if the corresponding clones are
locally and/or globally similar (see definitions of local/global clustering
in the documentation of cluster_irr
.
Additionally, get_joint_graph
performs the following operation
between each pair of graphs:
First, it merges the vertices. If graphs a
and b
have |a|
and |b|
vertices, then the joint graph has |a|+|b|
vertices,
regardless of whether exactly the same clone (vertex) is found in both graphs.
Second, it performs global clustering between graph pairs. If two clones (graph vertices) have similar CDR3 sequences, then the vertices are connected by an edge. The same global clustering strategy (controls) are applied as the ones used to build the individual clust_irr objects. If different clust_irr objects are generated with different controls, then the graph merging algorithm will exit with an error.
The main output is an igraph
object.
Third, we compare CDR3s from the graph against CDR3 sequences with known
epitopes from these databases: VDJdb, McPAS-TCR and TCR3d. The comparison
is done based on the edit distance, where the parameter edit_dist
controls the edit distance threshold (default=0). If the edit distance
between two CDR3s is smaller than edit_dist
, then the database
information related to this CDR3 is transfered as vertex attribute to the
corresponding vertex.
With the parameter custom_db
, the user can provide additional
databases with CDR3 sequences and their cognate antigens.
get_joint_graph(clust_irrs, cores = 1, custom_db = NULL, edit_dist = 0)
get_joint_graph(clust_irrs, cores = 1, custom_db = NULL, edit_dist = 0)
clust_irrs |
A list of at least two S4 objects generated with the
function |
cores |
integer, number of computer cores to use (default = 1) |
custom_db |
data.frame, custom database with CDR3 sequences and their
matching antigens. The structure of custom_db must be identical to data(vdjdb).
This is an optional input to annotate the clones. If |
edit_dist |
integer, edit distance threshold. CDR3 sequences from the
graph will be matched with CDR3 sequences with known antigenic specificities
(from a database). If their edit distance is lower or equal to |
The main output of this function is an igraph
object. This object
represents a joint graph of the individual graphs contained as elements
in the input clust_irrs. The graph nodes and edges contain many attributes
which are described in the description section.
One additional output is a data.frame in which rows are clones (vertices) in the joint graph.
# load package input data data("CDR3ab") s_1 <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"]) s_2 <- data.frame(CDR3b = CDR3ab[101:200, "CDR3b"]) r <- data.frame(CDR3b = CDR3ab[1:500, "CDR3b"]) # run 1st analysis -> clust_irr object o_1 <- cluster_irr(s = s_1, r = r, ks = 4) # run 2nd analysis -> clust_irr object o_2 <- cluster_irr(s = s_2, r = r, ks = 4) # join clust_irr objects in a list clust_irrs <- c(o_1, o_2) names(clust_irrs) <- c("C1", "C2") # get graph g <- get_joint_graph(clust_irrs = clust_irrs) names(g)
# load package input data data("CDR3ab") s_1 <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"]) s_2 <- data.frame(CDR3b = CDR3ab[101:200, "CDR3b"]) r <- data.frame(CDR3b = CDR3ab[1:500, "CDR3b"]) # run 1st analysis -> clust_irr object o_1 <- cluster_irr(s = s_1, r = r, ks = 4) # run 2nd analysis -> clust_irr object o_2 <- cluster_irr(s = s_2, r = r, ks = 4) # join clust_irr objects in a list clust_irrs <- c(o_1, o_2) names(clust_irrs) <- c("C1", "C2") # get graph g <- get_joint_graph(clust_irrs = clust_irrs) names(g)
data.frame with CDR3a
and/or CDR3b
sequences and their
matching antigenic epitopes obtained from McPAS-TCR. The remaining CDR3
columns are set to NA. For data processing details see the script
inst/script/get_mcpastcr.R
data(mcpas)
data(mcpas)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference (Pubmed ID)
data(mcpas) loads the object McPAS-TCR
data(mcpas)
data(mcpas)
This function visualizes a graph. The main input is g
object
created by the function get_graph
.
plot_graph(g, select_by = "Ag_species", as_visnet = FALSE, show_singletons = TRUE, node_opacity = 1)
plot_graph(g, select_by = "Ag_species", as_visnet = FALSE, show_singletons = TRUE, node_opacity = 1)
g |
Object returned by the functions |
as_visnet |
logical, if |
select_by |
character string, two values are possible: "Ag_species" or
"Ag_gene". This only has an effect if |
show_singletons |
logical, if |
node_opacity |
probability, controls the opacity of node colors. Lower values corresponding to more transparent colors. |
The output is an igraph
or visNetwork
plot.
The size of the vertices increases linearly as the logarithm of the degree of the clonal expansion (number of cells per clone) in the corresponding clones.
# load package input data data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], clone_size = 1) r <- data.frame(CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1) # artificially enrich motif 'RWGW' inside sample dataset substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "RWGW" # add an artificial clonal expansion of two sequences to the sample dataset s <- rbind(s, data.frame(CDR3b = c("CATSRADKPDGLDALETQYF", "CATSRAAKPDGLAALSTQYF"), clone_size = 5)) # run analysis ci <- cluster_irr(s = s, r = r, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) g <- get_graph(clust_irr = ci) # plot graph with vertices as clones plot_graph(g, as_visnet=FALSE, show_singletons=TRUE, node_opacity = 0.8)
# load package input data data("CDR3ab") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], clone_size = 1) r <- data.frame(CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1) # artificially enrich motif 'RWGW' inside sample dataset substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "RWGW" # add an artificial clonal expansion of two sequences to the sample dataset s <- rbind(s, data.frame(CDR3b = c("CATSRADKPDGLDALETQYF", "CATSRAAKPDGLAALSTQYF"), clone_size = 5)) # run analysis ci <- cluster_irr(s = s, r = r, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) g <- get_graph(clust_irr = ci) # plot graph with vertices as clones plot_graph(g, as_visnet=FALSE, show_singletons=TRUE, node_opacity = 0.8)
data.frame with paired CDR3a
and CDR3b
CDR3 sequences and their
matching epitopes obtained from TCR3d. The remaining CDR3 columns are set to
NA. The antigenic epitopes come from cancer antigens and from viral antigens.
For data processing details see the script inst/script/get_tcr3d.R
data(tcr3d)
data(tcr3d)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference ID
data(tcr3d) loads the object tcr3d
data("tcr3d")
data("tcr3d")
data.frame with unpaired CDR3a
or CDR3b
sequences and their
matching epitopes obtained from VDJdb. The remaining CDR3 columns are set
to NA. For data processing details see the script inst/script/get_vdjdb.R
data(vdjdb)
data(vdjdb)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference (Pubmed ID)
data(vdjdb) loads the object vdjdb
data("vdjdb")
data("vdjdb")