Title: | Clustering of immune receptor repertoires |
---|---|
Description: | ClustIRR analyzes repertoires of B- and T-cell receptors. It starts by identifying communities of immune receptors with similar specificities, based on the sequences of their complementarity-determining regions (CDRs). Next, it employs a Bayesian probabilistic models to quantify differential community occupancy (DCO) between repertoires, allowing the identification of expanding or contracting communities in response to e.g. infection or cancer treatment. |
Authors: | Simo Kitanovski [aut, cre] , Kai Wollek [aut] |
Maintainer: | Simo Kitanovski <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 1.5.39 |
Built: | 2025-01-24 03:33:45 UTC |
Source: | https://github.com/bioc/ClustIRR |
Predefined scoring matrix for amino acid or nucleoitide alignments.
data("BLOSUM62")
data("BLOSUM62")
BLOSUM62 is a square symmetric matrix. Rows and columns are identical single letters, representing nucleotide or amino acid. Elements are integer coefficients (substitution scores).
BLOSUM62 was obtained from NCBI (the same matrix used by the stand- alone BLAST software).
See https://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
See https://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
data(BLOSUM62, package = "ClustIRR") BLOSUM62
data(BLOSUM62, package = "ClustIRR") BLOSUM62
and
chains of 10,000 T cell receptors.T-cell receptor (TCR) repertoire with 10,000 T-cells (rows). Each T-cell
has the following features: amino acid sequences of their complementarity
determining region 3 (CDR3); and variable (V) and joining (J) gene names
for TCR chains and
.
Important remark: this is a mock dataset, all CDR3 sequences and the genes were
sampled from a larger set of CDR3 sequences and genes of naive
CD8+ T cells in humans.
data(CDR3ab)
data(CDR3ab)
data.frame
with 10,000 rows and 6 columns
CDR3a
: CDR amino acid sequence
TRAV
: variable (V) gene of TCR
TRAV
: joining (J) gene of TCR
CDR3b
: CDR amino acid sequence
TRBV
: variable (V) gene of TCR
TRBV
: joining (J) gene of TCR
data(CDR3ab) loads the object CDR3ab, which is a data.frame with six columns
(3 for TCR and 3 for TCR
) and 10,000 rows (see details).
data("CDR3ab")
data("CDR3ab")
Objects of the class clust_irr
are generated by the function
cluster_irr
. These objects are used to store the clustering results
in a structured way, such that they may be used as inputs of other ClustIRR
functions (e.g. get_graph, plot_graph, etc.).
The output is an S4
object of class clust_irr
. This object
contains two sublists:
clust
, list, contains clustering results for each IR chain.
The results are stored as data.frame in separate sub-list named appropriately
(e.g. CDR3a, CDR3b, CDR3g, etc.). Each row in the data.frames contains a pair
of CDR3s.
The remaining columns contain similarity scores for the complete CDR3 sequences
(column weight
) or their cores (column cweight
). The columns
max_len
and max_clen
store the length of the longer CDR3 and
CDR3 core sequence in the pair, and these used to normalize the scores
weight
and cweight
: the normalized scores are shown in the
columns nweight
and ncweight
inputs
, list, contains all user provided inputs (see Arguments)
clust |
list, contains clustering results for each TCR/BCR chain. The results are stored in separate sub-list named appropriately (e.g. CDR3a, CDR3b, CDR3g, etc.) |
inputs |
list, contains all user provided inputs |
The output is an S4 object of class clust_irr
To access the slots of clust_irr
object we have two accessor
functions. In the description below, x
is a clust_irr
object.
get_clustirr_clust(x)
:
Extract the clustering results (slot clust)
get_clustirr_inputs(x)
:
Extract the processed inputs (slot inputs)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are globally similar knitr::kable(head(slot(c, "clust")$CDR3b)) # clust_irr S4 object generated 'manually' from the individual results new_clust_irr <- new("clust_irr", clust = slot(object = c, name = "clust"), inputs = slot(object = c, name = "inputs")) # we should get identical outputs identical(x = new_clust_irr, y = c)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are globally similar knitr::kable(head(slot(c, "clust")$CDR3b)) # clust_irr S4 object generated 'manually' from the individual results new_clust_irr <- new("clust_irr", clust = slot(object = c, name = "clust"), inputs = slot(object = c, name = "inputs")) # we should get identical outputs identical(x = new_clust_irr, y = c)
cluster_irr
computes similarities between immune receptors (IRs =
T-cell and B-cell receptors) based on their CDR3 sequences.
cluster_irr(s, meta = NULL, control = list(gmi = 0.7, trim_flank_aa = 3, db_dist = 0, db_custom = NULL))
cluster_irr(s, meta = NULL, control = list(gmi = 0.7, trim_flank_aa = 3, db_dist = 0, db_custom = NULL))
s |
a data.frame with complementarity determining region 3 (CDR3) amino acid sequences observed in IRR clones (data.frame rows). The data.frame has the following columns (IR clone features):
|
meta |
data.frame with meta-data for each IR clone, which may contain data such as, V/J genes, biological condition, age, etc. This data will be used to annotate the graph nodes and help downstream analyses. |
control |
auxiliary parameters to control the algorithm's behavior. See the details below:
|
IRRs, such as T-cell receptor repertoires, are made up of T-cells which
are distributed over T-cell clones. TCR clones with identical
pairs of CDR3 and CDR3
sequences most likely
recognize the same sets of antigens. Meanwhile, TCR clones with
similar pairs of CDR3
and CDR3
sequences
may also share common specificity. ClustIRR aims to quantify the
similarity between pairs of TCR clones based on the similarities of their
CDR3s sequences.
How to compute a similarity score between a pair of CDR3 sequences?
1. Align pairs of sequences
2. Score alignment with BLOSUM62 substitution matrix and gap open/exten costs
3. Normalize alignment scores by alignment length
4. Compute the normalized alignment score of the CDR3 cores.
CDR3 cores are the central parts of the CDR3 loop, which tend to have
high probability of making a contact with the antigen. ClustIRR constructs the
CDR3 cores by trimming few residues (defined by control$trim_flanks
)
from both ends of each CDR3 sequence.
For large IRRs with many clones, step 1 requires significant computational
resources. To mitigate this challenge, we employ a screening step in which
similar sequence pairs selected. In short, each CDR3 is used as a query in
a fast protein-BLAST search as implemented in the R-package blaster,
while the remaining CDR3s are considered as a database of amino acid sequences
against which the query is compared. CDR3 sequences which share at least 70%
sequence identity (user parameter control$gmi
) with the query are
selected, and only these are scored according to steps 2-4.
The output is an S4
object of class clust_irr
. This object
contains two sublists:
clust
, list, contains clustering results for each IR chain.
The results are stored as data.frame in separate sub-list named appropriately
(e.g. CDR3a, CDR3b, CDR3g, etc.). Each row in the data.frames contains a pair
of CDR3s.
The remaining columns contain similarity scores for the complete CDR3 sequences
(column weight
) or their cores (column cweight
). The columns
max_len
and max_clen
store the length of the longer CDR3 sequence
and core in the pair, and these used to normalize the scores weight
and
cweight
: the normalized scores are shown in the columns nweight
and ncweight
inputs
, list, contains all user provided inputs (see Arguments)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are similar knitr::kable(head(slot(c, "clust")$CDR3b))
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run analysis c <- cluster_irr(s = s) # output class class(c) # output structure str(c) # inspect which CDR3bs are similar knitr::kable(head(slot(c, "clust")$CDR3b))
This algorithm takes as input a community matrix, and quantifies the relative enrichment/depletion of individual communities in each sample using a Bayesian hierarchical model.
dco(community_occupancy_matrix, mcmc_control, compute_delta=TRUE, groups = NA)
dco(community_occupancy_matrix, mcmc_control, compute_delta=TRUE, groups = NA)
community_occupancy_matrix |
matrix, rows are communities, columns are repertoires, matrix entries are numbers of cells in each community and repertoire. |
mcmc_control |
list, configurations for the Markov Chain Monte Carlo (MCMC) simulation. |
mcmc_warmup = 750; number of MCMC warmups
mcmc_iter = 1500; number of MCMC iterations
mcmc_chains = 4; number of MCMC chains
mcmc_cores = 1; number of computer cores
mcmc_algorithm = "NUTS"; which MCMC algorithm to use
adapt_delta = 0.95; MCMC step size
max_treedepth = 12; the max value, in exponents of 2, of what the binary tree size in NUTS should have.
compute_delta |
should delta be computed by the Stan model? This may be take up extra memory. |
groups |
vector with integers |
The output is a list with the folling elements:
fit |
model fit (stan object) |
posterior_summary |
nested list with data.frames, summary of model parameters, including their means, medians, 95% credible intervals, etc. Predicted observations (y_hat), which are useful for posterior predictive checks are also provided. |
community_occupancy_matrix |
matrix, rows are communities, columns are repertoires, matrix entries are numbers of cells in each community and repertoire. |
mcmc_control |
mcmc configuration inputs provided as list. |
compute_delta |
the input compute_delta. |
groups |
the input groups. |
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco)
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco)
Performs graph-based community detection to find densely connected groups of
nodes in graph constructed by get_graph
or get_joint_graph
.
detect_communities(graph, weight, algorithm = "leiden", resolution = 1, iterations = 100, metric = "average", chains)
detect_communities(graph, weight, algorithm = "leiden", resolution = 1, iterations = 100, metric = "average", chains)
graph |
|
algorithm |
graph-based community detection (GCD) method: leiden (default) or louvain. |
resolution |
clustering resolution (default = 1) for the GCD. |
iterations |
clustering iterations (default = 100) for the GCD. |
weight |
which edge weight metric (default = ncweight) should be used for GCD |
metric |
possible metrics: "average" (default), "strict" or "loose". |
chains |
which chains should be used for clustering? For instance: chains = "CDR3a"; or chains = CDR3b; or chains = c("CDR3a", "CDR3b"). |
ClustIRR employs graph-based community detection (GCD) algorithms, such as
Louvain or Leiden, to identify densely connected nodes. But first, we must
decide how to compute a similarity between two nodes, and
,
(e.g. TCR clones) based on the similarity scores between their CDR3 sequences
(computed in
clust_irr
) and use this metric as edge weight
.
Scenario 1
If our IRR data data contains CDR3 sequences from only one chain, such as
CDR3, then
is defined as
The user can decide among the two definitions by specifying
weight
= "ncweight"
(default)
weight
= "nweight"
Scenario 2
If our IRR data contains CDR3 sequences from both chains (paired data)
To compute the similarity score between TCR clones, and
,
we compute the average alignment score (
metric
=average)
from their CDR3 and CDR3
alignment scores (in the
next, I will use TCR
as an example, however, this approach
can also be used to compare TCR
or BCRIgH-IgL
clones):
,
where and
are the
alignment scores for the CDR3
and CDR3
sequences,
respectively; and
and
are the alignment scores for the CDR3
and CDR3
cores,
respectively. Based on this metric, CDR3
and CDR3
contribute towards the overall similarity of the TCR clones with equal weights.
ClustIRR provides two additional metrics for computing similarity scores
between TCR clones, including a metric
=strict, which assigns
high similarity score to a pair of TCR clones only if both of their
CDR3 and CDR3
sequence pairs are similar
,
and a metric
=loose, which assigns high similarity score to
a pair of TCR clones if either of their CDR3 and CDR3
sequence pairs are similar
,
The output is a list with the folling elements:
community_occupancy_matrix |
matrix, rows are communities, columns are repertoires, matrix entries are numbers of cells in each community and repertoire. |
community_summary |
data.frame, rows are communities and their properties are provided as columns. |
node_summary |
data.frame, rows are nodes (clones) and their properties are provided as columnscontains all user provided. |
graph |
igraph object, processed graph object. |
graph_structure_quality |
graph modularity and quality (only for Leiden) measure of the strength of division of the graph into communities. |
input_config |
list, inputs provided as list. |
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:300, "CDR3a"], CDR3b = CDR3ab[1:300, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[201:400, "CDR3a"], CDR3b = CDR3ab[201:400, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, weight = "nweight", algorithm = "leiden", resolution = 1, iterations = 100, metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community occupancymatrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary)
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:300, "CDR3a"], CDR3b = CDR3ab[1:300, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[201:400, "CDR3a"], CDR3b = CDR3ab[201:400, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, weight = "nweight", algorithm = "leiden", resolution = 1, iterations = 100, metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community occupancymatrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary)
Use 1. a list of community IDs, 2. node_summary
data.frame generated
by the function detect_communities
; and 3. antigen species/genes to
estimate the number of antigen-specific T-cells in selected communities in
each repertoire.
get_ag_summary(communities, node_summary, ag_species = NULL, ag_genes = NULL, db = "vdjdb", db_dist = 0, chain = "both")
get_ag_summary(communities, node_summary, ag_species = NULL, ag_genes = NULL, db = "vdjdb", db_dist = 0, chain = "both")
communities |
numeric vector with community IDs |
node_summary |
|
ag_species |
antigen species, character vector, e.g. c("EBV", "CMV") |
ag_genes |
antigen genes, character vector, e.g. "MLANA" |
db |
annotation database, character, e.g. "vdjdb" |
db_dist |
maximum edit distance threshold for matching, nummeric |
chain |
immune receptor chain for annotation, "both", "CDR3a" or "CDR3b" |
The user has to provide a vector of antigen species (e.g.
ag_species
= c("EBV", "CMV")) and/or a vector of antigen
genes (e.g. ag_genes
= "MLANA"). Furthermore, the user has
to provide nodes (node_summary
data.frame created by the
function detect_communities
) and a vector with community IDs.
The user can also select an annotation database db
,
such as "vdjdb", "mcpas" or "tcr3d"; and restrict the annotation to
specific IR chains, such as "CDR3a", "CDR3b" or "both". By default,
we will look for perfect matches (db_dist=0
) between CDR3
sequences in the input and in the annotation database for annotation.
Flexible annotation based on edit distances can be performed by
increasing db_dist
.
The output is a data.frame with the number of T-cells specific for the antigenic species/genes (columns) provided as input per repertoire (row), including the total number of T-cells in each repertoire.
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco) # generate beta violin plots ag_summary <- get_ag_summary(communities = 1:3, node_summary = gcd$node_summary, ag_species = c("EBV", "CMV"), ag_genes = "MLANA", db = "vdjdb", db_dist = 0, chain = "both")
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco) # generate beta violin plots ag_summary <- get_ag_summary(communities = 1:3, node_summary = gcd$node_summary, ag_species = c("EBV", "CMV"), ag_genes = "MLANA", db = "vdjdb", db_dist = 0, chain = "both")
coefficient means for communities in immune receptor repertoiresUse 1. node_summary
data.frame generated by the function
detect_communities
; 2. beta
data.frame which is part
of posterior_summary
generated by the function dco
;
and 3. antigen species/genes to visualize distributions of
coefficient means for communities that contain antigen-specific and
antigen-unspecific IRs.
get_beta_violins(beta, node_summary, ag_species = NULL, ag_genes = NULL, db = "vdjdb", db_dist = 0, chain = "both")
get_beta_violins(beta, node_summary, ag_species = NULL, ag_genes = NULL, db = "vdjdb", db_dist = 0, chain = "both")
beta |
|
node_summary |
|
ag_species |
antigen species, character vector, e.g. c("EBV", "CMV") |
ag_genes |
antigen genes, character vector, e.g. "MLANA" |
db |
annotation database, character, e.g. "vdjdb" |
db_dist |
maximum edit distance threshold for matching, nummeric |
chain |
immune receptor chain for annotation, "both", "CDR3a" or "CDR3b" |
The user has to provide a vector of antigen species (e.g.
ag_species
= c("EBV", "CMV")) and/or a vector of antigen
genes (e.g. ag_genes
= "MLANA"). Furthermore, the user has
to provide nodes (node_summary
data.frame created by the
function detect_communities
) and beta
data.frame
which is part of posterior_summary
generated by the function
dco
.
The user can also select an annotation database db
,
such as "vdjdb", "mcpas" or "tcr3d"; and restrict the annotation to
specific IR chains, such as "CDR3a", "CDR3b" or "both". By default,
we will look for perfect matches (db_dist=0
) between CDR3
sequences in the input and in the annotation database for annotation.
Flexible annotation based on edit distances can be performed by
increasing db_dist
.
The output is a list with 4 elements:
node_annotations
: annotated node_summary
beta_summary
: annotated beta
vars
: annotation variables
violins
: violin plots (one for each antigen species and gene)
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco) # generate beta violin plots beta_violins <- get_beta_violins(beta = dco$posterior_summary$beta, node_summary = gcd$node_summary, ag_species = c("EBV", "CMV"), ag_genes = "MLANA", db = "vdjdb", db_dist = 0, chain = "both")
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:500, "CDR3a"], CDR3b = CDR3ab[1:500, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[401:900, "CDR3a"], CDR3b = CDR3ab[401:900, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # look at outputs names(gcd) # look at the community matrix head(gcd$community_occupancy_matrix) # look at the community summary head(gcd$community_summary$wide) # look at the node summary head(gcd$node_summary) # differential community occupancy analysis dco <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix) names(dco) # generate beta violin plots beta_violins <- get_beta_violins(beta = dco$posterior_summary$beta, node_summary = gcd$node_summary, ag_species = c("EBV", "CMV"), ag_genes = "MLANA", db = "vdjdb", db_dist = 0, chain = "both")
igraph
object from clust_irr
objectGiven a clust_irr
object generated by the function cluster_irr
,
the function get_graph
constructs an igraph
object.
The graph nodes represent IR clones. Undirected edges are drawn between
pairs of nodes, and the attributes of these edges are assigned based
on the clust_irr
outputs: ,
, etc.
get_graph(clust_irr)
get_graph(clust_irr)
clust_irr |
S4 object generated by the function |
The output is a list with the following elements. First, the list contains
an igraph
object. The graph nodes and edges contain attributes encoded
in the clust_irr
objects. Second, it contains a data.frame in which
rows are clones (nodes) in the graph. Third, the list contains the logical
variable joint_graph
, which is set to TRUE
if the graph is a
joint graph generated by the function get_joint_graph
and FALSE
if the graph is not a joint graph generated by get_graph
.
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) names(g)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) names(g)
Use the community_occupancy_matrix
generated by the function
detect_communities
to generate honeycomb plots for each pair of
repertoires. In each plot, we will show communities (rows in the matric
community_occupancy_matrix
) as dots and their intensities in a
pair of repertoires (x-axis and y-axis). The density of dots is encoded
by the color of the honeycomb-like hexagons.
get_honeycombs(com)
get_honeycombs(com)
com |
|
Use the community_occupancy_matrix
generated by the function
detect_communities
to generate honeycomb plots for each pair of
repertoires. In each plot, we will show communities (rows in the matric
community_occupancy_matrix
) as dots and their intensities in a
pair of repertoires (x-axis and y-axis). The density of dots is encoded
by the color of the honeycomb-like hexagons.
The output is a list with ggplots. Given n repertoires (columns in input
community_occupancy_matrix
), it will generate n*(n-1)/2 plots.
You can arrange the ggplots (or a portion of them) in any shape e.g. with
the R-package patchwork.
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:300, "CDR3a"], CDR3b = CDR3ab[1:300, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[201:400, "CDR3a"], CDR3b = CDR3ab[201:400, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # get honeycombs g <- get_honeycombs(com = gcd$community_occupancy_matrix) g
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3a = CDR3ab[1:300, "CDR3a"], CDR3b = CDR3ab[1:300, "CDR3b"], clone_size = 1, sample = "a") b <- data.frame(CDR3a = CDR3ab[201:400, "CDR3a"], CDR3b = CDR3ab[201:400, "CDR3b"], clone_size = 1, sample = "b") b$clone_size[1] <- 20 # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get joint graph jg <- get_joint_graph(clust_irrs = c) # detect communities gcd <- detect_communities(graph = jg$graph, algorithm = "leiden", resolution = 1, weight = "ncweight", metric = "average", chains = c("CDR3a", "CDR3b")) # get honeycombs g <- get_honeycombs(com = gcd$community_occupancy_matrix) g
igraph
object from multiple clust_irr
objectsGiven a vector of clust_irr
objects, generated by the function
cluster_irr
, the function get_joint_graph
performs the
following steps:
1. runs the function get_graph
on each clust_irr
object
2. merges the nodes: if graph a
and b
have |a|
and |b|
nodes, then the joint graph has |a|+|b|
nodes,
regardless of whether exactly the same clone (vertex) is found in both
graphs.
3. draws edges between nodes from the different graphs using the same
algorithm for drawing edges between nodes within an IRR (see function
clust_irr
).
4. return a joint graph as igraph
object
5. return the input clustirr object list
6. return a logical joint_graph
=TRUE
get_joint_graph(clust_irrs, cores = 1)
get_joint_graph(clust_irrs, cores = 1)
clust_irrs |
A list of at least two S4 objects generated with the
function |
cores |
number of computer cores to use (default = 1) |
The main output is an igraph
object.
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "a", clone_size = 1) b <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "b", clone_size = 1) # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get graph g <- get_joint_graph(clust_irrs = c) names(g)
# load package input data data("CDR3ab", package = "ClustIRR") a <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "a", clone_size = 1) b <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "b", clone_size = 1) # run ClustIRR analysis c <- c(cluster_irr(s = a), cluster_irr(s = b)) # get graph g <- get_joint_graph(clust_irrs = c) names(g)
data.frame with CDR3a
and/or CDR3b
sequences and their
matching antigenic epitopes obtained from McPAS-TCR. The remaining CDR3
columns are set to NA. For data processing details see the script
inst/script/get_mcpastcr.R
data(mcpas)
data(mcpas)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference (Pubmed ID)
data(mcpas) loads the object McPAS-TCR
data(mcpas)
data(mcpas)
This function visualizes a graph. The main input is g
object
created by the function get_graph
.
plot_graph(g, select_by = "Ag_species", as_visnet = FALSE, show_singletons = TRUE, node_opacity = 1)
plot_graph(g, select_by = "Ag_species", as_visnet = FALSE, show_singletons = TRUE, node_opacity = 1)
g |
Object returned by the functions |
as_visnet |
logical, if |
select_by |
character string, two values are possible: "Ag_species" or
"Ag_gene". This only has an effect if |
show_singletons |
logical, if |
node_opacity |
probability, controls the opacity of node colors. Lower values corresponding to more transparent colors. |
The output is an igraph
or visNetwork
plot.
The size of the vertices increases linearly as the logarithm of the degree of the clonal expansion (number of cells per clone) in the corresponding clones.
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) # plot graph with vertices as clones plot_graph(g, as_visnet=FALSE, show_singletons=TRUE, node_opacity = 0.8)
# load package input data data("CDR3ab", package = "ClustIRR") s <- data.frame(CDR3b = CDR3ab[1:100, "CDR3b"], sample = "A", clone_size = 1) # run ClustIRR analysis out <- cluster_irr(s = s) # get graph g <- get_graph(clust_irr = out) # plot graph with vertices as clones plot_graph(g, as_visnet=FALSE, show_singletons=TRUE, node_opacity = 0.8)
data.frame with paired CDR3a
and CDR3b
CDR3 sequences and their
matching epitopes obtained from TCR3d. The remaining CDR3 columns are set to
NA. The antigenic epitopes come from cancer antigens and from viral antigens.
For data processing details see the script inst/script/get_tcr3d.R
data(tcr3d)
data(tcr3d)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference ID
data(tcr3d) loads the object tcr3d
data("tcr3d")
data("tcr3d")
data.frame with unpaired CDR3a
or CDR3b
sequences and their
matching epitopes obtained from VDJdb. The remaining CDR3 columns are set
to NA. For data processing details see the script inst/script/get_vdjdb.R
data(vdjdb)
data(vdjdb)
data.frame
with columns:
CDR3a
: CDR3a amino acid sequence
CDR3b
: CDR3b amino acid sequence
CDR3g
: CDR3g amino acid sequence -> NA
CDR3d
: CDR3d amino acid sequence -> NA
CDR3h
: CDR3h amino acid sequence -> NA
CDR3l
: CDR3l amino acid sequence -> NA
CDR3_species
: CDR3 species (e.g. human, mouse, ...)
Antigen_species
: antigen species
Antigen_gene
: antigen gene
Reference
: Reference (Pubmed ID)
data(vdjdb) loads the object vdjdb
data("vdjdb")
data("vdjdb")