Package 'immLynx' reference manual

Title:	Linking Advanced TCR Python Pipelines and Hugging Face Models in R
Description:	A comprehensive toolkit that bridges popular Python-based immune repertoire analysis tools and Hugging Face protein language models into the R environment. Provides unified interfaces for TCR distance calculations (tcrdist3), sequence generation probability (OLGA), selection inference (soNNia), clustering (clusTCR), protein embeddings (ESM-2), metaclone discovery (metaclonotypist). Fully compatible with the scRepertoire and immApex ecosystem for single-cell immune repertoire analysis.
Authors:	Nick Borcherding [aut, cre] (ORCID: <https://orcid.org/0000-0003-1427-6342>)
Maintainer:	Nick Borcherding <[email protected]>
License:	MIT + file LICENSE
Version:	1.1.0
Built:	2026-07-04 02:23:17 UTC
Source:	https://github.com/bioc/immLynx

Convert TCR Data to tcrdist3 Format

Description

Converts TCR data from immLynx/scRepertoire format to the format required by tcrdist3 for distance calculations.

Usage

convertToTcrdist(
  tcr_data,
  chains = c("beta", "alpha", "both"),
  include_count = TRUE
)
convertToTcrdist(
  tcr_data,
  chains = c("beta", "alpha", "both"),
  include_count = TRUE
)

Arguments

tcr_data

A data.frame from extractTCRdata() or similar source

chains

Which chains to include: "alpha", "beta", or "both". Default is "beta".

include_count

Logical. If TRUE, adds a 'count' column (default 1). Default is TRUE.

Value

A data.frame in tcrdist3 format with columns:

count: Clone count (default 1)
v_a_gene: Alpha V gene (if alpha chain included)
j_a_gene: Alpha J gene (if alpha chain included)
cdr3_a_aa: Alpha CDR3 amino acid sequence (if alpha chain included)
v_b_gene: Beta V gene (if beta chain included)
j_b_gene: Beta J gene (if beta chain included)
cdr3_b_aa: Beta CDR3 amino acid sequence (if beta chain included)

Examples

# Convert long-format TCR data to tcrdist3 format
tcr_data <- data.frame(
  barcode = paste0("cell_", 1:5),
  cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF",
              "CASNQGLNEKLFF", "CASSLDRNEQFF"),
  v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")),
  j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")),
  chain = rep("TRB", 5),
  stringsAsFactors = FALSE
)
tcrdist_format <- convertToTcrdist(tcr_data, chains = "beta")
head(tcrdist_format)

# Convert long-format TCR data to tcrdist3 format
tcr_data <- data.frame(
  barcode = paste0("cell_", 1:5),
  cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF",
              "CASNQGLNEKLFF", "CASSLDRNEQFF"),
  v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")),
  j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")),
  chain = rep("TRB", 5),
  stringsAsFactors = FALSE
)
tcrdist_format <- convertToTcrdist(tcr_data, chains = "beta")
head(tcrdist_format)

Extract TCR Data from SingleCellExperiment Object

Description

Extracts T-cell receptor data from a SingleCellExperiment object that has been processed with scRepertoire. This is a convenience wrapper around immApex::getIR() that provides additional formatting options.

Usage

extractTCRdata(
  input,
  chains = c("TRB", "TRA", "TRG", "TRD", "IGH", "IGL", "IGK", "both"),
  format = c("long", "wide"),
  remove_na = TRUE
)
extractTCRdata(
  input,
  chains = c("TRB", "TRA", "TRG", "TRD", "IGH", "IGL", "IGK", "both"),
  format = c("long", "wide"),
  remove_na = TRUE
)

Arguments

input

A SingleCellExperiment object containing scRepertoire TCR data in metadata.

chains

Which chains to extract: "TRA", "TRB", "TRG", "TRD", "IGH", "IGL", "IGK", or "both" (for TRA and TRB). Default is "TRB".

format

Output format: "long" (one row per chain) or "wide" (one row per cell with columns for each chain). Default is "long".

remove_na

Logical. If TRUE, removes rows with NA CDR3 sequences. Default is TRUE.

Value

A data.frame containing:

barcode: Cell barcode
cdr3_aa: CDR3 amino acid sequence
cdr3_nt: CDR3 nucleotide sequence (if available)
v: V gene
d: D gene (if applicable)
j: J gene
c: C gene (if available)
chain: Chain type (TRA, TRB, etc.)

Examples

# Extract beta chain data
data(immLynx_example)
tcr_data <- extractTCRdata(immLynx_example, chains = "TRB")
head(tcr_data)

# Extract beta chain data
data(immLynx_example)
tcr_data <- extractTCRdata(immLynx_example, chains = "TRB")
head(tcr_data)

Generate Random TCR Sequences using OLGA

Description

Generate random TCR sequences from OLGA's generative model.

Usage

generateOLGA(n = 100, model = "humanTRB")
generateOLGA(n = 100, model = "humanTRB")

Arguments

n

Number of sequences to generate.

model

OLGA model to use: "humanTRB", "humanTRA", "humanIGH", "mouseTRB".

Value

Data.frame with generated sequences (nt_seq, aa_seq, v_index, j_index).

Examples

# Available models
models <- c("humanTRB", "humanTRA", "humanIGH", "mouseTRB")

  # Generate 100 random human TRB sequences
  random_seqs <- generateOLGA(n = 100, model = "humanTRB")
  head(random_seqs)

  # Generate human TRA sequences
  tra_seqs <- generateOLGA(n = 50, model = "humanTRA")

  # Generate mouse TRB sequences
  mouse_seqs <- generateOLGA(n = 200, model = "mouseTRB")

# Available models
models <- c("humanTRB", "humanTRA", "humanIGH", "mouseTRB")

  # Generate 100 random human TRB sequences
  random_seqs <- generateOLGA(n = 100, model = "humanTRB")
  head(random_seqs)

  # Generate human TRA sequences
  tra_seqs <- generateOLGA(n = 50, model = "humanTRA")

  # Generate mouse TRB sequences
  mouse_seqs <- generateOLGA(n = 200, model = "mouseTRB")

Initialize a Hugging Face Model and Tokenizer

Description

Downloads or loads a cached pre-trained model and its corresponding tokenizer from the Hugging Face Hub. Uses the basilisk-managed Python environment which includes transformers and torch.

Usage

huggingModel(model_name = "facebook/esm2_t12_35M_UR50D")
huggingModel(model_name = "facebook/esm2_t12_35M_UR50D")

Arguments

model_name

A string specifying the model identifier from the Hugging Face Hub. For ESM-2 35M, this is "facebook/esm2_t12_35M_UR50D". Other options include "facebook/esm2_t33_650M_UR50D" for larger models.

Value

A list containing the R-wrapped Python objects for the 'model' and 'tokenizer', along with the basilisk process handle. The process must be stopped when no longer needed via basilisk::basiliskStop().

Examples

# Default model is ESM-2 35M
model_name <- "facebook/esm2_t12_35M_UR50D"

  # Load the default ESM-2 35M model
  hf_components <- huggingModel(model_name)
  names(hf_components)  # "model", "tokenizer", "proc"

  # Use with tokenizeSequences and proteinEmbeddings
  sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF")
  tokenized <- tokenizeSequences(hf_components$tokenizer,
                                 sequences)
  embeddings <- proteinEmbeddings(hf_components$model,
                                  tokenized,
                                  pool = "mean",
                                  chunk_size = 32)

  # Clean up the basilisk process when done
  basilisk::basiliskStop(hf_components$proc)

# Default model is ESM-2 35M
model_name <- "facebook/esm2_t12_35M_UR50D"

  # Load the default ESM-2 35M model
  hf_components <- huggingModel(model_name)
  names(hf_components)  # "model", "tokenizer", "proc"

  # Use with tokenizeSequences and proteinEmbeddings
  sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF")
  tokenized <- tokenizeSequences(hf_components$tokenizer,
                                 sequences)
  embeddings <- proteinEmbeddings(hf_components$model,
                                  tokenized,
                                  pool = "mean",
                                  chunk_size = 32)

  # Clean up the basilisk process when done
  basilisk::basiliskStop(hf_components$proc)

Example Single-Cell RNA-seq Data with TCR Information

Description

An SCE object containing single-cell RNA-seq data from multiple patients with integrated T-cell receptor (TCR) repertoire data from scRepertoire. This dataset is useful for demonstrating the functionality of immLynx analysis functions.

Usage

data(immLynx_example)
data(immLynx_example)

Format

An SCE object with the following components:

Assays: RNA expression data
Metadata: Cell-level metadata including:

orig.ident: Original sample identifier (e.g., "P17B", "P17L")
nCount_RNA: Total RNA counts per cell
nFeature_RNA: Number of detected genes per cell
CTgene: TCR gene information (V/J genes)
CTnt: TCR CDR3 nucleotide sequences
CTaa: TCR CDR3 amino acid sequences
CTstrict: Strict TCR identifier combining gene and sequence
clonalFrequency: Number of cells sharing the same TCR
clonalProportion: Proportion of cells with the same TCR
cloneSize: Categorical clone size (Small, Medium, Large, Hyperexpanded)
Patient: Patient identifier (P17, P18, P19, P20)
Type: Sample type (B = Blood, L = Lymph node)
clusters: Cell cluster assignments

Details

This dataset was created by combining 10X Genomics single-cell gene expression and VDJ sequencing data from 8 samples across 4 patients. Each patient contributed two samples: blood (B) and lymph node (L) tissue. More information on the data can be found in the following [manuscript](https://pubmed.ncbi.nlm.nih.gov/33622974/).

Examples

data(immLynx_example)
immLynx_example
data(immLynx_example)
immLynx_example

Get Protein Embeddings from a Model

Description

Applies a pre-trained model to a batch of tokenized sequences to generate embeddings. This is the core embedding function used by runEmbeddings.

Usage

proteinEmbeddings(
  model,
  tokenized.batch,
  pool = c("none", "mean", "cls"),
  chunk_size = NULL,
  prefer_dtype = c("float16", "bfloat16", "float32"),
  prefer_device = c("auto", "cuda", "mps", "cpu")
)
proteinEmbeddings(
  model,
  tokenized.batch,
  pool = c("none", "mean", "cls"),
  chunk_size = NULL,
  prefer_dtype = c("float16", "bfloat16", "float32"),
  prefer_device = c("auto", "cuda", "mps", "cpu")
)

Arguments

model

HF model (from AutoModel or similar), typically obtained via huggingModel.

tokenized.batch

A *list* of tokenized tensors OR a list of such lists (i.e., already minibatched). If you pass a single big batch, set chunk_size. Typically obtained via tokenizeSequences.

pool

One of "mean", "cls", or "none". "mean" is recommended for sequence-level embeddings.

chunk_size

If tokenized.batch is a single big batch, split it into chunks of this many sequences. Ignored if you pre-batched upstream.

prefer_dtype

One of "float16", "bfloat16", "float32". Lower precision uses less memory but may reduce accuracy.

prefer_device

One of "auto", "cuda", "mps", "cpu". "auto" will select the best available device.

Value

An R matrix [n_sequences x hidden] if pool != "none". If pool == "none", returns a list of arrays per chunk.

Examples

sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF")

  # Full workflow: load model, tokenize, embed
  hf_components <- huggingModel()
  tokenized <- tokenizeSequences(hf_components$tokenizer,
                                 sequences)

  # Mean pooling (recommended for sequence-level tasks)
  embeddings <- proteinEmbeddings(hf_components$model,
                                  tokenized,
                                  pool = "mean",
                                  chunk_size = 32)
  dim(embeddings)  # [n_sequences x hidden_dim]

  # CLS token embedding
  cls_emb <- proteinEmbeddings(hf_components$model,
                               tokenized,
                               pool = "cls",
                               chunk_size = 32)

  # Per-token embeddings (no pooling)
  token_emb <- proteinEmbeddings(hf_components$model,
                                 tokenized,
                                 pool = "none",
                                 chunk_size = 32)

  # Use GPU with half precision for speed
  embeddings_gpu <- proteinEmbeddings(
      hf_components$model, tokenized,
      pool = "mean", chunk_size = 64,
      prefer_device = "cuda",
      prefer_dtype = "float16")

sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF")

  # Full workflow: load model, tokenize, embed
  hf_components <- huggingModel()
  tokenized <- tokenizeSequences(hf_components$tokenizer,
                                 sequences)

  # Mean pooling (recommended for sequence-level tasks)
  embeddings <- proteinEmbeddings(hf_components$model,
                                  tokenized,
                                  pool = "mean",
                                  chunk_size = 32)
  dim(embeddings)  # [n_sequences x hidden_dim]

  # CLS token embedding
  cls_emb <- proteinEmbeddings(hf_components$model,
                               tokenized,
                               pool = "cls",
                               chunk_size = 32)

  # Per-token embeddings (no pooling)
  token_emb <- proteinEmbeddings(hf_components$model,
                                 tokenized,
                                 pool = "none",
                                 chunk_size = 32)

  # Use GPU with half precision for speed
  embeddings_gpu <- proteinEmbeddings(
      hf_components$model, tokenized,
      pool = "mean", chunk_size = 64,
      prefer_device = "cuda",
      prefer_dtype = "float16")

Run clusTCR Clustering on scRepertoire Data

Description

This function extracts TCR sequences from a SingleCellExperiment object with scRepertoire data and performs clustering using the clusTCR algorithm.

Usage

runClustTCR(
  input,
  chains = c("TRB", "TRA", "both"),
  method = "mcl",
  combine_chains = FALSE,
  return_object = TRUE,
  column_prefix = "clustcr",
  ...
)
runClustTCR(
  input,
  chains = c("TRB", "TRA", "both"),
  method = "mcl",
  combine_chains = FALSE,
  return_object = TRUE,
  column_prefix = "clustcr",
  ...
)

Arguments

input

A SingleCellExperiment object containing scRepertoire TCR data in the metadata.

chains

Character string specifying which chains to use: "TRA", "TRB", or "both". Default is "TRB".

method

Clustering method passed to clusTCR. Default is "mcl" (Markov Clustering), which is accurate for typical repertoire datasets.

combine_chains

Logical. If TRUE and chains="both", concatenates alpha and beta sequences with "_". Default is FALSE (clusters chains separately).

return_object

Logical. If TRUE, adds cluster assignments back to the input object metadata. If FALSE, returns only the clustering results. Default is TRUE.

column_prefix

Prefix for the new metadata column(s). Default is "clustcr".

...

Additional arguments passed to calculate.clustcr (e.g., inflation).

Value

If return_object=TRUE, returns the input object with cluster assignments added to metadata. If return_object=FALSE, returns a data.frame with barcodes and cluster assignments.

Examples

data(immLynx_example)

  # Cluster TRB chain using MCL algorithm
  sce <- runClustTCR(immLynx_example,
                            chains = "TRB")

  # Adjust MCL inflation parameter
  sce <- runClustTCR(immLynx_example,
                            chains = "TRB",
                            inflation = 3.0)

  # Cluster both chains separately
  sce <- runClustTCR(immLynx_example,
                            chains = "both")

  # Combine alpha and beta chains before clustering
  sce <- runClustTCR(immLynx_example,
                            chains = "both",
                            combine_chains = TRUE)

  # Get results as data.frame
  clusters_df <- runClustTCR(immLynx_example,
                             chains = "TRB",
                             return_object = FALSE)

data(immLynx_example)

  # Cluster TRB chain using MCL algorithm
  sce <- runClustTCR(immLynx_example,
                            chains = "TRB")

  # Adjust MCL inflation parameter
  sce <- runClustTCR(immLynx_example,
                            chains = "TRB",
                            inflation = 3.0)

  # Cluster both chains separately
  sce <- runClustTCR(immLynx_example,
                            chains = "both")

  # Combine alpha and beta chains before clustering
  sce <- runClustTCR(immLynx_example,
                            chains = "both",
                            combine_chains = TRUE)

  # Get results as data.frame
  clusters_df <- runClustTCR(immLynx_example,
                             chains = "TRB",
                             return_object = FALSE)

Generate Protein Language Model Embeddings for TCR Sequences

Description

Extracts TCR CDR3 sequences from a SingleCellExperiment object and generates embeddings using a protein language model (e.g., ESM-2).

Usage

runEmbeddings(
  input,
  chains = c("TRB", "TRA", "both"),
  model_name = "facebook/esm2_t12_35M_UR50D",
  pool = "mean",
  chunk_size = 32,
  reduction_name = "tcr_esm",
  reduction_key = "ESM_",
  return_object = TRUE,
  ...
)
runEmbeddings(
  input,
  chains = c("TRB", "TRA", "both"),
  model_name = "facebook/esm2_t12_35M_UR50D",
  pool = "mean",
  chunk_size = 32,
  reduction_name = "tcr_esm",
  reduction_key = "ESM_",
  return_object = TRUE,
  ...
)

Arguments

input

A SingleCellExperiment object containing scRepertoire TCR data.

chains

Which chain(s) to embed: "TRB", "TRA", or "both". Default is "TRB".

model_name

Hugging Face model name. Default is "facebook/esm2_t12_35M_UR50D". Other options: "facebook/esm2_t33_650M_UR50D", "facebook/esm2_t36_3B_UR50D"

pool

Pooling method: "mean", "cls", or "none". Default is "mean".

chunk_size

Number of sequences to process at once. Default is 32.

reduction_name

Name for the dimensional reduction. Default is "tcr_esm".

reduction_key

Key prefix for embeddings in reduction. Default is "ESM_".

return_object

Logical. If TRUE, adds embeddings as dimensional reduction. If FALSE, returns list with embeddings and metadata. Default is TRUE.

...

Additional arguments passed to proteinEmbeddings().

Details

This function uses protein language models to generate dense vector representations of TCR CDR3 sequences. These embeddings can be used for: - Dimensionality reduction and visualization - Clustering TCRs by sequence similarity - Downstream machine learning tasks

Value

If return_object=TRUE, returns input object with embeddings added as a dimensional reduction. If FALSE, returns list with embeddings matrix and metadata.

Examples

data(immLynx_example)

  # Generate ESM-2 embeddings for TRB chain
  sce <- runEmbeddings(immLynx_example,
                              chains = "TRB")

  # Use a larger ESM-2 model
  sce <- runEmbeddings(immLynx_example,
                              chains = "TRB",
                              model_name = "facebook/esm2_t33_650M_UR50D")

  # Embed both chains together
  sce <- runEmbeddings(immLynx_example,
                              chains = "both")

  # Use CLS pooling instead of mean pooling
  sce <- runEmbeddings(immLynx_example,
                              chains = "TRB",
                              pool = "cls")

  # Get raw embeddings as a list
  emb_list <- runEmbeddings(immLynx_example,
                            chains = "TRB",
                            return_object = FALSE)
  dim(emb_list$embeddings)

data(immLynx_example)

  # Generate ESM-2 embeddings for TRB chain
  sce <- runEmbeddings(immLynx_example,
                              chains = "TRB")

  # Use a larger ESM-2 model
  sce <- runEmbeddings(immLynx_example,
                              chains = "TRB",
                              model_name = "facebook/esm2_t33_650M_UR50D")

  # Embed both chains together
  sce <- runEmbeddings(immLynx_example,
                              chains = "both")

  # Use CLS pooling instead of mean pooling
  sce <- runEmbeddings(immLynx_example,
                              chains = "TRB",
                              pool = "cls")

  # Get raw embeddings as a list
  emb_list <- runEmbeddings(immLynx_example,
                            chains = "TRB",
                            return_object = FALSE)
  dim(emb_list$embeddings)

Perform HLA Association Analysis on Metaclones

Description

Tests associations between metaclone membership and HLA alleles using Fisher's exact test with FDR correction.

Usage

runHLAassociation(
  metaclone_data,
  hla_data,
  by = "barcode",
  fdr_threshold = 0.05
)
runHLAassociation(
  metaclone_data,
  hla_data,
  by = "barcode",
  fdr_threshold = 0.05
)

Arguments

metaclone_data

A data.frame with metaclone assignments, typically from runMetaclonotypist() with return_input=FALSE.

hla_data

A data.frame with HLA typing information. Must have a 'barcode' or 'sample' column for matching and columns for each HLA allele.

by

Column name to use for matching between datasets. Default is "barcode".

fdr_threshold

FDR threshold for significance. Default is 0.05.

Value

A data.frame with HLA association results:

metaclone: Metaclone identifier
hla_allele: HLA allele tested
odds_ratio: Association odds ratio
pvalue: Raw p-value from Fisher's exact test
fdr: FDR-adjusted p-value
significant: Whether FDR < threshold

Examples

# Create example metaclone and HLA data
metaclone_data <- data.frame(
  barcode = paste0("cell_", 1:20),
  metaclone = rep(c("MC1", "MC2"), each = 10),
  stringsAsFactors = FALSE
)
hla_data <- data.frame(
  barcode = paste0("cell_", 1:20),
  HLA_A = c(rep("A*02:01", 8), rep("A*01:01", 4),
            rep("A*02:01", 3), rep("A*01:01", 5)),
  stringsAsFactors = FALSE
)
results <- runHLAassociation(metaclone_data, hla_data)

# Create example metaclone and HLA data
metaclone_data <- data.frame(
  barcode = paste0("cell_", 1:20),
  metaclone = rep(c("MC1", "MC2"), each = 10),
  stringsAsFactors = FALSE
)
hla_data <- data.frame(
  barcode = paste0("cell_", 1:20),
  HLA_A = c(rep("A*02:01", 8), rep("A*01:01", 4),
            rep("A*02:01", 3), rep("A*01:01", 5)),
  stringsAsFactors = FALSE
)
results <- runHLAassociation(metaclone_data, hla_data)

Run Metaclonotypist for TCR Metaclone Discovery

Description

Identifies TCR metaclones (groups of related T cell receptors) using the metaclonotypist pipeline. Metaclonotypist uses a two-stage approach: fast edit-distance-based screening followed by TCRdist or SCEPTR refinement.

Usage

runMetaclonotypist(
  input,
  chains = c("beta", "alpha"),
  method = c("tcrdist", "sceptr"),
  max_edits = 2,
  max_dist = NULL,
  clustering = c("cc", "leiden", "louvain", "mcl"),
  resolution = 1,
  return_input = TRUE,
  column_name = "metaclone"
)
runMetaclonotypist(
  input,
  chains = c("beta", "alpha"),
  method = c("tcrdist", "sceptr"),
  max_edits = 2,
  max_dist = NULL,
  clustering = c("cc", "leiden", "louvain", "mcl"),
  resolution = 1,
  return_input = TRUE,
  column_name = "metaclone"
)

Arguments

input

A SingleCellExperiment object with scRepertoire data, or a data.frame with TCR data.

chains

Which chain to analyze: "alpha" or "beta". Default is "beta".

method

Distance metric for refinement: "tcrdist" (default) or "sceptr".

max_edits

Maximum CDR3 edit distance for initial screening. Default is 2.

max_dist

Maximum distance threshold for clustering. Default is 20 for tcrdist, 1.5 for sceptr.

clustering

Clustering algorithm: "cc" (connected components, default), "leiden", "louvain", or "mcl".

resolution

Resolution parameter for leiden/louvain clustering. Default is 1.0.

return_input

Logical. If TRUE and input is a SingleCellExperiment object, adds metaclone assignments to metadata. Default is TRUE.

column_name

Name for the metadata column. Default is "metaclone".

Details

Metaclonotypist identifies groups of related TCRs that may recognize similar antigens. The algorithm: 1. Uses the Symdel algorithm for fast edit-distance-based candidate identification 2. Refines candidates using TCRdist or SCEPTR similarity metrics 3. Applies graph-based clustering (Leiden by default) to identify metaclones

Value

If return_input=TRUE and input is a SingleCellExperiment, returns the object with metaclone assignments added to metadata. Otherwise returns a data.frame with:

barcode: Cell barcode
cdr3_aa: CDR3 amino acid sequence
metaclone: Metaclone cluster assignment
metaclone_size: Number of cells in the metaclone

References

Metaclonotypist: https://github.com/qimmuno/metaclonotypist

Examples

data(immLynx_example)

  # Run metaclonotypist on beta chain
  sce <- runMetaclonotypist(immLynx_example, chains = "beta")

  # Get results as data.frame instead of adding to object
  metaclones <- runMetaclonotypist(immLynx_example,
                                   return_input = FALSE)

  # Adjust edit distance threshold
  sce <- runMetaclonotypist(immLynx_example,
                                   max_edits = 3,
                                   max_dist = 50)

data(immLynx_example)

  # Run metaclonotypist on beta chain
  sce <- runMetaclonotypist(immLynx_example, chains = "beta")

  # Get results as data.frame instead of adding to object
  metaclones <- runMetaclonotypist(immLynx_example,
                                   return_input = FALSE)

  # Adjust edit distance threshold
  sce <- runMetaclonotypist(immLynx_example,
                                   max_edits = 3,
                                   max_dist = 50)

Calculate Generation Probability (Pgen) for TCRs in scRepertoire Data

Description

Extracts TCR sequences from a SingleCellExperiment object and calculates their generation probability using OLGA.

Usage

runOLGA(
  input,
  chains = c("TRB", "TRA"),
  model = NULL,
  organism = "human",
  use_vj_genes = FALSE,
  return_object = TRUE,
  column_name = "olga_pgen"
)
runOLGA(
  input,
  chains = c("TRB", "TRA"),
  model = NULL,
  organism = "human",
  use_vj_genes = FALSE,
  return_object = TRUE,
  column_name = "olga_pgen"
)

Arguments

input

A SingleCellExperiment object containing scRepertoire TCR data.

chains

Which chain to analyze: "TRA" or "TRB". Default is "TRB".

model

OLGA model to use. Options: "humanTRB", "humanTRA", "humanIGH", "mouseTRB". If NULL, will be inferred from organism and chains parameters.

organism

Organism: "human" or "mouse". Used if model is NULL. Default is "human".

use_vj_genes

Logical. If TRUE, includes V and J gene information in Pgen calculation. Default is FALSE (sequence-only Pgen).

return_object

Logical. If TRUE, adds Pgen values to metadata. Default is TRUE.

column_name

Name for the metadata column. Default is "olga_pgen".

Value

If return_object=TRUE, returns input object with Pgen added to metadata. If FALSE, returns data.frame with barcodes, sequences, and Pgen values.

Examples

data(immLynx_example)

  # Calculate Pgen for TRB chain
  sce <- runOLGA(immLynx_example, chains = "TRB")

  # Calculate Pgen for TRA chain
  sce <- runOLGA(immLynx_example, chains = "TRA")

  # Include V and J gene information
  sce <- runOLGA(immLynx_example,
                        chains = "TRB",
                        use_vj_genes = TRUE)

  # Get results as data.frame
  pgen_df <- runOLGA(immLynx_example,
                     chains = "TRB",
                     return_object = FALSE)

  # Specify model explicitly for mouse data
  sce <- runOLGA(immLynx_example,
                        model = "mouseTRB")

data(immLynx_example)

  # Calculate Pgen for TRB chain
  sce <- runOLGA(immLynx_example, chains = "TRB")

  # Calculate Pgen for TRA chain
  sce <- runOLGA(immLynx_example, chains = "TRA")

  # Include V and J gene information
  sce <- runOLGA(immLynx_example,
                        chains = "TRB",
                        use_vj_genes = TRUE)

  # Get results as data.frame
  pgen_df <- runOLGA(immLynx_example,
                     chains = "TRB",
                     return_object = FALSE)

  # Specify model explicitly for mouse data
  sce <- runOLGA(immLynx_example,
                        model = "mouseTRB")

Run soNNia Selection Analysis

Description

Infer selection pressures on TCRs using soNNia. Requires a background dataset of unselected sequences (generated by OLGA).

Usage

runSoNNia(
  input,
  chains = c("TRB", "TRA"),
  background_file,
  organism = "human",
  save_folder = "sonia_output",
  n_epochs = 100,
  return_object = TRUE
)
runSoNNia(
  input,
  chains = c("TRB", "TRA"),
  background_file,
  organism = "human",
  save_folder = "sonia_output",
  n_epochs = 100,
  return_object = TRUE
)

Arguments

input

A SingleCellExperiment object containing scRepertoire TCR data.

chains

Which chain to analyze: "TRB" or "TRA". Default is "TRB".

background_file

Path to CSV file with background sequences (from generateOLGA).

organism

Organism: "human" or "mouse". Default is "human".

save_folder

Directory to save soNNia model. Default is "sonia_output".

n_epochs

Number of training epochs. Default is 100.

return_object

If TRUE, adds selection scores to metadata. Default is TRUE.

Details

This function requires a background dataset of unselected TCR sequences, which can be generated using generateOLGA(). The selected sequences are extracted from the input object and compared to this background to infer selection pressures.

Value

If return_object=TRUE, returns input object with selection scores in metadata. Otherwise returns the soNNia results.

Examples

data(immLynx_example)
## Not run: 
  # Step 1: Generate background sequences with OLGA
  background <- generateOLGA(n = 1000, model = "humanTRB")
  bg_file <- tempfile(fileext = ".csv")
  utils::write.csv(background, bg_file, row.names = FALSE)

  # Step 2: Run soNNia selection analysis
  sce <- runSoNNia(immLynx_example,
                          chains = "TRB",
                          background_file = bg_file)

  # Get raw results instead of adding to object
  sonia_results <- runSoNNia(immLynx_example,
                             chains = "TRB",
                             background_file = bg_file,
                             return_object = FALSE)

## End(Not run)
data(immLynx_example)
## Not run: 
  # Step 1: Generate background sequences with OLGA
  background <- generateOLGA(n = 1000, model = "humanTRB")
  bg_file <- tempfile(fileext = ".csv")
  utils::write.csv(background, bg_file, row.names = FALSE)

  # Step 2: Run soNNia selection analysis
  sce <- runSoNNia(immLynx_example,
                          chains = "TRB",
                          background_file = bg_file)

  # Get raw results instead of adding to object
  sonia_results <- runSoNNia(immLynx_example,
                             chains = "TRB",
                             background_file = bg_file,
                             return_object = FALSE)

## End(Not run)

Calculate TCR Distances on scRepertoire Data

Description

This function extracts TCR sequences from a SingleCellExperiment object with scRepertoire data and calculates pairwise TCR distances using tcrdist3.

Usage

runTCRdist(
  input,
  chains = "beta",
  organism = "human",
  compute_distances = TRUE,
  add_to_object = FALSE
)
runTCRdist(
  input,
  chains = "beta",
  organism = "human",
  compute_distances = TRUE,
  add_to_object = FALSE
)

Arguments

input

A SingleCellExperiment object containing scRepertoire TCR data.

chains

Character vector specifying chains: "alpha", "beta", or c("alpha", "beta"). Default is "beta".

organism

Organism: "human" or "mouse". Default is "human".

compute_distances

Logical. Whether to compute full distance matrix. Default TRUE.

add_to_object

Logical. If TRUE, attempts to add distance matrix to object (stored in metadata for SCE). Default is FALSE due to large matrix size.

Value

A list containing:

distances

Distance matrices (pw_alpha, pw_beta, pw_cdr3_a_aa, pw_cdr3_b_aa)

barcodes

Cell barcodes corresponding to matrix rows/columns

tcr_data

The formatted TCR data.frame used for analysis

If add_to_object=TRUE, returns the input object with distances stored.

Examples

data(immLynx_example)

  # Calculate TCR distances for beta chain
  dist_results <- runTCRdist(immLynx_example,
                             chains = "beta")

  # Access the distance matrix
  dist_matrix <- dist_results$distances
  barcodes <- dist_results$barcodes

  # Calculate for both chains
  dist_both <- runTCRdist(immLynx_example,
                          chains = c("alpha", "beta"))

  # Add distances directly to the object
  sce <- runTCRdist(immLynx_example,
                           chains = "beta",
                           add_to_object = TRUE)

data(immLynx_example)

  # Calculate TCR distances for beta chain
  dist_results <- runTCRdist(immLynx_example,
                             chains = "beta")

  # Access the distance matrix
  dist_matrix <- dist_results$distances
  barcodes <- dist_results$barcodes

  # Calculate for both chains
  dist_both <- runTCRdist(immLynx_example,
                          chains = c("alpha", "beta"))

  # Add distances directly to the object
  sce <- runTCRdist(immLynx_example,
                           chains = "beta",
                           add_to_object = TRUE)

Summarize TCR Repertoire Statistics

Description

Generates summary statistics for TCR repertoire data including diversity metrics, clonality measures, and sequence characteristics.

Usage

summarizeTCRrepertoire(
  input,
  chains = c("TRB", "TRA", "both"),
  group.by = NULL,
  calculate_diversity = TRUE
)
summarizeTCRrepertoire(
  input,
  chains = c("TRB", "TRA", "both"),
  group.by = NULL,
  calculate_diversity = TRUE
)

Arguments

input

A SingleCellExperiment object with scRepertoire data, or a data.frame from extractTCRdata().

chains

Which chains to summarize: "TRB", "TRA", or "both". Default is "TRB".

group.by

Optional metadata column for grouping (SingleCellExperiment objects only).

calculate_diversity

Logical. If TRUE, calculates diversity indices. Default is TRUE.

Value

An object of class TCR_summary containing summary statistics for the TCR repertoire.

Examples

# Summarize from a data.frame
tcr_data <- data.frame(
  barcode = paste0("cell_", 1:10),
  cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF",
              "CASSYSTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF",
              "CASNQGLNEKLFF", "CASSYSTGELFF", "CASSLGTGELFF",
              "CASSIRSSYEQYF"),
  v = rep("TRBV7-2", 10),
  j = rep("TRBJ2-2", 10),
  chain = rep("TRB", 10),
  stringsAsFactors = FALSE
)
summary <- summarizeTCRrepertoire(tcr_data)
print(summary)

# Summarize from a data.frame
tcr_data <- data.frame(
  barcode = paste0("cell_", 1:10),
  cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF",
              "CASSYSTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF",
              "CASNQGLNEKLFF", "CASSYSTGELFF", "CASSLGTGELFF",
              "CASSIRSSYEQYF"),
  v = rep("TRBV7-2", 10),
  j = rep("TRBJ2-2", 10),
  chain = rep("TRB", 10),
  stringsAsFactors = FALSE
)
summary <- summarizeTCRrepertoire(tcr_data)
print(summary)

S4 Class for TCR Repertoire Summary

Description

Formal S4 class to store summary statistics for a TCR repertoire, including diversity metrics, clonality measures, and sequence characteristics.

Usage

## S4 method for signature 'TCR_summary'
show(object)
## S4 method for signature 'TCR_summary'
show(object)

Arguments

object

A TCR_summary object.

Value

An S4 object of class TCR_summary containing the summary statistics described in the slots. The show method prints a formatted summary to the console and returns invisible(NULL).

Slots

total_cells: Integer. Total number of cells with TCR data.
unique_clonotypes: Integer. Number of unique clonotypes.
clonotype_ratio: Numeric. Ratio of unique clonotypes to total cells.
diversity: List of diversity indices (Shannon, Simpson, etc.), or NULL.
top_clones: Data.frame of the top 10 most frequent clonotypes.
cdr3_length: List with CDR3 length distribution statistics.
gene_usage: List of V and J gene usage data.frames.
chains: Character. Chain(s) summarized.

Tokenize Amino Acid Sequences

Description

Takes a vector of amino acid sequences and uses a Hugging Face tokenizer to convert them into numerical input IDs suitable for model input. The tokenizer should be obtained from huggingModel, which manages the Python environment via basilisk.

Usage

tokenizeSequences(
  tokenizer,
  aa_sequences,
  padding = TRUE,
  truncation = TRUE,
  return_tensors = "pt"
)
tokenizeSequences(
  tokenizer,
  aa_sequences,
  padding = TRUE,
  truncation = TRUE,
  return_tensors = "pt"
)

Arguments

tokenizer

The tokenizer object returned by huggingModel.

aa_sequences

A character vector of amino acid sequences (e.g., CDR3 sequences).

padding

A logical or string. If TRUE, pads sequences to the length of the longest sequence in the batch. Defaults to TRUE.

truncation

A logical. If TRUE, truncates sequences to the model's maximum input length. Defaults to TRUE.

return_tensors

A string specifying the format for the returned tensors. "pt" for PyTorch tensors, "tf" for TensorFlow. Defaults to "pt".

Value

The tokenized output, typically a dictionary-like object containing 'input_ids' and 'attention_mask'.

Examples

sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF")

  # Initialize model and tokenizer
  hf_components <- huggingModel()

  # Tokenize CDR3 sequences
  tokenized <- tokenizeSequences(hf_components$tokenizer,
                                 sequences)

  # Tokenize without padding (variable-length output)
  tokenized_nopad <- tokenizeSequences(
      hf_components$tokenizer,
      sequences,
      padding = FALSE)

  # Pass tokenized output to proteinEmbeddings
  embeddings <- proteinEmbeddings(hf_components$model,
                                  tokenized,
                                  pool = "mean",
                                  chunk_size = 32)

  # Clean up
  basilisk::basiliskStop(hf_components$proc)

sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF")

  # Initialize model and tokenizer
  hf_components <- huggingModel()

  # Tokenize CDR3 sequences
  tokenized <- tokenizeSequences(hf_components$tokenizer,
                                 sequences)

  # Tokenize without padding (variable-length output)
  tokenized_nopad <- tokenizeSequences(
      hf_components$tokenizer,
      sequences,
      padding = FALSE)

  # Pass tokenized output to proteinEmbeddings
  embeddings <- proteinEmbeddings(hf_components$model,
                                  tokenized,
                                  pool = "mean",
                                  chunk_size = 32)

  # Clean up
  basilisk::basiliskStop(hf_components$proc)

Validate TCR Data Format

Description

Validates that TCR data is in the correct format for immLynx analysis functions. Checks for required columns, valid sequence formats, and gene nomenclature.

Usage

validateTCRdata(
  tcr_data,
  check_genes = FALSE,
  check_sequences = TRUE,
  strict = FALSE
)
validateTCRdata(
  tcr_data,
  check_genes = FALSE,
  check_sequences = TRUE,
  strict = FALSE
)

Arguments

tcr_data

A data.frame containing TCR data

check_genes

Logical. If TRUE, validates gene names against IMGT nomenclature. Default is FALSE.

check_sequences

Logical. If TRUE, validates that CDR3 sequences contain only valid amino acids. Default is TRUE.

strict

Logical. If TRUE, stops with error on validation failure. If FALSE, returns validation report. Default is FALSE.

Value

If strict=FALSE, returns a list with:

valid: Logical indicating overall validity
errors: Character vector of error messages
warnings: Character vector of warning messages
summary: Summary statistics of the data

If strict=TRUE, returns TRUE invisibly on success or stops with error.

Examples

# Create example TCR data
tcr_data <- data.frame(
  barcode = paste0("cell_", 1:5),
  cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF",
              "CASNQGLNEKLFF", "CASSLDRNEQFF"),
  v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")),
  j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")),
  chain = rep("TRB", 5),
  stringsAsFactors = FALSE
)
report <- validateTCRdata(tcr_data, strict = FALSE)
report$valid
report$summary

# Create example TCR data
tcr_data <- data.frame(
  barcode = paste0("cell_", 1:5),
  cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF",
              "CASNQGLNEKLFF", "CASSLDRNEQFF"),
  v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")),
  j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")),
  chain = rep("TRB", 5),
  stringsAsFactors = FALSE
)
report <- validateTCRdata(tcr_data, strict = FALSE)
report$valid
report$summary

Package 'immLynx'

Help Index

Convert TCR Data to tcrdist3 Format

Description

Usage

Arguments

Value

Examples

Extract TCR Data from SingleCellExperiment Object

Description

Usage

Arguments

Value

Examples

Generate Random TCR Sequences using OLGA

Description

Usage

Arguments

Value

Examples

Initialize a Hugging Face Model and Tokenizer

Description

Usage

Arguments

Value

See Also

Examples

Example Single-Cell RNA-seq Data with TCR Information

Description

Usage

Format

Details

Examples

Get Protein Embeddings from a Model

Description

Usage

Arguments

Value

See Also

Examples

Run clusTCR Clustering on scRepertoire Data

Description

Usage

Arguments

Value

Examples

Generate Protein Language Model Embeddings for TCR Sequences

Description

Usage

Arguments

Details

Value

Examples

Perform HLA Association Analysis on Metaclones

Description

Usage

Arguments

Value

Examples

Run Metaclonotypist for TCR Metaclone Discovery

Description

Usage

Arguments

Details

Value

References

Examples

Calculate Generation Probability (Pgen) for TCRs in scRepertoire Data

Description

Usage

Arguments

Value

Examples

Run soNNia Selection Analysis

Description

Usage

Arguments

Details

Value

Examples