| Title: | Linking Advanced TCR Python Pipelines and Hugging Face Models in R |
|---|---|
| Description: | A comprehensive toolkit that bridges popular Python-based immune repertoire analysis tools and Hugging Face protein language models into the R environment. Provides unified interfaces for TCR distance calculations (tcrdist3), sequence generation probability (OLGA), selection inference (soNNia), clustering (clusTCR), protein embeddings (ESM-2), metaclone discovery (metaclonotypist). Fully compatible with the scRepertoire and immApex ecosystem for single-cell immune repertoire analysis. |
| Authors: | Nick Borcherding [aut, cre] (ORCID: <https://orcid.org/0000-0003-1427-6342>) |
| Maintainer: | Nick Borcherding <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-05-30 09:39:57 UTC |
| Source: | https://github.com/bioc/immLynx |
Converts TCR data from immLynx/scRepertoire format to the format required by tcrdist3 for distance calculations.
convertToTcrdist( tcr_data, chains = c("beta", "alpha", "both"), include_count = TRUE )convertToTcrdist( tcr_data, chains = c("beta", "alpha", "both"), include_count = TRUE )
tcr_data |
A data.frame from extractTCRdata() or similar source |
chains |
Which chains to include: "alpha", "beta", or "both". Default is "beta". |
include_count |
Logical. If TRUE, adds a 'count' column (default 1). Default is TRUE. |
A data.frame in tcrdist3 format with columns:
Clone count (default 1)
Alpha V gene (if alpha chain included)
Alpha J gene (if alpha chain included)
Alpha CDR3 amino acid sequence (if alpha chain included)
Beta V gene (if beta chain included)
Beta J gene (if beta chain included)
Beta CDR3 amino acid sequence (if beta chain included)
# Convert long-format TCR data to tcrdist3 format tcr_data <- data.frame( barcode = paste0("cell_", 1:5), cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF", "CASNQGLNEKLFF", "CASSLDRNEQFF"), v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")), j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")), chain = rep("TRB", 5), stringsAsFactors = FALSE ) tcrdist_format <- convertToTcrdist(tcr_data, chains = "beta") head(tcrdist_format)# Convert long-format TCR data to tcrdist3 format tcr_data <- data.frame( barcode = paste0("cell_", 1:5), cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF", "CASNQGLNEKLFF", "CASSLDRNEQFF"), v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")), j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")), chain = rep("TRB", 5), stringsAsFactors = FALSE ) tcrdist_format <- convertToTcrdist(tcr_data, chains = "beta") head(tcrdist_format)
Extracts T-cell receptor data from a SingleCellExperiment object that has been
processed with scRepertoire. This is a convenience wrapper around
immApex::getIR() that provides additional formatting options.
extractTCRdata( input, chains = c("TRB", "TRA", "TRG", "TRD", "IGH", "IGL", "IGK", "both"), format = c("long", "wide"), remove_na = TRUE )extractTCRdata( input, chains = c("TRB", "TRA", "TRG", "TRD", "IGH", "IGL", "IGK", "both"), format = c("long", "wide"), remove_na = TRUE )
input |
A SingleCellExperiment object containing scRepertoire TCR data in metadata. |
chains |
Which chains to extract: "TRA", "TRB", "TRG", "TRD", "IGH", "IGL", "IGK", or "both" (for TRA and TRB). Default is "TRB". |
format |
Output format: "long" (one row per chain) or "wide" (one row per cell with columns for each chain). Default is "long". |
remove_na |
Logical. If TRUE, removes rows with NA CDR3 sequences. Default is TRUE. |
A data.frame containing:
Cell barcode
CDR3 amino acid sequence
CDR3 nucleotide sequence (if available)
V gene
D gene (if applicable)
J gene
C gene (if available)
Chain type (TRA, TRB, etc.)
# Extract beta chain data data(immLynx_example) tcr_data <- extractTCRdata(immLynx_example, chains = "TRB") head(tcr_data)# Extract beta chain data data(immLynx_example) tcr_data <- extractTCRdata(immLynx_example, chains = "TRB") head(tcr_data)
Generate random TCR sequences from OLGA's generative model.
generateOLGA(n = 100, model = "humanTRB")generateOLGA(n = 100, model = "humanTRB")
n |
Number of sequences to generate. |
model |
OLGA model to use: "humanTRB", "humanTRA", "humanIGH", "mouseTRB". |
Data.frame with generated sequences (nt_seq, aa_seq, v_index, j_index).
# Available models models <- c("humanTRB", "humanTRA", "humanIGH", "mouseTRB") # Generate 100 random human TRB sequences random_seqs <- generateOLGA(n = 100, model = "humanTRB") head(random_seqs) # Generate human TRA sequences tra_seqs <- generateOLGA(n = 50, model = "humanTRA") # Generate mouse TRB sequences mouse_seqs <- generateOLGA(n = 200, model = "mouseTRB")# Available models models <- c("humanTRB", "humanTRA", "humanIGH", "mouseTRB") # Generate 100 random human TRB sequences random_seqs <- generateOLGA(n = 100, model = "humanTRB") head(random_seqs) # Generate human TRA sequences tra_seqs <- generateOLGA(n = 50, model = "humanTRA") # Generate mouse TRB sequences mouse_seqs <- generateOLGA(n = 200, model = "mouseTRB")
Downloads or loads a cached pre-trained model and its
corresponding tokenizer from the Hugging Face Hub. Uses the basilisk-managed
Python environment which includes transformers and torch.
huggingModel(model_name = "facebook/esm2_t12_35M_UR50D")huggingModel(model_name = "facebook/esm2_t12_35M_UR50D")
model_name |
A string specifying the model identifier from the Hugging Face Hub. For ESM-2 35M, this is "facebook/esm2_t12_35M_UR50D". Other options include "facebook/esm2_t33_650M_UR50D" for larger models. |
A list containing the R-wrapped Python objects for the 'model'
and 'tokenizer', along with the basilisk process handle. The process
must be stopped when no longer needed via basilisk::basiliskStop().
tokenizeSequences, proteinEmbeddings,
runEmbeddings
# Default model is ESM-2 35M model_name <- "facebook/esm2_t12_35M_UR50D" # Load the default ESM-2 35M model hf_components <- huggingModel(model_name) names(hf_components) # "model", "tokenizer", "proc" # Use with tokenizeSequences and proteinEmbeddings sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF") tokenized <- tokenizeSequences(hf_components$tokenizer, sequences) embeddings <- proteinEmbeddings(hf_components$model, tokenized, pool = "mean", chunk_size = 32) # Clean up the basilisk process when done basilisk::basiliskStop(hf_components$proc)# Default model is ESM-2 35M model_name <- "facebook/esm2_t12_35M_UR50D" # Load the default ESM-2 35M model hf_components <- huggingModel(model_name) names(hf_components) # "model", "tokenizer", "proc" # Use with tokenizeSequences and proteinEmbeddings sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF") tokenized <- tokenizeSequences(hf_components$tokenizer, sequences) embeddings <- proteinEmbeddings(hf_components$model, tokenized, pool = "mean", chunk_size = 32) # Clean up the basilisk process when done basilisk::basiliskStop(hf_components$proc)
An SCE object containing single-cell RNA-seq data from multiple patients with integrated T-cell receptor (TCR) repertoire data from scRepertoire. This dataset is useful for demonstrating the functionality of immLynx analysis functions.
data(immLynx_example)data(immLynx_example)
An SCE object with the following components:
RNA expression data
Cell-level metadata including:
orig.ident: Original sample identifier (e.g., "P17B", "P17L")
nCount_RNA: Total RNA counts per cell
nFeature_RNA: Number of detected genes per cell
CTgene: TCR gene information (V/J genes)
CTnt: TCR CDR3 nucleotide sequences
CTaa: TCR CDR3 amino acid sequences
CTstrict: Strict TCR identifier combining gene and sequence
clonalFrequency: Number of cells sharing the same TCR
clonalProportion: Proportion of cells with the same TCR
cloneSize: Categorical clone size (Small, Medium, Large, Hyperexpanded)
Patient: Patient identifier (P17, P18, P19, P20)
Type: Sample type (B = Blood, L = Lymph node)
clusters: Cell cluster assignments
This dataset was created by combining 10X Genomics single-cell gene expression and VDJ sequencing data from 8 samples across 4 patients. Each patient contributed two samples: blood (B) and lymph node (L) tissue. More information on the data can be found in the following [manuscript](https://pubmed.ncbi.nlm.nih.gov/33622974/).
data(immLynx_example) immLynx_exampledata(immLynx_example) immLynx_example
Applies a pre-trained model to a batch of tokenized sequences
to generate embeddings. This is the core embedding function used by
runEmbeddings.
proteinEmbeddings( model, tokenized.batch, pool = c("none", "mean", "cls"), chunk_size = NULL, prefer_dtype = c("float16", "bfloat16", "float32"), prefer_device = c("auto", "cuda", "mps", "cpu") )proteinEmbeddings( model, tokenized.batch, pool = c("none", "mean", "cls"), chunk_size = NULL, prefer_dtype = c("float16", "bfloat16", "float32"), prefer_device = c("auto", "cuda", "mps", "cpu") )
model |
HF model (from AutoModel or similar), typically obtained via
|
tokenized.batch |
A *list* of tokenized tensors OR a list of such lists
(i.e., already minibatched). If you pass a single big batch, set chunk_size.
Typically obtained via |
pool |
One of "mean", "cls", or "none". "mean" is recommended for sequence-level embeddings. |
chunk_size |
If tokenized.batch is a single big batch, split it into chunks of this many sequences. Ignored if you pre-batched upstream. |
prefer_dtype |
One of "float16", "bfloat16", "float32". Lower precision uses less memory but may reduce accuracy. |
prefer_device |
One of "auto", "cuda", "mps", "cpu". "auto" will select the best available device. |
An R matrix [n_sequences x hidden] if pool != "none". If pool == "none", returns a list of arrays per chunk.
runEmbeddings for a higher-level wrapper that works
directly with SingleCellExperiment objects.
sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF") # Full workflow: load model, tokenize, embed hf_components <- huggingModel() tokenized <- tokenizeSequences(hf_components$tokenizer, sequences) # Mean pooling (recommended for sequence-level tasks) embeddings <- proteinEmbeddings(hf_components$model, tokenized, pool = "mean", chunk_size = 32) dim(embeddings) # [n_sequences x hidden_dim] # CLS token embedding cls_emb <- proteinEmbeddings(hf_components$model, tokenized, pool = "cls", chunk_size = 32) # Per-token embeddings (no pooling) token_emb <- proteinEmbeddings(hf_components$model, tokenized, pool = "none", chunk_size = 32) # Use GPU with half precision for speed embeddings_gpu <- proteinEmbeddings( hf_components$model, tokenized, pool = "mean", chunk_size = 64, prefer_device = "cuda", prefer_dtype = "float16")sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF") # Full workflow: load model, tokenize, embed hf_components <- huggingModel() tokenized <- tokenizeSequences(hf_components$tokenizer, sequences) # Mean pooling (recommended for sequence-level tasks) embeddings <- proteinEmbeddings(hf_components$model, tokenized, pool = "mean", chunk_size = 32) dim(embeddings) # [n_sequences x hidden_dim] # CLS token embedding cls_emb <- proteinEmbeddings(hf_components$model, tokenized, pool = "cls", chunk_size = 32) # Per-token embeddings (no pooling) token_emb <- proteinEmbeddings(hf_components$model, tokenized, pool = "none", chunk_size = 32) # Use GPU with half precision for speed embeddings_gpu <- proteinEmbeddings( hf_components$model, tokenized, pool = "mean", chunk_size = 64, prefer_device = "cuda", prefer_dtype = "float16")
This function extracts TCR sequences from a SingleCellExperiment object with scRepertoire data and performs clustering using the clusTCR algorithm.
runClustTCR( input, chains = c("TRB", "TRA", "both"), method = "mcl", combine_chains = FALSE, return_object = TRUE, column_prefix = "clustcr", ... )runClustTCR( input, chains = c("TRB", "TRA", "both"), method = "mcl", combine_chains = FALSE, return_object = TRUE, column_prefix = "clustcr", ... )
input |
A SingleCellExperiment object containing scRepertoire TCR data in the metadata. |
chains |
Character string specifying which chains to use: "TRA", "TRB", or "both". Default is "TRB". |
method |
Clustering method passed to clusTCR. Default is "mcl" (Markov Clustering), which is accurate for typical repertoire datasets. |
combine_chains |
Logical. If TRUE and chains="both", concatenates alpha and beta sequences with "_". Default is FALSE (clusters chains separately). |
return_object |
Logical. If TRUE, adds cluster assignments back to the input object metadata. If FALSE, returns only the clustering results. Default is TRUE. |
column_prefix |
Prefix for the new metadata column(s). Default is "clustcr". |
... |
Additional arguments passed to calculate.clustcr (e.g., inflation). |
If return_object=TRUE, returns the input object with cluster assignments added to metadata. If return_object=FALSE, returns a data.frame with barcodes and cluster assignments.
data(immLynx_example) # Cluster TRB chain using MCL algorithm sce <- runClustTCR(immLynx_example, chains = "TRB") # Adjust MCL inflation parameter sce <- runClustTCR(immLynx_example, chains = "TRB", inflation = 3.0) # Cluster both chains separately sce <- runClustTCR(immLynx_example, chains = "both") # Combine alpha and beta chains before clustering sce <- runClustTCR(immLynx_example, chains = "both", combine_chains = TRUE) # Get results as data.frame clusters_df <- runClustTCR(immLynx_example, chains = "TRB", return_object = FALSE)data(immLynx_example) # Cluster TRB chain using MCL algorithm sce <- runClustTCR(immLynx_example, chains = "TRB") # Adjust MCL inflation parameter sce <- runClustTCR(immLynx_example, chains = "TRB", inflation = 3.0) # Cluster both chains separately sce <- runClustTCR(immLynx_example, chains = "both") # Combine alpha and beta chains before clustering sce <- runClustTCR(immLynx_example, chains = "both", combine_chains = TRUE) # Get results as data.frame clusters_df <- runClustTCR(immLynx_example, chains = "TRB", return_object = FALSE)
Extracts TCR CDR3 sequences from a SingleCellExperiment object and generates embeddings using a protein language model (e.g., ESM-2).
runEmbeddings( input, chains = c("TRB", "TRA", "both"), model_name = "facebook/esm2_t12_35M_UR50D", pool = "mean", chunk_size = 32, reduction_name = "tcr_esm", reduction_key = "ESM_", return_object = TRUE, ... )runEmbeddings( input, chains = c("TRB", "TRA", "both"), model_name = "facebook/esm2_t12_35M_UR50D", pool = "mean", chunk_size = 32, reduction_name = "tcr_esm", reduction_key = "ESM_", return_object = TRUE, ... )
input |
A SingleCellExperiment object containing scRepertoire TCR data. |
chains |
Which chain(s) to embed: "TRB", "TRA", or "both". Default is "TRB". |
model_name |
Hugging Face model name. Default is "facebook/esm2_t12_35M_UR50D". Other options: "facebook/esm2_t33_650M_UR50D", "facebook/esm2_t36_3B_UR50D" |
pool |
Pooling method: "mean", "cls", or "none". Default is "mean". |
chunk_size |
Number of sequences to process at once. Default is 32. |
reduction_name |
Name for the dimensional reduction. Default is "tcr_esm". |
reduction_key |
Key prefix for embeddings in reduction. Default is "ESM_". |
return_object |
Logical. If TRUE, adds embeddings as dimensional reduction. If FALSE, returns list with embeddings and metadata. Default is TRUE. |
... |
Additional arguments passed to proteinEmbeddings(). |
This function uses protein language models to generate dense vector representations of TCR CDR3 sequences. These embeddings can be used for: - Dimensionality reduction and visualization - Clustering TCRs by sequence similarity - Downstream machine learning tasks
If return_object=TRUE, returns input object with embeddings added as a dimensional reduction. If FALSE, returns list with embeddings matrix and metadata.
data(immLynx_example) # Generate ESM-2 embeddings for TRB chain sce <- runEmbeddings(immLynx_example, chains = "TRB") # Use a larger ESM-2 model sce <- runEmbeddings(immLynx_example, chains = "TRB", model_name = "facebook/esm2_t33_650M_UR50D") # Embed both chains together sce <- runEmbeddings(immLynx_example, chains = "both") # Use CLS pooling instead of mean pooling sce <- runEmbeddings(immLynx_example, chains = "TRB", pool = "cls") # Get raw embeddings as a list emb_list <- runEmbeddings(immLynx_example, chains = "TRB", return_object = FALSE) dim(emb_list$embeddings)data(immLynx_example) # Generate ESM-2 embeddings for TRB chain sce <- runEmbeddings(immLynx_example, chains = "TRB") # Use a larger ESM-2 model sce <- runEmbeddings(immLynx_example, chains = "TRB", model_name = "facebook/esm2_t33_650M_UR50D") # Embed both chains together sce <- runEmbeddings(immLynx_example, chains = "both") # Use CLS pooling instead of mean pooling sce <- runEmbeddings(immLynx_example, chains = "TRB", pool = "cls") # Get raw embeddings as a list emb_list <- runEmbeddings(immLynx_example, chains = "TRB", return_object = FALSE) dim(emb_list$embeddings)
Tests associations between metaclone membership and HLA alleles using Fisher's exact test with FDR correction.
runHLAassociation( metaclone_data, hla_data, by = "barcode", fdr_threshold = 0.05 )runHLAassociation( metaclone_data, hla_data, by = "barcode", fdr_threshold = 0.05 )
metaclone_data |
A data.frame with metaclone assignments, typically from runMetaclonotypist() with return_input=FALSE. |
hla_data |
A data.frame with HLA typing information. Must have a 'barcode' or 'sample' column for matching and columns for each HLA allele. |
by |
Column name to use for matching between datasets. Default is "barcode". |
fdr_threshold |
FDR threshold for significance. Default is 0.05. |
A data.frame with HLA association results:
Metaclone identifier
HLA allele tested
Association odds ratio
Raw p-value from Fisher's exact test
FDR-adjusted p-value
Whether FDR < threshold
# Create example metaclone and HLA data metaclone_data <- data.frame( barcode = paste0("cell_", 1:20), metaclone = rep(c("MC1", "MC2"), each = 10), stringsAsFactors = FALSE ) hla_data <- data.frame( barcode = paste0("cell_", 1:20), HLA_A = c(rep("A*02:01", 8), rep("A*01:01", 4), rep("A*02:01", 3), rep("A*01:01", 5)), stringsAsFactors = FALSE ) results <- runHLAassociation(metaclone_data, hla_data)# Create example metaclone and HLA data metaclone_data <- data.frame( barcode = paste0("cell_", 1:20), metaclone = rep(c("MC1", "MC2"), each = 10), stringsAsFactors = FALSE ) hla_data <- data.frame( barcode = paste0("cell_", 1:20), HLA_A = c(rep("A*02:01", 8), rep("A*01:01", 4), rep("A*02:01", 3), rep("A*01:01", 5)), stringsAsFactors = FALSE ) results <- runHLAassociation(metaclone_data, hla_data)
Identifies TCR metaclones (groups of related T cell receptors) using the metaclonotypist pipeline. Metaclonotypist uses a two-stage approach: fast edit-distance-based screening followed by TCRdist or SCEPTR refinement.
runMetaclonotypist( input, chains = c("beta", "alpha"), method = c("tcrdist", "sceptr"), max_edits = 2, max_dist = NULL, clustering = c("cc", "leiden", "louvain", "mcl"), resolution = 1, return_input = TRUE, column_name = "metaclone" )runMetaclonotypist( input, chains = c("beta", "alpha"), method = c("tcrdist", "sceptr"), max_edits = 2, max_dist = NULL, clustering = c("cc", "leiden", "louvain", "mcl"), resolution = 1, return_input = TRUE, column_name = "metaclone" )
input |
A SingleCellExperiment object with scRepertoire data, or a data.frame with TCR data. |
chains |
Which chain to analyze: "alpha" or "beta". Default is "beta". |
method |
Distance metric for refinement: "tcrdist" (default) or "sceptr". |
max_edits |
Maximum CDR3 edit distance for initial screening. Default is 2. |
max_dist |
Maximum distance threshold for clustering. Default is 20 for tcrdist, 1.5 for sceptr. |
clustering |
Clustering algorithm: "cc" (connected components, default), "leiden", "louvain", or "mcl". |
resolution |
Resolution parameter for leiden/louvain clustering. Default is 1.0. |
return_input |
Logical. If TRUE and input is a SingleCellExperiment object, adds metaclone assignments to metadata. Default is TRUE. |
column_name |
Name for the metadata column. Default is "metaclone". |
Metaclonotypist identifies groups of related TCRs that may recognize similar antigens. The algorithm: 1. Uses the Symdel algorithm for fast edit-distance-based candidate identification 2. Refines candidates using TCRdist or SCEPTR similarity metrics 3. Applies graph-based clustering (Leiden by default) to identify metaclones
If return_input=TRUE and input is a SingleCellExperiment, returns the object with metaclone assignments added to metadata. Otherwise returns a data.frame with:
Cell barcode
CDR3 amino acid sequence
Metaclone cluster assignment
Number of cells in the metaclone
Metaclonotypist: https://github.com/qimmuno/metaclonotypist
data(immLynx_example) # Run metaclonotypist on beta chain sce <- runMetaclonotypist(immLynx_example, chains = "beta") # Get results as data.frame instead of adding to object metaclones <- runMetaclonotypist(immLynx_example, return_input = FALSE) # Adjust edit distance threshold sce <- runMetaclonotypist(immLynx_example, max_edits = 3, max_dist = 50)data(immLynx_example) # Run metaclonotypist on beta chain sce <- runMetaclonotypist(immLynx_example, chains = "beta") # Get results as data.frame instead of adding to object metaclones <- runMetaclonotypist(immLynx_example, return_input = FALSE) # Adjust edit distance threshold sce <- runMetaclonotypist(immLynx_example, max_edits = 3, max_dist = 50)
Extracts TCR sequences from a SingleCellExperiment object and calculates their generation probability using OLGA.
runOLGA( input, chains = c("TRB", "TRA"), model = NULL, organism = "human", use_vj_genes = FALSE, return_object = TRUE, column_name = "olga_pgen" )runOLGA( input, chains = c("TRB", "TRA"), model = NULL, organism = "human", use_vj_genes = FALSE, return_object = TRUE, column_name = "olga_pgen" )
input |
A SingleCellExperiment object containing scRepertoire TCR data. |
chains |
Which chain to analyze: "TRA" or "TRB". Default is "TRB". |
model |
OLGA model to use. Options: "humanTRB", "humanTRA", "humanIGH", "mouseTRB". If NULL, will be inferred from organism and chains parameters. |
organism |
Organism: "human" or "mouse". Used if model is NULL. Default is "human". |
use_vj_genes |
Logical. If TRUE, includes V and J gene information in Pgen calculation. Default is FALSE (sequence-only Pgen). |
return_object |
Logical. If TRUE, adds Pgen values to metadata. Default is TRUE. |
column_name |
Name for the metadata column. Default is "olga_pgen". |
If return_object=TRUE, returns input object with Pgen added to metadata. If FALSE, returns data.frame with barcodes, sequences, and Pgen values.
data(immLynx_example) # Calculate Pgen for TRB chain sce <- runOLGA(immLynx_example, chains = "TRB") # Calculate Pgen for TRA chain sce <- runOLGA(immLynx_example, chains = "TRA") # Include V and J gene information sce <- runOLGA(immLynx_example, chains = "TRB", use_vj_genes = TRUE) # Get results as data.frame pgen_df <- runOLGA(immLynx_example, chains = "TRB", return_object = FALSE) # Specify model explicitly for mouse data sce <- runOLGA(immLynx_example, model = "mouseTRB")data(immLynx_example) # Calculate Pgen for TRB chain sce <- runOLGA(immLynx_example, chains = "TRB") # Calculate Pgen for TRA chain sce <- runOLGA(immLynx_example, chains = "TRA") # Include V and J gene information sce <- runOLGA(immLynx_example, chains = "TRB", use_vj_genes = TRUE) # Get results as data.frame pgen_df <- runOLGA(immLynx_example, chains = "TRB", return_object = FALSE) # Specify model explicitly for mouse data sce <- runOLGA(immLynx_example, model = "mouseTRB")
Infer selection pressures on TCRs using soNNia. Requires a background dataset of unselected sequences (generated by OLGA).
runSoNNia( input, chains = c("TRB", "TRA"), background_file, organism = "human", save_folder = "sonia_output", n_epochs = 100, return_object = TRUE )runSoNNia( input, chains = c("TRB", "TRA"), background_file, organism = "human", save_folder = "sonia_output", n_epochs = 100, return_object = TRUE )
input |
A SingleCellExperiment object containing scRepertoire TCR data. |
chains |
Which chain to analyze: "TRB" or "TRA". Default is "TRB". |
background_file |
Path to CSV file with background sequences (from generateOLGA). |
organism |
Organism: "human" or "mouse". Default is "human". |
save_folder |
Directory to save soNNia model. Default is "sonia_output". |
n_epochs |
Number of training epochs. Default is 100. |
return_object |
If TRUE, adds selection scores to metadata. Default is TRUE. |
This function requires a background dataset of unselected TCR sequences, which can be generated using generateOLGA(). The selected sequences are extracted from the input object and compared to this background to infer selection pressures.
If return_object=TRUE, returns input object with selection scores in metadata. Otherwise returns the soNNia results.
data(immLynx_example) ## Not run: # Step 1: Generate background sequences with OLGA background <- generateOLGA(n = 1000, model = "humanTRB") bg_file <- tempfile(fileext = ".csv") utils::write.csv(background, bg_file, row.names = FALSE) # Step 2: Run soNNia selection analysis sce <- runSoNNia(immLynx_example, chains = "TRB", background_file = bg_file) # Get raw results instead of adding to object sonia_results <- runSoNNia(immLynx_example, chains = "TRB", background_file = bg_file, return_object = FALSE) ## End(Not run)data(immLynx_example) ## Not run: # Step 1: Generate background sequences with OLGA background <- generateOLGA(n = 1000, model = "humanTRB") bg_file <- tempfile(fileext = ".csv") utils::write.csv(background, bg_file, row.names = FALSE) # Step 2: Run soNNia selection analysis sce <- runSoNNia(immLynx_example, chains = "TRB", background_file = bg_file) # Get raw results instead of adding to object sonia_results <- runSoNNia(immLynx_example, chains = "TRB", background_file = bg_file, return_object = FALSE) ## End(Not run)
This function extracts TCR sequences from a SingleCellExperiment object with scRepertoire data and calculates pairwise TCR distances using tcrdist3.
runTCRdist( input, chains = "beta", organism = "human", compute_distances = TRUE, add_to_object = FALSE )runTCRdist( input, chains = "beta", organism = "human", compute_distances = TRUE, add_to_object = FALSE )
input |
A SingleCellExperiment object containing scRepertoire TCR data. |
chains |
Character vector specifying chains: "alpha", "beta", or c("alpha", "beta"). Default is "beta". |
organism |
Organism: "human" or "mouse". Default is "human". |
compute_distances |
Logical. Whether to compute full distance matrix. Default TRUE. |
add_to_object |
Logical. If TRUE, attempts to add distance matrix to object (stored in metadata for SCE). Default is FALSE due to large matrix size. |
A list containing:
distances |
Distance matrices (pw_alpha, pw_beta, pw_cdr3_a_aa, pw_cdr3_b_aa) |
barcodes |
Cell barcodes corresponding to matrix rows/columns |
tcr_data |
The formatted TCR data.frame used for analysis |
If add_to_object=TRUE, returns the input object with distances stored.
data(immLynx_example) # Calculate TCR distances for beta chain dist_results <- runTCRdist(immLynx_example, chains = "beta") # Access the distance matrix dist_matrix <- dist_results$distances barcodes <- dist_results$barcodes # Calculate for both chains dist_both <- runTCRdist(immLynx_example, chains = c("alpha", "beta")) # Add distances directly to the object sce <- runTCRdist(immLynx_example, chains = "beta", add_to_object = TRUE)data(immLynx_example) # Calculate TCR distances for beta chain dist_results <- runTCRdist(immLynx_example, chains = "beta") # Access the distance matrix dist_matrix <- dist_results$distances barcodes <- dist_results$barcodes # Calculate for both chains dist_both <- runTCRdist(immLynx_example, chains = c("alpha", "beta")) # Add distances directly to the object sce <- runTCRdist(immLynx_example, chains = "beta", add_to_object = TRUE)
Generates summary statistics for TCR repertoire data including diversity metrics, clonality measures, and sequence characteristics.
summarizeTCRrepertoire( input, chains = c("TRB", "TRA", "both"), group.by = NULL, calculate_diversity = TRUE )summarizeTCRrepertoire( input, chains = c("TRB", "TRA", "both"), group.by = NULL, calculate_diversity = TRUE )
input |
A SingleCellExperiment object with scRepertoire data, or a data.frame from extractTCRdata(). |
chains |
Which chains to summarize: "TRB", "TRA", or "both". Default is "TRB". |
group.by |
Optional metadata column for grouping (SingleCellExperiment objects only). |
calculate_diversity |
Logical. If TRUE, calculates diversity indices. Default is TRUE. |
An object of class TCR_summary containing
summary statistics for the TCR repertoire.
# Summarize from a data.frame tcr_data <- data.frame( barcode = paste0("cell_", 1:10), cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF", "CASSYSTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF", "CASNQGLNEKLFF", "CASSYSTGELFF", "CASSLGTGELFF", "CASSIRSSYEQYF"), v = rep("TRBV7-2", 10), j = rep("TRBJ2-2", 10), chain = rep("TRB", 10), stringsAsFactors = FALSE ) summary <- summarizeTCRrepertoire(tcr_data) print(summary)# Summarize from a data.frame tcr_data <- data.frame( barcode = paste0("cell_", 1:10), cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF", "CASSYSTGELFF", "CASSIRSSYEQYF", "CASSLGTGELFF", "CASNQGLNEKLFF", "CASSYSTGELFF", "CASSLGTGELFF", "CASSIRSSYEQYF"), v = rep("TRBV7-2", 10), j = rep("TRBJ2-2", 10), chain = rep("TRB", 10), stringsAsFactors = FALSE ) summary <- summarizeTCRrepertoire(tcr_data) print(summary)
Formal S4 class to store summary statistics for a TCR repertoire, including diversity metrics, clonality measures, and sequence characteristics.
## S4 method for signature 'TCR_summary' show(object)## S4 method for signature 'TCR_summary' show(object)
object |
A |
An S4 object of class TCR_summary containing the summary
statistics described in the slots. The show method prints a
formatted summary to the console and returns invisible(NULL).
total_cellsInteger. Total number of cells with TCR data.
unique_clonotypesInteger. Number of unique clonotypes.
clonotype_ratioNumeric. Ratio of unique clonotypes to total cells.
diversityList of diversity indices (Shannon, Simpson, etc.), or NULL.
top_clonesData.frame of the top 10 most frequent clonotypes.
cdr3_lengthList with CDR3 length distribution statistics.
gene_usageList of V and J gene usage data.frames.
chainsCharacter. Chain(s) summarized.
Takes a vector of amino acid sequences and uses a Hugging Face
tokenizer to convert them into numerical input IDs suitable for model input.
The tokenizer should be obtained from huggingModel, which
manages the Python environment via basilisk.
tokenizeSequences( tokenizer, aa_sequences, padding = TRUE, truncation = TRUE, return_tensors = "pt" )tokenizeSequences( tokenizer, aa_sequences, padding = TRUE, truncation = TRUE, return_tensors = "pt" )
tokenizer |
The tokenizer object returned by |
aa_sequences |
A character vector of amino acid sequences (e.g., CDR3 sequences). |
padding |
A logical or string. If TRUE, pads sequences to the length of the longest sequence in the batch. Defaults to TRUE. |
truncation |
A logical. If TRUE, truncates sequences to the model's maximum input length. Defaults to TRUE. |
return_tensors |
A string specifying the format for the returned tensors. "pt" for PyTorch tensors, "tf" for TensorFlow. Defaults to "pt". |
The tokenized output, typically a dictionary-like object containing 'input_ids' and 'attention_mask'.
huggingModel, proteinEmbeddings,
runEmbeddings
sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF") # Initialize model and tokenizer hf_components <- huggingModel() # Tokenize CDR3 sequences tokenized <- tokenizeSequences(hf_components$tokenizer, sequences) # Tokenize without padding (variable-length output) tokenized_nopad <- tokenizeSequences( hf_components$tokenizer, sequences, padding = FALSE) # Pass tokenized output to proteinEmbeddings embeddings <- proteinEmbeddings(hf_components$model, tokenized, pool = "mean", chunk_size = 32) # Clean up basilisk::basiliskStop(hf_components$proc)sequences <- c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF") # Initialize model and tokenizer hf_components <- huggingModel() # Tokenize CDR3 sequences tokenized <- tokenizeSequences(hf_components$tokenizer, sequences) # Tokenize without padding (variable-length output) tokenized_nopad <- tokenizeSequences( hf_components$tokenizer, sequences, padding = FALSE) # Pass tokenized output to proteinEmbeddings embeddings <- proteinEmbeddings(hf_components$model, tokenized, pool = "mean", chunk_size = 32) # Clean up basilisk::basiliskStop(hf_components$proc)
Validates that TCR data is in the correct format for immLynx analysis functions. Checks for required columns, valid sequence formats, and gene nomenclature.
validateTCRdata( tcr_data, check_genes = FALSE, check_sequences = TRUE, strict = FALSE )validateTCRdata( tcr_data, check_genes = FALSE, check_sequences = TRUE, strict = FALSE )
tcr_data |
A data.frame containing TCR data |
check_genes |
Logical. If TRUE, validates gene names against IMGT nomenclature. Default is FALSE. |
check_sequences |
Logical. If TRUE, validates that CDR3 sequences contain only valid amino acids. Default is TRUE. |
strict |
Logical. If TRUE, stops with error on validation failure. If FALSE, returns validation report. Default is FALSE. |
If strict=FALSE, returns a list with:
Logical indicating overall validity
Character vector of error messages
Character vector of warning messages
Summary statistics of the data
If strict=TRUE, returns TRUE invisibly on success or stops with error.
# Create example TCR data tcr_data <- data.frame( barcode = paste0("cell_", 1:5), cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF", "CASNQGLNEKLFF", "CASSLDRNEQFF"), v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")), j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")), chain = rep("TRB", 5), stringsAsFactors = FALSE ) report <- validateTCRdata(tcr_data, strict = FALSE) report$valid report$summary# Create example TCR data tcr_data <- data.frame( barcode = paste0("cell_", 1:5), cdr3_aa = c("CASSLGTGELFF", "CASSIRSSYEQYF", "CASSYSTGELFF", "CASNQGLNEKLFF", "CASSLDRNEQFF"), v = paste0("TRBV", c("7-2", "12-3", "5-1", "28", "7-9")), j = paste0("TRBJ", c("2-2", "1-1", "2-7", "1-5", "2-1")), chain = rep("TRB", 5), stringsAsFactors = FALSE ) report <- validateTCRdata(tcr_data, strict = FALSE) report$valid report$summary