--- title: "Getting Started with immLynx" author: "Nick Borcherding" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Getting Started with immLynx} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, tidy = FALSE ) library(BiocStyle) python_available <- tryCatch({ proc <- basilisk::basiliskStart(immLynx::immLynxEnv) on.exit(basilisk::basiliskStop(proc)) TRUE }, error = function(e) { FALSE }) ``` # Introduction T-cell receptors (TCRs) are heterodimeric surface proteins that mediate adaptive immune recognition. Each TCR consists of an alpha and beta chain (or gamma and delta in a smaller subset of T cells), with the complementarity-determining region 3 (CDR3) serving as the primary determinant of antigen specificity. Advances in single-cell RNA sequencing (scRNA-seq) now enable simultaneous profiling of gene expression and TCR sequences at cellular resolution, producing rich datasets that benefit from computational analysis of repertoire diversity, clonal structure, and sequence similarity. A growing ecosystem of Python-based tools has emerged for TCR repertoire analysis, including tcrdist3 for pairwise distance calculation, OLGA for generation probability estimation, soNNia for selection inference, clusTCR for sequence clustering, metaclonotypist for metaclone discovery, and the ESM-2 family of protein language models for learned sequence representations. While these tools are powerful individually, using them together requires managing multiple Python environments and translating data between frameworks---a barrier for researchers working primarily in R. **immLynx** addresses this challenge by providing a unified R interface to these Python tools through the `r Biocpkg("basilisk")` package, which manages isolated conda environments transparently. All functions accept and return `SingleCellExperiment` objects, ensuring seamless integration with the Bioconductor single-cell ecosystem, including `r Biocpkg("scRepertoire")` for clonotype assignment and `r Biocpkg("immApex")` for TCR data extraction. The package supports the following analytical capabilities: - **TCR distance calculations** via tcrdist3, enabling pairwise distance matrices based on CDR3 sequence, V gene, and J gene usage - **Generation probability (Pgen)** via OLGA, quantifying how likely a given TCR sequence is to arise from V(D)J recombination - **Selection inference** via soNNia, estimating post-thymic selection pressures on observed TCR repertoires - **Sequence clustering** via clusTCR, grouping TCRs by CDR3 similarity using Markov Cluster (MCL) or other algorithms - **Protein language model embeddings** via ESM-2, generating dense vector representations of CDR3 sequences that capture biochemical and structural properties - **Metaclone discovery** via metaclonotypist, identifying groups of related TCR clonotypes that may share antigen specificity This vignette demonstrates the core functionality of immLynx, walking through data preparation, individual analysis functions, and a combined workflow. For advanced use cases including method comparison and custom embedding strategies, see the companion vignette `vignette("advanced_analysis", package = "immLynx")`. # Installation ```{r installation, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("immLynx") ``` ## Python Dependencies immLynx uses `r Biocpkg("basilisk")` to manage Python dependencies automatically. The first time you call a function that invokes Python, basilisk will create an isolated conda environment containing all required packages (tcrdist3, olga, soNNia, clusTCR, metaclonotypist, and PyTorch with the ESM-2 model). This initial setup may take several minutes but only needs to occur once per installation. Subsequent calls reuse the existing environment with minimal overhead. # Quick Start ## Loading the Package and Example Data We begin by loading immLynx along with `r Biocpkg("scran")` and `r Biocpkg("scater")`, which provide utilities for single-cell analysis and visualization. The `immLynx_example` dataset is a `SingleCellExperiment` object containing scRNA-seq data from multiple patients, with TCR repertoire information added via `r Biocpkg("scRepertoire")`. ```{r load} library(immLynx) library(scran) library(scater) data("immLynx_example") immLynx_example ``` ## Extracting TCR Data Before running downstream analyses, it is often useful to extract TCR data into a data.frame format for inspection and quality control. The `extractTCRdata()` function wraps `immApex::getIR()` and supports both long format (one row per chain) and wide format (one row per cell with columns for each chain). The `validateTCRdata()` function checks for common issues such as missing CDR3 sequences, non-standard amino acids, and inconsistent chain annotations. ```{r extract} tcr_data <- extractTCRdata(immLynx_example, chains = "TRB") head(tcr_data) tcr_wide <- extractTCRdata(immLynx_example, chains = "both", format = "wide") validation <- validateTCRdata(tcr_data) print(validation) ``` ## Summarizing TCR Repertoire The `summarizeTCRrepertoire()` function computes standard repertoire statistics including clonotype frequencies, diversity indices (Shannon entropy, Simpson index, clonality), CDR3 length distributions, and V/J gene usage. These metrics provide a high-level overview of repertoire structure before detailed analysis. ```{r summary} summary <- summarizeTCRrepertoire(immLynx_example, chains = "TRB") print(summary) ``` # Analysis Functions ## TCR Clustering with clusTCR clusTCR groups TCR sequences by CDR3 similarity using the Markov Cluster (MCL) algorithm. The `inflation` parameter controls cluster granularity: higher values produce more, smaller clusters, while lower values yield fewer, larger clusters. Cluster assignments are stored directly in the `colData` of the `SingleCellExperiment` object, making them immediately available for downstream visualization and statistical testing. ```{r clustcr, eval=python_available} sce <- runClustTCR( immLynx_example, chains = "TRB", method = "mcl", inflation = 2.0 ) table(sce$clustcr_TRB) ``` ## TCR Distance Calculations with tcrdist3 tcrdist3 computes pairwise distances between TCR sequences using a substitution-matrix-based metric that accounts for CDR3 amino acid sequence, V gene, and J gene identity. The resulting distance matrix can be used for hierarchical clustering, nearest-neighbor analysis, or as input to dimensionality reduction methods. Note that tcrdist3 deduplicates sequences internally, so the distance matrix dimensions correspond to the number of unique TCR sequences rather than the total number of cells. ```{r tcrdist, eval=python_available} dist_results <- runTCRdist( immLynx_example, chains = "beta", organism = "human" ) dim(dist_results$distances$pw_beta) ``` ## Generation Probability with OLGA OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid sequences) computes the generation probability (Pgen) of a TCR sequence---the probability that V(D)J recombination produces that specific CDR3 amino acid sequence. Low Pgen values indicate sequences that are unlikely to arise by chance, which may suggest antigen-driven selection. The `runOLGA()` function adds log10-transformed Pgen values to the `colData` of the input object. ```{r olga, eval=python_available} sce <- runOLGA( immLynx_example, chains = "TRB", model = "humanTRB" ) hist(log10(sce$olga_pgen_TRB), breaks = 50) ``` The `generateOLGA()` function can also produce synthetic TCR sequences sampled from the recombination model, which is useful for constructing background distributions or benchmarking analyses. ```{r generate, eval=python_available} random_seqs <- generateOLGA(n = 100, model = "humanTRB") head(random_seqs) ``` ## Protein Language Model Embeddings Protein language models such as ESM-2 learn dense vector representations of amino acid sequences from large-scale protein databases. When applied to CDR3 sequences, these embeddings capture biochemical and structural properties that complement sequence-based distance metrics. The `runEmbeddings()` function generates per-sequence embeddings and stores them as a dimensionality reduction in the `SingleCellExperiment` object, enabling visualization with standard tools such as UMAP or PCA. Multiple pooling strategies (mean, cls, max) are available to aggregate per-residue representations into a fixed-length vector. ```{r embeddings, eval=FALSE} sce <- runEmbeddings( immLynx_example, chains = "TRB", model_name = "facebook/esm2_t12_35M_UR50D", pool = "mean" ) sce <- scater::runUMAP(sce, dimred = "tcr_esm") scater::plotReducedDim(sce, dimred = "UMAP") ``` ## Metaclone Discovery with Metaclonotypist Metaclonotypist identifies metaclones---groups of TCR clonotypes that share enough sequence similarity to potentially recognize the same antigen. This approach extends traditional exact-match clonotyping by allowing controlled levels of CDR3 edit distance and tcrdist-based similarity. The `max_edits` and `max_dist` parameters control the stringency of metaclone grouping. ```{r metaclonotypist, eval=python_available} sce <- runMetaclonotypist( immLynx_example, chains = "beta", method = "tcrdist", max_edits = 2, max_dist = 20 ) table(sce$metaclone) ``` ## Selection Inference with soNNia soNNia uses a neural network to model post-thymic selection by comparing observed TCR sequences against a background distribution generated by V(D)J recombination. The model estimates a selection factor for each sequence, quantifying how much more (or less) likely a sequence is to appear in the observed repertoire relative to the recombination baseline. This requires a background sequence file, which can be generated using `generateOLGA()`. ```{r sonnia, eval=FALSE} background <- generateOLGA(n = 10000, model = "humanTRB") write.csv(background, "background.csv", row.names = FALSE) sce <- runSoNNia( immLynx_example, chains = "TRB", background_file = "background.csv" ) ``` # Workflow Example The following example demonstrates a combined workflow that integrates multiple immLynx analyses on a single dataset. By storing all results in the same `SingleCellExperiment` object, different analytical layers (clustering, generation probability, embeddings) can be compared and visualized together. ```{r workflow, eval=FALSE} library(immLynx) library(scran) library(scater) library(ggplot2) data("immLynx_example") # Summarize the repertoire summary <- summarizeTCRrepertoire(immLynx_example) print(summary) # Cluster TCRs by CDR3 similarity immLynx_example <- runClustTCR( immLynx_example, chains = "TRB", method = "mcl" ) # Calculate generation probability immLynx_example <- runOLGA( immLynx_example, chains = "TRB" ) # Generate protein language model embeddings immLynx_example <- runEmbeddings( immLynx_example, chains = "TRB" ) # Visualize embeddings colored by cluster assignment immLynx_example <- scater::runUMAP( immLynx_example, dimred = "tcr_esm", name = "tcr_umap" ) scater::plotReducedDim( immLynx_example, dimred = "tcr_umap", colour_by = "clustcr_TRB" ) # Visualize embeddings colored by generation probability scater::plotReducedDim( immLynx_example, dimred = "tcr_umap", colour_by = "olga_pgen_log10_TRB" ) ``` # Session Info ```{r session, eval=TRUE} sessionInfo() ```