---
title: "Getting Started with immLynx"
author: "Nick Borcherding"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Getting Started with immLynx}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  tidy = FALSE
)
library(BiocStyle)

python_available <- tryCatch({
    proc <- basilisk::basiliskStart(immLynx::immLynxEnv)
    on.exit(basilisk::basiliskStop(proc))
    TRUE
}, error = function(e) {
    FALSE
})
```

# Introduction

T-cell receptors (TCRs) are heterodimeric surface proteins that mediate
adaptive immune recognition. Each TCR consists of an alpha and beta chain
(or gamma and delta in a smaller subset of T cells), with the
complementarity-determining region 3 (CDR3) serving as the primary
determinant of antigen specificity. Advances in single-cell RNA
sequencing (scRNA-seq) now enable simultaneous profiling of gene
expression and TCR sequences at cellular resolution, producing rich
datasets that benefit from computational analysis of repertoire diversity,
clonal structure, and sequence similarity.

A growing ecosystem of Python-based tools has emerged for TCR repertoire
analysis, including tcrdist3 for pairwise distance calculation, OLGA for
generation probability estimation, soNNia for selection inference,
clusTCR for sequence clustering, metaclonotypist for metaclone
discovery, and the ESM-2 family of protein language models for learned
sequence representations. While these tools are powerful individually,
using them together requires managing multiple Python environments and
translating data between frameworks---a barrier for researchers working
primarily in R.

**immLynx** addresses this challenge by providing a unified R interface
to these Python tools through the `r Biocpkg("basilisk")` package, which
manages isolated conda environments transparently. All functions accept
and return `SingleCellExperiment` objects, ensuring seamless integration
with the Bioconductor single-cell ecosystem, including
`r Biocpkg("scRepertoire")` for clonotype assignment and
`r Biocpkg("immApex")` for TCR data extraction. The package supports the
following analytical capabilities:

- **TCR distance calculations** via tcrdist3, enabling pairwise distance
  matrices based on CDR3 sequence, V gene, and J gene usage
- **Generation probability (Pgen)** via OLGA, quantifying how likely a
  given TCR sequence is to arise from V(D)J recombination
- **Selection inference** via soNNia, estimating post-thymic selection
  pressures on observed TCR repertoires
- **Sequence clustering** via clusTCR, grouping TCRs by CDR3 similarity
  using Markov Cluster (MCL) or other algorithms
- **Protein language model embeddings** via ESM-2, generating dense
  vector representations of CDR3 sequences that capture biochemical and
  structural properties
- **Metaclone discovery** via metaclonotypist, identifying groups of
  related TCR clonotypes that may share antigen specificity

This vignette demonstrates the core functionality of immLynx, walking
through data preparation, individual analysis functions, and a combined
workflow. For advanced use cases including method comparison and custom
embedding strategies, see the companion vignette
`vignette("advanced_analysis", package = "immLynx")`.

# Installation

```{r installation, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("immLynx")
```

## Python Dependencies

immLynx uses `r Biocpkg("basilisk")` to manage Python dependencies
automatically. The first time you call a function that invokes Python,
basilisk will create an isolated conda environment containing all
required packages (tcrdist3, olga, soNNia, clusTCR, metaclonotypist,
and PyTorch with the ESM-2 model). This initial setup may take several
minutes but only needs to occur once per installation. Subsequent calls
reuse the existing environment with minimal overhead.

# Quick Start

## Loading the Package and Example Data

We begin by loading immLynx along with `r Biocpkg("scran")` and
`r Biocpkg("scater")`, which provide utilities for single-cell analysis
and visualization. The `immLynx_example` dataset is a
`SingleCellExperiment` object containing scRNA-seq data from multiple
patients, with TCR repertoire information added via
`r Biocpkg("scRepertoire")`.

```{r load}
library(immLynx)
library(scran)
library(scater)

data("immLynx_example")

immLynx_example
```

## Extracting TCR Data

Before running downstream analyses, it is often useful to extract TCR
data into a data.frame format for inspection and quality control. The
`extractTCRdata()` function wraps `immApex::getIR()` and supports both
long format (one row per chain) and wide format (one row per cell with
columns for each chain). The `validateTCRdata()` function checks for
common issues such as missing CDR3 sequences, non-standard amino acids,
and inconsistent chain annotations.

```{r extract}
tcr_data <- extractTCRdata(immLynx_example,
                           chains = "TRB")
head(tcr_data)

tcr_wide <- extractTCRdata(immLynx_example,
                           chains = "both",
                           format = "wide")

validation <- validateTCRdata(tcr_data)
print(validation)
```

## Summarizing TCR Repertoire

The `summarizeTCRrepertoire()` function computes standard repertoire
statistics including clonotype frequencies, diversity indices (Shannon
entropy, Simpson index, clonality), CDR3 length distributions, and V/J
gene usage. These metrics provide a high-level overview of repertoire
structure before detailed analysis.

```{r summary}
summary <- summarizeTCRrepertoire(immLynx_example,
                                  chains = "TRB")
print(summary)
```

# Analysis Functions

## TCR Clustering with clusTCR

clusTCR groups TCR sequences by CDR3 similarity using the Markov Cluster
(MCL) algorithm. The `inflation` parameter controls cluster granularity:
higher values produce more, smaller clusters, while lower values yield
fewer, larger clusters. Cluster assignments are stored directly in the
`colData` of the `SingleCellExperiment` object, making them immediately
available for downstream visualization and statistical testing.

```{r clustcr, eval=python_available}
sce <- runClustTCR(
  immLynx_example,
  chains = "TRB",
  method = "mcl",
  inflation = 2.0
)

table(sce$clustcr_TRB)
```

## TCR Distance Calculations with tcrdist3

tcrdist3 computes pairwise distances between TCR sequences using a
substitution-matrix-based metric that accounts for CDR3 amino acid
sequence, V gene, and J gene identity. The resulting distance matrix can
be used for hierarchical clustering, nearest-neighbor analysis, or as
input to dimensionality reduction methods. Note that tcrdist3
deduplicates sequences internally, so the distance matrix dimensions
correspond to the number of unique TCR sequences rather than the total
number of cells.

```{r tcrdist, eval=python_available}
dist_results <- runTCRdist(
  immLynx_example,
  chains = "beta",
  organism = "human"
)

dim(dist_results$distances$pw_beta)
```

## Generation Probability with OLGA

OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid
sequences) computes the generation probability (Pgen) of a TCR
sequence---the probability that V(D)J recombination produces that
specific CDR3 amino acid sequence. Low Pgen values indicate sequences
that are unlikely to arise by chance, which may suggest antigen-driven
selection. The `runOLGA()` function adds log10-transformed Pgen values
to the `colData` of the input object.

```{r olga, eval=python_available}
sce <- runOLGA(
  immLynx_example,
  chains = "TRB",
  model = "humanTRB"
)

hist(log10(sce$olga_pgen_TRB), breaks = 50)
```

The `generateOLGA()` function can also produce synthetic TCR sequences
sampled from the recombination model, which is useful for constructing
background distributions or benchmarking analyses.

```{r generate, eval=python_available}
random_seqs <- generateOLGA(n = 100,
                            model = "humanTRB")
head(random_seqs)
```

## Protein Language Model Embeddings

Protein language models such as ESM-2 learn dense vector representations
of amino acid sequences from large-scale protein databases. When applied
to CDR3 sequences, these embeddings capture biochemical and structural
properties that complement sequence-based distance metrics. The
`runEmbeddings()` function generates per-sequence embeddings and stores
them as a dimensionality reduction in the `SingleCellExperiment` object,
enabling visualization with standard tools such as UMAP or PCA. Multiple
pooling strategies (mean, cls, max) are available to aggregate
per-residue representations into a fixed-length vector.

```{r embeddings, eval=FALSE}
sce <- runEmbeddings(
  immLynx_example,
  chains = "TRB",
  model_name = "facebook/esm2_t12_35M_UR50D",
  pool = "mean"
)

sce <- scater::runUMAP(sce,
                       dimred = "tcr_esm")
scater::plotReducedDim(sce, dimred = "UMAP")
```

## Metaclone Discovery with Metaclonotypist

Metaclonotypist identifies metaclones---groups of TCR clonotypes that
share enough sequence similarity to potentially recognize the same
antigen. This approach extends traditional exact-match clonotyping by
allowing controlled levels of CDR3 edit distance and tcrdist-based
similarity. The `max_edits` and `max_dist` parameters control the
stringency of metaclone grouping.

```{r metaclonotypist, eval=python_available}
sce <- runMetaclonotypist(
  immLynx_example,
  chains = "beta",
  method = "tcrdist",
  max_edits = 2,
  max_dist = 20
)

table(sce$metaclone)
```

## Selection Inference with soNNia

soNNia uses a neural network to model post-thymic selection by comparing
observed TCR sequences against a background distribution generated by
V(D)J recombination. The model estimates a selection factor for each
sequence, quantifying how much more (or less) likely a sequence is to
appear in the observed repertoire relative to the recombination
baseline. This requires a background sequence file, which can be
generated using `generateOLGA()`.

```{r sonnia, eval=FALSE}
background <- generateOLGA(n = 10000, model = "humanTRB")
write.csv(background, "background.csv", row.names = FALSE)

sce <- runSoNNia(
  immLynx_example,
  chains = "TRB",
  background_file = "background.csv"
)
```

# Workflow Example

The following example demonstrates a combined workflow that integrates
multiple immLynx analyses on a single dataset. By storing all results in
the same `SingleCellExperiment` object, different analytical layers
(clustering, generation probability, embeddings) can be compared and
visualized together.

```{r workflow, eval=FALSE}
library(immLynx)
library(scran)
library(scater)
library(ggplot2)

data("immLynx_example")

# Summarize the repertoire
summary <- summarizeTCRrepertoire(immLynx_example)
print(summary)

# Cluster TCRs by CDR3 similarity
immLynx_example <- runClustTCR(
  immLynx_example,
  chains = "TRB",
  method = "mcl"
)

# Calculate generation probability
immLynx_example <- runOLGA(
  immLynx_example,
  chains = "TRB"
)

# Generate protein language model embeddings
immLynx_example <- runEmbeddings(
  immLynx_example,
  chains = "TRB"
)

# Visualize embeddings colored by cluster assignment
immLynx_example <- scater::runUMAP(
  immLynx_example,
  dimred = "tcr_esm",
  name = "tcr_umap"
)

scater::plotReducedDim(
  immLynx_example,
  dimred = "tcr_umap",
  colour_by = "clustcr_TRB"
)

# Visualize embeddings colored by generation probability
scater::plotReducedDim(
  immLynx_example,
  dimred = "tcr_umap",
  colour_by = "olga_pgen_log10_TRB"
)
```

# Session Info

```{r session, eval=TRUE}
sessionInfo()
```