---
title: "Getting Started with immGLIPH"
author:
  - name: Nick Borcherding
    affiliation: Washington University in St. Louis
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Getting Started with immGLIPH}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
library(BiocStyle)
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  tidy = FALSE,
  warning = FALSE,
  fig.width = 7,
  fig.height = 5
)
```

# Introduction

immGLIPH provides an R implementation of the GLIPH (Grouping of Lymphocyte
Interactions by Paratope Hotspots) and GLIPH2 algorithms for clustering T cell
receptors (TCRs) that are predicted to bind the same HLA-restricted peptide
antigen.

The package identifies TCR specificity groups by detecting statistically
enriched CDR3$\beta$ motifs (local similarity) and structurally similar
CDR3$\beta$ sequences (global similarity), then clusters them into convergence
groups and scores each group for biological significance.

**immGLIPH is an R implementation of existing algorithms. Users should cite
the original publications:**

- **GLIPH**: Glanville, J. et al. *Identifying specificity groups in the T cell
  receptor repertoire.* Nature 547, 94--98 (2017).
  [doi:10.1038/nature22976](https://doi.org/10.1038/nature22976)

- **GLIPH2**: Huang, H. et al. *Analyzing the Mycobacterium tuberculosis immune
  response by T-cell receptor clustering with GLIPH2 and genome-wide antigen
  screening.* Nature Biotechnology 38, 1194--1202 (2020).
  [doi:10.1038/s41587-020-0505-4](https://doi.org/10.1038/s41587-020-0505-4)

## Installation

immGLIPH can be installed from Bioconductor:

```{r eval=FALSE}
BiocManager::install("immGLIPH")
```

The reference repertoire data (~19 MB) is downloaded automatically the first
time you run `runGLIPH()` and cached locally via
[BiocFileCache](https://bioconductor.org/packages/BiocFileCache/). You can
pre-download the data with:

```{r eval=FALSE}
BiocManager::install("BiocFileCache")
ref <- getGLIPHreference()
```

### Loading Package

```{r}
library(immGLIPH)
```

## Integration with the scRepertoire Ecosystem

immGLIPH integrates with the
[scRepertoire](https://bioconductor.org/packages/scRepertoire/) ecosystem
through [immApex](https://bioconductor.org/packages/immApex/). Both
scRepertoire and immApex are Bioconductor packages and can be installed via
`BiocManager::install()`. This means `runGLIPH()` can directly accept:

- **SingleCellExperiment** objects with TCR information
- **combineTCR()** output lists from scRepertoire
- Standard data frames or character vectors

# Quick Start

## Loading the Example Data

immGLIPH includes built-in example data derived from the scRepertoire
example dataset (Yost et al. 2019):

- **`gliph_input_data`**: A data frame of 365 TRB CDR3 sequences with
  V-gene and patient annotations.
- **`gliph_sce`**: A SingleCellExperiment object with TCR clonotype
  information in `colData`, demonstrating the single-cell workflow.

```{r}
data("gliph_input_data")
head(gliph_input_data)
dim(gliph_input_data)
```

## Input Data Format

`runGLIPH()` accepts a data frame with the following columns:

| Column      | Required | Description |
|:------------|:---------|:------------|
| **CDR3b**   | Yes      | CDR3$\beta$ amino acid sequences |
| **TRBV**    | No       | V-gene usage (e.g., "TRBV5-1") |
| **patient** | No       | Donor/sample identifier |
| **HLA**     | No       | HLA alleles, comma-separated |
| **counts**  | No       | Clone frequency |

Alternative column names are automatically recognized (e.g., `cdr3`, `v_gene`,
`sample`, `clone_count`).

# Working with Single-Cell Data

When working with single-cell immune repertoire data, you can use
scRepertoire to prepare your data and pass the output directly to immGLIPH.

```{r}
library(scRepertoire)
data("contig_list")

# After processing with cellranger/etc, combine contigs
combined <- combineTCR(contig_list[seq_len(2)],
                       samples = c("P1", "P2"))

# Take a small slice so the example runs quickly. In real use, pass
# all samples and rely on the bundled reference downloaded by
# getGLIPHreference().
combined_small <- lapply(combined, function(x) x[seq_len(50), ])

# Use a small custom reference built from the bundled example data
small_ref <- gliph_input_data[, c("CDR3b", "TRBV")]

# Pass scRepertoire output directly to runGLIPH
results_sc <- runGLIPH(combined_small,
                       method     = "gliph2",
                       refdb_beta = small_ref,
                       sim_depth  = 100,
                       n_cores    = 1)
```

For **SingleCellExperiment** objects that already contain TCR metadata (e.g.,
added via `scRepertoire::combineExpression()`), immGLIPH extracts the receptor
data automatically using `immApex::getIR()`. Here is an example using the
bundled `gliph_sce` dataset:

```{r}
library(SingleCellExperiment)
data("gliph_sce")

# SingleCellExperiment object with TCR info in colData
results_sce <- runGLIPH(gliph_sce,
                        method     = "gliph2",
                        chains     = "TRB",
                        refdb_beta = small_ref,
                        sim_depth  = 100,
                        n_cores    = 1)
```

# The `runGLIPH()` Function

## Key Arguments

| Argument | Default | Description |
|:---------|:--------|:------------|
| `cdr3_sequences` | -- | Input data (data frame, vector, SCE, or list) |
| `method` | `"gliph2"` | Algorithm preset: `"gliph1"`, `"gliph2"`, or `"custom"` |
| `sim_depth` | 1000 | Simulation depth (higher = more reproducible, slower) |
| `n_cores` | 1 | Number of parallel cores |
| `refdb_beta` | `"human_v2.0_CD48"` | Reference database (built-in name or custom data frame) |
| `local_similarities` | `TRUE` | Search for local (motif-based) similarities |
| `global_similarities` | `TRUE` | Search for global (structural) similarities |
| `structboundaries` | `TRUE` | Trim structural boundaries before comparison |
| `accept_CF` | `TRUE` | Filter to sequences starting with C, ending with F |

## Method Presets

The `method` parameter configures a coordinated set of algorithm choices:

| Setting | `"gliph1"` | `"gliph2"` | `"custom"` |
|:--------|:-----------|:-----------|:-----------|
| Local method | Repeated random sampling | Fisher's exact test | User-defined |
| Global method | Hamming distance cutoff | Struct-based + Fisher | User-defined |
| Clustering | Connected components | Per-motif isolated | User-defined |
| Scoring | GLIPH1 formula | GLIPH2 formula | User-defined |

## Running GLIPH2 (Default)

By default `runGLIPH()` downloads a reference repertoire via BiocFileCache.
For this vignette we supply a custom reference data frame directly using
`refdb_beta` to avoid network access:

```{r}
data("gliph_input_data")

# Use the input data itself as a small reference for demonstration
ref_df <- gliph_input_data[, c("CDR3b", "TRBV")]

res_gliph2 <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  method         = "gliph2",
  refdb_beta     = ref_df,
  sim_depth      = 100,
  n_cores        = 1
)
```

## Running GLIPH1

```{r}
res_gliph1 <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  method         = "gliph1",
  refdb_beta     = ref_df,
  sim_depth      = 100,
  n_cores        = 1
)
```

## Understanding the Output

The output is a list with the following elements:

```{r}
names(res_gliph2)
```

### Cluster Properties

The `cluster_properties` data frame summarizes each convergence group with
enrichment scores:

```{r}
head(res_gliph2$cluster_properties)
```

### Cluster Membership

The `cluster_list` is a named list where each element contains the member
TCRs of a convergence group:

```{r}
# Number of convergence groups
length(res_gliph2$cluster_list)

# Members of the first cluster (if any found)
if (length(res_gliph2$cluster_list) > 0) {
  head(res_gliph2$cluster_list[[1]])
}
```

### Motif Enrichment

The `motif_enrichment` element contains the locally enriched motifs:

```{r}
# Significantly enriched motifs
if (!is.null(res_gliph2$motif_enrichment$selected_motifs)) {
  head(res_gliph2$motif_enrichment$selected_motifs)
}
```

### Network Edges

The `connections` data frame contains the edge list representing the TCR
similarity network:

```{r}
if (!is.null(res_gliph2$connections)) {
  head(res_gliph2$connections)
}
```

# Customizing the Analysis

## Using `method = "custom"`

The `"custom"` method allows independent control over each algorithmic
component:

```{r}
res_custom <- runGLIPH(
  cdr3_sequences  = gliph_input_data[seq_len(200), ],
  method          = "custom",
  refdb_beta      = ref_df,
  local_method    = "fisher",    # or "rrs"
  global_method   = "cutoff",    # or "fisher"
  clustering_method = "GLIPH1.0", # or "GLIPH2.0"
  scoring_method  = "GLIPH2.0",  # or "GLIPH1.0"
  sim_depth       = 100,
  n_cores         = 1
)
```

## Adjusting Significance Thresholds

For the Fisher-based local method (GLIPH2), you can adjust:

- **`lcminp`**: Maximum p-value for a motif to be considered enriched
  (default 0.01)
- **`lcminove`**: Minimum fold-enrichment per motif length
  (default `c(1000, 100, 10)` for 2-mers, 3-mers, 4-mers)
- **`kmer_mindepth`**: Minimum motif observations in the sample (default 3)

```{r}
res_strict <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  method         = "gliph2",
  refdb_beta     = ref_df,
  lcminp         = 0.001,            # Stricter p-value
  lcminove       = c(10000, 1000, 100), # Higher fold-change
  sim_depth      = 100,
  n_cores        = 1
)
```

## Choosing a Reference Database

immGLIPH ships with reference repertoires for human and mouse from
the original GLIPH publications. Each is available as CD4, CD8, or combined
(CD48) subsets:

| Name | Species | Version | Subset | Source |
|:-----|:--------|:--------|:-------|:-------|
| `"human_v1.0_CD4"` | Human | v1.0 | CD4 | Glanville et al. (2017) |
| `"human_v1.0_CD8"` | Human | v1.0 | CD8 | Glanville et al. (2017) |
| `"human_v1.0_CD48"` | Human | v1.0 | CD4+CD8 | Glanville et al. (2017) |
| `"human_v2.0_CD4"` | Human | v2.0 | CD4 | Huang et al. (2020) |
| `"human_v2.0_CD8"` | Human | v2.0 | CD8 | Huang et al. (2020) |
| `"human_v2.0_CD48"` | Human | v2.0 | CD4+CD8 | Huang et al. (2020) |
| `"mouse_v1.0_CD4"` | Mouse | v1.0 | CD4 | Huang et al. (2020) |
| `"mouse_v1.0_CD8"` | Mouse | v1.0 | CD8 | Huang et al. (2020) |
| `"mouse_v1.0_CD48"` | Mouse | v1.0 | CD4+CD8 | Huang et al. (2020) |
| `"gliph_reference"` | Human | v1.0 | CD4+CD8 | Legacy alias for `human_v1.0_CD48` |

Each reference includes pre-computed V-gene usage and CDR3 length frequency
distributions, which are automatically used for cluster scoring.

To analyse mouse data, supply mouse CDR3$\beta$ sequences as
`cdr3_sequences` and set `refdb_beta = "mouse_v1.0_CD48"` (or one of the
CD4/CD8 subsets in the table). The reference is fetched and cached the
first time it is requested.

## Using a Custom Reference Database

You can also supply your own reference as a data frame. A minimal
reference is a two-column table of CDR3$\beta$ amino-acid sequences and
their corresponding V-gene names. Here we build a small one from the
bundled `gliph_input_data` for illustration:

```{r}
custom_ref <- gliph_input_data[, c("CDR3b", "TRBV")]
head(custom_ref, 3)

res <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(100), ],
  refdb_beta     = custom_ref,
  method         = "gliph2",
  sim_depth      = 100,
  n_cores        = 1
)
```

# Motif Discovery with `findMotifs()`

The `findMotifs()` function searches for continuous and discontinuous k-mer
motifs in a set of CDR3 sequences. It is used internally by `runGLIPH()` but
can also be called independently.

## Key Arguments

| Argument | Default | Description |
|:---------|:--------|:------------|
| `seqs` | -- | Character vector of CDR3$\beta$ sequences |
| `q` | `2:4` | Motif lengths to search |
| `kmer_mindepth` | `NULL` | Minimum motif count to return |
| `discontinuous` | `FALSE` | Include discontinuous motifs (with one variable position) |

## Example

```{r}
data("gliph_input_data")
sample_seqs <- as.character(gliph_input_data$CDR3b[seq_len(100)])

# Find all 3-mers appearing at least 5 times
motifs <- findMotifs(seqs = sample_seqs, 
                     q = 3, 
                     kmer_mindepth = 5)
head(motifs[order(motifs$V1, decreasing = TRUE), ])
```

Including discontinuous motifs (e.g., `C.S` where `.` is any amino acid):

```{r}
disc_motifs <- findMotifs(seqs          = sample_seqs,
                          q             = 2,
                          kmer_mindepth = 5,
                          discontinuous = TRUE
)
# Show discontinuous motifs (those containing a dot)
disc_only <- disc_motifs[grep("\\.", disc_motifs$motif), ]
head(disc_only[order(disc_only$V1, decreasing = TRUE), ])
```

# Cluster Scoring with `clusterScoring()`

The `clusterScoring()` function evaluates convergence groups using up to five
metrics. This is called automatically by `runGLIPH()`, but can be re-run with
different parameters on existing results.

## Key Arguments

| Argument | Default | Description |
|:---------|:--------|:------------|
| `cluster_list` | -- | Named list of cluster data frames (from `runGLIPH()$cluster_list`) |
| `cdr3_sequences` | -- | Original input data frame |
| `refdb_beta` | `"human_v2.0_CD48"` | Reference database |
| `gliph_version` | 1 | Scoring formula: 1 (GLIPH) or 2 (GLIPH2) |
| `sim_depth` | 1000 | Resampling depth for score estimation |

## Scoring Components

The total score is derived from up to five components (depending on available
data):

1. **`network.size.score`**: Probability of observing a cluster of this size
   by chance
2. **`cdr3.length.score`**: Enrichment of CDR3 length distribution within the
   cluster
3. **`vgene.score`**: Enrichment of V-gene usage (requires TRBV column)
4. **`clonal.expansion.score`**: Enrichment of expanded clones (requires counts
   column)
5. **`hla.score`**: Enrichment of shared HLA alleles among donors (requires
   patient + HLA columns)

## Example

```{r}
# Re-score with GLIPH2 formula
if (length(res_gliph1$cluster_list) > 0) {
  rescored <- clusterScoring(
    cluster_list   = res_gliph1$cluster_list,
    cdr3_sequences = gliph_input_data[seq_len(200), ],
    refdb_beta     = ref_df,
    gliph_version  = 2,
    sim_depth      = 100,
    n_cores        = 1)
  head(rescored)
}
```

# De Novo TCR Generation with `deNovoTCRs()`

The `deNovoTCRs()` function generates artificial CDR3$\beta$ sequences that
resemble the positional amino acid composition of a given convergence group.
This can be used to predict novel TCR sequences with similar binding
characteristics.

## Key Arguments

| Argument | Default | Description |
|:---------|:--------|:------------|
| `convergence_group_tag` | -- | Tag identifying the cluster (from `cluster_properties$tag`) |
| `clustering_output` | `NULL` | Output list from `runGLIPH()` |
| `result_folder` | `""` | Alternative: load from files |
| `sims` | 100,000 | Number of de novo sequences to generate |
| `num_tops` | 1,000 | Return top N highest-scoring sequences |
| `normalization` | `FALSE` | Normalize scores against the reference database |
| `make_figure` | `FALSE` | Plot score vs. rank |

## Example

```{r}
# Generate de novo TCRs for the first convergence group (if any found)
if (length(res_gliph1$cluster_list) > 0) {
  de_novo <- deNovoTCRs(
    convergence_group_tag = names(res_gliph1$cluster_list)[1],
    clustering_output     = res_gliph1,
    refdb_beta            = ref_df,
    sims                  = 10000,
    num_tops              = 100,
    make_figure           = FALSE,
    n_cores               = 1
  )

  # Top predicted sequences
  head(de_novo$de_novo_sequences)

  # Positional weight matrix used for generation
  head(de_novo$PWM_Scoring)
}
```

# Network Visualization with `plotNetwork()`

The `plotNetwork()` function creates an interactive network visualization of
the convergence groups using the visNetwork package.

## Key Arguments

| Argument | Default | Description |
|:---------|:--------|:------------|
| `clustering_output` | `NULL` | Output list from `runGLIPH()` |
| `color_info` | `"total.score"` | Column name for node coloring |
| `color_palette` | `viridis::viridis` | Color palette function |
| `local_edge_color` | `"orange"` | Color for local similarity edges |
| `global_edge_color` | `"#68bceb"` | Color for global similarity edges |
| `size_info` | `NULL` | Column name for node sizing |
| `cluster_min_size` | 3 | Minimum cluster size to display |

## Example

```{r}
if (!is.null(res_gliph1$cluster_properties) &&
    nrow(res_gliph1$cluster_properties) > 0) {
  plotNetwork(
    clustering_output = res_gliph1,
    color_info        = "total.score",
    cluster_min_size  = 2,
    n_cores           = 1
  )
}
```

# Loading Saved Results with `loadGLIPH()`

If you save results to disk using `result_folder`, you can reload them
later. We use `tempdir()` here so the example does not write to your
working directory:

```{r}
out_dir <- file.path(tempdir(), "gliph_results")

res_saved <- runGLIPH(
  cdr3_sequences = gliph_input_data[seq_len(200), ],
  method         = "gliph2",
  refdb_beta     = ref_df,
  result_folder  = out_dir,
  sim_depth      = 100,
  n_cores        = 1
)

reloaded <- loadGLIPH(result_folder = out_dir)
names(reloaded)
```

# Saving Results to Disk

When `result_folder` is specified, `runGLIPH()` writes several output files:

| File | Description |
|:-----|:------------|
| `local_similarities.txt` | Enriched motifs |
| `all_motifs.txt` | All tested motifs with statistics |
| `clone_network.txt` | Network edge list |
| `convergence_groups.txt` | Cluster properties and scores |
| `cluster_member_details.txt` | Full member information per cluster |
| `parameters.txt` | All parameters used |

# Performance

## Accelerated Computation with immApex

When [immApex](https://bioconductor.org/packages/immApex/) is installed,
immGLIPH automatically uses its C++-accelerated backends for two
computationally intensive steps:

1. **Motif enumeration** (`findMotifs()`): Uses `immApex::calculateMotif()`
   with OpenMP multithreading instead of the pure-R `stringdist::qgrams()`
   approach.

2. **Global Hamming distance network** (GLIPH1 method): Uses
   `immApex::buildNetwork()` to compute pairwise distances in a single C++
   call, replacing the parallel loop over `stringdist::stringdist()`.

If immApex is not installed, immGLIPH falls back to the original pure-R
implementations transparently, no code changes are needed.

```{r eval=FALSE}
# Install immApex for performance acceleration
BiocManager::install("BorchLab/immApex")
```

# Tips and Best Practices

1. **Start with GLIPH2**: The Fisher-based approach is generally more
   statistically rigorous than repeated random sampling.

2. **Sample size matters**: GLIPH works best with >200 unique CDR3$\beta$
   sequences. Very small samples may yield few or no convergence groups.

3. **Include V-gene information**: When available, TRBV data improves both
   global similarity detection and scoring accuracy.

4. **Adjust `sim_depth`**: For publication-quality results, use
   `sim_depth >= 1000`. For exploratory analysis, `sim_depth = 100` is faster.

5. **Parallelization**: For large datasets (>5,000 sequences), set
   `n_cores > 1` to use parallel processing via BiocParallel.

6. **Install immApex**: For best performance, install immApex to enable
   C++-accelerated motif enumeration and network construction (see
   Performance section above).

7. **Choose the right reference**: For mouse data, use `refdb_beta = "mouse_v1.0_CD48"`.
   For specialized repertoires, provide a custom data frame via `refdb_beta`.

# Session Info

```{r}
sessionInfo()
```