--- title: "Getting Started with immGLIPH" author: - name: Nick Borcherding affiliation: Washington University in St. Louis date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 3 toc_float: true vignette: > %\VignetteIndexEntry{Getting Started with immGLIPH} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(BiocStyle) knitr::opts_chunk$set( echo = TRUE, message = FALSE, tidy = FALSE, warning = FALSE, fig.width = 7, fig.height = 5 ) ``` # Introduction immGLIPH provides an R implementation of the GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots) and GLIPH2 algorithms for clustering T cell receptors (TCRs) that are predicted to bind the same HLA-restricted peptide antigen. The package identifies TCR specificity groups by detecting statistically enriched CDR3$\beta$ motifs (local similarity) and structurally similar CDR3$\beta$ sequences (global similarity), then clusters them into convergence groups and scores each group for biological significance. **immGLIPH is an R implementation of existing algorithms. Users should cite the original publications:** - **GLIPH**: Glanville, J. et al. *Identifying specificity groups in the T cell receptor repertoire.* Nature 547, 94--98 (2017). [doi:10.1038/nature22976](https://doi.org/10.1038/nature22976) - **GLIPH2**: Huang, H. et al. *Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening.* Nature Biotechnology 38, 1194--1202 (2020). [doi:10.1038/s41587-020-0505-4](https://doi.org/10.1038/s41587-020-0505-4) ## Installation immGLIPH can be installed from Bioconductor: ```{r eval=FALSE} BiocManager::install("immGLIPH") ``` The reference repertoire data (~19 MB) is downloaded automatically the first time you run `runGLIPH()` and cached locally via [BiocFileCache](https://bioconductor.org/packages/BiocFileCache/). You can pre-download the data with: ```{r eval=FALSE} BiocManager::install("BiocFileCache") ref <- getGLIPHreference() ``` ### Loading Package ```{r} library(immGLIPH) ``` ## Integration with the scRepertoire Ecosystem immGLIPH integrates with the [scRepertoire](https://bioconductor.org/packages/scRepertoire/) ecosystem through [immApex](https://bioconductor.org/packages/immApex/). Both scRepertoire and immApex are Bioconductor packages and can be installed via `BiocManager::install()`. This means `runGLIPH()` can directly accept: - **SingleCellExperiment** objects with TCR information - **combineTCR()** output lists from scRepertoire - Standard data frames or character vectors # Quick Start ## Loading the Example Data immGLIPH includes built-in example data derived from the scRepertoire example dataset (Yost et al. 2019): - **`gliph_input_data`**: A data frame of 365 TRB CDR3 sequences with V-gene and patient annotations. - **`gliph_sce`**: A SingleCellExperiment object with TCR clonotype information in `colData`, demonstrating the single-cell workflow. ```{r} data("gliph_input_data") head(gliph_input_data) dim(gliph_input_data) ``` ## Input Data Format `runGLIPH()` accepts a data frame with the following columns: | Column | Required | Description | |:------------|:---------|:------------| | **CDR3b** | Yes | CDR3$\beta$ amino acid sequences | | **TRBV** | No | V-gene usage (e.g., "TRBV5-1") | | **patient** | No | Donor/sample identifier | | **HLA** | No | HLA alleles, comma-separated | | **counts** | No | Clone frequency | Alternative column names are automatically recognized (e.g., `cdr3`, `v_gene`, `sample`, `clone_count`). # Working with Single-Cell Data When working with single-cell immune repertoire data, you can use scRepertoire to prepare your data and pass the output directly to immGLIPH. ```{r} library(scRepertoire) data("contig_list") # After processing with cellranger/etc, combine contigs combined <- combineTCR(contig_list[seq_len(2)], samples = c("P1", "P2")) # Take a small slice so the example runs quickly. In real use, pass # all samples and rely on the bundled reference downloaded by # getGLIPHreference(). combined_small <- lapply(combined, function(x) x[seq_len(50), ]) # Use a small custom reference built from the bundled example data small_ref <- gliph_input_data[, c("CDR3b", "TRBV")] # Pass scRepertoire output directly to runGLIPH results_sc <- runGLIPH(combined_small, method = "gliph2", refdb_beta = small_ref, sim_depth = 100, n_cores = 1) ``` For **SingleCellExperiment** objects that already contain TCR metadata (e.g., added via `scRepertoire::combineExpression()`), immGLIPH extracts the receptor data automatically using `immApex::getIR()`. Here is an example using the bundled `gliph_sce` dataset: ```{r} library(SingleCellExperiment) data("gliph_sce") # SingleCellExperiment object with TCR info in colData results_sce <- runGLIPH(gliph_sce, method = "gliph2", chains = "TRB", refdb_beta = small_ref, sim_depth = 100, n_cores = 1) ``` # The `runGLIPH()` Function ## Key Arguments | Argument | Default | Description | |:---------|:--------|:------------| | `cdr3_sequences` | -- | Input data (data frame, vector, SCE, or list) | | `method` | `"gliph2"` | Algorithm preset: `"gliph1"`, `"gliph2"`, or `"custom"` | | `sim_depth` | 1000 | Simulation depth (higher = more reproducible, slower) | | `n_cores` | 1 | Number of parallel cores | | `refdb_beta` | `"human_v2.0_CD48"` | Reference database (built-in name or custom data frame) | | `local_similarities` | `TRUE` | Search for local (motif-based) similarities | | `global_similarities` | `TRUE` | Search for global (structural) similarities | | `structboundaries` | `TRUE` | Trim structural boundaries before comparison | | `accept_CF` | `TRUE` | Filter to sequences starting with C, ending with F | ## Method Presets The `method` parameter configures a coordinated set of algorithm choices: | Setting | `"gliph1"` | `"gliph2"` | `"custom"` | |:--------|:-----------|:-----------|:-----------| | Local method | Repeated random sampling | Fisher's exact test | User-defined | | Global method | Hamming distance cutoff | Struct-based + Fisher | User-defined | | Clustering | Connected components | Per-motif isolated | User-defined | | Scoring | GLIPH1 formula | GLIPH2 formula | User-defined | ## Running GLIPH2 (Default) By default `runGLIPH()` downloads a reference repertoire via BiocFileCache. For this vignette we supply a custom reference data frame directly using `refdb_beta` to avoid network access: ```{r} data("gliph_input_data") # Use the input data itself as a small reference for demonstration ref_df <- gliph_input_data[, c("CDR3b", "TRBV")] res_gliph2 <- runGLIPH( cdr3_sequences = gliph_input_data[seq_len(200), ], method = "gliph2", refdb_beta = ref_df, sim_depth = 100, n_cores = 1 ) ``` ## Running GLIPH1 ```{r} res_gliph1 <- runGLIPH( cdr3_sequences = gliph_input_data[seq_len(200), ], method = "gliph1", refdb_beta = ref_df, sim_depth = 100, n_cores = 1 ) ``` ## Understanding the Output The output is a list with the following elements: ```{r} names(res_gliph2) ``` ### Cluster Properties The `cluster_properties` data frame summarizes each convergence group with enrichment scores: ```{r} head(res_gliph2$cluster_properties) ``` ### Cluster Membership The `cluster_list` is a named list where each element contains the member TCRs of a convergence group: ```{r} # Number of convergence groups length(res_gliph2$cluster_list) # Members of the first cluster (if any found) if (length(res_gliph2$cluster_list) > 0) { head(res_gliph2$cluster_list[[1]]) } ``` ### Motif Enrichment The `motif_enrichment` element contains the locally enriched motifs: ```{r} # Significantly enriched motifs if (!is.null(res_gliph2$motif_enrichment$selected_motifs)) { head(res_gliph2$motif_enrichment$selected_motifs) } ``` ### Network Edges The `connections` data frame contains the edge list representing the TCR similarity network: ```{r} if (!is.null(res_gliph2$connections)) { head(res_gliph2$connections) } ``` # Customizing the Analysis ## Using `method = "custom"` The `"custom"` method allows independent control over each algorithmic component: ```{r} res_custom <- runGLIPH( cdr3_sequences = gliph_input_data[seq_len(200), ], method = "custom", refdb_beta = ref_df, local_method = "fisher", # or "rrs" global_method = "cutoff", # or "fisher" clustering_method = "GLIPH1.0", # or "GLIPH2.0" scoring_method = "GLIPH2.0", # or "GLIPH1.0" sim_depth = 100, n_cores = 1 ) ``` ## Adjusting Significance Thresholds For the Fisher-based local method (GLIPH2), you can adjust: - **`lcminp`**: Maximum p-value for a motif to be considered enriched (default 0.01) - **`lcminove`**: Minimum fold-enrichment per motif length (default `c(1000, 100, 10)` for 2-mers, 3-mers, 4-mers) - **`kmer_mindepth`**: Minimum motif observations in the sample (default 3) ```{r} res_strict <- runGLIPH( cdr3_sequences = gliph_input_data[seq_len(200), ], method = "gliph2", refdb_beta = ref_df, lcminp = 0.001, # Stricter p-value lcminove = c(10000, 1000, 100), # Higher fold-change sim_depth = 100, n_cores = 1 ) ``` ## Choosing a Reference Database immGLIPH ships with reference repertoires for human and mouse from the original GLIPH publications. Each is available as CD4, CD8, or combined (CD48) subsets: | Name | Species | Version | Subset | Source | |:-----|:--------|:--------|:-------|:-------| | `"human_v1.0_CD4"` | Human | v1.0 | CD4 | Glanville et al. (2017) | | `"human_v1.0_CD8"` | Human | v1.0 | CD8 | Glanville et al. (2017) | | `"human_v1.0_CD48"` | Human | v1.0 | CD4+CD8 | Glanville et al. (2017) | | `"human_v2.0_CD4"` | Human | v2.0 | CD4 | Huang et al. (2020) | | `"human_v2.0_CD8"` | Human | v2.0 | CD8 | Huang et al. (2020) | | `"human_v2.0_CD48"` | Human | v2.0 | CD4+CD8 | Huang et al. (2020) | | `"mouse_v1.0_CD4"` | Mouse | v1.0 | CD4 | Huang et al. (2020) | | `"mouse_v1.0_CD8"` | Mouse | v1.0 | CD8 | Huang et al. (2020) | | `"mouse_v1.0_CD48"` | Mouse | v1.0 | CD4+CD8 | Huang et al. (2020) | | `"gliph_reference"` | Human | v1.0 | CD4+CD8 | Legacy alias for `human_v1.0_CD48` | Each reference includes pre-computed V-gene usage and CDR3 length frequency distributions, which are automatically used for cluster scoring. To analyse mouse data, supply mouse CDR3$\beta$ sequences as `cdr3_sequences` and set `refdb_beta = "mouse_v1.0_CD48"` (or one of the CD4/CD8 subsets in the table). The reference is fetched and cached the first time it is requested. ## Using a Custom Reference Database You can also supply your own reference as a data frame. A minimal reference is a two-column table of CDR3$\beta$ amino-acid sequences and their corresponding V-gene names. Here we build a small one from the bundled `gliph_input_data` for illustration: ```{r} custom_ref <- gliph_input_data[, c("CDR3b", "TRBV")] head(custom_ref, 3) res <- runGLIPH( cdr3_sequences = gliph_input_data[seq_len(100), ], refdb_beta = custom_ref, method = "gliph2", sim_depth = 100, n_cores = 1 ) ``` # Motif Discovery with `findMotifs()` The `findMotifs()` function searches for continuous and discontinuous k-mer motifs in a set of CDR3 sequences. It is used internally by `runGLIPH()` but can also be called independently. ## Key Arguments | Argument | Default | Description | |:---------|:--------|:------------| | `seqs` | -- | Character vector of CDR3$\beta$ sequences | | `q` | `2:4` | Motif lengths to search | | `kmer_mindepth` | `NULL` | Minimum motif count to return | | `discontinuous` | `FALSE` | Include discontinuous motifs (with one variable position) | ## Example ```{r} data("gliph_input_data") sample_seqs <- as.character(gliph_input_data$CDR3b[seq_len(100)]) # Find all 3-mers appearing at least 5 times motifs <- findMotifs(seqs = sample_seqs, q = 3, kmer_mindepth = 5) head(motifs[order(motifs$V1, decreasing = TRUE), ]) ``` Including discontinuous motifs (e.g., `C.S` where `.` is any amino acid): ```{r} disc_motifs <- findMotifs(seqs = sample_seqs, q = 2, kmer_mindepth = 5, discontinuous = TRUE ) # Show discontinuous motifs (those containing a dot) disc_only <- disc_motifs[grep("\\.", disc_motifs$motif), ] head(disc_only[order(disc_only$V1, decreasing = TRUE), ]) ``` # Cluster Scoring with `clusterScoring()` The `clusterScoring()` function evaluates convergence groups using up to five metrics. This is called automatically by `runGLIPH()`, but can be re-run with different parameters on existing results. ## Key Arguments | Argument | Default | Description | |:---------|:--------|:------------| | `cluster_list` | -- | Named list of cluster data frames (from `runGLIPH()$cluster_list`) | | `cdr3_sequences` | -- | Original input data frame | | `refdb_beta` | `"human_v2.0_CD48"` | Reference database | | `gliph_version` | 1 | Scoring formula: 1 (GLIPH) or 2 (GLIPH2) | | `sim_depth` | 1000 | Resampling depth for score estimation | ## Scoring Components The total score is derived from up to five components (depending on available data): 1. **`network.size.score`**: Probability of observing a cluster of this size by chance 2. **`cdr3.length.score`**: Enrichment of CDR3 length distribution within the cluster 3. **`vgene.score`**: Enrichment of V-gene usage (requires TRBV column) 4. **`clonal.expansion.score`**: Enrichment of expanded clones (requires counts column) 5. **`hla.score`**: Enrichment of shared HLA alleles among donors (requires patient + HLA columns) ## Example ```{r} # Re-score with GLIPH2 formula if (length(res_gliph1$cluster_list) > 0) { rescored <- clusterScoring( cluster_list = res_gliph1$cluster_list, cdr3_sequences = gliph_input_data[seq_len(200), ], refdb_beta = ref_df, gliph_version = 2, sim_depth = 100, n_cores = 1) head(rescored) } ``` # De Novo TCR Generation with `deNovoTCRs()` The `deNovoTCRs()` function generates artificial CDR3$\beta$ sequences that resemble the positional amino acid composition of a given convergence group. This can be used to predict novel TCR sequences with similar binding characteristics. ## Key Arguments | Argument | Default | Description | |:---------|:--------|:------------| | `convergence_group_tag` | -- | Tag identifying the cluster (from `cluster_properties$tag`) | | `clustering_output` | `NULL` | Output list from `runGLIPH()` | | `result_folder` | `""` | Alternative: load from files | | `sims` | 100,000 | Number of de novo sequences to generate | | `num_tops` | 1,000 | Return top N highest-scoring sequences | | `normalization` | `FALSE` | Normalize scores against the reference database | | `make_figure` | `FALSE` | Plot score vs. rank | ## Example ```{r} # Generate de novo TCRs for the first convergence group (if any found) if (length(res_gliph1$cluster_list) > 0) { de_novo <- deNovoTCRs( convergence_group_tag = names(res_gliph1$cluster_list)[1], clustering_output = res_gliph1, refdb_beta = ref_df, sims = 10000, num_tops = 100, make_figure = FALSE, n_cores = 1 ) # Top predicted sequences head(de_novo$de_novo_sequences) # Positional weight matrix used for generation head(de_novo$PWM_Scoring) } ``` # Network Visualization with `plotNetwork()` The `plotNetwork()` function creates an interactive network visualization of the convergence groups using the visNetwork package. ## Key Arguments | Argument | Default | Description | |:---------|:--------|:------------| | `clustering_output` | `NULL` | Output list from `runGLIPH()` | | `color_info` | `"total.score"` | Column name for node coloring | | `color_palette` | `viridis::viridis` | Color palette function | | `local_edge_color` | `"orange"` | Color for local similarity edges | | `global_edge_color` | `"#68bceb"` | Color for global similarity edges | | `size_info` | `NULL` | Column name for node sizing | | `cluster_min_size` | 3 | Minimum cluster size to display | ## Example ```{r} if (!is.null(res_gliph1$cluster_properties) && nrow(res_gliph1$cluster_properties) > 0) { plotNetwork( clustering_output = res_gliph1, color_info = "total.score", cluster_min_size = 2, n_cores = 1 ) } ``` # Loading Saved Results with `loadGLIPH()` If you save results to disk using `result_folder`, you can reload them later. We use `tempdir()` here so the example does not write to your working directory: ```{r} out_dir <- file.path(tempdir(), "gliph_results") res_saved <- runGLIPH( cdr3_sequences = gliph_input_data[seq_len(200), ], method = "gliph2", refdb_beta = ref_df, result_folder = out_dir, sim_depth = 100, n_cores = 1 ) reloaded <- loadGLIPH(result_folder = out_dir) names(reloaded) ``` # Saving Results to Disk When `result_folder` is specified, `runGLIPH()` writes several output files: | File | Description | |:-----|:------------| | `local_similarities.txt` | Enriched motifs | | `all_motifs.txt` | All tested motifs with statistics | | `clone_network.txt` | Network edge list | | `convergence_groups.txt` | Cluster properties and scores | | `cluster_member_details.txt` | Full member information per cluster | | `parameters.txt` | All parameters used | # Performance ## Accelerated Computation with immApex When [immApex](https://bioconductor.org/packages/immApex/) is installed, immGLIPH automatically uses its C++-accelerated backends for two computationally intensive steps: 1. **Motif enumeration** (`findMotifs()`): Uses `immApex::calculateMotif()` with OpenMP multithreading instead of the pure-R `stringdist::qgrams()` approach. 2. **Global Hamming distance network** (GLIPH1 method): Uses `immApex::buildNetwork()` to compute pairwise distances in a single C++ call, replacing the parallel loop over `stringdist::stringdist()`. If immApex is not installed, immGLIPH falls back to the original pure-R implementations transparently, no code changes are needed. ```{r eval=FALSE} # Install immApex for performance acceleration BiocManager::install("BorchLab/immApex") ``` # Tips and Best Practices 1. **Start with GLIPH2**: The Fisher-based approach is generally more statistically rigorous than repeated random sampling. 2. **Sample size matters**: GLIPH works best with >200 unique CDR3$\beta$ sequences. Very small samples may yield few or no convergence groups. 3. **Include V-gene information**: When available, TRBV data improves both global similarity detection and scoring accuracy. 4. **Adjust `sim_depth`**: For publication-quality results, use `sim_depth >= 1000`. For exploratory analysis, `sim_depth = 100` is faster. 5. **Parallelization**: For large datasets (>5,000 sequences), set `n_cores > 1` to use parallel processing via BiocParallel. 6. **Install immApex**: For best performance, install immApex to enable C++-accelerated motif enumeration and network construction (see Performance section above). 7. **Choose the right reference**: For mouse data, use `refdb_beta = "mouse_v1.0_CD48"`. For specialized repertoires, provide a custom data frame via `refdb_beta`. # Session Info ```{r} sessionInfo() ```