---
title: "Introduction to CGRphylo2: Chaos Game Representation for Phylogenetic Analysis"
author: "Amarinder Singh Thind"
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Introduction to CGRphylo2}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    echo = TRUE, warning = FALSE, message = FALSE,
    fig.width = 7, fig.height = 5,
    fig.crop = FALSE   ## avoids magick dependency during R CMD check
)
```

# Introduction

### About CGRphylo2

CGRphylo2 provides an alignment-free phylogenetic analysis method
for viral genomes using Chaos Game Representation (CGR), a technique based
on statistical physics concepts. Viruses exhibit high mutation rates,
facilitating rapid evolution and emergence of new species, subspecies,
strains, and recombinant forms. Accurate classification is crucial for
understanding viral evolution and therapeutic development. Traditional
phylogenetic methods require sequence alignment, which is computationally
intensive. CGRphylo2 addresses this by implementing CGR-based whole-genome
comparison that is fast, accurate, and computationally efficient. The package
successfully classifies closely related viral lineages (demonstrated on
SARS-CoV-2 lineages A and B), identifies recombinants (such as the XBB
variant), and distinguishes multiple strains simultaneously. It processes
sequences 5-13.7x faster than alignment-based methods (Clustal-Omega) with
linear computational scaling. As a k-mer based approach, it enables
simultaneous comparison of numerous closely-related sequences of different
lengths. The package creates frequency matrices for distance calculations
and phylogenetic tree construction, with outputs compatible with standard
formats (MEGA, PHYLIP, Newick). Earlier version reference is Thind and
Sinha (2023) <doi:10.2174/0113892029264990231013112156>.

## Why CGRphylo?

**CGRphylo2** provides an efficient alignment-free approach for phylogenetic
analysis of viral genomes using Chaos Game Representation (CGR). The package
offers several key advantages:

- **Precision**: Accurately classifies closely related viral strains and
  recombinants
- **Speed**: 5-13.7x faster than traditional alignment methods like
  Clustal-Omega
- **Scalability**: Linear computational cost with dataset size
- **Accessibility**: Designed for both high and low-resource settings

## Computational Efficiency

CGRphylo's efficiency comes from its alignment-free approach:

- For 69 SARS-CoV-2 genomes: **5x faster** than Clustal-Omega
- For 106 genomes: **13.7x faster** than Clustal-Omega
- Adding sequences requires just one frequency matrix calculation
- No quadratic scaling issues like multiple sequence alignment

## Citation

If you use CGRphylo in your research, please cite:

> Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing the
> SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants.
> *Current Genomics*, 24(3).
> [doi:10.2174/0113892029264990231013112156](https://doi.org/10.2174/0113892029264990231013112156)

# Installation

## Bioconductor Installation (recommended)

```{r eval=FALSE}
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("CGRphylo2")
```

## Development Version

```{r eval=FALSE}
 
if (!require("devtools", quietly = TRUE)) {
    install.packages("devtools")
}
devtools::install_github("amarinderthind/CGRphylo2")
```

# Quick Start Example

This example demonstrates the complete CGRphylo2 workflow using the SARS-CoV-2
sequences bundled with the package in `inst/extdata`.

## Load Required Packages

```{r load_packages}
library(CGRphylo2)
library(ape)
```

## Prepare Example Sequences

We load the SARS-CoV-2 sequences bundled with the package and filter out any
with excessive ambiguous (N) bases.

```{r load_data}
library(seqinr)
fasta_path <- system.file("extdata", "example_sequences.fasta",
                           package = "CGRphylo2")
fastafile  <- seqinr::read.fasta(fasta_path, seqtype = "DNA",
                                  as.string = TRUE, set.attributes = FALSE)
N_filter       <- 50
fasta_filtered <- filter_N(fastafile, N_filter)

cat("Sequences retained:", length(fasta_filtered), "\n")
for (nm in names(fasta_filtered)) {
    cat(nm, ": length =", nchar(fasta_filtered[[nm]]), "bp\n")
}
```

## Sequence Quality Control

```{r metadata, fig.cap="Sequence lengths (left) and GC content (right) for the bundled SARS-CoV-2 sequences."}
meta <- create_meta(fasta_filtered, N_filter = 50)
print(meta)

par(mfrow = c(1, 2), mar = c(4, 6, 3, 1))

dotchart(meta$length,
    labels = meta$name, xlab = "Sequence length (bp)",
    pch = 21, bg = "green", pt.cex = 1.4, cex = 0.9,
    main = "Sequence Lengths"
)
dotchart(meta$GC_content,
    labels = meta$name, xlab = "GC content (%)",
    pch = 21, bg = "steelblue", pt.cex = 1.4, cex = 0.9,
    main = "GC Content"
)
par(mfrow = c(1, 1))
```

## Calculate CGR Frequency Matrices

The k-mer size determines the resolution of the analysis. Typical values for
real genome data are 4-8; we use k=4 here.

```{r calculate_frequencies}
library(BiocParallel)

k_mer    <- 4   ## use 6 for real genome-scale data (virus) ~30k bp
len_trim <- min(nchar(unlist(fasta_filtered)))

## SerialParam() required in vignettes — R CMD check forbids >2 parallel processes
freq_matrices <- parallelCGR(fasta_filtered,
    k_mer    = k_mer,
    len_trim = len_trim,
    BPPARAM  = SerialParam()
)

cat("Sequences processed:", length(freq_matrices), "\n")
cat("K-mers per matrix  :", nrow(freq_matrices[[1]]),
    paste0("(4^", k_mer, ")"), "\n")
```

## CGR Visualization

```{r cgr_plots, fig.cap="CGR plots for the first two sequences, showing the k-mer density patterns characteristic of each genome."}
par(mfrow = c(1, 2), mar = c(2, 2, 3, 1))

cgr1 <- cgrplot(1)
plot(cgr1[, 1], cgr1[, 2],
    main = paste("CGR:", names(fasta_filtered)[1]),
    xlab = "", ylab = "", cex = 0.4, pch = 4, col = "steelblue"
)

cgr2 <- cgrplot(2)
plot(cgr2[, 1], cgr2[, 2],
    main = paste("CGR:", names(fasta_filtered)[2]),
    xlab = "", ylab = "", cex = 0.4, pch = 4, col = "tomato"
)
par(mfrow = c(1, 1))
```

## Calculate Distance Matrix

```{r calculate_distances}
 
distance_matrix <- calculateDistanceMatrix(freq_matrices,
    distance_type = "Euclidean"
)
print(round(distance_matrix, 4))
```

## Distance Matrix Heatmap

```{r heatmap, fig.cap="Pairwise Euclidean distance matrix. Lighter colours indicate smaller distances (more similar sequences)."}
col_pal <- colorRampPalette(c("white", "steelblue", "darkblue"))(50)
n       <- nrow(distance_matrix)

image(seq_len(n), seq_len(n), distance_matrix,
    col  = col_pal, xaxt = "n", yaxt = "n",
    xlab = "", ylab = "",
    main = "Pairwise Distance Heatmap"
)
axis(1, at = seq_len(n), labels = rownames(distance_matrix),
    las = 2, cex.axis = 0.9)
axis(2, at = seq_len(n), labels = colnames(distance_matrix),
    las = 1, cex.axis = 0.9)
for (i in seq_len(n)) {
    for (j in seq_len(n)) {
        text(i, j, round(distance_matrix[i, j], 3), cex = 0.75)
    }
}
```

## Build Phylogenetic Tree

```{r build_tree, fig.cap="Neighbour-joining tree built from CGR Euclidean distances for the bundled SARS-CoV-2 sequences."}
nj_tree <- nj(as.dist(distance_matrix))

plot(nj_tree,
    main = "Neighbour-Joining Tree (CGR distances)",
    cex  = 0.9
)
```

## Export Results

```{r export_results}
 
mega_file   <- tempfile(fileext = ".meg")
phylip_file <- tempfile(fileext = ".phy")

saveMegaDistance(mega_file, distance_matrix)
savePhylipDistance(phylip_file, distance_matrix, mode = "relaxed")

cat("--- MEGA file (first 6 lines) ---\n")
writeLines(readLines(mega_file, n = 6))

cat("\n--- PHYLIP file (first 6 lines) ---\n")
writeLines(readLines(phylip_file, n = 6))
```

# Detailed Workflow

## 1. Bioconductor Integration

CGRphylo2 integrates with the Bioconductor ecosystem via `from_DNAStringSet()`,
which converts a `DNAStringSet` object to the list format used by all CGRphylo2
functions.

```{r biostrings_integration}
library(Biostrings)
dna <- DNAStringSet(c(
    seq1 = "ATCGATCGATCGATCG",
    seq2 = "GCTAGCTAGCTAGCTA"
))
seqs <- from_DNAStringSet(dna)
cat("Sequences loaded:", length(seqs), "\n")
```

## 2. Data Preparation

We load the same bundled SARS-CoV-2 sequences used in the Quick Start:

```{r detailed_load_data}
library(seqinr)
fasta_path     <- system.file("extdata", "example_sequences.fasta",
                               package = "CGRphylo2")
fastafile      <- seqinr::read.fasta(fasta_path, seqtype = "DNA",
                                      as.string = TRUE, set.attributes = FALSE)
N_filter       <- 50
fasta_filtered <- filter_N(fastafile, N_filter)
meta           <- create_meta(fasta_filtered, N_filter)
len_trim       <- min(meta$length)
k_mer          <- 6
```

## 3. Distance Calculation Methods

CGRphylo2 supports three distance metrics:

```{r distance_methods, fig.cap="NJ trees using three distance metrics."}
## Recompute frequency matrices for this section
freq_matrices <- parallelCGR(fasta_filtered,
    k_mer    = k_mer,
    len_trim = len_trim,
    BPPARAM  = SerialParam()
)

par(mfrow = c(1, 3), mar = c(1, 1, 3, 1))
for (dtype in c("Euclidean", "Manhattan", "S_Euclidean")) {
    dist_d <- calculateDistanceMatrix(freq_matrices, distance_type = dtype)
    tree_d <- nj(as.dist(dist_d))
    plot(tree_d, main = dtype, cex = 0.8)
}
par(mfrow = c(1, 1))
```

## 4. Choosing K-mer Size

```{r kmer_comparison, fig.cap="NJ trees at k = 2, 3, and 4."}
par(mfrow = c(1, 3), mar = c(1, 1, 3, 1))
for (k in c(2, 3, 4)) {
    freq_k <- parallelCGR(fasta_filtered, k_mer = k, len_trim = len_trim,
                          BPPARAM = SerialParam())
    dist_k <- calculateDistanceMatrix(freq_k, distance_type = "Euclidean")
    tree_k <- nj(as.dist(dist_k))
    plot(tree_k, main = paste0("k = ", k), cex = 0.8)
}
par(mfrow = c(1, 1))
```

## 5. Parallel Processing

```{r parallel_processing, eval=TRUE}
freq_matrices <- parallelCGR(fasta_filtered,
    k_mer    = 6,
    len_trim = len_trim,
    BPPARAM  = SerialParam()
)
cat("Matrices computed:", length(freq_matrices), "\n")
```

## 6. Memory Optimisation for Very Large Datasets

```{r memory_optimization, eval=FALSE}
batch_size        <- 100
n_sequences       <- length(fasta_filtered)
n_batches         <- ceiling(n_sequences / batch_size)
all_freq_matrices <- list()

for (i in seq_len(n_batches)) {
    start_idx         <- (i - 1) * batch_size + 1
    end_idx           <- min(i * batch_size, n_sequences)
    batch_seqs        <- fasta_filtered[start_idx:end_idx]
    batch_freq        <- parallelCGR(batch_seqs, k_mer = 6, len_trim = len_trim,
                                     BPPARAM = BiocParallel::bpparam())
    all_freq_matrices <- c(all_freq_matrices, batch_freq)
}
```

# Understanding CGR

## The CGR Algorithm

Chaos Game Representation creates a 2D fractal representation of DNA sequences:

1. **Initialise**: Start at centre of unit square (0.5, 0.5)
2. **Iterate**: For each nucleotide, move halfway toward its corner —
   A: (0,0), C: (1,0), G: (1,1), T: (0,1)
3. **Result**: Each sequence creates a unique compositional fingerprint

## K-mer Frequencies

- **4^k possible k-mers**: k=6 gives 4,096 possible 6-mers
- **Normalised frequencies**: each count divided by total k-mer count
- **Scale-invariant**: robust to sequence length differences

## Distance Metrics

- **Euclidean**: sqrt(sum((f1-f2)^2)) — standard choice, balanced
- **Manhattan**: sum(|f1-f2|) — more sensitive to small differences
- **Squared Euclidean**: sum((f1-f2)^2) — amplifies large differences

# Performance Benchmarks

| Method | 69 sequences | 106 sequences |
|--------|-------------|--------------|
| Clustal-Omega | 5 minutes | 13.7 minutes |
| CGRphylo (k=6) | 1 minute | 1 minute |
| **Speedup** | **5x** | **13.7x** |

Memory usage scales linearly: each frequency matrix at k=6 is ~32 KB, so
1,000 sequences require ~32 MB.

# Troubleshooting

## Out of Memory
Process sequences in batches (see Memory Optimisation above).

## Slow Performance
Use `parallelCGR`, reduce k-mer size, or filter low-quality sequences first.

## Tree Topology Differs from MSA
CGRphylo captures compositional patterns; MSA captures position-specific
mutations. Both are valid but emphasise different evolutionary signals.

# Session Information

```{r session_info}
sessionInfo()
```

# References

1. Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing
   the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants.
   *Current Genomics*, 24(3).

2. Jeffrey HJ (1990). Chaos game representation of gene structure.
   *Nucleic Acids Research*, 18(8):2163-2170.

3. Deschavanne PJ, et al. (1999). Genomic signature: characterization and
   classification of species assessed by chaos game representation of
   sequences. *Molecular Biology and Evolution*, 16(10):1391-1399.