--- title: "Introduction to CGRphylo2: Chaos Game Representation for Phylogenetic Analysis" author: "Amarinder Singh Thind" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Introduction to CGRphylo2} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 5, fig.crop = FALSE ## avoids magick dependency during R CMD check ) ``` # Introduction ### About CGRphylo2 CGRphylo2 provides an alignment-free phylogenetic analysis method for viral genomes using Chaos Game Representation (CGR), a technique based on statistical physics concepts. Viruses exhibit high mutation rates, facilitating rapid evolution and emergence of new species, subspecies, strains, and recombinant forms. Accurate classification is crucial for understanding viral evolution and therapeutic development. Traditional phylogenetic methods require sequence alignment, which is computationally intensive. CGRphylo2 addresses this by implementing CGR-based whole-genome comparison that is fast, accurate, and computationally efficient. The package successfully classifies closely related viral lineages (demonstrated on SARS-CoV-2 lineages A and B), identifies recombinants (such as the XBB variant), and distinguishes multiple strains simultaneously. It processes sequences 5-13.7x faster than alignment-based methods (Clustal-Omega) with linear computational scaling. As a k-mer based approach, it enables simultaneous comparison of numerous closely-related sequences of different lengths. The package creates frequency matrices for distance calculations and phylogenetic tree construction, with outputs compatible with standard formats (MEGA, PHYLIP, Newick). Earlier version reference is Thind and Sinha (2023) . ## Why CGRphylo? **CGRphylo2** provides an efficient alignment-free approach for phylogenetic analysis of viral genomes using Chaos Game Representation (CGR). The package offers several key advantages: - **Precision**: Accurately classifies closely related viral strains and recombinants - **Speed**: 5-13.7x faster than traditional alignment methods like Clustal-Omega - **Scalability**: Linear computational cost with dataset size - **Accessibility**: Designed for both high and low-resource settings ## Computational Efficiency CGRphylo's efficiency comes from its alignment-free approach: - For 69 SARS-CoV-2 genomes: **5x faster** than Clustal-Omega - For 106 genomes: **13.7x faster** than Clustal-Omega - Adding sequences requires just one frequency matrix calculation - No quadratic scaling issues like multiple sequence alignment ## Citation If you use CGRphylo in your research, please cite: > Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing the > SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. > *Current Genomics*, 24(3). > [doi:10.2174/0113892029264990231013112156](https://doi.org/10.2174/0113892029264990231013112156) # Installation ## Bioconductor Installation (recommended) ```{r eval=FALSE} if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("CGRphylo2") ``` ## Development Version ```{r eval=FALSE} if (!require("devtools", quietly = TRUE)) { install.packages("devtools") } devtools::install_github("amarinderthind/CGRphylo2") ``` # Quick Start Example This example demonstrates the complete CGRphylo2 workflow using the SARS-CoV-2 sequences bundled with the package in `inst/extdata`. ## Load Required Packages ```{r load_packages} library(CGRphylo2) library(ape) ``` ## Prepare Example Sequences We load the SARS-CoV-2 sequences bundled with the package and filter out any with excessive ambiguous (N) bases. ```{r load_data} library(seqinr) fasta_path <- system.file("extdata", "example_sequences.fasta", package = "CGRphylo2") fastafile <- seqinr::read.fasta(fasta_path, seqtype = "DNA", as.string = TRUE, set.attributes = FALSE) N_filter <- 50 fasta_filtered <- filter_N(fastafile, N_filter) cat("Sequences retained:", length(fasta_filtered), "\n") for (nm in names(fasta_filtered)) { cat(nm, ": length =", nchar(fasta_filtered[[nm]]), "bp\n") } ``` ## Sequence Quality Control ```{r metadata, fig.cap="Sequence lengths (left) and GC content (right) for the bundled SARS-CoV-2 sequences."} meta <- create_meta(fasta_filtered, N_filter = 50) print(meta) par(mfrow = c(1, 2), mar = c(4, 6, 3, 1)) dotchart(meta$length, labels = meta$name, xlab = "Sequence length (bp)", pch = 21, bg = "green", pt.cex = 1.4, cex = 0.9, main = "Sequence Lengths" ) dotchart(meta$GC_content, labels = meta$name, xlab = "GC content (%)", pch = 21, bg = "steelblue", pt.cex = 1.4, cex = 0.9, main = "GC Content" ) par(mfrow = c(1, 1)) ``` ## Calculate CGR Frequency Matrices The k-mer size determines the resolution of the analysis. Typical values for real genome data are 4-8; we use k=4 here. ```{r calculate_frequencies} library(BiocParallel) k_mer <- 4 ## use 6 for real genome-scale data (virus) ~30k bp len_trim <- min(nchar(unlist(fasta_filtered))) ## SerialParam() required in vignettes — R CMD check forbids >2 parallel processes freq_matrices <- parallelCGR(fasta_filtered, k_mer = k_mer, len_trim = len_trim, BPPARAM = SerialParam() ) cat("Sequences processed:", length(freq_matrices), "\n") cat("K-mers per matrix :", nrow(freq_matrices[[1]]), paste0("(4^", k_mer, ")"), "\n") ``` ## CGR Visualization ```{r cgr_plots, fig.cap="CGR plots for the first two sequences, showing the k-mer density patterns characteristic of each genome."} par(mfrow = c(1, 2), mar = c(2, 2, 3, 1)) cgr1 <- cgrplot(1) plot(cgr1[, 1], cgr1[, 2], main = paste("CGR:", names(fasta_filtered)[1]), xlab = "", ylab = "", cex = 0.4, pch = 4, col = "steelblue" ) cgr2 <- cgrplot(2) plot(cgr2[, 1], cgr2[, 2], main = paste("CGR:", names(fasta_filtered)[2]), xlab = "", ylab = "", cex = 0.4, pch = 4, col = "tomato" ) par(mfrow = c(1, 1)) ``` ## Calculate Distance Matrix ```{r calculate_distances} distance_matrix <- calculateDistanceMatrix(freq_matrices, distance_type = "Euclidean" ) print(round(distance_matrix, 4)) ``` ## Distance Matrix Heatmap ```{r heatmap, fig.cap="Pairwise Euclidean distance matrix. Lighter colours indicate smaller distances (more similar sequences)."} col_pal <- colorRampPalette(c("white", "steelblue", "darkblue"))(50) n <- nrow(distance_matrix) image(seq_len(n), seq_len(n), distance_matrix, col = col_pal, xaxt = "n", yaxt = "n", xlab = "", ylab = "", main = "Pairwise Distance Heatmap" ) axis(1, at = seq_len(n), labels = rownames(distance_matrix), las = 2, cex.axis = 0.9) axis(2, at = seq_len(n), labels = colnames(distance_matrix), las = 1, cex.axis = 0.9) for (i in seq_len(n)) { for (j in seq_len(n)) { text(i, j, round(distance_matrix[i, j], 3), cex = 0.75) } } ``` ## Build Phylogenetic Tree ```{r build_tree, fig.cap="Neighbour-joining tree built from CGR Euclidean distances for the bundled SARS-CoV-2 sequences."} nj_tree <- nj(as.dist(distance_matrix)) plot(nj_tree, main = "Neighbour-Joining Tree (CGR distances)", cex = 0.9 ) ``` ## Export Results ```{r export_results} mega_file <- tempfile(fileext = ".meg") phylip_file <- tempfile(fileext = ".phy") saveMegaDistance(mega_file, distance_matrix) savePhylipDistance(phylip_file, distance_matrix, mode = "relaxed") cat("--- MEGA file (first 6 lines) ---\n") writeLines(readLines(mega_file, n = 6)) cat("\n--- PHYLIP file (first 6 lines) ---\n") writeLines(readLines(phylip_file, n = 6)) ``` # Detailed Workflow ## 1. Bioconductor Integration CGRphylo2 integrates with the Bioconductor ecosystem via `from_DNAStringSet()`, which converts a `DNAStringSet` object to the list format used by all CGRphylo2 functions. ```{r biostrings_integration} library(Biostrings) dna <- DNAStringSet(c( seq1 = "ATCGATCGATCGATCG", seq2 = "GCTAGCTAGCTAGCTA" )) seqs <- from_DNAStringSet(dna) cat("Sequences loaded:", length(seqs), "\n") ``` ## 2. Data Preparation We load the same bundled SARS-CoV-2 sequences used in the Quick Start: ```{r detailed_load_data} library(seqinr) fasta_path <- system.file("extdata", "example_sequences.fasta", package = "CGRphylo2") fastafile <- seqinr::read.fasta(fasta_path, seqtype = "DNA", as.string = TRUE, set.attributes = FALSE) N_filter <- 50 fasta_filtered <- filter_N(fastafile, N_filter) meta <- create_meta(fasta_filtered, N_filter) len_trim <- min(meta$length) k_mer <- 6 ``` ## 3. Distance Calculation Methods CGRphylo2 supports three distance metrics: ```{r distance_methods, fig.cap="NJ trees using three distance metrics."} ## Recompute frequency matrices for this section freq_matrices <- parallelCGR(fasta_filtered, k_mer = k_mer, len_trim = len_trim, BPPARAM = SerialParam() ) par(mfrow = c(1, 3), mar = c(1, 1, 3, 1)) for (dtype in c("Euclidean", "Manhattan", "S_Euclidean")) { dist_d <- calculateDistanceMatrix(freq_matrices, distance_type = dtype) tree_d <- nj(as.dist(dist_d)) plot(tree_d, main = dtype, cex = 0.8) } par(mfrow = c(1, 1)) ``` ## 4. Choosing K-mer Size ```{r kmer_comparison, fig.cap="NJ trees at k = 2, 3, and 4."} par(mfrow = c(1, 3), mar = c(1, 1, 3, 1)) for (k in c(2, 3, 4)) { freq_k <- parallelCGR(fasta_filtered, k_mer = k, len_trim = len_trim, BPPARAM = SerialParam()) dist_k <- calculateDistanceMatrix(freq_k, distance_type = "Euclidean") tree_k <- nj(as.dist(dist_k)) plot(tree_k, main = paste0("k = ", k), cex = 0.8) } par(mfrow = c(1, 1)) ``` ## 5. Parallel Processing ```{r parallel_processing, eval=TRUE} freq_matrices <- parallelCGR(fasta_filtered, k_mer = 6, len_trim = len_trim, BPPARAM = SerialParam() ) cat("Matrices computed:", length(freq_matrices), "\n") ``` ## 6. Memory Optimisation for Very Large Datasets ```{r memory_optimization, eval=FALSE} batch_size <- 100 n_sequences <- length(fasta_filtered) n_batches <- ceiling(n_sequences / batch_size) all_freq_matrices <- list() for (i in seq_len(n_batches)) { start_idx <- (i - 1) * batch_size + 1 end_idx <- min(i * batch_size, n_sequences) batch_seqs <- fasta_filtered[start_idx:end_idx] batch_freq <- parallelCGR(batch_seqs, k_mer = 6, len_trim = len_trim, BPPARAM = BiocParallel::bpparam()) all_freq_matrices <- c(all_freq_matrices, batch_freq) } ``` # Understanding CGR ## The CGR Algorithm Chaos Game Representation creates a 2D fractal representation of DNA sequences: 1. **Initialise**: Start at centre of unit square (0.5, 0.5) 2. **Iterate**: For each nucleotide, move halfway toward its corner — A: (0,0), C: (1,0), G: (1,1), T: (0,1) 3. **Result**: Each sequence creates a unique compositional fingerprint ## K-mer Frequencies - **4^k possible k-mers**: k=6 gives 4,096 possible 6-mers - **Normalised frequencies**: each count divided by total k-mer count - **Scale-invariant**: robust to sequence length differences ## Distance Metrics - **Euclidean**: sqrt(sum((f1-f2)^2)) — standard choice, balanced - **Manhattan**: sum(|f1-f2|) — more sensitive to small differences - **Squared Euclidean**: sum((f1-f2)^2) — amplifies large differences # Performance Benchmarks | Method | 69 sequences | 106 sequences | |--------|-------------|--------------| | Clustal-Omega | 5 minutes | 13.7 minutes | | CGRphylo (k=6) | 1 minute | 1 minute | | **Speedup** | **5x** | **13.7x** | Memory usage scales linearly: each frequency matrix at k=6 is ~32 KB, so 1,000 sequences require ~32 MB. # Troubleshooting ## Out of Memory Process sequences in batches (see Memory Optimisation above). ## Slow Performance Use `parallelCGR`, reduce k-mer size, or filter low-quality sequences first. ## Tree Topology Differs from MSA CGRphylo captures compositional patterns; MSA captures position-specific mutations. Both are valid but emphasise different evolutionary signals. # Session Information ```{r session_info} sessionInfo() ``` # References 1. Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. *Current Genomics*, 24(3). 2. Jeffrey HJ (1990). Chaos game representation of gene structure. *Nucleic Acids Research*, 18(8):2163-2170. 3. Deschavanne PJ, et al. (1999). Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. *Molecular Biology and Evolution*, 16(10):1391-1399.