Introduction to CGRphylo2: Chaos Game Representation for Phylogenetic Analysis

Introduction

About CGRphylo2

CGRphylo2 provides an alignment-free phylogenetic analysis method for viral genomes using Chaos Game Representation (CGR), a technique based on statistical physics concepts. Viruses exhibit high mutation rates, facilitating rapid evolution and emergence of new species, subspecies, strains, and recombinant forms. Accurate classification is crucial for understanding viral evolution and therapeutic development. Traditional phylogenetic methods require sequence alignment, which is computationally intensive. CGRphylo2 addresses this by implementing CGR-based whole-genome comparison that is fast, accurate, and computationally efficient. The package successfully classifies closely related viral lineages (demonstrated on SARS-CoV-2 lineages A and B), identifies recombinants (such as the XBB variant), and distinguishes multiple strains simultaneously. It processes sequences 5-13.7x faster than alignment-based methods (Clustal-Omega) with linear computational scaling. As a k-mer based approach, it enables simultaneous comparison of numerous closely-related sequences of different lengths. The package creates frequency matrices for distance calculations and phylogenetic tree construction, with outputs compatible with standard formats (MEGA, PHYLIP, Newick). Earlier version reference is Thind and Sinha (2023) doi:10.2174/0113892029264990231013112156.

Why CGRphylo?

CGRphylo2 provides an efficient alignment-free approach for phylogenetic analysis of viral genomes using Chaos Game Representation (CGR). The package offers several key advantages:

  • Precision: Accurately classifies closely related viral strains and recombinants
  • Speed: 5-13.7x faster than traditional alignment methods like Clustal-Omega
  • Scalability: Linear computational cost with dataset size
  • Accessibility: Designed for both high and low-resource settings

Computational Efficiency

CGRphylo’s efficiency comes from its alignment-free approach:

  • For 69 SARS-CoV-2 genomes: 5x faster than Clustal-Omega
  • For 106 genomes: 13.7x faster than Clustal-Omega
  • Adding sequences requires just one frequency matrix calculation
  • No quadratic scaling issues like multiple sequence alignment

Citation

If you use CGRphylo in your research, please cite:

Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Current Genomics, 24(3). doi:10.2174/0113892029264990231013112156

Installation

Development Version

if (!require("devtools", quietly = TRUE)) {
    install.packages("devtools")
}
devtools::install_github("amarinderthind/CGRphylo2")

Quick Start Example

This example demonstrates the complete CGRphylo2 workflow using the SARS-CoV-2 sequences bundled with the package in inst/extdata.

Load Required Packages

library(CGRphylo2)
library(ape)

Prepare Example Sequences

We load the SARS-CoV-2 sequences bundled with the package and filter out any with excessive ambiguous (N) bases.

library(seqinr)
fasta_path <- system.file("extdata", "example_sequences.fasta",
                           package = "CGRphylo2")
fastafile  <- seqinr::read.fasta(fasta_path, seqtype = "DNA",
                                  as.string = TRUE, set.attributes = FALSE)
N_filter       <- 50
fasta_filtered <- filter_N(fastafile, N_filter)

cat("Sequences retained:", length(fasta_filtered), "\n")
## Sequences retained: 13
for (nm in names(fasta_filtered)) {
    cat(nm, ": length =", nchar(fasta_filtered[[nm]]), "bp\n")
}
## XBB.1 : length = 29823 bp
## XBB.1 : length = 29823 bp
## XBB.1 : length = 29823 bp
## XBB.1 : length = 29823 bp
## ba.2.75 : length = 29652 bp
## ba.2.75 : length = 29652 bp
## ba.2.75 : length = 29652 bp
## ba.2.75 : length = 29652 bp
## BJ.1 : length = 29754 bp
## BJ.1 : length = 29754 bp
## BJ.1 : length = 29754 bp
## BJ.1 : length = 29754 bp
## MERS-CoV : length = 29994 bp

Sequence Quality Control

meta <- create_meta(fasta_filtered, N_filter = 50)
print(meta)
##        name length GC_content N_content
## 1     XBB.1  29823   37.86675        30
## 2     XBB.1  29818   37.85968        25
## 3     XBB.1  29803   37.85525        22
## 4     XBB.1  29773   37.90347        12
## 5   ba.2.75  29652   37.88615         2
## 6   ba.2.75  29652   37.88277         1
## 7   ba.2.75  29652   37.89289         1
## 8   ba.2.75  29652   37.88277         1
## 9      BJ.1  29754   37.92767         0
## 10     BJ.1  29726   37.90621         1
## 11     BJ.1  29757   37.91041         0
## 12     BJ.1  29724   37.88857         0
## 13 MERS-CoV  29994   41.15490         0
par(mfrow = c(1, 2), mar = c(4, 6, 3, 1))

dotchart(meta$length,
    labels = meta$name, xlab = "Sequence length (bp)",
    pch = 21, bg = "green", pt.cex = 1.4, cex = 0.9,
    main = "Sequence Lengths"
)
dotchart(meta$GC_content,
    labels = meta$name, xlab = "GC content (%)",
    pch = 21, bg = "steelblue", pt.cex = 1.4, cex = 0.9,
    main = "GC Content"
)
Sequence lengths (left) and GC content (right) for the bundled SARS-CoV-2 sequences.

Sequence lengths (left) and GC content (right) for the bundled SARS-CoV-2 sequences.

par(mfrow = c(1, 1))

Calculate CGR Frequency Matrices

The k-mer size determines the resolution of the analysis. Typical values for real genome data are 4-8; we use k=4 here.

library(BiocParallel)

k_mer    <- 4   ## use 6 for real genome-scale data (virus) ~30k bp
len_trim <- min(nchar(unlist(fasta_filtered)))

## SerialParam() required in vignettes — R CMD check forbids >2 parallel processes
freq_matrices <- parallelCGR(fasta_filtered,
    k_mer    = k_mer,
    len_trim = len_trim,
    BPPARAM  = SerialParam()
)

cat("Sequences processed:", length(freq_matrices), "\n")
## Sequences processed: 13
cat("K-mers per matrix  :", nrow(freq_matrices[[1]]),
    paste0("(4^", k_mer, ")"), "\n")
## K-mers per matrix  : 256 (4^4)

CGR Visualization

par(mfrow = c(1, 2), mar = c(2, 2, 3, 1))

cgr1 <- cgrplot(1)
plot(cgr1[, 1], cgr1[, 2],
    main = paste("CGR:", names(fasta_filtered)[1]),
    xlab = "", ylab = "", cex = 0.4, pch = 4, col = "steelblue"
)

cgr2 <- cgrplot(2)
plot(cgr2[, 1], cgr2[, 2],
    main = paste("CGR:", names(fasta_filtered)[2]),
    xlab = "", ylab = "", cex = 0.4, pch = 4, col = "tomato"
)
CGR plots for the first two sequences, showing the k-mer density patterns characteristic of each genome.

CGR plots for the first two sequences, showing the k-mer density patterns characteristic of each genome.

par(mfrow = c(1, 1))

Calculate Distance Matrix

distance_matrix <- calculateDistanceMatrix(freq_matrices,
    distance_type = "Euclidean"
)
print(round(distance_matrix, 4))
##           XBB.1  XBB.1  XBB.1  XBB.1 ba.2.75 ba.2.75 ba.2.75 ba.2.75   BJ.1
## XBB.1    0.0000 0.0002 0.0003 0.0002  0.0007  0.0007  0.0006  0.0007 0.0006
## XBB.1    0.0002 0.0000 0.0003 0.0003  0.0007  0.0007  0.0007  0.0007 0.0006
## XBB.1    0.0003 0.0003 0.0000 0.0003  0.0006  0.0006  0.0006  0.0006 0.0005
## XBB.1    0.0002 0.0003 0.0003 0.0000  0.0007  0.0007  0.0007  0.0007 0.0006
## ba.2.75  0.0007 0.0007 0.0006 0.0007  0.0000  0.0002  0.0002  0.0003 0.0007
## ba.2.75  0.0007 0.0007 0.0006 0.0007  0.0002  0.0000  0.0002  0.0003 0.0007
## ba.2.75  0.0006 0.0007 0.0006 0.0007  0.0002  0.0002  0.0000  0.0002 0.0007
## ba.2.75  0.0007 0.0007 0.0006 0.0007  0.0003  0.0003  0.0002  0.0000 0.0007
## BJ.1     0.0006 0.0006 0.0005 0.0006  0.0007  0.0007  0.0007  0.0007 0.0000
## BJ.1     0.0005 0.0006 0.0005 0.0006  0.0006  0.0006  0.0006  0.0006 0.0003
## BJ.1     0.0006 0.0006 0.0005 0.0006  0.0007  0.0007  0.0007  0.0007 0.0004
## BJ.1     0.0006 0.0006 0.0005 0.0006  0.0007  0.0007  0.0007  0.0007 0.0004
## MERS-CoV 0.0160 0.0161 0.0160 0.0160  0.0161  0.0161  0.0161  0.0161 0.0160
##            BJ.1   BJ.1   BJ.1 MERS-CoV
## XBB.1    0.0005 0.0006 0.0006   0.0160
## XBB.1    0.0006 0.0006 0.0006   0.0161
## XBB.1    0.0005 0.0005 0.0005   0.0160
## XBB.1    0.0006 0.0006 0.0006   0.0160
## ba.2.75  0.0006 0.0007 0.0007   0.0161
## ba.2.75  0.0006 0.0007 0.0007   0.0161
## ba.2.75  0.0006 0.0007 0.0007   0.0161
## ba.2.75  0.0006 0.0007 0.0007   0.0161
## BJ.1     0.0003 0.0004 0.0004   0.0160
## BJ.1     0.0000 0.0003 0.0003   0.0161
## BJ.1     0.0003 0.0000 0.0003   0.0160
## BJ.1     0.0003 0.0003 0.0000   0.0160
## MERS-CoV 0.0161 0.0160 0.0160   0.0000

Distance Matrix Heatmap

col_pal <- colorRampPalette(c("white", "steelblue", "darkblue"))(50)
n       <- nrow(distance_matrix)

image(seq_len(n), seq_len(n), distance_matrix,
    col  = col_pal, xaxt = "n", yaxt = "n",
    xlab = "", ylab = "",
    main = "Pairwise Distance Heatmap"
)
axis(1, at = seq_len(n), labels = rownames(distance_matrix),
    las = 2, cex.axis = 0.9)
axis(2, at = seq_len(n), labels = colnames(distance_matrix),
    las = 1, cex.axis = 0.9)
for (i in seq_len(n)) {
    for (j in seq_len(n)) {
        text(i, j, round(distance_matrix[i, j], 3), cex = 0.75)
    }
}
Pairwise Euclidean distance matrix. Lighter colours indicate smaller distances (more similar sequences).

Pairwise Euclidean distance matrix. Lighter colours indicate smaller distances (more similar sequences).

Build Phylogenetic Tree

nj_tree <- nj(as.dist(distance_matrix))

plot(nj_tree,
    main = "Neighbour-Joining Tree (CGR distances)",
    cex  = 0.9
)
Neighbour-joining tree built from CGR Euclidean distances for the bundled SARS-CoV-2 sequences.

Neighbour-joining tree built from CGR Euclidean distances for the bundled SARS-CoV-2 sequences.

Export Results

mega_file   <- tempfile(fileext = ".meg")
phylip_file <- tempfile(fileext = ".phy")

saveMegaDistance(mega_file, distance_matrix)
savePhylipDistance(phylip_file, distance_matrix, mode = "relaxed")

cat("--- MEGA file (first 6 lines) ---\n")
## --- MEGA file (first 6 lines) ---
writeLines(readLines(mega_file, n = 6))
## #mega
## !Title: Distance Matrix;
## !Format DataType=Distance DataFormat=LowerLeft NTaxa=13;
## 
## [
## #XBB.1
cat("\n--- PHYLIP file (first 6 lines) ---\n")
## 
## --- PHYLIP file (first 6 lines) ---
writeLines(readLines(phylip_file, n = 6))
##   13
## XBB.1             0    0.000247    0.000289    0.000219    0.000674    0.000688    0.000646    0.000661    0.000566    0.000535    0.000557     0.00057    0.015997
## XBB.1      0.000247           0    0.000313    0.000315    0.000677    0.000694    0.000657    0.000673    0.000584    0.000561    0.000574    0.000587     0.01607
## XBB.1      0.000289    0.000313           0    0.000333    0.000626    0.000643    0.000607    0.000611    0.000533    0.000512    0.000515    0.000506    0.016027
## XBB.1      0.000219    0.000315    0.000333           0    0.000697    0.000716    0.000675    0.000687     0.00059     0.00057    0.000585    0.000595    0.016016
## ba.2.75      0.000674    0.000677    0.000626    0.000697           0    0.000155    0.000182     0.00025    0.000688    0.000645    0.000708    0.000706    0.016051

Detailed Workflow

1. Bioconductor Integration

CGRphylo2 integrates with the Bioconductor ecosystem via from_DNAStringSet(), which converts a DNAStringSet object to the list format used by all CGRphylo2 functions.

library(Biostrings)
dna <- DNAStringSet(c(
    seq1 = "ATCGATCGATCGATCG",
    seq2 = "GCTAGCTAGCTAGCTA"
))
seqs <- from_DNAStringSet(dna)
cat("Sequences loaded:", length(seqs), "\n")
## Sequences loaded: 2

2. Data Preparation

We load the same bundled SARS-CoV-2 sequences used in the Quick Start:

library(seqinr)
fasta_path     <- system.file("extdata", "example_sequences.fasta",
                               package = "CGRphylo2")
fastafile      <- seqinr::read.fasta(fasta_path, seqtype = "DNA",
                                      as.string = TRUE, set.attributes = FALSE)
N_filter       <- 50
fasta_filtered <- filter_N(fastafile, N_filter)
meta           <- create_meta(fasta_filtered, N_filter)
len_trim       <- min(meta$length)
k_mer          <- 6

3. Distance Calculation Methods

CGRphylo2 supports three distance metrics:

## Recompute frequency matrices for this section
freq_matrices <- parallelCGR(fasta_filtered,
    k_mer    = k_mer,
    len_trim = len_trim,
    BPPARAM  = SerialParam()
)

par(mfrow = c(1, 3), mar = c(1, 1, 3, 1))
for (dtype in c("Euclidean", "Manhattan", "S_Euclidean")) {
    dist_d <- calculateDistanceMatrix(freq_matrices, distance_type = dtype)
    tree_d <- nj(as.dist(dist_d))
    plot(tree_d, main = dtype, cex = 0.8)
}
NJ trees using three distance metrics.

NJ trees using three distance metrics.

par(mfrow = c(1, 1))

4. Choosing K-mer Size

par(mfrow = c(1, 3), mar = c(1, 1, 3, 1))
for (k in c(2, 3, 4)) {
    freq_k <- parallelCGR(fasta_filtered, k_mer = k, len_trim = len_trim,
                          BPPARAM = SerialParam())
    dist_k <- calculateDistanceMatrix(freq_k, distance_type = "Euclidean")
    tree_k <- nj(as.dist(dist_k))
    plot(tree_k, main = paste0("k = ", k), cex = 0.8)
}
NJ trees at k = 2, 3, and 4.

NJ trees at k = 2, 3, and 4.

par(mfrow = c(1, 1))

5. Parallel Processing

freq_matrices <- parallelCGR(fasta_filtered,
    k_mer    = 6,
    len_trim = len_trim,
    BPPARAM  = SerialParam()
)
cat("Matrices computed:", length(freq_matrices), "\n")
## Matrices computed: 13

6. Memory Optimisation for Very Large Datasets

batch_size        <- 100
n_sequences       <- length(fasta_filtered)
n_batches         <- ceiling(n_sequences / batch_size)
all_freq_matrices <- list()

for (i in seq_len(n_batches)) {
    start_idx         <- (i - 1) * batch_size + 1
    end_idx           <- min(i * batch_size, n_sequences)
    batch_seqs        <- fasta_filtered[start_idx:end_idx]
    batch_freq        <- parallelCGR(batch_seqs, k_mer = 6, len_trim = len_trim,
                                     BPPARAM = BiocParallel::bpparam())
    all_freq_matrices <- c(all_freq_matrices, batch_freq)
}

Understanding CGR

The CGR Algorithm

Chaos Game Representation creates a 2D fractal representation of DNA sequences:

  1. Initialise: Start at centre of unit square (0.5, 0.5)
  2. Iterate: For each nucleotide, move halfway toward its corner — A: (0,0), C: (1,0), G: (1,1), T: (0,1)
  3. Result: Each sequence creates a unique compositional fingerprint

K-mer Frequencies

  • 4^k possible k-mers: k=6 gives 4,096 possible 6-mers
  • Normalised frequencies: each count divided by total k-mer count
  • Scale-invariant: robust to sequence length differences

Distance Metrics

  • Euclidean: sqrt(sum((f1-f2)^2)) — standard choice, balanced
  • Manhattan: sum(|f1-f2|) — more sensitive to small differences
  • Squared Euclidean: sum((f1-f2)^2) — amplifies large differences

Performance Benchmarks

Method 69 sequences 106 sequences
Clustal-Omega 5 minutes 13.7 minutes
CGRphylo (k=6) 1 minute 1 minute
Speedup 5x 13.7x

Memory usage scales linearly: each frequency matrix at k=6 is ~32 KB, so 1,000 sequences require ~32 MB.

Troubleshooting

Out of Memory

Process sequences in batches (see Memory Optimisation above).

Slow Performance

Use parallelCGR, reduce k-mer size, or filter low-quality sequences first.

Tree Topology Differs from MSA

CGRphylo captures compositional patterns; MSA captures position-specific mutations. Both are valid but emphasise different evolutionary signals.

Session Information

sessionInfo()
## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Biostrings_2.81.3   Seqinfo_1.3.0       XVector_0.53.0     
##  [4] IRanges_2.47.2      S4Vectors_0.51.3    BiocGenerics_0.59.7
##  [7] generics_0.1.4      BiocParallel_1.47.0 seqinr_4.2-44      
## [10] ape_5.8-1           CGRphylo2_0.99.2    BiocStyle_2.41.0   
## 
## loaded via a namespace (and not attached):
##  [1] ade4_1.7-24         jsonlite_2.0.0      compiler_4.6.1     
##  [4] BiocManager_1.30.27 crayon_1.5.3        Rcpp_1.1.1-1.1     
##  [7] parallel_4.6.1      jquerylib_0.1.4     yaml_2.3.12        
## [10] fastmap_1.2.0       lattice_0.22-9      R6_2.6.1           
## [13] knitr_1.51          MASS_7.3-65         maketools_1.3.2    
## [16] bslib_0.11.0        rlang_1.2.0         cachem_1.1.0       
## [19] xfun_0.59           sass_0.4.10         sys_3.4.3          
## [22] otel_0.2.0          cli_3.6.6           digest_0.6.39      
## [25] grid_4.6.1          lifecycle_1.0.5     nlme_3.1-169       
## [28] evaluate_1.0.5      codetools_0.2-20    buildtools_1.0.0   
## [31] rmarkdown_2.31      tools_4.6.1         htmltools_0.5.9

References

  1. Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Current Genomics, 24(3).

  2. Jeffrey HJ (1990). Chaos game representation of gene structure. Nucleic Acids Research, 18(8):2163-2170.

  3. Deschavanne PJ, et al. (1999). Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molecular Biology and Evolution, 16(10):1391-1399.