CGRphylo2 provides an alignment-free phylogenetic analysis method for viral genomes using Chaos Game Representation (CGR), a technique based on statistical physics concepts. Viruses exhibit high mutation rates, facilitating rapid evolution and emergence of new species, subspecies, strains, and recombinant forms. Accurate classification is crucial for understanding viral evolution and therapeutic development. Traditional phylogenetic methods require sequence alignment, which is computationally intensive. CGRphylo2 addresses this by implementing CGR-based whole-genome comparison that is fast, accurate, and computationally efficient. The package successfully classifies closely related viral lineages (demonstrated on SARS-CoV-2 lineages A and B), identifies recombinants (such as the XBB variant), and distinguishes multiple strains simultaneously. It processes sequences 5-13.7x faster than alignment-based methods (Clustal-Omega) with linear computational scaling. As a k-mer based approach, it enables simultaneous comparison of numerous closely-related sequences of different lengths. The package creates frequency matrices for distance calculations and phylogenetic tree construction, with outputs compatible with standard formats (MEGA, PHYLIP, Newick). Earlier version reference is Thind and Sinha (2023) doi:10.2174/0113892029264990231013112156.
CGRphylo2 provides an efficient alignment-free approach for phylogenetic analysis of viral genomes using Chaos Game Representation (CGR). The package offers several key advantages:
CGRphylo’s efficiency comes from its alignment-free approach:
If you use CGRphylo in your research, please cite:
Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Current Genomics, 24(3). doi:10.2174/0113892029264990231013112156
This example demonstrates the complete CGRphylo2 workflow using the
SARS-CoV-2 sequences bundled with the package in
inst/extdata.
We load the SARS-CoV-2 sequences bundled with the package and filter out any with excessive ambiguous (N) bases.
library(seqinr)
fasta_path <- system.file("extdata", "example_sequences.fasta",
package = "CGRphylo2")
fastafile <- seqinr::read.fasta(fasta_path, seqtype = "DNA",
as.string = TRUE, set.attributes = FALSE)
N_filter <- 50
fasta_filtered <- filter_N(fastafile, N_filter)
cat("Sequences retained:", length(fasta_filtered), "\n")## Sequences retained: 13
## XBB.1 : length = 29823 bp
## XBB.1 : length = 29823 bp
## XBB.1 : length = 29823 bp
## XBB.1 : length = 29823 bp
## ba.2.75 : length = 29652 bp
## ba.2.75 : length = 29652 bp
## ba.2.75 : length = 29652 bp
## ba.2.75 : length = 29652 bp
## BJ.1 : length = 29754 bp
## BJ.1 : length = 29754 bp
## BJ.1 : length = 29754 bp
## BJ.1 : length = 29754 bp
## MERS-CoV : length = 29994 bp
## name length GC_content N_content
## 1 XBB.1 29823 37.86675 30
## 2 XBB.1 29818 37.85968 25
## 3 XBB.1 29803 37.85525 22
## 4 XBB.1 29773 37.90347 12
## 5 ba.2.75 29652 37.88615 2
## 6 ba.2.75 29652 37.88277 1
## 7 ba.2.75 29652 37.89289 1
## 8 ba.2.75 29652 37.88277 1
## 9 BJ.1 29754 37.92767 0
## 10 BJ.1 29726 37.90621 1
## 11 BJ.1 29757 37.91041 0
## 12 BJ.1 29724 37.88857 0
## 13 MERS-CoV 29994 41.15490 0
par(mfrow = c(1, 2), mar = c(4, 6, 3, 1))
dotchart(meta$length,
labels = meta$name, xlab = "Sequence length (bp)",
pch = 21, bg = "green", pt.cex = 1.4, cex = 0.9,
main = "Sequence Lengths"
)
dotchart(meta$GC_content,
labels = meta$name, xlab = "GC content (%)",
pch = 21, bg = "steelblue", pt.cex = 1.4, cex = 0.9,
main = "GC Content"
)Sequence lengths (left) and GC content (right) for the bundled SARS-CoV-2 sequences.
The k-mer size determines the resolution of the analysis. Typical values for real genome data are 4-8; we use k=4 here.
library(BiocParallel)
k_mer <- 4 ## use 6 for real genome-scale data (virus) ~30k bp
len_trim <- min(nchar(unlist(fasta_filtered)))
## SerialParam() required in vignettes — R CMD check forbids >2 parallel processes
freq_matrices <- parallelCGR(fasta_filtered,
k_mer = k_mer,
len_trim = len_trim,
BPPARAM = SerialParam()
)
cat("Sequences processed:", length(freq_matrices), "\n")## Sequences processed: 13
## K-mers per matrix : 256 (4^4)
par(mfrow = c(1, 2), mar = c(2, 2, 3, 1))
cgr1 <- cgrplot(1)
plot(cgr1[, 1], cgr1[, 2],
main = paste("CGR:", names(fasta_filtered)[1]),
xlab = "", ylab = "", cex = 0.4, pch = 4, col = "steelblue"
)
cgr2 <- cgrplot(2)
plot(cgr2[, 1], cgr2[, 2],
main = paste("CGR:", names(fasta_filtered)[2]),
xlab = "", ylab = "", cex = 0.4, pch = 4, col = "tomato"
)CGR plots for the first two sequences, showing the k-mer density patterns characteristic of each genome.
distance_matrix <- calculateDistanceMatrix(freq_matrices,
distance_type = "Euclidean"
)
print(round(distance_matrix, 4))## XBB.1 XBB.1 XBB.1 XBB.1 ba.2.75 ba.2.75 ba.2.75 ba.2.75 BJ.1
## XBB.1 0.0000 0.0002 0.0003 0.0002 0.0007 0.0007 0.0006 0.0007 0.0006
## XBB.1 0.0002 0.0000 0.0003 0.0003 0.0007 0.0007 0.0007 0.0007 0.0006
## XBB.1 0.0003 0.0003 0.0000 0.0003 0.0006 0.0006 0.0006 0.0006 0.0005
## XBB.1 0.0002 0.0003 0.0003 0.0000 0.0007 0.0007 0.0007 0.0007 0.0006
## ba.2.75 0.0007 0.0007 0.0006 0.0007 0.0000 0.0002 0.0002 0.0003 0.0007
## ba.2.75 0.0007 0.0007 0.0006 0.0007 0.0002 0.0000 0.0002 0.0003 0.0007
## ba.2.75 0.0006 0.0007 0.0006 0.0007 0.0002 0.0002 0.0000 0.0002 0.0007
## ba.2.75 0.0007 0.0007 0.0006 0.0007 0.0003 0.0003 0.0002 0.0000 0.0007
## BJ.1 0.0006 0.0006 0.0005 0.0006 0.0007 0.0007 0.0007 0.0007 0.0000
## BJ.1 0.0005 0.0006 0.0005 0.0006 0.0006 0.0006 0.0006 0.0006 0.0003
## BJ.1 0.0006 0.0006 0.0005 0.0006 0.0007 0.0007 0.0007 0.0007 0.0004
## BJ.1 0.0006 0.0006 0.0005 0.0006 0.0007 0.0007 0.0007 0.0007 0.0004
## MERS-CoV 0.0160 0.0161 0.0160 0.0160 0.0161 0.0161 0.0161 0.0161 0.0160
## BJ.1 BJ.1 BJ.1 MERS-CoV
## XBB.1 0.0005 0.0006 0.0006 0.0160
## XBB.1 0.0006 0.0006 0.0006 0.0161
## XBB.1 0.0005 0.0005 0.0005 0.0160
## XBB.1 0.0006 0.0006 0.0006 0.0160
## ba.2.75 0.0006 0.0007 0.0007 0.0161
## ba.2.75 0.0006 0.0007 0.0007 0.0161
## ba.2.75 0.0006 0.0007 0.0007 0.0161
## ba.2.75 0.0006 0.0007 0.0007 0.0161
## BJ.1 0.0003 0.0004 0.0004 0.0160
## BJ.1 0.0000 0.0003 0.0003 0.0161
## BJ.1 0.0003 0.0000 0.0003 0.0160
## BJ.1 0.0003 0.0003 0.0000 0.0160
## MERS-CoV 0.0161 0.0160 0.0160 0.0000
col_pal <- colorRampPalette(c("white", "steelblue", "darkblue"))(50)
n <- nrow(distance_matrix)
image(seq_len(n), seq_len(n), distance_matrix,
col = col_pal, xaxt = "n", yaxt = "n",
xlab = "", ylab = "",
main = "Pairwise Distance Heatmap"
)
axis(1, at = seq_len(n), labels = rownames(distance_matrix),
las = 2, cex.axis = 0.9)
axis(2, at = seq_len(n), labels = colnames(distance_matrix),
las = 1, cex.axis = 0.9)
for (i in seq_len(n)) {
for (j in seq_len(n)) {
text(i, j, round(distance_matrix[i, j], 3), cex = 0.75)
}
}Pairwise Euclidean distance matrix. Lighter colours indicate smaller distances (more similar sequences).
nj_tree <- nj(as.dist(distance_matrix))
plot(nj_tree,
main = "Neighbour-Joining Tree (CGR distances)",
cex = 0.9
)Neighbour-joining tree built from CGR Euclidean distances for the bundled SARS-CoV-2 sequences.
mega_file <- tempfile(fileext = ".meg")
phylip_file <- tempfile(fileext = ".phy")
saveMegaDistance(mega_file, distance_matrix)
savePhylipDistance(phylip_file, distance_matrix, mode = "relaxed")
cat("--- MEGA file (first 6 lines) ---\n")## --- MEGA file (first 6 lines) ---
## #mega
## !Title: Distance Matrix;
## !Format DataType=Distance DataFormat=LowerLeft NTaxa=13;
##
## [
## #XBB.1
##
## --- PHYLIP file (first 6 lines) ---
## 13
## XBB.1 0 0.000247 0.000289 0.000219 0.000674 0.000688 0.000646 0.000661 0.000566 0.000535 0.000557 0.00057 0.015997
## XBB.1 0.000247 0 0.000313 0.000315 0.000677 0.000694 0.000657 0.000673 0.000584 0.000561 0.000574 0.000587 0.01607
## XBB.1 0.000289 0.000313 0 0.000333 0.000626 0.000643 0.000607 0.000611 0.000533 0.000512 0.000515 0.000506 0.016027
## XBB.1 0.000219 0.000315 0.000333 0 0.000697 0.000716 0.000675 0.000687 0.00059 0.00057 0.000585 0.000595 0.016016
## ba.2.75 0.000674 0.000677 0.000626 0.000697 0 0.000155 0.000182 0.00025 0.000688 0.000645 0.000708 0.000706 0.016051
CGRphylo2 integrates with the Bioconductor ecosystem via
from_DNAStringSet(), which converts a
DNAStringSet object to the list format used by all
CGRphylo2 functions.
library(Biostrings)
dna <- DNAStringSet(c(
seq1 = "ATCGATCGATCGATCG",
seq2 = "GCTAGCTAGCTAGCTA"
))
seqs <- from_DNAStringSet(dna)
cat("Sequences loaded:", length(seqs), "\n")## Sequences loaded: 2
We load the same bundled SARS-CoV-2 sequences used in the Quick Start:
library(seqinr)
fasta_path <- system.file("extdata", "example_sequences.fasta",
package = "CGRphylo2")
fastafile <- seqinr::read.fasta(fasta_path, seqtype = "DNA",
as.string = TRUE, set.attributes = FALSE)
N_filter <- 50
fasta_filtered <- filter_N(fastafile, N_filter)
meta <- create_meta(fasta_filtered, N_filter)
len_trim <- min(meta$length)
k_mer <- 6CGRphylo2 supports three distance metrics:
## Recompute frequency matrices for this section
freq_matrices <- parallelCGR(fasta_filtered,
k_mer = k_mer,
len_trim = len_trim,
BPPARAM = SerialParam()
)
par(mfrow = c(1, 3), mar = c(1, 1, 3, 1))
for (dtype in c("Euclidean", "Manhattan", "S_Euclidean")) {
dist_d <- calculateDistanceMatrix(freq_matrices, distance_type = dtype)
tree_d <- nj(as.dist(dist_d))
plot(tree_d, main = dtype, cex = 0.8)
}NJ trees using three distance metrics.
par(mfrow = c(1, 3), mar = c(1, 1, 3, 1))
for (k in c(2, 3, 4)) {
freq_k <- parallelCGR(fasta_filtered, k_mer = k, len_trim = len_trim,
BPPARAM = SerialParam())
dist_k <- calculateDistanceMatrix(freq_k, distance_type = "Euclidean")
tree_k <- nj(as.dist(dist_k))
plot(tree_k, main = paste0("k = ", k), cex = 0.8)
}NJ trees at k = 2, 3, and 4.
freq_matrices <- parallelCGR(fasta_filtered,
k_mer = 6,
len_trim = len_trim,
BPPARAM = SerialParam()
)
cat("Matrices computed:", length(freq_matrices), "\n")## Matrices computed: 13
batch_size <- 100
n_sequences <- length(fasta_filtered)
n_batches <- ceiling(n_sequences / batch_size)
all_freq_matrices <- list()
for (i in seq_len(n_batches)) {
start_idx <- (i - 1) * batch_size + 1
end_idx <- min(i * batch_size, n_sequences)
batch_seqs <- fasta_filtered[start_idx:end_idx]
batch_freq <- parallelCGR(batch_seqs, k_mer = 6, len_trim = len_trim,
BPPARAM = BiocParallel::bpparam())
all_freq_matrices <- c(all_freq_matrices, batch_freq)
}Chaos Game Representation creates a 2D fractal representation of DNA sequences:
| Method | 69 sequences | 106 sequences |
|---|---|---|
| Clustal-Omega | 5 minutes | 13.7 minutes |
| CGRphylo (k=6) | 1 minute | 1 minute |
| Speedup | 5x | 13.7x |
Memory usage scales linearly: each frequency matrix at k=6 is ~32 KB, so 1,000 sequences require ~32 MB.
Process sequences in batches (see Memory Optimisation above).
Use parallelCGR, reduce k-mer size, or filter
low-quality sequences first.
CGRphylo captures compositional patterns; MSA captures position-specific mutations. Both are valid but emphasise different evolutionary signals.
## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Biostrings_2.81.3 Seqinfo_1.3.0 XVector_0.53.0
## [4] IRanges_2.47.2 S4Vectors_0.51.3 BiocGenerics_0.59.7
## [7] generics_0.1.4 BiocParallel_1.47.0 seqinr_4.2-44
## [10] ape_5.8-1 CGRphylo2_0.99.2 BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] ade4_1.7-24 jsonlite_2.0.0 compiler_4.6.1
## [4] BiocManager_1.30.27 crayon_1.5.3 Rcpp_1.1.1-1.1
## [7] parallel_4.6.1 jquerylib_0.1.4 yaml_2.3.12
## [10] fastmap_1.2.0 lattice_0.22-9 R6_2.6.1
## [13] knitr_1.51 MASS_7.3-65 maketools_1.3.2
## [16] bslib_0.11.0 rlang_1.2.0 cachem_1.1.0
## [19] xfun_0.59 sass_0.4.10 sys_3.4.3
## [22] otel_0.2.0 cli_3.6.6 digest_0.6.39
## [25] grid_4.6.1 lifecycle_1.0.5 nlme_3.1-169
## [28] evaluate_1.0.5 codetools_0.2-20 buildtools_1.0.0
## [31] rmarkdown_2.31 tools_4.6.1 htmltools_0.5.9
Thind AS, Sinha S (2023). Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Current Genomics, 24(3).
Jeffrey HJ (1990). Chaos game representation of gene structure. Nucleic Acids Research, 18(8):2163-2170.
Deschavanne PJ, et al. (1999). Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molecular Biology and Evolution, 16(10):1391-1399.