| Title: | Chaos Game Representation for Phylogenetic Analysis |
|---|---|
| Description: | An alignment-free phylogenetic analysis method for viral genomes using Chaos Game Representation (CGR), a technique based on statistical physics concepts. Viruses exhibit high mutation rates, facilitating rapid evolution and emergence of new species, subspecies, strains, and recombinant forms. Accurate classification is crucial for understanding viral evolution and therapeutic development. Traditional phylogenetic methods require sequence alignment, which is computationally intensive. CGRphylo2 addresses this by implementing CGR-based whole-genome comparison that is fast, accurate, and computationally efficient. The package successfully classifies closely related viral lineages (demonstrated on SARS-CoV-2 lineages A and B), identifies recombinants (such as the XBB variant), and distinguishes multiple strains simultaneously. It processes sequences 5-13.7x faster than alignment-based methods (Clustal-Omega) with linear computational scaling. As a k-mer based approach, it enables simultaneous comparison of numerous closely-related sequences of different lengths. The package creates frequency matrices for distance calculations and phylogenetic tree construction, with outputs compatible with standard formats (MEGA, PHYLIP, Newick). Methods are based on Thind and Sinha (2023) <doi:10.2174/0113892029264990231013112156>. |
| Authors: | Amarinder Singh Thind [aut, cre] (ORCID: <https://orcid.org/0000-0003-4592-0380>) |
| Maintainer: | Amarinder Singh Thind <[email protected]> |
| License: | GPL-3 |
| Version: | 0.99.2 |
| Built: | 2026-06-28 11:13:12 UTC |
| Source: | https://github.com/bioc/CGRphylo2 |
Computes a full pairwise distance matrix from a list of CGR frequency matrices.
calculateDistanceMatrix(freq_matrices, distance_type = "Euclidean")calculateDistanceMatrix(freq_matrices, distance_type = "Euclidean")
freq_matrices |
List. A named list of frequency matrices, one per sequence, as returned by parallelCGR(). |
distance_type |
Character. Type of distance to calculate: "Euclidean" (default), "S_Euclidean", or "Manhattan". |
This function calculates all pairwise distances between sequences based on their CGR frequency matrices. The resulting distance matrix can be used for phylogenetic tree construction, clustering, or other downstream analyses.
Numeric matrix. A symmetric distance matrix with sequence names as row and column names.
# Build two minimal frequency matrices (4 k-mers each, normalized) fm1 <- matrix(c(0.4, 0.3, 0.2, 0.1), ncol = 1, dimnames = list(c("A", "C", "G", "T"), NULL)) fm2 <- matrix(c(0.1, 0.2, 0.3, 0.4), ncol = 1, dimnames = list(c("A", "C", "G", "T"), NULL)) fm3 <- matrix(c(0.25, 0.25, 0.25, 0.25), ncol = 1, dimnames = list(c("A", "C", "G", "T"), NULL)) freq_list <- list(Seq1 = fm1, Seq2 = fm2, Seq3 = fm3) dist_matrix <- calculateDistanceMatrix(freq_list, distance_type = "Euclidean") print(round(dist_matrix, 4))# Build two minimal frequency matrices (4 k-mers each, normalized) fm1 <- matrix(c(0.4, 0.3, 0.2, 0.1), ncol = 1, dimnames = list(c("A", "C", "G", "T"), NULL)) fm2 <- matrix(c(0.1, 0.2, 0.3, 0.4), ncol = 1, dimnames = list(c("A", "C", "G", "T"), NULL)) fm3 <- matrix(c(0.25, 0.25, 0.25, 0.25), ncol = 1, dimnames = list(c("A", "C", "G", "T"), NULL)) freq_list <- list(Seq1 = fm1, Seq2 = fm2, Seq3 = fm3) dist_matrix <- calculateDistanceMatrix(freq_list, distance_type = "Euclidean") print(round(dist_matrix, 4))
Creates x and y coordinates for visualizing a Chaos Game Representation (CGR) plot of a DNA sequence. The CGR plot is a 2D representation that reveals patterns and composition of genomic sequences.
cgrplot(seq_index)cgrplot(seq_index)
seq_index |
Integer. The index of the sequence in the global fasta_filtered object to plot. |
The Chaos Game Representation is an iterative mapping technique that creates a 2D representation of DNA sequences. Each base (A, C, G, T) is assigned to a corner of a unit square, and the sequence is plotted by iteratively moving halfway from the current position to the corner corresponding to the next base.
The resulting plot has fractal properties and reveals sequence composition, repeats, and other genomic features. Different sequences create distinct patterns that reflect their underlying genomic structure.
Corner assignments (standard):
A: (0, 0) - bottom left
C: (1, 0) - bottom right
G: (1, 1) - top right
T: (0, 1) - top left
Matrix with two columns (x and y coordinates) for plotting the CGR. Each row represents the position of one nucleotide in the CGR space.
Jeffrey HJ (1990). Chaos game representation of gene structure. Nucleic Acids Research, 18(8):2163-2170.
assign("fasta_filtered", list(seq1 = "ATCGATCGATCGATCGATCG"), envir = .GlobalEnv) cgr_coords <- cgrplot(1) plot(cgr_coords[, 1], cgr_coords[, 2], main = "CGR Plot", xlab = "", ylab = "", cex = 0.5, pch = 4 ) rm(fasta_filtered, envir = .GlobalEnv)assign("fasta_filtered", list(seq1 = "ATCGATCGATCGATCGATCG"), envir = .GlobalEnv) cgr_coords <- cgrplot(1) plot(cgr_coords[, 1], cgr_coords[, 2], main = "CGR Plot", xlab = "", ylab = "", cex = 0.5, pch = 4 ) rm(fasta_filtered, envir = .GlobalEnv)
Extracts and compiles metadata including sequence length, GC content, and N content from FASTA sequences.
create_meta(fastafile, N_filter)create_meta(fastafile, N_filter)
fastafile |
List. A list of DNA sequences read by seqinr::read.fasta() |
N_filter |
Integer. N filter threshold (for reference in output) |
This function provides useful summary statistics for quality control and understanding sequence characteristics before phylogenetic analysis.
data.frame. A data frame with columns: name, length, GC_content, N_content
# Create test sequences test_seqs <- list( seq1 = "ATCGATCGATCG", seq2 = "GCTAGCTAGCTA", seq3 = "AAAATTTTCCCCGGGG" ) # Get metadata meta <- create_meta(test_seqs, N_filter = 50) print(meta)# Create test sequences test_seqs <- list( seq1 = "ATCGATCGATCG", seq2 = "GCTAGCTAGCTA", seq3 = "AAAATTTTCCCCGGGG" ) # Get metadata meta <- create_meta(test_seqs, N_filter = 50) print(meta)
Filters a list of DNA sequences by removing those with too many ambiguous (N) bases.
filter_N(fastafile, N_filter)filter_N(fastafile, N_filter)
fastafile |
List. A named list of DNA sequences (e.g. from
|
N_filter |
Integer. Maximum number of N bases allowed in a sequence. Sequences with more N's than this threshold will be removed. |
This function is useful for quality control before phylogenetic analysis. Sequences with excessive ambiguous bases can affect the accuracy of distance calculations and tree construction.
List. Filtered list of sequences.
test_seqs <- list( good_seq = "ATCGATCG", bad_seq = "ATCGNNNNNATCG", okay_seq = "ATCGNNATCG" ) filtered <- filter_N(test_seqs, N_filter = 3) length(filtered) # Should be 2test_seqs <- list( good_seq = "ATCGATCG", bad_seq = "ATCGNNNNNATCG", okay_seq = "ATCGNNATCG" ) filtered <- filter_N(test_seqs, N_filter = 3) length(filtered) # Should be 2
Converts a DNAStringSet object to a named list of
character strings suitable for use with CGRphylo2 functions such as
filter_N, create_meta, and parallelCGR.
from_DNAStringSet(dna)from_DNAStringSet(dna)
dna |
A |
Bioconductor workflows commonly store DNA sequences as
DNAStringSet objects. This function bridges that format with
CGRphylo2's internal list representation, allowing seamless use of
Bioconductor data structures in CGR-based phylogenetic analysis.
A named list of character strings, one element per sequence.
if (requireNamespace("Biostrings", quietly = TRUE)) { dna <- Biostrings::DNAStringSet(c( seq1 = "ATCGATCGATCGATCG", seq2 = "GCTAGCTAGCTAGCTA" )) seqs <- from_DNAStringSet(dna) length(seqs) # 2 nchar(seqs[[1]]) # 16 }if (requireNamespace("Biostrings", quietly = TRUE)) { dna <- Biostrings::DNAStringSet(c( seq1 = "ATCGATCGATCGATCG", seq2 = "GCTAGCTAGCTAGCTA" )) seqs <- from_DNAStringSet(dna) length(seqs) # 2 nchar(seqs[[1]]) # 16 }
Computes the distance between two CGR frequency matrices using the specified distance metric.
matrixDistance(matrix1, matrix2, distance_type = "Euclidean")matrixDistance(matrix1, matrix2, distance_type = "Euclidean")
matrix1 |
Numeric matrix. First CGR frequency matrix. |
matrix2 |
Numeric matrix. Second CGR frequency matrix. |
distance_type |
Character. Type of distance to calculate. Options are:
|
This function calculates pairwise distances between CGR frequency matrices. The Euclidean distance is most commonly used, but Manhattan and squared Euclidean distances are also available for specific applications.
Numeric. The calculated distance between the two matrices.
# Create two simple frequency matrices matrix1 <- matrix(c(0.1, 0.2, 0.3, 0.4), ncol = 1) matrix2 <- matrix(c(0.15, 0.25, 0.25, 0.35), ncol = 1) # Calculate Euclidean distance dist_euclidean <- matrixDistance(matrix1, matrix2, distance_type = "Euclidean") print(dist_euclidean) # Calculate Manhattan distance dist_manhattan <- matrixDistance(matrix1, matrix2, distance_type = "Manhattan") print(dist_manhattan)# Create two simple frequency matrices matrix1 <- matrix(c(0.1, 0.2, 0.3, 0.4), ncol = 1) matrix2 <- matrix(c(0.15, 0.25, 0.25, 0.35), ncol = 1) # Calculate Euclidean distance dist_euclidean <- matrixDistance(matrix1, matrix2, distance_type = "Euclidean") print(dist_euclidean) # Calculate Manhattan distance dist_manhattan <- matrixDistance(matrix1, matrix2, distance_type = "Manhattan") print(dist_manhattan)
Efficiently calculates CGR frequency matrices for multiple sequences using BiocParallel for cross-platform parallel processing.
parallelCGR(sequences, k_mer, len_trim, BPPARAM = BiocParallel::bpparam())parallelCGR(sequences, k_mer, len_trim, BPPARAM = BiocParallel::bpparam())
sequences |
List. A named list of DNA sequences (from |
k_mer |
Integer. The k-mer size for frequency calculation. |
len_trim |
Integer. Length to trim all sequences to. |
BPPARAM |
A |
This function uses BiocParallel::bplapply to dispatch computation
across available cores. The BPPARAM argument lets callers choose the
back-end: MulticoreParam on Linux/macOS, SnowParam on Windows,
or SerialParam for sequential execution.
Named list. A list of frequency matrices, one per sequence.
seqs <- list( Seq1 = "ATCGATCGATCGATCGATCG", Seq2 = "GCTAGCTAGCTAGCTAGCTA" ) freq_mats <- parallelCGR(seqs, k_mer = 2, len_trim = 20, BPPARAM = BiocParallel::SerialParam()) cat("Matrices computed:", length(freq_mats), "\n")seqs <- list( Seq1 = "ATCGATCGATCGATCGATCG", Seq2 = "GCTAGCTAGCTAGCTAGCTA" ) freq_mats <- parallelCGR(seqs, k_mer = 2, len_trim = 20, BPPARAM = BiocParallel::SerialParam()) cat("Matrices computed:", length(freq_mats), "\n")
A convenience wrapper function to create a CGR plot with sensible defaults.
plot_cgr(seq_index, main = NULL, cex = 0.2, pch = 4, col = "black", ...)plot_cgr(seq_index, main = NULL, cex = 0.2, pch = 4, col = "black", ...)
seq_index |
Integer. The index of the sequence to plot. |
main |
Character. Main title for the plot. If NULL, uses sequence name. |
cex |
Numeric. Point size (default 0.2 for dense sequences). |
pch |
Integer. Point character type (default 4 for crosses). |
col |
Character. Color for points (default "black"). |
... |
Additional arguments passed to plot(). |
NULL. Creates a plot as a side effect.
assign("fasta_filtered", list(seq1 = "ATCGATCGATCGATCGATCG"), envir = .GlobalEnv) plot_cgr(1) plot_cgr(1, main = "My Sequence", col = "blue", cex = 0.3) rm(fasta_filtered, envir = .GlobalEnv)assign("fasta_filtered", list(seq1 = "ATCGATCGATCGATCGATCG"), envir = .GlobalEnv) plot_cgr(1) plot_cgr(1, main = "My Sequence", col = "blue", cex = 0.3) rm(fasta_filtered, envir = .GlobalEnv)
Exports a distance matrix to MEGA format for use with MEGA software for phylogenetic tree visualization and analysis.
saveMegaDistance(filename, distance_matrix)saveMegaDistance(filename, distance_matrix)
filename |
Character. Output filename (typically with .meg extension). |
distance_matrix |
Numeric matrix. A square distance matrix with row and column names corresponding to sequence identifiers. |
MEGA (Molecular Evolutionary Genetics Analysis) is widely used for phylogenetic analysis. This function creates a distance matrix file compatible with MEGA's format specifications.
NULL. Writes file as a side effect.
# Build a small symmetric distance matrix dist_mat <- matrix( c(0.00, 0.12, 0.25, 0.12, 0.00, 0.18, 0.25, 0.18, 0.00), nrow = 3 ) rownames(dist_mat) <- colnames(dist_mat) <- c("Seq1", "Seq2", "Seq3") # Save to a temporary file (no permanent files written during checks) out <- tempfile(fileext = ".meg") saveMegaDistance(out, dist_mat) # Inspect the first few lines of the output writeLines(readLines(out, n = 8))# Build a small symmetric distance matrix dist_mat <- matrix( c(0.00, 0.12, 0.25, 0.12, 0.00, 0.18, 0.25, 0.18, 0.00), nrow = 3 ) rownames(dist_mat) <- colnames(dist_mat) <- c("Seq1", "Seq2", "Seq3") # Save to a temporary file (no permanent files written during checks) out <- tempfile(fileext = ".meg") saveMegaDistance(out, dist_mat) # Inspect the first few lines of the output writeLines(readLines(out, n = 8))
Exports a distance matrix to PHYLIP format for phylogenetic analysis with various bioinformatics tools.
savePhylipDistance(filename, distance_matrix, mode = "relaxed")savePhylipDistance(filename, distance_matrix, mode = "relaxed")
filename |
Character. Output filename (typically .txt or .phy extension). |
distance_matrix |
Numeric matrix. A square distance matrix with row and column names corresponding to sequence identifiers. |
mode |
Character. PHYLIP format mode:
|
PHYLIP format is widely supported by phylogenetic software. The original format limits sequence names to 10 characters, while the relaxed format allows longer names. The relaxed format is recommended for modern applications.
NULL. Writes file as a side effect.
PHYLIP format specification: http://www.phylo.org/index.php/help/relaxed_phylip
# Build a small symmetric distance matrix dist_mat <- matrix( c(0.00, 0.12, 0.25, 0.12, 0.00, 0.18, 0.25, 0.18, 0.00), nrow = 3 ) rownames(dist_mat) <- colnames(dist_mat) <- c("Seq1", "Seq2", "Seq3") # Save in relaxed PHYLIP format to a temporary file out <- tempfile(fileext = ".phy") savePhylipDistance(out, dist_mat, mode = "relaxed") # Inspect output writeLines(readLines(out))# Build a small symmetric distance matrix dist_mat <- matrix( c(0.00, 0.12, 0.25, 0.12, 0.00, 0.18, 0.25, 0.18, 0.00), nrow = 3 ) rownames(dist_mat) <- colnames(dist_mat) <- c("Seq1", "Seq2", "Seq3") # Save in relaxed PHYLIP format to a temporary file out <- tempfile(fileext = ".phy") savePhylipDistance(out, dist_mat, mode = "relaxed") # Inspect output writeLines(readLines(out))