To install this package, start R (version “4.4”) and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioGA")
You can also install the package directly from GitHub using the
devtools
package:
In this vignette, we illustrate the usage of BioGA for genetic algorithm optimization in the context of high throughput genomic data analysis. We showcase its interoperability with Bioconductor classes, demonstrating how genetic algorithm optimization can be seamlessly integrated into existing genomics pipelines for improved analysis and interpretation.
The BioGA package provides a set of functions for genetic algorithm optimization tailored for analyzing high throughput genomic data. This vignette demonstrates the usage of BioGA in the context of selecting the best combination of genes for predicting a certain trait, such as disease susceptibility.
Genomic data refers to the genetic information stored in an organism’s DNA. It includes the sequence of nucleotides (adenine, thymine, cytosine, and guanine) that make up the DNA molecules. Genomic data can provide valuable insights into various biological processes, such as gene expression, genetic variation, and evolutionary relationships.
Genomic data in this context could consist of gene expression profiles measured across different individuals (e.g., patients).
Each row in the genomic_data matrix represents a gene, and each column represents a patient sample.
The values in the matrix represent the expression levels of each gene in each patient sample.
Here’s an example of genomic data:
Sample 1 Sample 2 Sample 3 Sample 4
Gene1 0.1 0.2 0.3 0.4
Gene2 1.2 1.3 1.4 1.5
Gene3 2.3 2.2 2.1 2.0
In this example, each row represents a gene (or genomic feature), and each column represents a sample. The values in the matrix represent some measurement of gene expression, such as mRNA levels or protein abundance, in each sample.
For instance, the value 0.1 in Sample 1 for Gene1 indicates the expression level of Gene1 in Sample 1. Similarly, the value 2.2 in Sample 2 for Gene3 indicates the expression level of Gene3 in Sample 2.
Genomic data can be used in various analyses, including genetic
association studies, gene expression analysis, and comparative genomics.
In the context of the evaluate_fitness_cpp
function,
genomic data is used to calculate fitness scores for individuals in a
population, typically in the context of genetic algorithm
optimization.
The population represents a set of candidate combinations of genes that could be predictive of the trait. Each individual in the population is represented by a binary vector indicating the presence or absence of each gene. For example, an individual in the population might be represented as [1, 0, 1], indicating the presence of Gene1 and Gene3 but the absence of Gene2. The population undergoes genetic algorithm operations such as selection, crossover, mutation, and replacement to evolve towards individuals with higher predictive power for the trait.
Consider an example scenario of using genetic algorithm optimization to select the best combination of genes for predicting a certain trait, such as disease susceptibility.
# Load necessary packages
library(BioGA)
library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
# Define parameters
num_genes <- 1000
num_samples <- 10
# Define parameters for genetic algorithm
population_size <- 100
generations <- 20
mutation_rate <- 0.1
# Generate example genomic data using SummarizedExperiment
counts <- matrix(rpois(num_genes * num_samples, lambda = 10),
nrow = num_genes
)
rownames(counts) <- paste0("Gene", 1:num_genes)
colnames(counts) <- paste0("Sample", 1:num_samples)
# Create SummarizedExperiment object
se <-
SummarizedExperiment::SummarizedExperiment(assays = list(counts = counts))
# Convert SummarizedExperiment to matrix for compatibility with BioGA package
genomic_data <- assay(se)
In this example, counts
is a matrix representing the
counts of gene expression levels across different samples. Each row
corresponds to a gene, and each column corresponds to a sample. We use
the SummarizedExperiment
class to store this data, which is
common Bioconductor class for representing rectangular feature x sample
data, such as RNAseq count matrices or microarray data.
head(genomic_data)
#> Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
#> Gene1 8 6 10 22 8 11 11 8 7
#> Gene2 12 9 9 9 12 16 10 12 8
#> Gene3 16 6 10 7 8 10 12 6 7
#> Gene4 6 11 15 11 14 7 11 11 12
#> Gene5 16 12 5 10 10 6 13 10 8
#> Gene6 10 12 9 12 17 8 15 13 6
#> Sample10
#> Gene1 10
#> Gene2 11
#> Gene3 8
#> Gene4 7
#> Gene5 10
#> Gene6 7
# Initialize population (select the number of canditate you wish `population`)
population <- BioGA::initialize_population_cpp(genomic_data,
population_size = 5
)
The population represents a set of candidate combinations of genes that could be predictive of the trait. Each individual in the population is represented by a binary vector indicating the presence or absence of each gene. For example, an individual in the population might be represented as [1, 0, 1], indicating the presence of Gene1 and Gene3 but the absence of Gene2. The population undergoes genetic algorithm operations such as selection, crossover, mutation, and replacement to evolve towards individuals with higher predictive power for the trait.
# Initialize fitness history
fitness_history <- list()
# Initialize time progress
start_time <- Sys.time()
# Run genetic algorithm optimization
generation <- 0
while (TRUE) {
generation <- generation + 1
# Evaluate fitness
fitness <- BioGA::evaluate_fitness_cpp(genomic_data, population)
fitness_history[[generation]] <- fitness
# Check termination condition
if (generation == generations) { # defined number of generations
break
}
# Selection
selected_parents <- BioGA::selection_cpp(population,
fitness,
num_parents = 2
)
# Crossover and Mutation
offspring <- BioGA::crossover_cpp(selected_parents, offspring_size = 2)
# (no mutation in this example)
mutated_offspring <- BioGA::mutation_cpp(offspring, mutation_rate = 0)
# Replacement
population <- BioGA::replacement_cpp(population, mutated_offspring,
num_to_replace = 1
)
# Calculate time progress
elapsed_time <- difftime(Sys.time(), start_time, units = "secs")
# Print time progress
cat(
"\rGeneration:", generation, "- Elapsed Time:",
format(elapsed_time, units = "secs"), " "
)
}
#> Generation: 1 - Elapsed Time: 0.008921385 secs Generation: 2 - Elapsed Time: 0.01110435 secs Generation: 3 - Elapsed Time: 0.01146245 secs Generation: 4 - Elapsed Time: 0.01181459 secs Generation: 5 - Elapsed Time: 0.0121448 secs Generation: 6 - Elapsed Time: 0.01247811 secs Generation: 7 - Elapsed Time: 0.01280761 secs Generation: 8 - Elapsed Time: 0.01313877 secs Generation: 9 - Elapsed Time: 0.01346612 secs Generation: 10 - Elapsed Time: 0.01380491 secs Generation: 11 - Elapsed Time: 0.01416802 secs Generation: 12 - Elapsed Time: 0.01451135 secs Generation: 13 - Elapsed Time: 0.0148406 secs Generation: 14 - Elapsed Time: 0.01516747 secs Generation: 15 - Elapsed Time: 0.01549864 secs Generation: 16 - Elapsed Time: 0.01582909 secs Generation: 17 - Elapsed Time: 0.01618814 secs Generation: 18 - Elapsed Time: 0.01651907 secs Generation: 19 - Elapsed Time: 0.01684713 secs
The fitness calculation described in the provided code calculates a measure of dissimilarity between the gene expression profiles of individuals in the population and the genomic data. This measure of dissimilarity, or “fitness”, quantifies how well the gene expression profile of an individual matches the genomic data.
Mathematically, the fitness calculation can be represented as follows:
Let:
gijk be the gene expression level of gene j in individual i and sample k from the genomic data.
pij be the gene expression level of gene j in individual i from the population.
N be the number of individuals in the population.
G be the number of genes.
S be the number of samples.
Then, the fitness Fi for individual i in the population can be calculated as the sum of squared differences between the gene expression levels of individual i and the corresponding gene expression levels in the genomic data, across all genes and samples: $$ F_i = \sum_{j=1}^{G} \sum_{k=1}^{S} (g_{ijk} - p_{ij})^2 $$
This fitness calculation aims to minimize the overall dissimilarity between the gene expression profiles of individuals in the population and the genomic data. Individuals with lower fitness scores are considered to have gene expression profiles that are more similar to the genomic data and are therefore more likely to be selected for further optimization in the genetic algorithm.
This showcases the integration of genetic algorithms with genomic data analysis and highlights the potential of genetic algorithms for feature selection in genomics.
Here’s how BioGA could work in the context of high throughput genomic data analysis:
Problem Definition: BioGA starts with a clear definition of the problem to be solved. This could include tasks such as identifying genetic markers associated with a particular disease, optimizing gene expression patterns, or clustering genomic data to identify patterns or groupings.
Representation: Genomic data would need to be appropriately represented for use within the genetic algorithm framework. This might involve encoding the data in a suitable format, such as binary strings representing genes or chromosomes.
Fitness Evaluation: BioGA would define a fitness function that evaluates how well a particular solution performs with respect to the problem being addressed. In the context of genomic data analysis, this could involve measures such as classification accuracy, correlation with clinical outcomes, or fitness to a particular model.
Initialization: The algorithm would initialize a population of candidate solutions, typically randomly or using some heuristic method. Each solution in the population represents a potential solution to the problem at hand.
Genetic Operations: BioGA would apply genetic operators such as selection, crossover, and mutation to evolve the population over successive generations. Selection identifies individuals with higher fitness to serve as parents for the next generation. Crossover combines genetic material from two parent solutions to produce offspring. Mutation introduces random changes to the offspring to maintain genetic diversity.
Termination Criteria: The algorithm would continue iterating through generations until a termination criterion is met. This could be a maximum number of generations, reaching a satisfactory solution, or convergence of the population.
Result Analysis: Once the algorithm terminates, BioGA would analyze the final population to identify the best solution(s) found. This could involve further validation or interpretation of the results in the context of the original problem.
Other applications of BioGA in genomic data analysis could include genome-wide association studies (GWAS), gene expression analysis, pathway analysis, and predictive modeling for personalized medicine, among others. By leveraging genetic algorithms, BioGA offers a powerful approach to exploring complex genomic datasets and identifying meaningful patterns and associations.
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.1 (2024-06-14)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz Etc/UTC
#> date 2024-10-30
#> pandoc 3.2.1 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-8 2024-09-12 [2] RSPM (R 4.4.0)
#> animation 2.7 2021-10-07 [2] RSPM (R 4.4.0)
#> Biobase * 2.65.1 2024-10-27 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocGenerics * 0.51.3 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocManager 1.30.25 2024-08-28 [2] RSPM (R 4.4.0)
#> BiocStyle * 2.33.1 2024-10-18 [2] https://bioc.r-universe.dev (R 4.4.1)
#> biocViews 1.73.1 2024-10-04 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BioGA * 1.1.0 2024-10-30 [1] https://bioc.r-universe.dev (R 4.4.1)
#> bitops 1.0-9 2024-10-03 [2] RSPM (R 4.4.0)
#> bslib 0.8.0 2024-07-29 [2] RSPM (R 4.4.0)
#> buildtools 1.0.0 2024-10-28 [3] local (/pkg)
#> cachem 1.1.0 2024-05-16 [2] RSPM (R 4.4.0)
#> cli 3.6.3 2024-06-21 [2] RSPM (R 4.4.0)
#> colorspace 2.1-1 2024-07-26 [2] RSPM (R 4.4.0)
#> crayon 1.5.3 2024-06-20 [2] RSPM (R 4.4.0)
#> DelayedArray 0.31.14 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> digest 0.6.37 2024-08-19 [2] RSPM (R 4.4.0)
#> evaluate 1.0.1 2024-10-10 [2] RSPM (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [2] RSPM (R 4.4.0)
#> farver 2.1.2 2024-05-13 [2] RSPM (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [2] RSPM (R 4.4.0)
#> GenomeInfoDb * 1.41.2 2024-10-02 [2] https://bioc.r-universe.dev (R 4.4.1)
#> GenomeInfoDbData 1.2.13 2024-10-30 [2] Bioconductor
#> GenomicRanges * 1.57.2 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> ggplot2 3.5.1 2024-04-23 [2] RSPM (R 4.4.0)
#> glue 1.8.0 2024-09-30 [2] RSPM (R 4.4.0)
#> graph 1.83.0 2024-10-01 [2] https://bioc.r-universe.dev (R 4.4.1)
#> gtable 0.3.6 2024-10-25 [2] RSPM (R 4.4.0)
#> highr 0.11 2024-05-26 [2] RSPM (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#> httr 1.4.7 2023-08-15 [2] RSPM (R 4.4.0)
#> IRanges * 2.39.2 2024-10-25 [2] https://bioc.r-universe.dev (R 4.4.1)
#> jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [2] RSPM (R 4.4.0)
#> knitr 1.48 2024-07-07 [2] RSPM (R 4.4.0)
#> labeling 0.4.3 2023-08-29 [2] RSPM (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [2] RSPM (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [2] RSPM (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [2] RSPM (R 4.4.0)
#> maketools 1.3.1 2024-10-28 [3] Github (jeroen/maketools@d46f92c)
#> Matrix 1.7-1 2024-10-18 [2] RSPM (R 4.4.0)
#> MatrixGenerics * 1.17.1 2024-10-23 [2] https://bioc.r-universe.dev (R 4.4.1)
#> matrixStats * 1.4.1 2024-09-08 [2] RSPM (R 4.4.0)
#> munsell 0.5.1 2024-04-01 [2] RSPM (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [2] RSPM (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [2] RSPM (R 4.4.0)
#> R6 2.5.1 2021-08-19 [2] RSPM (R 4.4.0)
#> RBGL 1.81.0 2024-10-18 [2] https://bioc.r-universe.dev (R 4.4.1)
#> Rcpp 1.0.13 2024-07-17 [2] RSPM (R 4.4.0)
#> RCurl 1.98-1.16 2024-07-11 [2] RSPM (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] RSPM (R 4.4.0)
#> rmarkdown 2.28 2024-08-17 [2] RSPM (R 4.4.0)
#> RUnit 0.4.33 2024-02-22 [2] RSPM (R 4.4.0)
#> S4Arrays 1.5.11 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> S4Vectors * 0.43.2 2024-10-17 [2] https://bioc.r-universe.dev (R 4.4.1)
#> sass 0.4.9 2024-03-15 [2] RSPM (R 4.4.0)
#> scales 1.3.0 2023-11-28 [2] RSPM (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] RSPM (R 4.4.0)
#> SparseArray 1.5.45 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> SummarizedExperiment * 1.35.5 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> sys 3.4.3 2024-10-04 [2] RSPM (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [2] RSPM (R 4.4.0)
#> UCSC.utils 1.1.0 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> utf8 1.2.4 2023-10-22 [2] RSPM (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [2] RSPM (R 4.4.0)
#> withr 3.0.2 2024-10-28 [2] RSPM (R 4.4.0)
#> xfun 0.48 2024-10-03 [2] RSPM (R 4.4.0)
#> XML 3.99-0.17 2024-06-25 [2] RSPM (R 4.4.0)
#> XVector 0.45.0 2024-10-02 [2] https://bioc.r-universe.dev (R 4.4.1)
#> yaml 2.3.10 2024-07-26 [2] RSPM (R 4.4.0)
#> zlibbioc 1.51.2 2024-10-21 [2] Bioconductor 3.20 (R 4.4.1)
#>
#> [1] /tmp/RtmpDmM3KZ/Rinst13f54e49a9e
#> [2] /github/workspace/pkglib
#> [3] /usr/local/lib/R/site-library
#> [4] /usr/lib/R/site-library
#> [5] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────