Statistical inference of the extent at which Darwinian natural
selection had impacted on genetic data from multiple populations
commands a healthy quota of the phylogenetic literature (Jacques et al. 2023). Validation of these
codon-based models relies heavily on simulated data. A search of the
entries on the Bioconductor (Gentleman et al.
2004) platform, on 29 July 2024, with keywords
codon
, mutation
, selection
,
simulate
and simulation
returned a total of 72
unique (out of the 2300 available Software) packages. None of the
retrieved entries was dedicated to codon data simulation for natural
selection analyses. Given the ever increasing diverse types of models of
natural selection inference from molecular data that exist, there is
indeed need for applicable packages on the platform.
Population genomic studies provided the mathematical foundation upon
which phylogenetics thrived (Wright 1931; Fisher
1922; Hardy 1908; Weinberg 1908; Darwin 1859). The thirst to
bridge the gap between these two genres of evolutionary biology continue
to drive the invention of more complex models of evolution (Aris-Brosou and Rodrigue 2012). Consequently,
there is need to develop codon sequence simulators to match the growth.
scoup
is designed on the basis of the mutation-selection
(MutSel) framework (Halpern and Bruno
1998) as a contribution to this quest. Only a couple of existing
selection-focused simulators in the literature used the MutSel framework
(Spielman and Wilke 2015; Arenas and Posada
2014). This is most probably due to the perceived complexity of
the methodology. In scoup
, the versatility of the Uhlenbeck and Ornstein (1930) algorithm as a
framework for evolutionary analyses (Bartoszek et
al. 2017) was exploited to circumvent the complexity.
In a bid to identify an appropriate quantifier that permits direct
comparison between the degree of selection signatures imposed during
simulation and that inferred, the ratio of the variance of the
non-synonymous to synonymous selection coefficients (vN/vS) was
discovered to be appropriate. The vN/vS statistic is consequently
posited as a quality selection discriminant metric. scoup
therefore represents an important contribution to the phylogenetic
modelling literature. Example code of how to successfully use the
package is presented below.
Use the following code to install scoup
from the
Bioconductor platform.
Three primary evolutionary algorithms are available in
scoup
. These include the frequency-dependent (Jones et al. 2017; Ayala and Campbell 1974),
the Ornstein-Uhlenbeck (OU) and the discrete algorithms. Example of
R
(R Core Team 2024) code
where these functions were utilised are presented. The homogeneous (Muse and Gaut 1994; Goldman and Yang 1994) and
heterogeneous (site-wise and branch-wise) (Nielsen and Yang 1998; Kosakovsky Pond and Frost
2005) selection inference modelling contexts were explored. Data
quality was assessed by comparing the maximum likelihood inferred (ω) and the analytically calculated
(dN/dS) estimates to
the magnitude of the imposed selection pressure (measured by vN/vS).
Template code used to analyse the output data (to obtain ω) with PAML
(Álvarez-Carretero, Kapli, and Yang 2023; Yang
2007) and FUBAR
(Murrell et
al. 2013; Kosakovsky Pond, Frost, and Muse 2005) are presented in
the Appendix. The R
code presented subsequently, require
that the user should have already installed the scoup
package.
## Loading required package: Matrix
# Number of extant taxa
## Excluded values contributed to results presented in article
leaves <- 8 # 64
# Number of codon sites
## Excluded values contributed to results presented in article
sSize <- 15 # 250
# Number of data replications for each parameter combination
## Edited count was used for the results presented in article
sims <- 1 # 50
# OU reversion parameter (Theta) value
## Excluded values contributed to results presented in article
eThta <- c(0.01) # c(0.01, 0.1, 1)
# OU asymptotic variance value
## Excluded values contributed to results presented in article
eVary <- c(0.0001) # c(0.0001, 0.01, 1)
# OU landscape shift parameters
hbrunoStat <- hbInput(c(vNvS=1, nsynVar=0.01))
# Sequence alignment size information
seqStat <- seqDetails(c(nsite=sSize, ntaxa=leaves))
# Iterate over all listed OU variance values
for(g in seq(1,length(eVary))){
# Iterate over all listed OU reversion parameter values
for(h in seq(1,length(eThta))){
# Create appropriate simulation function ("ou") object
adaptStat <- ouInput(c(eVar=eVary[g],Theta=eThta[h]))
# Iterate over the specified number of replicates
for(i in seq(1,sims)){
# Execute simulation
simData <- alignsim(adaptStat, seqStat, hbrunoStat, NULL)
}
}
}
# Print simulated alignment
seqCOL(simData)
## DNAStringSet object of length 8:
## width seq names
## [1] 45 AGGCTGCGCTTTCCACGGGTTGCTCTTAATGCCTTCGGCGCGCTG S001
## [2] 45 AAGCTGCGCTTTCCACGGGTTGCTCTTAATGCCTTCGGCGCGCTG S002
## [3] 45 AAGCTGCGCTTTCCACGGGTTGCTTTTAATGCCCTTGGCGCGCTG S003
## [4] 45 AAGCTGCGCTTTTCACGGGTTGCTCTTAATGCCTTAGGCGCGCTG S004
## [5] 45 AAGCGGCGTTTTCCACGGGTTGCTCCTAATGCCTGCGGCGGGCTG S005
## [6] 45 AAGCGGCGTTTTCCACGGGGTGCTCATAATGCCTTCGGCGCGCTG S006
## [7] 45 AAGCGGTGTTTTTCACGGGTTGCTCATAATGCATTCGGCGCGCTG S007
## [8] 45 AAGCGGCGTTTTTCACGGGTTGCTCATAATACATTCGGCGCGCCG S008
As expected, the correlation coefficient estimate was approximately 0.9974 when the means of the inferred (ω) and the calculated (dN/dS) selection effects were compared. The correlation estimation included all the commented values.
# Make package accessible in R session
library(scoup)
# Number of extant taxa
## Omitted value was used for the results presented in article
xtant <- 8 # 64
# Number of codon sites
## Omitted count was used for the results presented in article
siteSize <- 15 # 64
# Number of data replications for each parameter combination
## Omitted count was used for the results presented in article
simSize <- 1 # 50
# Variance of the non-synonymous selection coefficients
## Excluded values contributed to results presented in article
nsynVary <- c(0) # c(0, 0.001, 0.1)
# Ratio of the variance of the non-synonymous to synonymous coeff.
## Excluded values contributed to results presented in article
vNvSvec <- c(0) # c(0, 0.001, 1, 10)
# Sequence alignment size information
seqStat <- seqDetails(c(nsite=siteSize, ntaxa=xtant))
# Iterate over all listed coefficient variance ratios
for(a in seq(1,length(vNvSvec))){
# Iterate over all listed non-synonymous coefficients variance
for(b in seq(1,length(nsynVary))){
# Create appropriate simulation function ("omega") object
adaptData <- wInput(list(vNvS=vNvSvec[a],nsynVar=nsynVary[b]))
# Iterate over the specified number of replicates
for(i in seq(1,simSize)){
# Execute simulation
simulateSeq <- alignsim(adaptData, seqStat, NA)
}
}
}
# Print simulated alignment
cseq(simulateSeq)
## Sequence
## S001 GCAGGGGCTACGAGCGAGGTTTGTGGTGGAATCGACGGATGTGTA
## S002 GTACAGTGTACCAGCGAGGTCTGTGGGGGAACCGCCAGATATGGG
## S003 TTACAGCGTATCAGTGAGATTTGTGGTAGAGCGGTTCGTTGGGTG
## S004 GTACCGCATATCAGTGAAATCTGTGGTAGAACGGTCCGTTGGGTG
## S005 GATCAGCTTATCAACAAGGTCTGCGGTCGGGCAGTGCGTTGGGTG
## S006 GGACAGCTTAGCAACAAGGTCTATGGTAGGACGGTGCGTAGGATG
## S007 GAACGGTCCAGTAATAAGGTCCGTGGCATGACAGTCCGTTGGATA
## S008 GAACGGCCTAGTAGTAAGGCCCGTGGCATGATAGTGCGTTGGATA
Sequences generated with the presented code (with the excluded values activated) produced strongly correlated selection inferences (correlation coefficient ≈ 0.9923) when the average dN/dS and the ω values were compared. This implementation is an example of how to execute the frequency-dependent evolutionary technique with the package.
# Make package accessible in R session
library(scoup)
# Number of codon sites
## Commented value was used for results presented in article
sitesize<- 15 # 100
# Variance of non-synonymous selection coefficients
nsynVary <- 0.01
# Number of extant taxa
## Commented value was used for results presented in article
taxasize <- 8 # 1024
# Sequence alignment size information
seqsEntry <- seqDetails(c(nsite=sitesize, ntaxa=taxasize))
# Create the applicable ("ou") object for simulation function
## eVar= OU asymptotic variance, Theta=OU reversion parameter
adaptEntry <- ouInput(c(eVar=0.1,Theta=1))
# Ratio of the variance of the non-synonymous to synonymous coeff.
## Excluded values contributed to results presented in article
sratio <- c(0) # c(0, 1e-06, 1e-03, 0.1, 1, 10, 1000)
# Iterate over all listed coefficient variance ratios
for(a0 in seq(1,length(sratio))){
# OU landscape shift parameters
mValues <- hbInput(c(vNvS=sratio[a0], nsynVar=nsynVary))
# Execute simulation
simSeq <- alignsim(adaptEntry, seqsEntry, mValues, NA)
}
# Print simulated codon sequence
cseq(simSeq)
## Sequence
## S001 AGTCCTTATTCCAGGATAAGACGTCGATACGGAACTAACAACTGG
## S002 AGACCTTATTCCCGGATAAGGCGTCGGTACGGAGCTAACAACTGG
## S003 AGTCCTTATTCGCGGATAAGACGTCGGTACGGAGCTGACAACTGG
## S004 AGTCCTTATTCGCGGATAGGACGTCGGTACGGAGCTAACAATGGG
## S005 AGTCCCTATTCTTGCTTAAGACGTCGGTACAGCGATAACAACTGG
## S006 AGCCCTTACTCTCGCTTAAGACGTCGGTACGGCGATAACAACTGG
## S007 AGTCCTTATTCCCGGACGAGGCTACGGTACGGGGCTAACAACTGG
## S008 AGTCCTTATTCCCGGACTAGACTTCGGTACGGAGCTGACAACTCG
This is another example of how to call the OU shifting landscape evolutionary approach. The results obtained yielded a pairwise correlation coefficient estimate of approximately 0.9988 between the means of dN/dS and ω. The correlation coefficient estimates were approximately 0.8123 and 0.8305 when the averages were each compared to vN/vS, respectively.
# Make package accessible in R session
library(scoup)
# Number of internal nodes on the desired balanced tree
iNode <- 3
# Number of required codon sites
## Excluded value was used for the results presented in article
siteCount <- 15 # 1000
# Variance of non-synonymous selection coefficients
nsnV <- 0.01
# Number of data replications for each parameter combination
## Edited count was used for the results presented in article
nsim <- 1 # 50
# Ratio of the variance of the non-synonymous to synonymous coeff.
## Excluded values contributed to results presented in article
vNvSvec <- c(0) # c(0, 1e-06, 1e-03, 0.1, 1, 10, 100)
# Sequence alignment size information
seqsBwise <- seqDetails(c(nsite=siteCount, blength=0.10))
# Iterate over all listed coefficient variance ratios
for(h in seq(1,length(vNvSvec))){
# Iterate over the specified number of replicates
for(i in seq(1,nsim)){
# Create the parameter set applicable at each internal tree node
scInput <- rbind(vNvS=c(rep(0,iNode-1),vNvSvec[h]),
nsynVar=rep(nsnV,iNode))
# Create the applicable ("discrete") object for simulation function
adaptBranch <- discreteInput(list(p02xnodes=scInput))
# Execute simulation
genSeq <- alignsim(adaptBranch, seqsBwise, NULL)
}
}
# Print simulated sequence data
seqCOL(genSeq)
## DNAStringSet object of length 8:
## width seq names
## [1] 45 GAATGCAGACAGTATCAAGTCCAGAGAGGAAATTCAGCTCATGCG S001
## [2] 45 GAATGCAGACAGTATCAAGTCCAGCGAGGAAATTCAGCTCATGCG S002
## [3] 45 GAGTGTAGACAGTATCAAGTTCAGAGAGGGAACTCATTTCACACG S003
## [4] 45 GACTGCAGACAGTATCGAGTTCAAAGAGGAAACTCATTTCATGCG S004
## [5] 45 GAATGCCGACAATATCAGGTCCAGAGATGCAACTCATCACATGCG S005
## [6] 45 AAATGCCGACAATATCAGGTCCAGAGATGTAACTCATCACATGCG S006
## [7] 45 GAGTGTCGACAGCACCAGGTCCAGAGGGGGAACTCATCACATGCG S007
## [8] 45 GAGTGCCGACAGCACCAGGTCCAGAGGGGGAACTCGTCACATGCG S008
The correlation coefficient between the averages of the analytical dN/dS and the inferred ω estimates was approximately 0.9998, obtained from 50 independent iterations of the code for all the listed vN/vS values. The correlation coefficient estimate was approximately 0.6349 for vN/vS vs ω and 0.6360 for vN/vS vs dN/dS.
Reference scoup
code were presented to facilitate use of
the package. Although not explicitly presented, it is also possible to
generate data with signatures of directional selection by setting the
aaPlus
element of the wInput
entry of the
alignsim
function accordingly. The capacity of the package
is expected to be extended in future versions.
A more appropriate citation will be provided for the package after it has been accepted to the Bioconductor platform and after the corresponding article has been accepted for publication. In the meantime, to cite this package, use Sadiq, H. 2024. “scoup: Simulate Codon Sequences with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process”. R Package. doi:10.18129/B9.bioc.scoup.
CODEML
script executed in PAML
to infer
single alignment-wide ω
estimates for data sets generated from 50 independent executions of each
of the sensitivity analyses code presented above. The same
CODEML
script was used to analyse data (also 50 replicates)
from the episodic analyses code, with the model
entry
replaced with 2
. The scoup
simulated sequence
data and tree are seq.nex
and seq.tre
,
respectively.
seqfile = seq.nex
treefile = seq.tre
outfile = seq.out
noisy = 0
verbose = 0
seqtype = 1
ndata = 1
icode = 0
cleandata = 0
model = 0
NSsites = 0
CodonFreq = 3
estFreq = 0
clock = 0
fix_omega = 0
omega = 1e-05
FUBAR
command executed with HyPhy
through
the terminal in MacBook. Note that HyPhy
was already
installed on the computer. The seq.nex
input is the
scoup
simulated codon sequence data that is saved in the
same NEXUS file with the tree data. The NEXUS file resides in the
working directory.
The output of sessionInfo()
from the computer where this
file was generated is provided below.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scoup_1.1.0 Matrix_1.7-1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.3 httr_1.4.7 cli_3.6.3
## [4] knitr_1.48 rlang_1.1.4 xfun_0.48
## [7] UCSC.utils_1.2.0 jsonlite_1.8.9 S4Vectors_0.44.0
## [10] buildtools_1.0.0 Biostrings_2.75.0 htmltools_0.5.8.1
## [13] maketools_1.3.1 sys_3.4.3 stats4_4.4.1
## [16] sass_0.4.9 rmarkdown_2.28 grid_4.4.1
## [19] evaluate_1.0.1 jquerylib_0.1.4 fastmap_1.2.0
## [22] GenomeInfoDb_1.43.0 IRanges_2.41.0 yaml_2.3.10
## [25] lifecycle_1.0.4 BiocManager_1.30.25 compiler_4.4.1
## [28] XVector_0.46.0 lattice_0.22-6 digest_0.6.37
## [31] R6_2.5.1 GenomeInfoDbData_1.2.13 bslib_0.8.0
## [34] tools_4.4.1 zlibbioc_1.52.0 BiocGenerics_0.53.0
## [37] cachem_1.1.0