Kendall, B., Jin, C., Li, K., Ruan, F., Wang, X.A., Wang, J.-P.,
DNAcycP2: improved estimation of intrinsic DNA cyclizability through
data augmentation, Nucleic Acids Research, gkaf145, 2025.
Introduction
DNAcycP2, short for DNA
cyclizability Prediction
v2, is an R package (Python version is also available)
developed for precise and unbiased prediction of DNA intrinsic
cyclizability scores. This tool builds on a deep learning framework that
integrates Inception and Residual network architectures with an LSTM
layer, providing a robust and accurate prediction mechanism.
DNAcycP2 is an updated version of the earlier
DNAcycP tool released by Li et al. in 2021. While
DNAcycP was trained on loop-seq data from Basu et al. (2021), DNAcycP2
improves upon it by training on smoothed predictions derived from this
dataset. The predicted score, termed C-score, exhibits
high accuracy when compared with experimentally measured cyclizability
scores obtained from the loop-seq assay. This makes DNAcycP2 a valuable
tool for researchers studying DNA mechanics and structure.
Key differences between DNAcycP2 and DNAcycP
Following the release of DNAcycP, it was found that the intrinsic
cyclizability scores derived from Basu et al. (2021) retained residual
bias from the biotin effect, resulting in inaccuracies (Kendall et al.,
2025). To address this, we employed a data augmentation + moving average
smoothing method to produce unbiased estimates of intrinsic DNA
cyclizability for each sequence in the original training dataset. A new
model, trained on this corrected data but using the same architecture as
DNAcycP, was developed, resulting in DNAcycP2. This version also
introduces improved computational efficiency through parallelization
options. Further details are available in Kendall et al. (2025).
To demonstrate the differences, we compared predictions from DNAcycP
and DNAcycP2 in a yeast genomic region at base-pair resolution (Figure
1). The predicted biotin-dependent scores (C̃26, C̃29, and $ C_{31}$, model
trained separately) show 10-bp periodic oscillations due to biotin
biases, each with distinct phases. DNAcycP’s predictions improved over
the biotin-dependent scores, while still show substantial local
fluctuations likely caused by residual bias in the training data (the
called intrinsic cyclizability score Ĉ0 from Basu et
al. 2021). In contrast, DNAcycP2, trained on corrected intrinsic
cyclizability scores, produces much smoother local-scale predictions,
indicating a further improvement in removing the biotin bias.
The DNAcycP2 package retains all prediction functions from the
original DNAcycP. The improved prediction model, based on smoothed data,
can be accessed using the argument smooth=TRUE in the main function (see
usage below).
Usage
Main Functions
The DNAcycP2 R package provides two primary
functions for cyclizability prediction:
cycle
: Takes an R object (vector of
strings) as input. Each element in the vector is a DNA sequence.
cycle_fasta
: Takes the path of a fasta
file as input.
Selecting the Prediction Model
Both functions use the smooth
argument to specify the
prediction model:
smooth=TRUE
: DNAcycP2 (trained on
smoothed data, recommended).
smooth=FALSE
: DNAcycP (trained on
original data).
Parallelization with cycle_fasta
The cycle_fasta
function is designed for handling larger
files and supports parallelization. To enable parallelization, use the
following arguments:
n_cores
: Number of cores to use
(default: 1).
chunk_length
: Sequence length (in bp)
each core processes at a time (default: 100,000).
The cycle_fasta
function is designed for larger files,
so it has added parallelization capability. To utilize this capability,
specify the number of cores to be greater than 1 using the
n_cores
argument (default 1). You can also specify the
length of the sequence that each core will predict on at a given time
using the chunk_length
argument (default 100000).
For reference, on a personal computer (16 Gb RAM, M1 chip with 8-core
CPU), prediction at full parallelization directly on the yeast genome
FASTA file completes in 12 minutes, and on the hg38 human genome
Chromosome I FASTA file in just over 4 hours. In our experience,
selection of parallelization parameters (n_cores
and
chunk_length
) has little affect when making predictions on
a personal computer, but if using the package on a high- performance
compute cluster, prediction time should decrease as the number of cores
increases. If you do run into memory issues, we first suggest reducing
chunk_length
.
DNAcycP2 prediction – Normalized vs unnormalized
Both cycle_fasta
and cycle
output the
prediction results in normalized
(C0_norm
,C0S_norm
) and unnomralized
(C0_unnorm
,C0S_unnorm
) version.
In DNAcycP2, the predicted cyclizability always contains
normalized and unnormalized values.
the unnormalized results were based on the model trained on unnormalized
Ĉ0 or Ĉ0s
scores. In contrast, the normalized results were predicted by the model
trained on the normalized Ĉ0 or Ĉ0s
values. The cyclizability score from different loop-seq libraries may be
subject to a systematic library-specific constant difference due to its
definition (see
Basu et al 2021), and hence it’s a relative measure and not directly
comparable between libraries. The normalization will force the training
data to have mean = 0 and standard deviation = 1 such that the 50 bp
sequences from yeast genome roughly have mean = 0 and standard deviation
= 1 for intrinsic cyclizabilty score. Thus for any sequence under
prediciton, the normalized C-score can be more informative in terms of
its cyclizabilty relative to the population. For example, the C-score
provides statisitcal significance indicator, i.e. a C-score of 1.96
indicates 97.5% in the distribution.
Save DNAcycP2 prediciton to external file
Both cycle_fasta
and cycle
provides an
argument save_path_prefix
to save the prediction results
onto local hard drive. For example:
ex2_smooth <- DNAcycP2::cycle(
ex2$V1,
smooth=TRUE,
save_path_prefix="ex2_smooth"
)
This will execute the same predictions as previously, and
additionally save two files named ‘ex2_smooth_C0S_norm.txt’ and
‘ex2_smooth_C0S_unnorm.txt’ to the current working directory. The output
files from cycle_fasta
have the same format as the function
output, but for consistency with the Python pacakge it is
important to note that the output files from cycle
have a
different format than the function output. Namely, rather
than writing a single file for every sequence, the function always
writes two files (regardless of the number of sequences), one containing
normalized predictions for every sequence (ending in ‘C0S_norm.txt’ or
‘C0_norm.txt’) and the other containing unnormalized predictions for
every sequence (ending in ‘C0S_unnorm.txt’ or ‘C0_unnorm.txt’). C-scores
in each line correspond to the sequence from the sequences
input in the same order.
For any input sequence, DNAcycP2 predicts the C-score for every 50
bp. Regardless of the input sequence format the first C-score in the
output file corresponds to the sequence from position 1-50, second for
2-51 and so forth.
Example 3 (Single Sequence):
If you want the predict C-scores for a single sequence, you can
follow the same protocol as Example 1 or 2, depending on the input
format. We have included two example files representing the same 1000bp
stretch of S. Cerevisiae sacCer3 Chromosome I (1:1000) in .fasta and
.txt format.
First, we will consider the .fasta format:
ex3_fasta_file <- system.file(
"extdata", "ex3_single_seq.fasta", package = "DNAcycP2"
)
ex3_fasta_smooth <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=TRUE)
#> Sequence length for ID 1: 1000
#> Predicting cyclizability...
#> Chunk size: 100000, num threads: 1
ex3_fasta_original <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=FALSE)
#> Sequence length for ID 1: 1000
#> Predicting cyclizability...
#> Chunk size: 100000, num threads: 1
The output (ex3_fasta_smooth
or
ex3_fasta_original
) is a list with 1 entry named
“cycle_1”.
Let’s say we are interested only in the smooth (DNAcycP2), normalized
predictions for the subsequence defined by the first 100bp
(corresponding to subsequences defined by regions [1,50], [2,51], …, and
[51-100], or position
s 25, 26, …, and 75). We can access
the outputs for this subsequence using the following command:
ex3_fasta_smooth[[1]][1:51,c("position", "C0S_norm")]
#> position C0S_norm
#> 1 25 0.94794816
#> 2 26 0.88397890
#> 3 27 1.05313051
#> 4 28 1.31837559
#> 5 29 1.43364513
#> 6 30 1.47490740
#> 7 31 1.68856978
#> 8 32 1.79380059
#> 9 33 1.78517425
#> 10 34 1.51873469
#> 11 35 1.83588862
#> 12 36 1.81811321
#> 13 37 1.89403594
#> 14 38 1.94740176
#> 15 39 1.67138922
#> 16 40 1.55992007
#> 17 41 1.66846383
#> 18 42 1.62409902
#> 19 43 1.54238105
#> 20 44 1.46230042
#> 21 45 1.41134238
#> 22 46 1.27757096
#> 23 47 1.02153277
#> 24 48 1.01221049
#> 25 49 0.93556392
#> 26 50 1.00608754
#> 27 51 1.09152961
#> 28 52 1.32879055
#> 29 53 1.25343060
#> 30 54 1.34268773
#> 31 55 1.13988316
#> 32 56 1.00737357
#> 33 57 1.13837850
#> 34 58 1.11469972
#> 35 59 0.96556205
#> 36 60 0.85989273
#> 37 61 0.91100574
#> 38 62 0.87353730
#> 39 63 0.55437249
#> 40 64 0.60366714
#> 41 65 0.41495007
#> 42 66 0.22158630
#> 43 67 0.14434634
#> 44 68 0.08841632
#> 45 69 -0.09656693
#> 46 70 -0.19039944
#> 47 71 -0.34424984
#> 48 72 -0.41227320
#> 49 73 -0.38969812
#> 50 74 -0.33039850
#> 51 75 -0.44366425
Or, equivalently,
ex3_fasta_smooth$cycle_1[1:51,c("position", "C0S_norm")]
#> position C0S_norm
#> 1 25 0.94794816
#> 2 26 0.88397890
#> 3 27 1.05313051
#> 4 28 1.31837559
#> 5 29 1.43364513
#> 6 30 1.47490740
#> 7 31 1.68856978
#> 8 32 1.79380059
#> 9 33 1.78517425
#> 10 34 1.51873469
#> 11 35 1.83588862
#> 12 36 1.81811321
#> 13 37 1.89403594
#> 14 38 1.94740176
#> 15 39 1.67138922
#> 16 40 1.55992007
#> 17 41 1.66846383
#> 18 42 1.62409902
#> 19 43 1.54238105
#> 20 44 1.46230042
#> 21 45 1.41134238
#> 22 46 1.27757096
#> 23 47 1.02153277
#> 24 48 1.01221049
#> 25 49 0.93556392
#> 26 50 1.00608754
#> 27 51 1.09152961
#> 28 52 1.32879055
#> 29 53 1.25343060
#> 30 54 1.34268773
#> 31 55 1.13988316
#> 32 56 1.00737357
#> 33 57 1.13837850
#> 34 58 1.11469972
#> 35 59 0.96556205
#> 36 60 0.85989273
#> 37 61 0.91100574
#> 38 62 0.87353730
#> 39 63 0.55437249
#> 40 64 0.60366714
#> 41 65 0.41495007
#> 42 66 0.22158630
#> 43 67 0.14434634
#> 44 68 0.08841632
#> 45 69 -0.09656693
#> 46 70 -0.19039944
#> 47 71 -0.34424984
#> 48 72 -0.41227320
#> 49 73 -0.38969812
#> 50 74 -0.33039850
#> 51 75 -0.44366425
Next, we will consider the .txt format:
ex3_txt_file <- system.file(
"extdata",
"ex3_single_seq.txt",
package = "DNAcycP2"
)
ex3_txt <- read.csv(ex3_txt_file, header = FALSE)
ex3_txt_smooth <- DNAcycP2::cycle(ex3_txt$V1, smooth=TRUE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
ex3_txt_original <- DNAcycP2::cycle(ex3_txt$V1, smooth=FALSE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
The output (ex3_txt_smooth
or
ex3_txt_original
) is a list with 1 entry (unnamed).
Note, that ex3_fasta_smooth
and
ex3_txt_smooth
are essentially equivalent. The only
exceptions are perhaps slight rounding differences that come from the
computation, and that the list ex3_fasta_smooth
has named
entries (‘cycle_1’) while ex3_txt_smooth
does not. The same
applies for ex3_fasta_original
and
ex3_txt_original
.
Therefore, we can use a similar command to access the outputs for our
subsequence of interest:
ex3_txt_smooth[[1]][1:51,c("position", "C0S_norm")]
If there is a sequence (or group of sequences) we want to make
predictions on, we can also input them directly as strings. For
example:
input_seq1 =
"CATGACTGCAGCTAAAACGTTGACCTAGTCGTCAGTCTACGTACTAGCGTAGCTATATCGAGTCTAGCGTCTAG"
input_seq2 = "ATCTTTTGTATATCAAAAGACTAGATCGATTAGCGTACGCCCCTGACTAGATAGATCG"
seq1_smooth = DNAcycP2::cycle(c(input_seq1), smooth=TRUE)
both_seqs_smooth = DNAcycP2::cycle(c(input_seq1, input_seq2), smooth=TRUE)