---
title: "DNAcycP2: **DNA** **Cyc**lizability **P**rediction v**2**"
author: "Brody Kendall, Ji-Ping Wang, and Keren Li"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        highlight: pygments
        toc: true
        fig_width: 5
vignette: >
    %\VignetteIndexEntry{DNAcycP2}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

**Maintainer**: Ji-Ping Wang, <<jzwang@northwestern.edu>>


```{r setup, include = FALSE}
library(DNAcycP2)
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```


**References for methods**:

Kendall, B., Jin, C., Li, K., Ruan, F., Wang, X.A., Wang, J.-P., DNAcycP2: improved estimation of intrinsic DNA cyclizability through data augmentation, *Nucleic Acids Research*, gkaf145, 2025.

****************************

## Introduction

**DNAcycP2**, short for **DNA** **cyc**lizability **P**rediction v**2**, is an 
R package (Python version is also available) developed for precise and unbiased 
prediction of DNA intrinsic cyclizability scores. This tool builds on a deep 
learning framework that integrates Inception and Residual network architectures 
with an LSTM layer, providing a robust and accurate prediction mechanism.

DNAcycP2 is an updated version of the earlier **DNAcycP** tool released by Li 
et al. in 2021. While DNAcycP was trained on loop-seq data from Basu et al. 
(2021), DNAcycP2 improves upon it by training on smoothed predictions derived 
from this dataset. The predicted score, termed **C-score**, exhibits high 
accuracy when compared with experimentally measured cyclizability scores 
obtained from the loop-seq assay. This makes DNAcycP2 a valuable tool for 
researchers studying DNA mechanics and structure.

### Key differences between DNAcycP2 and DNAcycP

Following the release of DNAcycP, it was found that the intrinsic cyclizability 
scores derived from Basu et al. (2021) retained residual bias from the biotin 
effect, resulting in inaccuracies (Kendall et al., 2025). To address this, we 
employed a data augmentation + moving average smoothing method to produce 
unbiased estimates of intrinsic DNA cyclizability for each sequence in the 
original training dataset. A new model, trained on this corrected data but 
using the same architecture as DNAcycP, was developed, resulting in DNAcycP2. 
This version also introduces improved computational efficiency through 
parallelization options. Further details are available in Kendall et al. (2025).

To demonstrate the differences, we compared predictions from DNAcycP and 
DNAcycP2 in a yeast genomic region at base-pair resolution (Figure 1). The 
predicted biotin-dependent scores ($\tilde C_{26}$, $\tilde C_{29}$, and 
$ \tilde C_{31}$, model trained separately) show 10-bp periodic oscillations 
due to biotin biases, each with distinct phases. DNAcycP's predictions improved 
over the biotin-dependent scores, while still show substantial
local fluctuations likely caused by residual bias in the training data (the 
called intrinsic cyclizability score $\hat C_0$ from Basu et al. 2021). In 
contrast, DNAcycP2, trained on corrected intrinsic cyclizability scores, 
produces much smoother local-scale predictions, indicating a further 
improvement in removing the biotin bias.

The DNAcycP2 package retains all prediction functions from the original 
DNAcycP. The improved prediction model, based on smoothed data, can be accessed 
using the argument smooth=TRUE in the main function (see usage below).

### Available formats of DNAcycP2 and DNAcycP

DNAcycP2 is available in three formats: A web server available at 
http://DNAcycP.stats.northwestern.edu for real-time prediction and 
visualization of C-score up to 20K bp, a standalone Python package avilable for 
free download from https://github.com/jipingw/DNAcycP2-Python, and a new R 
package available for free download from bioconductor 
(https://github.com/jipingw/DNAcycP2). DNAcycP2 R package is a wrapper of its 
Python version, both generate the same prediction results.

DNAcycP Python package is still available for free download from 
https://github.com/jipingw/DNAcycP.
As DNAcycP2 include all functionalities of DNAcycP, users can generate all 
DNAcycP results using DNAcycP2.

## Installation

DNAcycP2 is available on Bioconductor with `R >= 4.5.0`. To install it, run the
following command:

```{r, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DNAcycP2")
```

## Usage

### Main Functions

The **DNAcycP2** R package provides two primary functions for cyclizability 
prediction:

1. **`cycle`**: Takes an R object (vector of strings) as input. Each element in 
the vector is a DNA sequence.
2. **`cycle_fasta`**: Takes the path of a fasta file as input.

### Selecting the Prediction Model

Both functions use the `smooth` argument to specify the prediction model:

- **`smooth=TRUE`**: DNAcycP2 (trained on smoothed data, recommended).
- **`smooth=FALSE`**: DNAcycP (trained on original data).

### Parallelization with `cycle_fasta`

The `cycle_fasta` function is designed for handling larger files and supports 
parallelization. To enable parallelization, use the following arguments:

- **`n_cores`**: Number of cores to use (default: 1).
- **`chunk_length`**: Sequence length (in bp) each core processes at a time 
(default: 100,000).

The `cycle_fasta` function is designed for larger files, so it has added 
parallelization capability. To utilize this capability, specify the number of 
cores to be greater than 1 using the `n_cores` argument (default 1). You can 
also specify the length of the sequence that each core will predict on at a 
given time using the `chunk_length` argument (default 100000).

For reference, on a personal computer (16 Gb RAM, M1 chip with 8-core CPU), 
prediction at full parallelization directly on the yeast genome FASTA file 
completes in 12 minutes, and on the hg38 human genome Chromosome I FASTA file 
in just over 4 hours. In our experience, selection of parallelization 
parameters (`n_cores` and `chunk_length`) has little affect when making 
predictions on a personal computer, but if using the package on a high-
performance compute cluster, prediction time should decrease as the number of 
cores increases. If you do run into memory issues, we first suggest reducing 
`chunk_length`.


```{r}
library(DNAcycP2)
```


### Example 1: fasta file input

```{r}
ex1_file <- system.file("extdata", "ex1.fasta", package = "DNAcycP2")
ex1_smooth <- DNAcycP2::cycle_fasta(
    ex1_file, smooth=TRUE, n_cores=1, chunk_length=1000
)
ex1_original <- DNAcycP2::cycle_fasta(
    ex1_file, smooth=FALSE, n_cores=1, chunk_length=1000
)
```

`cycle_fasta` takes the file path as input (`ex1_file`). `smooth=TRUE` 
specifies that DNAcycP2 be used to make predictions. `smooth=FALSE` specifies 
that DNAcycP be used to make predictions. `n_cores=2` specifies that 2 cores 
are to be used in parallel. `chunk_length=1000` specifies that each core will 
predict on sequences of length 1000 at a given time.


The output (`ex1_smooth` or `ex1_original`) is a list with element names 
starting with "cycle" followed by the sequence names in the fasta file. For 
example, `ex1.fasta` contains two sequences with IDs "1" and "2" respectively. 
Therefore both both `ex1_smooth` and `ex1_original` will be lists of length 2 
with names `cycle_1` and `cycle_2` for the first and second sequences 
respectively. 

Each item in the list (e.g. `ex1_smooth$cycle_1`) is a data.frame object with 
three columns. The first column is always `position`. When `smooth=TRUE`, the 
second and third columns are `C0S_norm` and `C0S_unnorm`, and when 
`smooth=FALSE` the second and third columns are `C0_norm` and `C0_unnorm`. 

### Example 2: input as a list/vector of sequences

```{r}
ex2_file <- system.file("extdata", "ex2.txt", package = "DNAcycP2")
ex2 <- read.csv(ex2_file, header = FALSE)
ex2_smooth <- DNAcycP2::cycle(ex2$V1, smooth=TRUE)
ex2_original <- DNAcycP2::cycle(ex2$V1, smooth=FALSE)
```

`cycle` takes the sequences themselves as input, so we first read the file 
(`ex2_file`) and then provide the sequences as input (`ex2$V1`)

The output (`ex2_smooth` or `ex2_original`) is a list with indices 
corresponding to each sequence from the `sequences` argument (here it is 
`ex2$V1`). For example, `ex2.txt` contains 100 sequences.
Therefore, both `ex2_smooth` and `ex2_original` will be lists of length 100,  
where each entry in the list corresponds to the sequence with its same index.

Each item in the list (e.g. `ex2_smooth[[1]]`) is a data.frame object with 
three columns. The first columns is always `position`. When `smooth=TRUE`, the 
second and third columns are `C0S_norm` and `C0S_unnorm`, and when 
`smooth=FALSE` the second and third columns are `C0_norm` and `C0_unnorm`.

### DNAcycP2 prediction -- Normalized vs unnormalized

Both `cycle_fasta` and `cycle` output the prediction results in normalized 
(`C0_norm`,`C0S_norm`) and unnomralized (`C0_unnorm`,`C0S_unnorm`) version. 

In DNAcycP2, the predicted cyclizability always contains **normalized** and 
**unnormalized** values. the unnormalized results were based on the model 
trained on unnormalized $\hat C_0$ or $\hat C_0^s$ scores. In contrast, the 
normalized results were predicted by the model trained on the normalized 
$\hat C_0$ or $\hat C_0^s$ values.
The cyclizability score from different loop-seq libraries may be subject to a 
systematic library-specific constant difference due to its definition (see  
Basu et al 2021), and hence it's a relative measure and not directly comparable 
between libraries. The normalization will force the training data to have 
mean = 0 and standard deviation = 1 such that the 50 bp sequences from yeast 
genome roughly have mean = 0 and standard deviation = 1 for intrinsic 
cyclizabilty score. Thus for any sequence under prediciton, the normalized 
C-score can be more informative in terms of its cyclizabilty relative to the 
population. For example, the C-score provides statisitcal significance 
indicator, i.e. a C-score of 1.96 indicates 97.5% in the distribution.


### Save DNAcycP2 prediciton to external file

Both `cycle_fasta` and `cycle` provides an argument `save_path_prefix` to save 
the prediction results onto local hard drive. For example:

```{r, eval=FALSE}
ex2_smooth <- DNAcycP2::cycle(
    ex2$V1, 
    smooth=TRUE, 
    save_path_prefix="ex2_smooth"
)
```

This will execute the same predictions as previously, and additionally save two 
files named 'ex2_smooth_C0S_norm.txt' and 'ex2_smooth_C0S_unnorm.txt' to the 
current working directory. The output files from `cycle_fasta` have the same 
format as the function output, but for consistency with the Python pacakge 
***it is important to note that the output files from `cycle` have a different 
format than the function output.*** Namely, rather than writing a single file 
for every sequence, the function always writes two files (regardless of the 
number of sequences), one containing normalized predictions for every sequence 
(ending in 'C0S_norm.txt' or 'C0_norm.txt') and the other containing 
unnormalized predictions for every sequence (ending in 'C0S_unnorm.txt' or 
'C0_unnorm.txt'). C-scores in each line correspond to the sequence from the 
`sequences` input in the same order.

For any input sequence, DNAcycP2 predicts the C-score for every 50 bp. 
Regardless of the input sequence format the first C-score in the output file 
corresponds to the sequence from position 1-50, second for 2-51 and so forth.

### Example 3 (Single Sequence):

If you want the predict C-scores for a single sequence, you can follow 
the same protocol as Example 1 or 2, depending on the input format. We 
have included two example files representing the same 1000bp stretch of 
S. Cerevisiae sacCer3 Chromosome I (1:1000) in .fasta and .txt format.

First, we will consider the .fasta format:

```{r}
ex3_fasta_file <- system.file(
    "extdata", "ex3_single_seq.fasta", package = "DNAcycP2"
)
ex3_fasta_smooth <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=TRUE)
ex3_fasta_original <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=FALSE)
```

The output (`ex3_fasta_smooth` or `ex3_fasta_original`) is a list with
1 entry named "cycle_1".

Let's say we are interested only in the smooth (DNAcycP2), normalized
predictions for the subsequence defined by the first 100bp 
(corresponding to subsequences defined by regions [1,50], [2,51],
..., and [51-100], or `position`s 25, 26, ..., and 75). We can 
access the outputs for this subsequence using the following command:

```{r}
ex3_fasta_smooth[[1]][1:51,c("position", "C0S_norm")]
```

Or, equivalently,

```{r}
ex3_fasta_smooth$cycle_1[1:51,c("position", "C0S_norm")]
```

Next, we will consider the .txt format:

```{r}
ex3_txt_file <- system.file(
    "extdata", 
    "ex3_single_seq.txt", 
    package = "DNAcycP2"
)
ex3_txt <- read.csv(ex3_txt_file, header = FALSE)
ex3_txt_smooth <- DNAcycP2::cycle(ex3_txt$V1, smooth=TRUE)
ex3_txt_original <- DNAcycP2::cycle(ex3_txt$V1, smooth=FALSE)
```

The output (`ex3_txt_smooth` or `ex3_txt_original`) is a list with 1 entry 
(unnamed).

Note, that `ex3_fasta_smooth` and `ex3_txt_smooth` are essentially equivalent. 
The only exceptions are perhaps slight rounding differences that come from the 
computation, and that the list `ex3_fasta_smooth` has named entries ('cycle_1') 
while `ex3_txt_smooth` does not. The same applies for `ex3_fasta_original` and 
`ex3_txt_original`.

Therefore, we can use a similar command to access the outputs for our 
subsequence of interest:

```r
ex3_txt_smooth[[1]][1:51,c("position", "C0S_norm")]
```

If there is a sequence (or group of sequences) we want to make predictions on, 
we can also input them directly as strings. For example:

```r
input_seq1 = 
    "CATGACTGCAGCTAAAACGTTGACCTAGTCGTCAGTCTACGTACTAGCGTAGCTATATCGAGTCTAGCGTCTAG"
input_seq2 = "ATCTTTTGTATATCAAAAGACTAGATCGATTAGCGTACGCCCCTGACTAGATAGATCG"
seq1_smooth = DNAcycP2::cycle(c(input_seq1), smooth=TRUE)
both_seqs_smooth = DNAcycP2::cycle(c(input_seq1, input_seq2), smooth=TRUE)
```

### Example 4: `DNAStringSet` object input

```{r}
library(Biostrings)
ex4_string_set <- readDNAStringSet(system.file("extdata", "ex1.fasta", package="DNAcycP2"))
ex4_smooth_output <- DNAcycP2::cycle(ex4_string_set, smooth=TRUE)
```

`ex4_string_set` here is a `DNAStringSet` object using `readDNAStringSet` function from 
`Biostrings` package.


## References

* Li, K., Carroll, M., Vafabakhsh, R., Wang, X.A. and Wang, J.-P., DNAcycP: A 
Deep Learning Tool for DNA Cyclizability Prediction, *Nucleic Acids Research*, 
2021
* Basu, A., Bobrovnikov, D.G., Qureshi, Z., Kayikcioglu, T., Ngo, T.T.M., 
Ranjan, A., Eustermann, S., Cieza, B., Morgan, M.T., Hejna, M. et al. (2021) 
Measuring DNA mechanics on the genome scale. Nature, 589, 462-467.

# Session info

```{r sessionInfo}
sessionInfo()
```