---
title: "cfdnakit vignette"
author: "Pitithat Puranachot"
email: pitithat.pur@cra.ac.th
package: "cfdnakit"
output:
  BiocStyle::html_document:
    toc : true
    toc_float: true
    

vignette: >
  %\VignetteIndexEntry{cfdnakit vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
```

# Introduction

This package provides basic functions for analyzing next-generation sequencing data of cell-free DNA (cfDNA). The package focuses on extracting the length of cfDNA sample and visualization of genome-wide enrichment of short-fragmented cfDNA. The aberration of fragmentation profile, denoted modified ctDNA estimation score (CES), allows quantification of circulating tumor DNA (ctDNA). We recommend using this package to analysis shallow whole-genome sequencing data (\~0.3X or more). This package was complemented by Bioconductor packages e.g. QDNAseq, Rsamtools and GenomicRanges which could further expand the functionality of this package in the future.

## Installation
### Install via the Bioconductor repository
```{r install via BiocManager, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("cfdnakit")
```
### Install the latest version via github
To install this package is via this github repository, please follow the instruction below.

**Install prerequisites packages**
```{r install devtools and BiocManager, eval=FALSE}
if(! "devtools" %in% rownames(installed.packages()))
    install.packages("devtools")
if(! "BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager")
```
**Install cfdnakit package**
```{r install cfdnakit via github, eval=FALSE}
library(devtools)  ### use devtools
install_github("Pitithat-pu/cfdnakit") ### install cfDNAKit 
```
The installation should work fine without non-zero exit status. Try load cfdnakit package into current R session

```{r load the package cfdnakit, eval=FALSE}
library(cfdnakit) ### Load cfdnakit package
```

# Prepare input BAM

A coordination-sorted BAM file of cfDNA from any liquid biopsy source (e.g. blood-plasma, CSF, urine) should be applicable. You should examine if the sequencing coverage reach coverage threshold before the analysis. Please make sure that the sequencing reads are mapped onto the following version of reference genome. cfdnakit supports the human reference genome GRCh37(hg19) and GRCh38(hg38). Preliminary fragment-length analysis can be performed using mouse cfDNA when mapping onto the mouse reference genome GRCm38(mm10).

# Read the BAM file with read_bamfile

Let's read sequence alignment file (.bam) using function **read_bamfile**. A BAM index file (.bai) is necessary for efficiently reading the file. If it doesn't exist in the same directory, this function will automatically create one. This function will split sequencing reads into equal-size non-overlapping windows. Possible size of bin are 100, 500, and 1000 KB. A path to the input file is given to the function **read_bamfile**. A SampleBam object will be created as the result.

For demonstration, we read an example sequence file ex.plasma.bam.

```{r read a bam file and splited into 1000 KB non-overlapping bins, warning=FALSE}
library(cfdnakit)
sample_bamfile <- system.file("extdata",
                             "ex.plasma.bam",
                package = "cfdnakit")
plasma_SampleBam <- read_bamfile(sample_bamfile,
                                apply_blacklist = FALSE)
```

By default, running read_bamfile will split reads into 1000 KB non-overlapping bins based-on the human reference genome GRCh37.

Reading the file should take a while depending on the size of your BAM. We recommend to save the result as RDS file using **saveRDS** function.

```{r save the file, eval=FALSE}
### Optional
saveRDS(plasma_SampleBam, file = "patientcfDNA_SampleBam.RDS")
```


# Analyse the Fragment Length Distribution

Fragment-length distribution of the sample can be visualized with function **plot_fragment_dist**. In the top-right legend, the modal length (bp) is shown in the parenthesis behind each sample name. The x-axis is the length of cfDNA; y-axis is the proportion of cfDNA fragment having a specific length.

```{r plot fragment length distribution, eval=FALSE}
plot_fragment_dist(list("Plasma.Sample"=plasma_SampleBam))
```

This document will demonstate by using SampleBam object (example_patientcfDNA_SampleBam.RDS). Now we load this file into R session.

```{r load example patient cfDNA sample BAM, warning=FALSE}
example_RDS <- "example_patientcfDNA_SampleBam.RDS"
example_RDS_file <-
    system.file("extdata",example_RDS,
                package = "cfdnakit")
sample_bins <- readRDS(example_RDS_file)

```

In general, plasma cfDNA should show non-random fragmentation pattern. The modal length of cfDNA is 167 bp and 10-bp periodic peaks. However, tumor-derived cfDNA are observed to be shorter (\<150 bp) than cfDNA of non-tumor origin. Here, we compare the fragment length distribution of patient's cfDNA with healthy individual. We derived a healthy cfDNA from Snyder et al. (2016) and create a RData file "BH01_chr15.RDS". This file can be loaded in R environment with readRDS function.

```{r getting build-in healthy plasma cfDNA sample, warning=FALSE}
control_rds<-"BH01_CHR15.SampleBam.rds"
control_RDS_file <-
    system.file("extdata",control_rds,
                package = "cfdnakit")
control_bins <-
    readRDS(control_RDS_file)

```

To provide visual comparison of cell-free DNA fragmentation, cfdnakit provide a function that allows plotting multiple distribution plots. We create a list of SampleBam files and plot their distribution together with function plot_fragment_dist. Each element in the list must be given a distinct sample name (e.g. Healthy.cfDNA).

```{r plot fragment length distribution comparing to a healthy cfDNA, fig.width=5.5, fig.height=4}
comparing_list <- list("Healthy.cfDNA"=control_bins,
                      "Patient.1"=sample_bins)
plot_fragment_dist(comparing_list)

```

**Optional Parameters**

maximum_length = Maximum length of cfDNA (Default 550)

minimum_length = Minimum length of cfDNA (Default 20)

# Quantification of Short Fragmented CfDNA

We can extract genome-wide fragment-length profile. First we define the short fragment as fragments having size (by default) between 100 - 150 base and 151 - 250 as the long fragment. Providing a SampleBam object to the function get_fragment_profile will return a SampleFragment object.

```{r getting fragment length profile,warning=FALSE, fig.width=12, fig.height=5}
sample_profile <- 
  get_fragment_profile(sample_bins,
                       sample_id = "Patient.1")
```

This SampleFragment contains a dataframe named **sample_profile**. This table contains information about the BAM file and the ratio number of Short/Long fragments. The table below describes important variables.

```{r show sample profile}
sample_profile$sample_profile
```

| Variable            | Description                                                   |
|---------------------|----------------------------------------------------           |
| Total.Fragments     | Number of DNA fragments (not the number of reads)             |
| Read.Pairs.in.range | Number of fragments within the defined range (short and long) |
| Mode                | Fragment length of the majority (bp)                          |
| Median              | Median Fragment length (bp)                                   |
| Mean                | Average Fragment length (bp)                                  |
| Mad                 | Fragment length median absolute deviation (MAD)               |
| N.Short             | Number of short fragments (n)                                 |
| N.Long              | Number of long fragments (n)                                  |
| S.L.Ratio           | N.Short/N.Long; Ratio of short fragment over long fragment    |
| S.L.Ratio_corrected | Ratio of short fragment over long fragment after GC-bias correction |
| Bin.Size.KB.        | Size of genomic bin of the analysis (KB)

# Plot Genome-wide Short-fragmented Ratio

We can plot genome-wide short-fragment ratio with the function plot_sl_ratio. Given short-fragment profile, short-fragment ratio per bin infer contribution of circulating tumor DNA (ctDNA) into cfDNA. The enrichment of short-fragment cfDNA in a large genomic region could infer the copy-number aberration status. The higher short-fragment ratio indicate amplification event whereas deletion would have relatively lower short-fragment ratio.

```{r plot genome-wide short-fragment ratio, fig.width=12, fig.height=5, warning=FALSE}
## For this demenstration, we load a real patient-derived cfDNA profile.
patient.SampleFragment.file <-
    system.file("extdata",
                "example_patientcfDNA_SampleFragment.RDS",
                package = "cfdnakit")
patient.SampleFragment <- readRDS(patient.SampleFragment.file)
plot_sl_ratio(patient.SampleFragment)
```

The range of S.L.Ratio is broad. It could be as high as 2 or more in sample with high contribution of ctDNA. Adjusting the plot is possible with parameter **ylim** or with other ggplot functions.

# Save SampleFragment as RDS file

Up to this step, we rather save the SampleFragment object as RDS for later use and for creation of a Panel-of-Normal.

```{r save SampleFragment Profile file,eval=FALSE}
destination_dir <- "~/cfdnakit.result"
saveRDS(sample_profile,
        file = paste0(destination_dir,
        "/plasma.SampleFragment.RDS"))
```

# Create Panel of Normal (PoN)

## Creating list of PoN files

Making a Panel-of-Normal is necessary for downstream analysis as we want to compare fragment profile between a cfDNA from patient with pooled of healthy individuals. First, we create a text file where each line is a full path to rds file of the SampleFragment object.

**Example content of PoN list file (let's name it Pon-list.txt)**

```
### Example content of Pon-list.txt
Path.to/SampleFragment_healthyplasma-01.rds
Path.to/SampleFragment_healthyplasma-02.rds
Path.to/SampleFragment_healthyplasma-03.rds
Path.to/SampleFragment_healthyplasma-04.rds
Path.to/SampleFragment_healthyplasma-05.rds
```

## Creating a PoN dataset

**create_PoN** function will read through all fragment profile files and create a set of PoN profile. We save this PoN profile and load into analysis later.

```{r create PoN profile, eval=FALSE}
PoN.profiles <- create_PoN("Path.to/Pon-list.txt")

saveRDS(PoN.profiles, "PoN.rds")
```

# Inferring CNV from short fragment cfDNA

## Normalizing Short-fragmented Ratio

The contribution of short-fragmented cfDNA into genomic regions could infer copy-number aberration harbored in the tumor cells. For this demonstration, we load ready-to-use PoN profile (ex.PoN.rds).

```{r Make a pon profile}
PoN_rdsfile <- system.file("extdata",
                          "ex.PoN.rds",
                          package = "cfdnakit")

PoN.profiles <- readRDS(PoN_rdsfile)
```

cfdnakit transforms S.L.Ratio of each genomic windows with the distribution of samples in PoN by function **get_zscore_profile**.

```{r Reading PoN samples,warning=FALSE,fig.width=8, fig.height=4}
sample_zscore <- 
  get_zscore_profile(patient.SampleFragment,
                     PoN.profiles)
```

## Circular Binary Segmentation (CBS)

cfdnakit implement the circular binary segmentation algorithm. Function **segmentByCBS** of package PSCBS were used through function segmentByPSCB. We can visualize both transformed S.L.Ratio and the segmentation by function **plot_transformed_sl**.

```{r Reading PoN samples and plot scaled value,warning=FALSE,fig.width=8, fig.height=4}
sample_zscore_segment <- segmentByPSCB(sample_zscore)

plot_transformed_sl(sample_zscore,sample_zscore_segment)
```

**Optional parameters** ylim = Y-axis of plot (Default c(-20,20))

## Estimating tumor fraction (TF) and CNV calling

Base on hypothesis that short-fragmented cfDNA originate from tumor cells. Enrichment of short-fragmented cfDNA in a large genomic segments correlates with the absolute copy in the tumor origin. cfdnakit heuristically search for fittest solution per absolute ploidy of 2, 3 and 4. The distance of each solution is calculated similarly to ACE (Poell JB et.al 2019) method.

Function call_cnv returns 3 solutions of the CNV calling and TF estimation. The analysis perform only on autosomal chromosomes. We can plot the distance matrix of all solution where solutions with \* are optimal solutions (minimum distance) per absolute genome ploidy.

```{r cnv calling and plot distance matrix,warning=FALSE,fig.width=6, fig.height=4}
sample_cnv <- call_cnv(sample_zscore_segment,sample_zscore)

plot_distance_matrix(sample_cnv)
```

Function call_cnv returns 3 solutions of the CNV calling and TF estimation. The analysis perform only on autosomal chromosomes. We can plot the distance matrix of all solution where solutions with \* are optimal solutions (minimum distance) per absolute genome ploidy.

The available solution can be obtained with function **get_solution_table**. The table shows the best solution (minimum distance) per absolute genome ploidy and rank them by the distance.

```{r get and print solution table}
solution_table <- get_solution_table(sample_cnv)

solution_table
```

## Plot optimal CNV profile

We can plot copy-number result by function plot_cnv_solution. By default, this function will produce the plot of the best solution (rank 1) and can be changed by specifying solution number to parameter **selected_solution**.

```{r plot cnv-calling first solution,warning=FALSE,fig.width=12, fig.height=5}
plot_cnv_solution(sample_cnv,selected_solution = 1)
```

# Copy-number Abnormality Score

As the result of copy-number solution fitting, the tumor fraction (tf ) indicates the estimated quantity of ctDNA from the amplitude of signals. cfdnakit also implements the ctDNA estimation score (CES) score to quantify the tumor-derived cell-free DNA from the segmentation and the SLRatio.

```{r calculate CES score}
calculate_CES_score(sample_zscore_segment)
```

This score is robust to coverage bias and noisy fragmented signals. Briefly, the Gaussian noise does not affect the score because the z-scores of segments, instead of the z-score of bins, are considered. Second, the average segment length is used as a penalty for sample quality. The signal of a bad quality sample does not strongly affect the score whereas a true highly unstable genome would overcome this penalty.

# Session info

Output of sessionInfo on the system on which this document was compiled:

```{r session info}
sessionInfo()
```