---
title: "DuplexDiscovereR tutorial"
author:
  - name: "Egor Semenchenko"
    affiliation: Max Delbrück Center for molecular medicine; Berlin Institute for Medical Systems Biology; Freie Universität Berlin  
    email: egorsbs01@gmail.com
package: DuplexDiscovereR
abstract: |
    DuplexDiscovereR implements methods for analysing data from RNA cross-linking 
    and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq 
    and other methods which yeild information on RNA-RNA interactions as 
    chimeric read fragments after high-throughput sequencing. 
bibliography: '`r system.file("references.bib", package = "DuplexDiscovereR")`'
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 5
    toc_float: true
    number_sections: true
vignette: |
  %\VignetteIndexEntry{DuplexDiscovereR tutorial}
  %\VignetteEncoding{UTF-8} 
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---


```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown(css.files = c("custom.css"))
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "##",
    crop = FALSE
)
```

# Installation

`DuplexDiscovereR` can be installed from Bioconductor:

```{r install_chunk, eval=FALSE}
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("DuplexDiscovereR")

library(DuplexDiscovereR)
?DuplexDiscovereR
```

or from Github page:

```{r install_chunk2, eval=FALSE}
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

library(devtools)
devtools::install_github("Egors01/DuplexDiscovereR")
```

## Installing RNAduplex

For calculating hybridization energies `DuplexDiscovereR` uses
[RNAduplex](https://www.tbi.univie.ac.at/RNA/RNAduplex.1.html) program
from the [ViennaRNA](https://www.tbi.univie.ac.at/RNA/) software suite.

Although this step is optional and is not required for using most of the
package methods, we strongly recommend installing ViennaRNA from its
[web-page](https://www.tbi.univie.ac.at/RNA/), as predicting RNA hybrids
is one of the central aims of the analysis of RNA-RNA interaction data.
Please note that ViennaRNA is distributed under its own licence.

# Introduction

RNA cross-linking and proximity ligation methods as SPLASH, PARIS,
LIGR-seq, RIC-seq and others provide infromation on the RNA-RNA
interactions on a transcriptome-wide scale. These experiments generate
high-throughput sequencing data containing a fraction of chimeric reads,
where each chimeric part corresponds to the base-paired stretches of RNA
(RNA duplexes)

`DuplexDiscovereR` is designed for the bioinformatic analysis of RNA data.
It employs a workflow that allows users to identify RNA duplexes and
and provides several options for filtering and ranking the results.

Starting from the set of aligned chimeric reads, it implements
Classification and filtering of non-RNA duplex alignments and their
into duplex groups (DGs). Identified DGs are then annotated
with transcripts or other genomic features. P-values are calculated to test
the hypothesis that DGs are generated by random ligation. Finally, RNA
duplex base pairing and hybridization energies are predicted.

The optimal procedures for identifying RNA duplexes may vary between
methods, as RNA-RNA interaction probing protocols are known to have
biases for certain types of RNA and differ in experimental steps.
`DuplexDiscovereR` allows user to build customized analysis by utilizing
its methods for efficient chimeric read clustering, classification and
convenience functions for flexible data filtering, annotation and
visualization.

RNA duplex data is known to be sparse, with little overlap between the results of
between the results of different probing protocols. To facilitate
comparisons between different experiments and to allow straightforward
cross-checking between replicates, `DuplexDiscovereR` includes
functionality to intersect multiple RNA interaction datasets.

`DuplexDiscovereR` relies on the `GInteractions` container from
`r Biocpkg("InteractionSet")` package for the storing of the RNA-RNA
interaction data and uses tidyverse `tibble` and `dplyr` for internal
the `r Biocpkg("Gviz")` - based `DuplexTrack` track and
supports the exportof the results to the .sam file.

# Quick start

##  Analysis steps

Key steps of RNA duplex data analysis with `DuplexDiscovereR` can be run
through the single function call. It is expected that user already
aligned the raw sequencing reads which produced some chimeric or
split-read alignments. The input can be provided in several formats:

-   Tabular file with the split-read alignments
    -   *Chimeric.out.junction* file generated by STAR
    -   *.bedpe* format file (defined in
        [bedtools](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format))
-   `GInteractions` object with chimeric alignments. 

Full `DuplexDiscovereR` workflow consists of the following steps:

-   Pre-processing of the input. This step calls the classification of the reads 
    into following classes: 
    -   2arm reads (only one chimeric junction per aligned read)
    -   multi-split (two or more one chimeric junctions per aligned
        read)
    -   multi-mapped reads
-   Comparing the 2arm reads mapped reads with the splice
    junction annotation. This filters out alignments across the exon-exon 
    junctions which typically contaminate the 'true' chimeric alignments 
    originated from the RNA duplexes
-   Deduplication of the identical read alignments
-   Clustering of reads into duplex groups (DGs)

*Optionally*,

-   Annotation of the DGs with transcript or other features
-   Prediction of the base-pairing hybridization energies
-   Computing p-values from testing the probability of random ligation

## Input arguments

Required inputs for minimal run are the following:

-   `data` Chimeric read alignments or coordinates. Object of any
    table type, convertible to `tibble` or `GInteractions`
-   `junctions_gr` `GRanges` object with splice junction
    coordinates.\
-   Library type and type of the read table, if input is not
    `GInteractions`.
    
Optional inputs for executing the full workflow:

-   `anno_gr` a `GRanges` object with genes or transcripts
    annotation object
-   `df_counts` a 2-column table with the `id` in the 1st column and
    raw read count in the 2nd. `id` should correspond to the entries
    in `anno_gr`. Providing this argument triggers p-value
    calculation.
-   `fafile`  path to *.fasta* file with the genome. Providing this
    argument triggers prediction of RNA hybrids.

## Loading example data

Once installed, we can load the example data. Example dataset is based
on the RNA duplex probing of SPLASH ES-cells [Aw. et.al
2016](https://pubmed.ncbi.nlm.nih.gov/27184079/) which was aligned with
STAR [Dobin et.al,
2013](https://academic.oup.com/bioinformatics/article-abstract/29/1/15/272537)
and subset to the chromosome 22.

```{r dataload,echo=T,eval=T,message=FALSE}
library(DuplexDiscovereR)
data(RNADuplexesSampleData)
```

Details on alignment of the reads of example dataset and fetching of the
annotation data are described in the document which can be found at

```{r dataload2 }
system.file("extdata", "scripts", "DD_data_generation.R", package = "DuplexDiscovereR")
```

## Running analysis

```{r run_workflow, echo=T,eval=T,message=FALSE}
genome_fasta <- NULL
result <- DuplexDiscovereR::runDuplexDiscoverer(
    data = RNADuplexesRawChimSTAR,
    junctions_gr = SampleSpliceJncGR,
    anno_gr = SampleGeneAnnoGR,
    df_counts = RNADuplexesGeneCounts,
    sample_name = "example_run",
    lib_type = "SE",
    table_type = "STAR",
    fafile = genome_fasta,
)
```

The `result` is a `DuplexDiscovererResults` object which contains several outputs of different
dimensions.

```{r observe_results1,eval=TRUE}
result
```

The primary output are detected duplex groups. 
This is a `GInteractions`, where each entry represents boundaries 
of the detected RNA duplex. Metadata fields as `n_reads` , `p_val` , `score` and others 
can be used for ranking and sub-setting results for downstream analysis. 

```{r observe_results2,eval=TRUE}
gi_clusters <- dd_get_duplex_groups(result)
head(gi_clusters, 2)
```

Individual read-level data can be accessed by running:

```{r observe_results3,eval=TRUE}
gi_reads <- dd_get_chimeric_reads(result)
head(gi_reads, 2)
```
Each entry in `gi_reads` is a single read. They are tagged by DG with the 
`dg_id`, but not collapsed into groups. This layout is used for
visualization and may be useful for a more detailed review of the structure of 
the individual DGs.

The dimensions of `gi_reads` do not correspond 1:1 to the size of the
to the size of the input, since in general not every element in the input data 
can be represented as a 2-arm split alignment.

To track each input alignment one can query the the read-level classification results, 
which have the same dimension with the input. 

```{r observe_results4a,eval=TRUE}
df_reads <- dd_get_reads_classes(result)
print(dim(df_reads))
print(dim(RNADuplexesRawChimSTAR))
```
`df_reads` has fields such as `read_type`, `map_type` and `junction_type` 
indicating the results of chimeric read classification and
filtering. 

This layout is useful for assessing the statistics and highlighting potential 
problems and optimization points, i.e. how many reads from the input were 
classified as true chimeric and how many of them were assigned to duplex groups. 

```{r observe_results4b,eval=TRUE}
table(df_reads$read_type)
```

Among the 2-arm alignments from `gi_reads`, only unambigously aligned reads 
without self-ovelaps that pass splice-junctions and minimum junction length
filters are subjected to clustering into DGs. To see the determined types of 
2-arm alignments, we can check the `junction_type` metadata field

```{r observe_results5,eval=TRUE}
table(gi_reads$junction_type)
```

## Writing output

The resulting `GInteractions` with duplex groups are ready for
integration into transcriptomic analysis using other Bioconductor
packages.

There are several options available for outputting the results for other
uses.

### Output to table

One can convert the object with duplex groups to dataframe-like object and
write it to file

```{r write_table, eval=TRUE}
clusters_dt <- makeDfFromGi(gi_clusters)
write.table(clusters_dt, file = tempfile(pattern = "dgs_out", fileext = ".tab"))
```

### Output to .sam file

Saving results to .sam file can be a useful tool for downstream
visualization with interactive viewers as IGV

```{r write_sam, eval=TRUE, message = FALSE}
writeGiToSAMfile(gi_clusters, file_out = tempfile(pattern = "dgs_out", fileext = ".sam"), genome = "hg38")
```

## Visualization

Collapsed DGs can be visualized with Gviz -based DuplexTrack, 
allowing overlays of RNA duplexes with the genes of other features

```{r, vis_1, eval=TRUE,message=FALSE}
library(Gviz)
# define plotting region
plotrange <- GRanges(
    seqnames = "chr22",
    ranges = IRanges(
        start = c(37877619),
        end = c(37878053)
    ),
    strand = "+"
)
# make genes track
anno_track <- Gviz::GeneRegionTrack(SampleGeneAnnoGR,
    name = "chr22 Genes",
    range = plotrange
)
# construct DuplexTrack
duplex_track <- DuplexTrack(
    gi = gi_clusters,
    gr_region = plotrange,
    labels.fontsize = 12,
    arcConstrain = 4,
    annotation.column1 = "gene_name.A",
    annotation.column2 = "gene_name.B"
)
plotTracks(list(anno_track, duplex_track),
    sizes = c(1, 3),
    from = start(plotrange) - 100,
    to = end(plotrange) + 100
)
```

## Calcualting hybridization energies

In the example above we have omitted the base-pairing prediction and the
hybridization energies by not passing the path to file with genome as the 
`fafile` argument. However, the prediction of RNA hybrids is
central to the analysis of RNA-RNA interaction data.

To compute the structure formed by two stretches of RNA
`DuplexDiscoverer` uses *RNAduplex* algorithm from ViennaRNA. Note
that ViennaRNA is distributed under its own licence. Please refer to
[ViennaRNA's license](https://viennarna.readthedocs.io/en/latest/license.html)
for details.

If the *RNAduplex* is installed, all steps of the
analysis can be re-run by calling `runDuplexDiscoverer` with the
option `fafile = <path to genome fasta>`. Alternatively, one can predict RNA
hybrids by calling separate function on already existing `GInteractions` object:

```         
sequence <- paste0(
"AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU",
"CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU")

# Save the sequence to a temp fasta file
fasta_file = tempfile(fileext = '.fa')
chrom <- "test_chrA"
writeLines(c(">test_chrA", sequence), con = fasta_file)

# Create the GInteraction object
# Define start and end positions for the base-pairing regions
regions <- data.frame(
    start1 = c(1, 11, 21, 31, 41),
    end1 = c(10, 20, 30, 40, 50),
    start2 = c(91, 81, 71, 61, 51),
    end2 = c(100, 90, 80, 70, 60)
)
# GRanges objects for the anchors
anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1))
anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2))
interaction <- GInteractions(anchor1, anchor2)
# predict hybrids
getRNAHybrids(interaction,fasta_file)
```

## Comparing multiple samples or replicates

Due to the sparse nature of RNA duplex data, it is desirable to be able
to compare replicates of the same experiment or to compare different
data sets.

`DuplexDiscovereR` provides a method for finding intersections between
multiple samples. Comparisons between ranges can result in one-to-many
relations, i.e. when a region (or pair a of regions) of one sample contains
multiple regions of other samples. `DuplexDiscovereR` addresses this
problem by using the following procedure: first it computes a non-redundant
set of interactions that stacks and collapses interactions from all
input samples. This resulting super-set is then compared in a pairwise
manner with each  entry from each individual sample.

To demonstrate how multiple samples can be compared, we will generate
three pseudo-replicates from the example data which we clustered above.

First, we will select the clustered reads and divide them into three groups

```{r comp1, eval=TRUE}
clust_reads <- gi_reads
clust_reads <- clust_reads[!is.na(clust_reads$dg_id)]

# Create separate pseudo-sample objects for each group
set.seed(123)
group_indices <- sample(rep(1:3, length.out = length(clust_reads)))

group1 <- clust_reads[group_indices == 1]
group2 <- clust_reads[group_indices == 2]
group3 <- clust_reads[group_indices == 3]
```

Since individual reads have already been assigned to duplex groups, 
we can collapse them, ensuring that we have some overlap between 
reads that belong to the same 'true' DGs, but are spread across three
artificial sets.

```{r comp2, eval=TRUE}
group1 <- collapse_duplex_groups(group1)
group2 <- collapse_duplex_groups(group2)
group3 <- collapse_duplex_groups(group3)
```

```{r comp3, eval=TRUE, message=FALSE}
a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3)
res_comp <- compareMultipleInteractions(a)
names(res_comp)
```

Result contains slot for interaction superset and overlap information
information. We can visualize the intersections of the samples using the upset plot

```{r comp4, eval=TRUE}
# first, check if UpSetR is installed
if (!requireNamespace("UpSetR", quietly = TRUE)) {
    stop("Install 'UpSetR' to use this function.")
}
library(UpSetR)
upset(res_comp$dt_upset, text.scale = 1.5)
```

# Building customized RNA duplex data analysis

One can customize the analysis by performing classification, clustering and 
subsequent annotation procedures step by step, adapting the procedures to
the RNA-RNA interaction probing protocol or upstream/downstream analysis.

In this section, we show how users can use the core methods in the 
package to achieve more flexibility in this task. 

## Preprocessing 

Most of the methods in `DuplexDiscovereR` work with
the RNA-RNA interaction data stored in `GInteraction` object. There are
several metadata fields which are not strictly required by the package,
but utilized in procedures run classification: `map_type`, `read_id` and 
`n_reads`,`score` for clustering.

The easiest way to run `DuplexDiscovereR` step-by-step is to start with a
pre-processing routine. It is fairly flexible in input type: it can be either
`GInteractions` ,  *Chimeric.out.Junction* file from *STAR* or generic 
*.bedpe* file with pairs of regions. 

```{r,  preproc ex1, message=FALSE}
data("RNADuplexesSampleData")

new_star <- runDuplexDiscoPreproc(
    data = RNADuplexesRawChimSTAR,
    table_type = "STAR",
    library_type = "SE",
)
```

```{r,  preproc ex2, message=FALSE}
new_bedpe <- runDuplexDiscoPreproc(
    data = RNADuplexesRawBed,
    table_type = "bedpe",
    library_type = "SE", return_gi = TRUE
)
```

By default, to make downstream read filtering slightly more convenient,
`tibble` is returned after the pre-processing. In case of the
`GInteractions` input, input reads already are 2-arm split by design, so
we can return `GInteractions` .

```{r,  preproc ex3, message=FALSE}
# keep only readname in metadata
mcols(RNADuplexSampleGI) <- mcols(RNADuplexSampleGI)["readname"]
new_gi <- runDuplexDiscoPreproc(data = RNADuplexSampleGI, return_gi = TRUE)
head(new_gi, 1)
```

## Classification and filtering

The basic classification routines implemented in the package are
comparing the chimeric junctions to the known splice junctions and
determining the types of the junctions. Both can be called by a 
separate method that adds a corresponding metadata field to the input:

```{r ex_classif1}
gi <- getChimericJunctionTypes(new_gi)
gi <- getSpliceJunctionChimeras(gi, sj_gr = SampleSpliceJncGR)
```

One can inspect detected junction types

```{r ex_classif2}
table(gi$junction_type)
```

and amount of reads which are not likely to come from true RNA
duplexes, because their chimeric junctions are too similar to normal
exon-exon junction

```{r ex_cassif3}
table(gi$splicejnc)
```

For determining the duplex groups, we will leave only chimeric junctions
in a read which do not self-overlap:

```{r }
keep <- which((gi$junction_type == "2arm") & (gi$splicejnc == 0))
gi <- gi[keep]
```

## Clustering reads into duplex groups

The most basic way to find the duplex groups is to simply call

```{r clust1, message=FALSE}
gi <- clusterDuplexGroups(gi)
```

This procedure will:

1.  Find overlaps between all chimeric read pairs.
2.  Represent reads as a directed graph, where edges are weighted with
    the amount of overlap between the reads.
3.  Call a community search algorithm on the graph which finds the read
    clusters.
4.  Add the read cluster IDs as the `dg_id` field to the input object.
    `NA` values are used for reads which are not a part of any read duplex
    group (DG)

```{r ex_clust2}
table(is.na(gi$dg_id))
```

To collapse reads to the DGs with the boundaries, containing
every read of DG, one can call

```{r ex_clist3}
gi_dgs <- collapse_duplex_groups(gi)
head(gi_dgs, 1)
# number of DGs
length(gi_dgs)
```
The above procedure can be applied to any subset of data. 
If the user is interested in the RNA structure in a particular region, they
can select all reads overlapping with this region and repeat the clustering several times. 
By adjusting the parameters (minimum overlap, maximum allowed distance between 
alignment arms, coordinates of coordinates) one can obtain more refined or larger
duplex groups for improving subsequent prediction of RNA secondary structure.

## Customization of the read clustering

### Collapsing identical reads

For large datasets, it is possible to identify and collapse identical
reads – those that map to the same coordinates with the same score. In
overlap-based clustering, these reads would belong to the same duplex
groups. Collapsing them into a single entry reduces the number of nodes
and vertices in the clustering graph, resulting in better time and
memory performance.

```{r , dupl1 }
res_collapse <- collapseIdenticalReads(gi)
# returns new object
gi_unique <- res_collapse$gi_collapsed
```

This procedure adds `duplex_id` column as new id and removes `read_id`,
because single entry in a new object can correspond to multiple reads.
`n_reads` records the number of reads in this temporary duplex group.

```{r dupl2 }
head(res_collapse$stats_df, 1)
```

Note that `collapse_duplex_groups()` method has options to handle
temporary duplex groups, so that merging identical reads before clustering
would not lose information on the read support for duplex group.

```{r dupl3, message=FALSE}
# cluster  gi_unqiue
gi_unique_dg <- collapse_duplex_groups(clusterDuplexGroups(gi_unique))
# check if the get the same amount of reads as in basic clustering
sum(gi_unique_dg$n_reads) == sum(gi_dgs$n_reads)
```

### Custom weights for clustering

There is a possibility for a user to re-define the rules by which
overlaps between reads are weighted before community search.

This might be a customization related to protocols or the size of the
dataset. Possible scenarios are weighting the clustering graph vertices
with hybridization energies calculated directly on the reads instead of
duplex groups, or employing other heuristic assumptions on how RNA
duplexes are translated into chimeric read fragments. e.g in RIL-seq, the
second chimeric arm is more likely to be a target RNA rather than sRNA.

To demonstrate such a modification, we will first compute the overlap graph manually,
using the following procedure, second, modify the graph weights, and third, call the 
clustering. 

First, we find pair overlaps in the input `GInteractions` object

```{r cl1}
graph_df <- computeGISelfOverlaps(gi_unique)
head(graph_df)
```

The size of this dataframe is determined by the presence of pairwise overlap
between both arms of read alignments. The first two columns represent
the element index or `duplex_id` of the corresponding reads. It can be
filtered or supplemented with other relevant metrics of the pairwise
relationship between the reads.

Then we change the rule for how we handle overlap between reads.
In this example we will manually change the span-to-overlap ratio per
pair or overlapping alignment arms.  Read pairs that overlap for less than half 
of their length will not be treated as reliable to be considered part of 
the same DG. 

```{r cl2a}
graph_df <- graph_df[graph_df$ratio.A > 0.5 & graph_df$ratio.B > 0.5, ]
```
We will use the sum of the overlap ratios as the new weights and re-scale them to {0,1}.
```{r cl2b}
rescale_vec <- function(x) {
    return((x - min(x)) / (max(x) - min(x)))
}

graph_df$weight_new <- graph_df$ratio.A + graph_df$ratio.B
graph_df$weight_new <- rescale_vec(graph_df$weight_new)
head(graph_df, 2)
```

This dataframe then can be passed to the clustering function:

```{r cl3, message=FALSE}
gi_clust_adj <- clusterDuplexGroups(gi_unique, graphdf = graph_df, weight_column = "weight_new")
gi_clust_adj_dgs <- collapse_duplex_groups(gi_clust_adj)
```

We get smaller number of the DGs, as some of them are
not assembled after restricting the minimum required overlap

```{r cl4}
length(gi_clust_adj)
```

# Session information

```{r session-info,cache = F,echo=T,message=T,warning=FALSE}
sessionInfo()
```