--- title: "DuplexDiscovereR tutorial" author: - name: "Egor Semenchenko" affiliation: Max Delbrück Center for molecular medicine; Berlin Institute for Medical Systems Biology; Freie Universität Berlin email: egorsbs01@gmail.com package: DuplexDiscovereR abstract: | DuplexDiscovereR implements methods for analysing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq and other methods which yeild information on RNA-RNA interactions as chimeric read fragments after high-throughput sequencing. bibliography: '`r system.file("references.bib", package = "DuplexDiscovereR")`' output: BiocStyle::html_document: toc: true toc_depth: 5 toc_float: true number_sections: true vignette: | %\VignetteIndexEntry{DuplexDiscovereR tutorial} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown(css.files = c("custom.css")) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "##", crop = FALSE ) ``` # Installation `DuplexDiscovereR` can be installed from Bioconductor: ```{r install_chunk, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DuplexDiscovereR") library(DuplexDiscovereR) ?DuplexDiscovereR ``` or from Github page: ```{r install_chunk2, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") library(devtools) devtools::install_github("Egors01/DuplexDiscovereR") ``` ## Installing RNAduplex For calculating hybridization energies `DuplexDiscovereR` uses [RNAduplex](https://www.tbi.univie.ac.at/RNA/RNAduplex.1.html) program from the [ViennaRNA](https://www.tbi.univie.ac.at/RNA/) software suite. Although this step is optional and is not required for using most of the package methods, we strongly recommend installing ViennaRNA from its [web-page](https://www.tbi.univie.ac.at/RNA/), as predicting RNA hybrids is one of the central aims of the analysis of RNA-RNA interaction data. Please note that ViennaRNA is distributed under its own licence. # Introduction RNA cross-linking and proximity ligation methods as SPLASH, PARIS, LIGR-seq, RIC-seq and others provide infromation on the RNA-RNA interactions on a transcriptome-wide scale. These experiments generate high-throughput sequencing data containing a fraction of chimeric reads, where each chimeric part corresponds to the base-paired stretches of RNA (RNA duplexes) `DuplexDiscovereR` is designed for the bioinformatic analysis of RNA data. It employs a workflow that allows users to identify RNA duplexes and and provides several options for filtering and ranking the results. Starting from the set of aligned chimeric reads, it implements Classification and filtering of non-RNA duplex alignments and their into duplex groups (DGs). Identified DGs are then annotated with transcripts or other genomic features. P-values are calculated to test the hypothesis that DGs are generated by random ligation. Finally, RNA duplex base pairing and hybridization energies are predicted. The optimal procedures for identifying RNA duplexes may vary between methods, as RNA-RNA interaction probing protocols are known to have biases for certain types of RNA and differ in experimental steps. `DuplexDiscovereR` allows user to build customized analysis by utilizing its methods for efficient chimeric read clustering, classification and convenience functions for flexible data filtering, annotation and visualization. RNA duplex data is known to be sparse, with little overlap between the results of between the results of different probing protocols. To facilitate comparisons between different experiments and to allow straightforward cross-checking between replicates, `DuplexDiscovereR` includes functionality to intersect multiple RNA interaction datasets. `DuplexDiscovereR` relies on the `GInteractions` container from `r Biocpkg("InteractionSet")` package for the storing of the RNA-RNA interaction data and uses tidyverse `tibble` and `dplyr` for internal the `r Biocpkg("Gviz")` - based `DuplexTrack` track and supports the exportof the results to the .sam file. # Quick start ## Analysis steps Key steps of RNA duplex data analysis with `DuplexDiscovereR` can be run through the single function call. It is expected that user already aligned the raw sequencing reads which produced some chimeric or split-read alignments. The input can be provided in several formats: - Tabular file with the split-read alignments - *Chimeric.out.junction* file generated by STAR - *.bedpe* format file (defined in [bedtools](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format)) - `GInteractions` object with chimeric alignments. Full `DuplexDiscovereR` workflow consists of the following steps: - Pre-processing of the input. This step calls the classification of the reads into following classes: - 2arm reads (only one chimeric junction per aligned read) - multi-split (two or more one chimeric junctions per aligned read) - multi-mapped reads - Comparing the 2arm reads mapped reads with the splice junction annotation. This filters out alignments across the exon-exon junctions which typically contaminate the 'true' chimeric alignments originated from the RNA duplexes - Deduplication of the identical read alignments - Clustering of reads into duplex groups (DGs) *Optionally*, - Annotation of the DGs with transcript or other features - Prediction of the base-pairing hybridization energies - Computing p-values from testing the probability of random ligation ## Input arguments Required inputs for minimal run are the following: - `data` Chimeric read alignments or coordinates. Object of any table type, convertible to `tibble` or `GInteractions` - `junctions_gr` `GRanges` object with splice junction coordinates.\ - Library type and type of the read table, if input is not `GInteractions`. Optional inputs for executing the full workflow: - `anno_gr` a `GRanges` object with genes or transcripts annotation object - `df_counts` a 2-column table with the `id` in the 1st column and raw read count in the 2nd. `id` should correspond to the entries in `anno_gr`. Providing this argument triggers p-value calculation. - `fafile` path to *.fasta* file with the genome. Providing this argument triggers prediction of RNA hybrids. ## Loading example data Once installed, we can load the example data. Example dataset is based on the RNA duplex probing of SPLASH ES-cells [Aw. et.al 2016](https://pubmed.ncbi.nlm.nih.gov/27184079/) which was aligned with STAR [Dobin et.al, 2013](https://academic.oup.com/bioinformatics/article-abstract/29/1/15/272537) and subset to the chromosome 22. ```{r dataload,echo=T,eval=T,message=FALSE} library(DuplexDiscovereR) data(RNADuplexesSampleData) ``` Details on alignment of the reads of example dataset and fetching of the annotation data are described in the document which can be found at ```{r dataload2 } system.file("extdata", "scripts", "DD_data_generation.R", package = "DuplexDiscovereR") ``` ## Running analysis ```{r run_workflow, echo=T,eval=T,message=FALSE} genome_fasta <- NULL result <- DuplexDiscovereR::runDuplexDiscoverer( data = RNADuplexesRawChimSTAR, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, df_counts = RNADuplexesGeneCounts, sample_name = "example_run", lib_type = "SE", table_type = "STAR", fafile = genome_fasta, ) ``` The `result` is a `DuplexDiscovererResults` object which contains several outputs of different dimensions. ```{r observe_results1,eval=TRUE} result ``` The primary output are detected duplex groups. This is a `GInteractions`, where each entry represents boundaries of the detected RNA duplex. Metadata fields as `n_reads` , `p_val` , `score` and others can be used for ranking and sub-setting results for downstream analysis. ```{r observe_results2,eval=TRUE} gi_clusters <- dd_get_duplex_groups(result) head(gi_clusters, 2) ``` Individual read-level data can be accessed by running: ```{r observe_results3,eval=TRUE} gi_reads <- dd_get_chimeric_reads(result) head(gi_reads, 2) ``` Each entry in `gi_reads` is a single read. They are tagged by DG with the `dg_id`, but not collapsed into groups. This layout is used for visualization and may be useful for a more detailed review of the structure of the individual DGs. The dimensions of `gi_reads` do not correspond 1:1 to the size of the to the size of the input, since in general not every element in the input data can be represented as a 2-arm split alignment. To track each input alignment one can query the the read-level classification results, which have the same dimension with the input. ```{r observe_results4a,eval=TRUE} df_reads <- dd_get_reads_classes(result) print(dim(df_reads)) print(dim(RNADuplexesRawChimSTAR)) ``` `df_reads` has fields such as `read_type`, `map_type` and `junction_type` indicating the results of chimeric read classification and filtering. This layout is useful for assessing the statistics and highlighting potential problems and optimization points, i.e. how many reads from the input were classified as true chimeric and how many of them were assigned to duplex groups. ```{r observe_results4b,eval=TRUE} table(df_reads$read_type) ``` Among the 2-arm alignments from `gi_reads`, only unambigously aligned reads without self-ovelaps that pass splice-junctions and minimum junction length filters are subjected to clustering into DGs. To see the determined types of 2-arm alignments, we can check the `junction_type` metadata field ```{r observe_results5,eval=TRUE} table(gi_reads$junction_type) ``` ## Writing output The resulting `GInteractions` with duplex groups are ready for integration into transcriptomic analysis using other Bioconductor packages. There are several options available for outputting the results for other uses. ### Output to table One can convert the object with duplex groups to dataframe-like object and write it to file ```{r write_table, eval=TRUE} clusters_dt <- makeDfFromGi(gi_clusters) write.table(clusters_dt, file = tempfile(pattern = "dgs_out", fileext = ".tab")) ``` ### Output to .sam file Saving results to .sam file can be a useful tool for downstream visualization with interactive viewers as IGV ```{r write_sam, eval=TRUE, message = FALSE} writeGiToSAMfile(gi_clusters, file_out = tempfile(pattern = "dgs_out", fileext = ".sam"), genome = "hg38") ``` ## Visualization Collapsed DGs can be visualized with Gviz -based DuplexTrack, allowing overlays of RNA duplexes with the genes of other features ```{r, vis_1, eval=TRUE,message=FALSE} library(Gviz) # define plotting region plotrange <- GRanges( seqnames = "chr22", ranges = IRanges( start = c(37877619), end = c(37878053) ), strand = "+" ) # make genes track anno_track <- Gviz::GeneRegionTrack(SampleGeneAnnoGR, name = "chr22 Genes", range = plotrange ) # construct DuplexTrack duplex_track <- DuplexTrack( gi = gi_clusters, gr_region = plotrange, labels.fontsize = 12, arcConstrain = 4, annotation.column1 = "gene_name.A", annotation.column2 = "gene_name.B" ) plotTracks(list(anno_track, duplex_track), sizes = c(1, 3), from = start(plotrange) - 100, to = end(plotrange) + 100 ) ``` ## Calcualting hybridization energies In the example above we have omitted the base-pairing prediction and the hybridization energies by not passing the path to file with genome as the `fafile` argument. However, the prediction of RNA hybrids is central to the analysis of RNA-RNA interaction data. To compute the structure formed by two stretches of RNA `DuplexDiscoverer` uses *RNAduplex* algorithm from ViennaRNA. Note that ViennaRNA is distributed under its own licence. Please refer to [ViennaRNA's license](https://viennarna.readthedocs.io/en/latest/license.html) for details. If the *RNAduplex* is installed, all steps of the analysis can be re-run by calling `runDuplexDiscoverer` with the option `fafile = `. Alternatively, one can predict RNA hybrids by calling separate function on already existing `GInteractions` object: ``` sequence <- paste0( "AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU", "CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU") # Save the sequence to a temp fasta file fasta_file = tempfile(fileext = '.fa') chrom <- "test_chrA" writeLines(c(">test_chrA", sequence), con = fasta_file) # Create the GInteraction object # Define start and end positions for the base-pairing regions regions <- data.frame( start1 = c(1, 11, 21, 31, 41), end1 = c(10, 20, 30, 40, 50), start2 = c(91, 81, 71, 61, 51), end2 = c(100, 90, 80, 70, 60) ) # GRanges objects for the anchors anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2)) interaction <- GInteractions(anchor1, anchor2) # predict hybrids getRNAHybrids(interaction,fasta_file) ``` ## Comparing multiple samples or replicates Due to the sparse nature of RNA duplex data, it is desirable to be able to compare replicates of the same experiment or to compare different data sets. `DuplexDiscovereR` provides a method for finding intersections between multiple samples. Comparisons between ranges can result in one-to-many relations, i.e. when a region (or pair a of regions) of one sample contains multiple regions of other samples. `DuplexDiscovereR` addresses this problem by using the following procedure: first it computes a non-redundant set of interactions that stacks and collapses interactions from all input samples. This resulting super-set is then compared in a pairwise manner with each entry from each individual sample. To demonstrate how multiple samples can be compared, we will generate three pseudo-replicates from the example data which we clustered above. First, we will select the clustered reads and divide them into three groups ```{r comp1, eval=TRUE} clust_reads <- gi_reads clust_reads <- clust_reads[!is.na(clust_reads$dg_id)] # Create separate pseudo-sample objects for each group set.seed(123) group_indices <- sample(rep(1:3, length.out = length(clust_reads))) group1 <- clust_reads[group_indices == 1] group2 <- clust_reads[group_indices == 2] group3 <- clust_reads[group_indices == 3] ``` Since individual reads have already been assigned to duplex groups, we can collapse them, ensuring that we have some overlap between reads that belong to the same 'true' DGs, but are spread across three artificial sets. ```{r comp2, eval=TRUE} group1 <- collapse_duplex_groups(group1) group2 <- collapse_duplex_groups(group2) group3 <- collapse_duplex_groups(group3) ``` ```{r comp3, eval=TRUE, message=FALSE} a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3) res_comp <- compareMultipleInteractions(a) names(res_comp) ``` Result contains slot for interaction superset and overlap information information. We can visualize the intersections of the samples using the upset plot ```{r comp4, eval=TRUE} # first, check if UpSetR is installed if (!requireNamespace("UpSetR", quietly = TRUE)) { stop("Install 'UpSetR' to use this function.") } library(UpSetR) upset(res_comp$dt_upset, text.scale = 1.5) ``` # Building customized RNA duplex data analysis One can customize the analysis by performing classification, clustering and subsequent annotation procedures step by step, adapting the procedures to the RNA-RNA interaction probing protocol or upstream/downstream analysis. In this section, we show how users can use the core methods in the package to achieve more flexibility in this task. ## Preprocessing Most of the methods in `DuplexDiscovereR` work with the RNA-RNA interaction data stored in `GInteraction` object. There are several metadata fields which are not strictly required by the package, but utilized in procedures run classification: `map_type`, `read_id` and `n_reads`,`score` for clustering. The easiest way to run `DuplexDiscovereR` step-by-step is to start with a pre-processing routine. It is fairly flexible in input type: it can be either `GInteractions` , *Chimeric.out.Junction* file from *STAR* or generic *.bedpe* file with pairs of regions. ```{r, preproc ex1, message=FALSE} data("RNADuplexesSampleData") new_star <- runDuplexDiscoPreproc( data = RNADuplexesRawChimSTAR, table_type = "STAR", library_type = "SE", ) ``` ```{r, preproc ex2, message=FALSE} new_bedpe <- runDuplexDiscoPreproc( data = RNADuplexesRawBed, table_type = "bedpe", library_type = "SE", return_gi = TRUE ) ``` By default, to make downstream read filtering slightly more convenient, `tibble` is returned after the pre-processing. In case of the `GInteractions` input, input reads already are 2-arm split by design, so we can return `GInteractions` . ```{r, preproc ex3, message=FALSE} # keep only readname in metadata mcols(RNADuplexSampleGI) <- mcols(RNADuplexSampleGI)["readname"] new_gi <- runDuplexDiscoPreproc(data = RNADuplexSampleGI, return_gi = TRUE) head(new_gi, 1) ``` ## Classification and filtering The basic classification routines implemented in the package are comparing the chimeric junctions to the known splice junctions and determining the types of the junctions. Both can be called by a separate method that adds a corresponding metadata field to the input: ```{r ex_classif1} gi <- getChimericJunctionTypes(new_gi) gi <- getSpliceJunctionChimeras(gi, sj_gr = SampleSpliceJncGR) ``` One can inspect detected junction types ```{r ex_classif2} table(gi$junction_type) ``` and amount of reads which are not likely to come from true RNA duplexes, because their chimeric junctions are too similar to normal exon-exon junction ```{r ex_cassif3} table(gi$splicejnc) ``` For determining the duplex groups, we will leave only chimeric junctions in a read which do not self-overlap: ```{r } keep <- which((gi$junction_type == "2arm") & (gi$splicejnc == 0)) gi <- gi[keep] ``` ## Clustering reads into duplex groups The most basic way to find the duplex groups is to simply call ```{r clust1, message=FALSE} gi <- clusterDuplexGroups(gi) ``` This procedure will: 1. Find overlaps between all chimeric read pairs. 2. Represent reads as a directed graph, where edges are weighted with the amount of overlap between the reads. 3. Call a community search algorithm on the graph which finds the read clusters. 4. Add the read cluster IDs as the `dg_id` field to the input object. `NA` values are used for reads which are not a part of any read duplex group (DG) ```{r ex_clust2} table(is.na(gi$dg_id)) ``` To collapse reads to the DGs with the boundaries, containing every read of DG, one can call ```{r ex_clist3} gi_dgs <- collapse_duplex_groups(gi) head(gi_dgs, 1) # number of DGs length(gi_dgs) ``` The above procedure can be applied to any subset of data. If the user is interested in the RNA structure in a particular region, they can select all reads overlapping with this region and repeat the clustering several times. By adjusting the parameters (minimum overlap, maximum allowed distance between alignment arms, coordinates of coordinates) one can obtain more refined or larger duplex groups for improving subsequent prediction of RNA secondary structure. ## Customization of the read clustering ### Collapsing identical reads For large datasets, it is possible to identify and collapse identical reads – those that map to the same coordinates with the same score. In overlap-based clustering, these reads would belong to the same duplex groups. Collapsing them into a single entry reduces the number of nodes and vertices in the clustering graph, resulting in better time and memory performance. ```{r , dupl1 } res_collapse <- collapseIdenticalReads(gi) # returns new object gi_unique <- res_collapse$gi_collapsed ``` This procedure adds `duplex_id` column as new id and removes `read_id`, because single entry in a new object can correspond to multiple reads. `n_reads` records the number of reads in this temporary duplex group. ```{r dupl2 } head(res_collapse$stats_df, 1) ``` Note that `collapse_duplex_groups()` method has options to handle temporary duplex groups, so that merging identical reads before clustering would not lose information on the read support for duplex group. ```{r dupl3, message=FALSE} # cluster gi_unqiue gi_unique_dg <- collapse_duplex_groups(clusterDuplexGroups(gi_unique)) # check if the get the same amount of reads as in basic clustering sum(gi_unique_dg$n_reads) == sum(gi_dgs$n_reads) ``` ### Custom weights for clustering There is a possibility for a user to re-define the rules by which overlaps between reads are weighted before community search. This might be a customization related to protocols or the size of the dataset. Possible scenarios are weighting the clustering graph vertices with hybridization energies calculated directly on the reads instead of duplex groups, or employing other heuristic assumptions on how RNA duplexes are translated into chimeric read fragments. e.g in RIL-seq, the second chimeric arm is more likely to be a target RNA rather than sRNA. To demonstrate such a modification, we will first compute the overlap graph manually, using the following procedure, second, modify the graph weights, and third, call the clustering. First, we find pair overlaps in the input `GInteractions` object ```{r cl1} graph_df <- computeGISelfOverlaps(gi_unique) head(graph_df) ``` The size of this dataframe is determined by the presence of pairwise overlap between both arms of read alignments. The first two columns represent the element index or `duplex_id` of the corresponding reads. It can be filtered or supplemented with other relevant metrics of the pairwise relationship between the reads. Then we change the rule for how we handle overlap between reads. In this example we will manually change the span-to-overlap ratio per pair or overlapping alignment arms. Read pairs that overlap for less than half of their length will not be treated as reliable to be considered part of the same DG. ```{r cl2a} graph_df <- graph_df[graph_df$ratio.A > 0.5 & graph_df$ratio.B > 0.5, ] ``` We will use the sum of the overlap ratios as the new weights and re-scale them to {0,1}. ```{r cl2b} rescale_vec <- function(x) { return((x - min(x)) / (max(x) - min(x))) } graph_df$weight_new <- graph_df$ratio.A + graph_df$ratio.B graph_df$weight_new <- rescale_vec(graph_df$weight_new) head(graph_df, 2) ``` This dataframe then can be passed to the clustering function: ```{r cl3, message=FALSE} gi_clust_adj <- clusterDuplexGroups(gi_unique, graphdf = graph_df, weight_column = "weight_new") gi_clust_adj_dgs <- collapse_duplex_groups(gi_clust_adj) ``` We get smaller number of the DGs, as some of them are not assembled after restricting the minimum required overlap ```{r cl4} length(gi_clust_adj) ``` # Session information ```{r session-info,cache = F,echo=T,message=T,warning=FALSE} sessionInfo() ```