| Title: | Analysis of the data from RNA duplex probing experiments |
|---|---|
| Description: | DuplexDiscovereR is a package designed for analyzing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq, and others. DuplexDiscovereR accepts input in the form of chimerically or split-aligned reads. It includes procedures for alignment classification, filtering, and efficient clustering of individual chimeric reads into duplex groups (DGs). Once DGs are identified, the package predicts RNA duplex formation and their hybridization energies. Additional metrics, such as p-values for random ligation hypothesis or mean DG alignment scores, can be calculated to rank final set of RNA duplexes. Data from multiple experiments or replicates can be processed separately and further compared to check the reproducibility of the experimental method. |
| Authors: | Egor Semenchenko [aut, cre, cph] (ORCID: <https://orcid.org/0009-0007-5306-076X>), Volodymyr Tsybulskyi [ctb] (ORCID: <https://orcid.org/0009-0002-4141-6291>), Irmtraud M. Meyer [aut, cph] (ORCID: <https://orcid.org/0000-0002-4048-3479>) |
| Maintainer: | Egor Semenchenko <[email protected]> |
| License: | GPL-3 |
| Version: | 1.7.0 |
| Built: | 2026-05-13 09:46:40 UTC |
| Source: | https://github.com/bioc/DuplexDiscovereR |
Check if there are a temporary duplex records with duplex_id, which consist of more than one read n_reads > 1 , but
does not have assigned any dg_id as the duplex group (DG) index. Creates new dg_id if n_reads > 1
.addDGidsForTmpDGs(gi_input).addDGidsForTmpDGs(gi_input)
gi_input |
GInteractions with the |
Meant to be used in the situations when previous collapsing steps merged two or more reads to the temporary DG with duplex_id, but
global clustering has not identified any overlap between this temporary group and other duplexes, resulting in undefined dg_id.
This function looks up for these cases and creates new dg_id for temporary DGs, marking them as the final DGs.
New dg_id values are unique and allocated sequentially after the maximum value of dg_id
GInteractions object with new dg_id for rows with n_reads > 1
GInteractions
Merges the count dataframe and interactions metadata by id_col
If key is not found, in metadata throws error
.addGeneCounts(gi, df_counts, id_col = "gene_id").addGeneCounts(gi, df_counts, id_col = "gene_id")
gi |
|
df_counts |
dataframe with read counts |
id_col |
key to use in merge |
GInteractions with added counts
Overlays RNA duplexes with GRanges annotation object.
annotateGI( gi, anno_gr, keys = c("gene_name", "gene_type", "gene_id"), ambig_key = keys[1], save_ambig = TRUE, order_key = NULL, order_vec = NULL )annotateGI( gi, anno_gr, keys = c("gene_name", "gene_type", "gene_id"), ambig_key = keys[1], save_ambig = TRUE, order_key = NULL, order_vec = NULL )
gi |
|
anno_gr |
|
keys |
vectors with the names of the features to use for annotation. |
ambig_key |
a which feature to use for recording the annotation ambiguity. Determines the values in |
save_ambig |
When RNA duplex overlaps multiple features, ambiguous annotation for a single key
can be stored in |
order_key |
Experimental. In case RNA duplex overlaps multiple features, this key will be used to sort the overlapping features. |
order_vec |
Experimental. An ordered vector of values in |
For each annotation feature in keys, i.e if keys=c(keyname1),
then <keyname1>.A, <keyname1>.B annotation fields will be created, containing the
names of overlapping features
If no overlap is found for the feature, then filed will have NA
GInteractions object with new fields
data("RNADuplexesSampleData") annotateGI(gi = RNADuplexSampleDGs, anno_gr = SampleGeneAnnoGR) # Prioritisation of the snRNA and lncRNA before mRNA if genes overlap annotateGI( gi = RNADuplexSampleDGs, anno_gr = SampleGeneAnnoGR, keys = c("gene_id", "gene_name", "gene_type"), order_key = "gene_type", order_vec = c("snRNA", "lncRNA", "protein_coding"), save_ambig = TRUE )data("RNADuplexesSampleData") annotateGI(gi = RNADuplexSampleDGs, anno_gr = SampleGeneAnnoGR) # Prioritisation of the snRNA and lncRNA before mRNA if genes overlap annotateGI( gi = RNADuplexSampleDGs, anno_gr = SampleGeneAnnoGR, keys = c("gene_id", "gene_name", "gene_type"), order_key = "gene_type", order_vec = c("snRNA", "lncRNA", "protein_coding"), save_ambig = TRUE )
DuplexTrack objectDuplexTrack inherits from [Gviz::Annotaiontrack()] and its Gviz parents.
Most likely, user doesn't need all dioplay pars for the parents, so only
parameters relevant to the DuplexTrack are returned by default.
availableDisplayPars(class)availableDisplayPars(class)
class |
|
list of the default display parameters.
library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") availableDisplayPars("DuplexTrack") DuplexDiscovereR::availableDisplayPars(a)library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") availableDisplayPars("DuplexTrack") DuplexDiscovereR::availableDisplayPars(a)
Calculates significance statistics for each gene/transcript pair represented
in a GInteractions object. Two optional models are supported:
calculateLigationPvalues( gi, df_counts, id_col = "gene_id", compute_binom = TRUE, compute_odds = FALSE )calculateLigationPvalues( gi, df_counts, id_col = "gene_id", compute_binom = TRUE, compute_odds = FALSE )
gi |
|
df_counts |
|
id_col |
|
compute_binom |
|
compute_odds |
|
A binomial random-ligation model, where pair probability is derived from per-gene/transcript counts and tested using the right-tail binomial test.
An odds-ratio / Fisher exact test based on a 2x2 contingency table using chimeric counts and singleton counts. This test is used in RBP-dependent and RNA probing methods like RIL-seq
The function also outputs abundance fractions for each duplex pair and counts of other RNA duplexes formed by either or neither gene/transcript in the pair.
Binomial model
Null hypothesis: an observed duplex is generated by random ligation.
Alternative hypothesis: the duplex is enriched above the random-ligation expectation.
The random-ligation probability is defined as:
where:
The observed-pair probabilities are renormalized to sum to one and the p-value is computed from the right tail of the binomial distribution.
Odds-ratio / Fisher model
For a pair (A, B), the contingency table is:
| B present | B absent | |
| A present | C1 | C2 |
| A absent | C3 | C4 |
with:
C1: chimeric fragments containing both A and B
C2: fragments containing A but not B
C3: fragments containing B but not A
C4: fragments containing neither A nor B
A one-sided Fisher exact test (alternative = "greater") is used.
GInteractions object with added metadata columns.
data("RNADuplexesSampleData") gi <- calculateLigationPvalues( RNADuplexSampleDGs, df_counts = RNADuplexesGeneCounts ) hist(gi$p.adj, breaks = 20)data("RNADuplexesSampleData") gi <- calculateLigationPvalues( RNADuplexSampleDGs, df_counts = RNADuplexesGeneCounts ) hist(gi$p.adj, breaks = 20)
Wraps two procedures for different types of classification for read alignment:
test if chimeric junction map to two non-overlapped regions or shorter than defined minimum distance
test if chimeric junction is also a splice junction
classifyTwoArmChimeras( gi, min_junction_len = 4, junctions_gr, max_sj_shift = 4 )classifyTwoArmChimeras( gi, min_junction_len = 4, junctions_gr, max_sj_shift = 4 )
gi |
|
min_junction_len |
minimum allowed distance between two chimeric arms |
junctions_gr |
|
max_sj_shift |
maximum shift between either donor and acceptor splice sites and corresponding chimreic junction coordinates to count chimeric junction as splice junction |
Calls detection of the chimeric junction type, annotates short junctions on same chromosome an strand as 'short'. Compares chimeric junctions with splice junctions. Adds results as the new metadata fields parallel to the input.
GInteractions object object of the same size with new columns:
filled with 0 or 1
factor for the junction types
DuplexDiscovereR::getChimericJunctionTypes(), DuplexDiscovereR::getSpliceJunctionChimeras()
data("RNADuplexesSampleData") head(RNADuplexSampleGI) # remove all metadata mcols(RNADuplexSampleGI) <- NULL gi <- classifyTwoArmChimeras(RNADuplexSampleGI, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) table(gi$splicejnc) table(gi$junction_type)data("RNADuplexesSampleData") head(RNADuplexSampleGI) # remove all metadata mcols(RNADuplexSampleGI) <- NULL gi <- classifyTwoArmChimeras(RNADuplexSampleGI, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) table(gi$splicejnc) table(gi$junction_type)
GInteractions objectMain method to find duplex groups from the individual interactions
clusterDuplexGroups( gi, graphdf = NULL, maxgap = 40, minoverlap = 10, id_column = "duplex_id", weight_column = "weight", fast_greedy = FALSE, decompose = FALSE, id_columns_grapdf = paste(id_column, c(1, 2), sep = "."), min_arm_ratio = 0.3, dump_graph = FALSE, dump_path = "" )clusterDuplexGroups( gi, graphdf = NULL, maxgap = 40, minoverlap = 10, id_column = "duplex_id", weight_column = "weight", fast_greedy = FALSE, decompose = FALSE, id_columns_grapdf = paste(id_column, c(1, 2), sep = "."), min_arm_ratio = 0.3, dump_graph = FALSE, dump_path = "" )
gi |
|
graphdf |
Optional. Dataframe representing connection edges between entries in gi If not provided, graphdf is created inside the function |
maxgap |
For graph creation only. Max shift between arms starts and ends for pair of overlapping reads |
minoverlap |
For graph creation only. Minimum required overlap between either arm for pair of overlapping reads Other optional arguments, which are not relevant, unless user want to modify clustering weights or modify clustering in some other way |
id_column |
Optional. Column name in the |
weight_column |
Optional. If graphdf is provided, field to use for weight overlaps |
fast_greedy |
Optional. Run the fast_greedy algorithm instead of Louvain. Can speed up calcualtion for the large graphs. |
decompose |
Decompose graph into separate sub-graphs before clustering. |
id_columns_grapdf |
Column in the graph dataframe, which was used for index |
min_arm_ratio |
For graph creation only. Span-to-overlap ratio threshold. If smaller than this value, then edge is not drawn |
dump_graph |
For debug. Export the graph elements. not used |
dump_path |
For debug. PArt to export the graph elements. not used |
Accepts or creates the connections graphdf dataframe, creates graph with
igraph package, uses community detection algoritm to call clusters.
New field dg_id is added to label the clusters (duplex groups).
If no community is found for the read, dg_id is NA
GInteractions object with new dg_id column
data("RNADuplexesSampleData") # run preprocessing and filtering preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- classifyTwoArmChimeras(preproc_gi, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) # collapse duplicates gi <- collapseIdenticalReads(preproc_gi)$gi # run global clustering gi <- clusterDuplexGroups(gi) # check dg_ids table(is.na(gi$dg_id))data("RNADuplexesSampleData") # run preprocessing and filtering preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- classifyTwoArmChimeras(preproc_gi, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) # collapse duplicates gi <- collapseIdenticalReads(preproc_gi)$gi # run global clustering gi <- clusterDuplexGroups(gi) # check dg_ids table(is.na(gi$dg_id))
Collapse each interaction in the input to the duplex group based on the pre-computed dg_id
collapse_duplex_groups( gi, return_unclustered = FALSE, return_collapsed = TRUE, keep_meta = TRUE )collapse_duplex_groups( gi, return_unclustered = FALSE, return_collapsed = TRUE, keep_meta = TRUE )
gi |
|
return_unclustered |
add unclustered reads to output |
return_collapsed |
add duplex groups, which were created as temporary with n_reads > 1 but was not clustered to the DG golabally. This parameter is used internally and should be kept default in most situations. |
keep_meta |
whether to keep metadata, which only unclustered reads have, in case of a mixed output |
'dg_id' is used as the identifier for the duplex group
Reads belonging to the same duplex group are collapsed into a single entry
with start and end are set as min() and max() coordinate of the reads in within the duplex group.
The 'score' column is averaged across the duplex group reads is calculated and
put as the 'score' for the collapsed duplex group
Behavior in case 'dg_id' = NA: Option 'return_unclustered' -
whether unclustered reads with should be added to the output gi
Interaction is not returned in the output. Default.
Interaction is returned in the output, output is mixed duplex groups and individual reads
Internally used argument #'
In case interaction already collapsed and n_read > 1, interaction will not be returned as duplex group
In case interaction has n_read > 1, interaction will be treated as duplex group
GInteractions object with collapsed duplex groups
# load example of clustered data data("RNADuplexesSampleData") # some reads assigned to DG, some are not table(is.na(RNADuplexSampleGI$dg_id)) # Return only DGs gicollapsed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = FALSE) # Return DGs and unclustered reads as well gimixed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = TRUE) # load small sample GInteractions and process it manually data("RNADuplexesSmallGI") # First, collapse duplicated reads. This adds n_reads and duplex ids ginodup <- collapseIdenticalReads(SampleSmallGI)$gi_collapsed # Second, run clustering, get DG ids ginodup <- clusterDuplexGroups(ginodup) # Return all DGs result in n=3 DGS, one of them formed by # identical duplicated alignments collapse_duplex_groups(ginodup, return_collapsed = TRUE) # Return DGs, but drop duplicated returns n=2 DGs collapse_duplex_groups(ginodup, return_collapsed = FALSE)# load example of clustered data data("RNADuplexesSampleData") # some reads assigned to DG, some are not table(is.na(RNADuplexSampleGI$dg_id)) # Return only DGs gicollapsed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = FALSE) # Return DGs and unclustered reads as well gimixed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = TRUE) # load small sample GInteractions and process it manually data("RNADuplexesSmallGI") # First, collapse duplicated reads. This adds n_reads and duplex ids ginodup <- collapseIdenticalReads(SampleSmallGI)$gi_collapsed # Second, run clustering, get DG ids ginodup <- clusterDuplexGroups(ginodup) # Return all DGs result in n=3 DGS, one of them formed by # identical duplicated alignments collapse_duplex_groups(ginodup, return_collapsed = TRUE) # Return DGs, but drop duplicated returns n=2 DGs collapse_duplex_groups(ginodup, return_collapsed = FALSE)
Two entries (reads) are considered identical if they share start, end, strand and score vales Identical entries are collapsed into the single one.
collapseIdenticalReads(gi)collapseIdenticalReads(gi)
gi |
GInteractions(mode='strict') object with chromA, strandA, startA, endA, chromB, strandB, startB, endB, score columns Optionally cigar_alnA, cigar_alnB columns are also considered for collapsing 'read_id' column used as the index in the initial objects. Created, if not exists |
Adds columns to the collapsed object duplex_id (int) unique record id n_reads (int) number of entries collapsed
result_list object with keys ' gi_collapsed': New collapsed GInteraction object ' stats_df': tibble with the mapping of the original entries to the new duplex_id
# load data data("RNADuplexesSmallGI") res_collapse <- collapseIdenticalReads(SampleSmallGI) gi_new <- res_collapse[["gi_collapsed"]] # keeps the mapping of the colapsed object to new read_stats_df <- res_collapse[["stats_df"]]# load data data("RNADuplexesSmallGI") res_collapse <- collapseIdenticalReads(SampleSmallGI) gi_new <- res_collapse[["gi_collapsed"]] # keeps the mapping of the colapsed object to new read_stats_df <- res_collapse[["stats_df"]]
Function calls clustering algorithm several times and collapses highly similar reads to the temporary duplex groups (DGs).
collapseSimilarChimeras( gi, read_stats_df, maxgap = 5, niter = 2, minoverlap = 10, min_nodes = 10 )collapseSimilarChimeras( gi, read_stats_df, maxgap = 5, niter = 2, minoverlap = 10, min_nodes = 10 )
gi |
|
read_stats_df |
|
maxgap |
Maximum relative shift between the overlapping read arms |
niter |
Number of times clustering will be called |
minoverlap |
Minimum required overlap between either read arm |
min_nodes |
Minimum count of nodes to finish the interaction merging |
Calling this procedure before global read clustering substantially reduces time required for calling DGs. Collapsed duplex groups are aggregated only from the reads which are shifted by only a few nucleotides from each other. These DGs are temporary until full library clustering is called. To keep track of the mapping of the temprary DGs to the input, dedicated dataframe is returned. The 'duplex_id' column will be added or updated as identifier for the temporary duplex group. The number of reads under single 'duplex_id' is recorded in the 'n_reads' fields
a list with the following keys
GInteractions object with both collapsed duplex groups
and not-collapsed unchanged reads
tibble With the mapping from the unique read -
with the the infromation about time and memory reaquired for the function
call
Combines all interaction into single superset by clustering & collapsing. Then compares every input entry with the superset. Overlaps between superset and inputs are recorded in a table as 0/1
compareMultipleInteractions( gi_samples_list, min_ratio = 0.3, minoverlap = 5, maxgap = 50, niter = 3, gi_superset = NULL, anno_gr = NULL )compareMultipleInteractions( gi_samples_list, min_ratio = 0.3, minoverlap = 5, maxgap = 50, niter = 3, gi_superset = NULL, anno_gr = NULL )
gi_samples_list |
anmes list with the |
min_ratio |
If the overlap-to-span ratio for either arm (A or B) for pair
of chimeric reads is less than |
minoverlap |
Parameter for read clustering to create a superset. Minimum required overlap to for either arm (A or B) for pair of entries. |
maxgap |
Parameter for read clustering. Minimum required shift between start and end coordinates of arms for pair of overlapping entries..
If the shift is longer than |
niter |
Internal parameter for debugging. Number of cluster& collapse iterations to find superset |
gi_superset |
Optional. Superset defining the space (all) of the interactions, against which inputs from the list will be compared. |
anno_gr |
Optional. |
dataframe recodding the overlaps between samples and supeset
# Create test set of RNA interactions chrom <- "chr1" start1 <- c(1, 11, 21, 31, 41, 51, 61, 71, 81, 91) end1 <- start1 + 9 start2 <- c(101, 111, 121, 131, 141, 151, 161, 171, 181, 191) end2 <- start2 + 9 anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = start1, end = end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = start2, end = end2)) interaction <- GInteractions(anchor1, anchor2) # Ensure some overlaps n <- length(interaction) group_size <- ceiling(n / 2) group_indices1 <- sort(sample(seq_len(n), group_size)) group_indices2 <- sort(sample(seq_len(n), group_size)) group_indices3 <- sort(sample(seq_len(n), group_size)) # Create separate GInteractions objects for each group group1 <- interaction[group_indices1] group2 <- interaction[group_indices2] group3 <- interaction[group_indices3] # format input and call comparison a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3) res <- compareMultipleInteractions(a) # comparison result head(res$dt_upset) # superset res$gi_all # dataframe for the Upset plot res$dt_upset# Create test set of RNA interactions chrom <- "chr1" start1 <- c(1, 11, 21, 31, 41, 51, 61, 71, 81, 91) end1 <- start1 + 9 start2 <- c(101, 111, 121, 131, 141, 151, 161, 171, 181, 191) end2 <- start2 + 9 anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = start1, end = end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = start2, end = end2)) interaction <- GInteractions(anchor1, anchor2) # Ensure some overlaps n <- length(interaction) group_size <- ceiling(n / 2) group_indices1 <- sort(sample(seq_len(n), group_size)) group_indices2 <- sort(sample(seq_len(n), group_size)) group_indices3 <- sort(sample(seq_len(n), group_size)) # Create separate GInteractions objects for each group group1 <- interaction[group_indices1] group2 <- interaction[group_indices2] group3 <- interaction[group_indices3] # format input and call comparison a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3) res <- compareMultipleInteractions(a) # comparison result head(res$dt_upset) # superset res$gi_all # dataframe for the Upset plot res$dt_upset
GInteractions
Utility function to find overlapping reads in the input and calculate overlap scores. Removes self-hits. Computes overlap/span ratios for each interaction arm. Sum of the scores is recorded in 'weight' field
computeGISelfOverlaps( gi, id_column = "duplex_id", maxgap = 40, minoverlap = 10 )computeGISelfOverlaps( gi, id_column = "duplex_id", maxgap = 40, minoverlap = 10 )
gi |
input gi object |
id_column |
column which use for using as ids for entries |
maxgap |
parameter for call of |
minoverlap |
parameter for call |
dataframe with indexes of pairwise overlapsin input and columns for span, overlap, ratios of either read arm
data("RNADuplexesSmallGI") computeGISelfOverlaps(SampleSmallGI)data("RNADuplexesSmallGI") computeGISelfOverlaps(SampleSmallGI)
GInteractions object to Granges
Creates the 'long' GRanges by stacking the A and B arms one 'on top' of the other.
Adds id and group fields as indicators of original index and interaction
arm (A- left arm, B- right arm)
convert_gi_to_ranges(gi)convert_gi_to_ranges(gi)
gi |
|
GRanges twice the length of the input
data("RNADuplexesSmallGI") convert_gi_to_ranges(SampleSmallGI)data("RNADuplexesSmallGI") convert_gi_to_ranges(SampleSmallGI)
chimeric_reads SlotRetrieves the value of the chimeric_reads slot in a DuplexDiscovererResults object.
dd_get_chimeric_reads(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads(object)dd_get_chimeric_reads(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads(object)
object |
A |
GInteractions object from the chimeric_reads slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
chimeric_reads_stats SlotRetrieves the value of the chimeric_reads_stats slot in a DuplexDiscovererResults object.
dd_get_chimeric_reads_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads_stats(object)dd_get_chimeric_reads_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads_stats(object)
object |
A |
tibble from the chimeric_reads_stats slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
duplex_groups slotRetrieves the value of the duplex_groups slot in a DuplexDiscovererResults object.
dd_get_duplex_groups(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_duplex_groups(object)dd_get_duplex_groups(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_duplex_groups(object)
object |
A |
GInteractions object from the duplex_groups slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
reads_classes SlotRetrieves the value of the reads_classes slot in a DuplexDiscovererResults object.
dd_get_reads_classes(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_reads_classes(object)dd_get_reads_classes(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_reads_classes(object)
object |
A |
tibble from the reads_classes slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
run_stats SlotRetrieves the value of the run_stats slot in a DuplexDiscovererResults object.
dd_get_run_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_run_stats(object)dd_get_run_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_run_stats(object)
object |
A |
tibble from the run_stats slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
DuplexDiscovereR is a package for analysing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq and others, which provide information about intra-molecular RNA-RNA interactions through chimeric RNA-seq reads. Chimerically aligned fragments in these experiments correspond to the base-paired stretches (RNA duplexes) of RNA molecules . DuplexDiscovereR takes input in the form of chimericly or split -aligned reads, It implements procedures for alignment classification, filtering and efficient clustering of individual chimeric reads into duplex groups (DGs). Once DGs are found, RNA duplex formation and their hybridization energies are predicted. Additional metrics, such as p-values or mean DG alignment scores, can be calculated to rank and analyse the final set of RNA duplexes. Data from multiple experiments or replicates can be processed separately and further compared to check the reproducibility of the experimental method.
DuplexDiscovereR
Egor Semenchenko
A helper S4 class to store the results of the full DuplexDiscovereR analysis. This class contains the following output:
duplex_groups: clustered duplex groups.
chimeric_reads: individual two-regions chimeric reads.
Contains both clustered and unclustered reads.
Clustered reads are linked to the duplex groups though 'dg_id' field in metadata
reads_classes: dataframe parallel to the the input containing classification result and detected mapping type for each entry in the input
chimeric_reads_stats: dataframe containing read type classification statistics
run_stats: data frame containing statistics about the time and memory used by the pipeline
DuplexDiscovererResults( duplex_groups, chimeric_reads, reads_classes, chimeric_reads_stats, run_stats )DuplexDiscovererResults( duplex_groups, chimeric_reads, reads_classes, chimeric_reads_stats, run_stats )
duplex_groups |
GInteractions object with duplex groups |
chimeric_reads |
GInteractions object with chimeric reads |
reads_classes |
|
chimeric_reads_stats |
|
run_stats |
|
Each output type has a corresponding accessor:
A DuplexDiscovererResults object.
duplex_groupsGInteractions object with duplex groups
chimeric_readsGInteractions object with chimeric reads
reads_classestibble (tbl_df) with read classification data.
chimeric_reads_statstibble (tbl_df) read type statistics.
run_statstibble (tbl_df) runtime and memory info
dd_get_duplex_groups() ,
dd_get_chimeric_reads() ,
dd_get_reads_classes() ,
dd_get_chimeric_reads_stats() ,
dd_get_run_stats()
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
Inherits the Gviz::AnnotationTrack, plots interaction ranges as boxes.
Arguments from Gviz::AnnotationTrack, as stacking which set boxes
layout are accepted. Parent aesthetics for labels are overwritten with Display
parameters of this class.
Accepts GInteractions object to plot and GRanges to define plot region
Duplexes which can be displayed on the plot range are connected with arcs. Duplexes which are partially outside of the range are displayed without arcs. Labeles and appearance can be controlled with display parameters
gi |
An |
gr_region |
|
from |
Integer start coordinate of subset region. Used if |
to |
Integer end coordinate of subset region. Used if |
chromosome |
Chromosome of subset region. Used if |
strand |
Used if |
fill.column |
used for fill. Default is "" (empty) and triggers IGV color pallete. Display parameters
|
library(InteractionSet) library(Gviz) # generate input anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100, 150, 400), end = c(200, 700, 1200, 1700, 2200, 250, 500) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300, 1500, 1700), end = c(400, 900, 1400, 1900, 2400, 1600, 1800) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") # define plotting range gr_region <- range(anchor1, anchor2) interactions$anno_A <- sample(LETTERS, length(interactions)) interactions$anno_B <- interactions$anno_A a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") plotTracks(a, stacking = "dense") plotTracks(a, stacking = "squish", annotation.column1 = "anno_A") # add interactions which are not fully in plot range: outside the range or on different chromosome() # one left (A) interaction arm outside of the plot, other on different chromosome new_anchor1 <- GRanges( seqnames = c("chr1", "chr2"), ranges = IRanges( start = c(10, 600), end = c(90, 700) ), strand = "+" ) new_anchor2 <- GRanges( seqnames = c("chr1", "chr1"), ranges = IRanges( start = c(1500, 1000), end = c(1600, 1200) ), strand = "+" ) new_interactions <- GInteractions(new_anchor1, new_anchor2) new_interactions$anno_A <- c("A.out", "A.out_chr") new_interactions$anno_B <- c("B.in", "B.in") all_interactions <- c(interactions, new_interactions) b <- DuplexDiscovereR::DuplexTrack(all_interactions, gr_region = gr_region, annotation.column1 = "anno_A", annotation.column2 = "anno_B" ) plotTracks(b) # to customize plot, one can call, to see options DuplexDiscovereR::availableDisplayPars(b)library(InteractionSet) library(Gviz) # generate input anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100, 150, 400), end = c(200, 700, 1200, 1700, 2200, 250, 500) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300, 1500, 1700), end = c(400, 900, 1400, 1900, 2400, 1600, 1800) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") # define plotting range gr_region <- range(anchor1, anchor2) interactions$anno_A <- sample(LETTERS, length(interactions)) interactions$anno_B <- interactions$anno_A a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") plotTracks(a, stacking = "dense") plotTracks(a, stacking = "squish", annotation.column1 = "anno_A") # add interactions which are not fully in plot range: outside the range or on different chromosome() # one left (A) interaction arm outside of the plot, other on different chromosome new_anchor1 <- GRanges( seqnames = c("chr1", "chr2"), ranges = IRanges( start = c(10, 600), end = c(90, 700) ), strand = "+" ) new_anchor2 <- GRanges( seqnames = c("chr1", "chr1"), ranges = IRanges( start = c(1500, 1000), end = c(1600, 1200) ), strand = "+" ) new_interactions <- GInteractions(new_anchor1, new_anchor2) new_interactions$anno_A <- c("A.out", "A.out_chr") new_interactions$anno_B <- c("B.in", "B.in") all_interactions <- c(interactions, new_interactions) b <- DuplexDiscovereR::DuplexTrack(all_interactions, gr_region = gr_region, annotation.column1 = "anno_A", annotation.column2 = "anno_B" ) plotTracks(b) # to customize plot, one can call, to see options DuplexDiscovereR::availableDisplayPars(b)
Takes CIGAR operands i.e M,N,S and sums the associated blocks length It is vectorized. i.e supports vector with CIGAR strings
get_char_count_cigar(strings, s)get_char_count_cigar(strings, s)
strings |
CIGAR string vector |
s |
CIGAR operands |
vector with length values
# From a vector get_char_count_cigar(c("4S18M22S", "25S26M"), "S") get_char_count_cigar(c("18M22S", "20M20S"), "M")# From a vector get_char_count_cigar(c("4S18M22S", "25S26M"), "S") get_char_count_cigar(c("18M22S", "20M20S"), "M")
Chimeric reads which can be represented ans two-arm interactions can be divided into several categories based on the distance between the chimeric fragments and existence of the overlap between these fragments.
getChimericJunctionTypes(gi, normal_gap_threshold = 10)getChimericJunctionTypes(gi, normal_gap_threshold = 10)
gi |
|
normal_gap_threshold |
minimum allowed distance between chimeric arms |
Takes GInteractions object and classifies junctions into following categories
normal chimeric read
normal chimeric read with junction < normal_gap_threshold
arms overlap
arms overlap on the opposite strand
gi object of the same size with the 'junction_type' field added
data("RNADuplexesSampleData") preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- getChimericJunctionTypes(preproc_gi) table(preproc_gi$junction_type)data("RNADuplexesSampleData") preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- getChimericJunctionTypes(preproc_gi) table(preproc_gi$junction_type)
Calls RNAduplex from ViennaRNA to find base-pairs for every entry in the input, throws a message and system warning if it is not installed
getRNAHybrids(gi, fafile)getRNAHybrids(gi, fafile)
gi |
|
fafile |
path to the .fasta file with genome |
object parallel to input with added energy GC content, dot-format base-pairings and lenghts of RNA hybrids will return the input, if RNAhybrids cannot be run
sequence <- paste0( "AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU", "CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU" ) # Save the sequence to a temp fasta file fasta_file <- tempfile(fileext = ".fa") chrom <- "test_chrA" writeLines(c(">test_chrA", sequence), con = fasta_file) # Create the GInteraction object # Define start and end positions for the base-pairing regions regions <- data.frame( start1 = c(1, 11, 21, 31, 41), end1 = c(10, 20, 30, 40, 50), start2 = c(91, 81, 71, 61, 51), end2 = c(100, 90, 80, 70, 60) ) # GRanges objects for the anchors anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2)) interaction <- GInteractions(anchor1, anchor2) # predict hybrids # In case ViennaRNA is installed ## Not run: gi_with_hybrids <- getRNAHybrids(interaction, fasta_file) ## End(Not run)sequence <- paste0( "AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU", "CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU" ) # Save the sequence to a temp fasta file fasta_file <- tempfile(fileext = ".fa") chrom <- "test_chrA" writeLines(c(">test_chrA", sequence), con = fasta_file) # Create the GInteraction object # Define start and end positions for the base-pairing regions regions <- data.frame( start1 = c(1, 11, 21, 31, 41), end1 = c(10, 20, 30, 40, 50), start2 = c(91, 81, 71, 61, 51), end2 = c(100, 90, 80, 70, 60) ) # GRanges objects for the anchors anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2)) interaction <- GInteractions(anchor1, anchor2) # predict hybrids # In case ViennaRNA is installed ## Not run: gi_with_hybrids <- getRNAHybrids(interaction, fasta_file) ## End(Not run)
Marks interactions which starts/ends within specified shift from the known splice junctions.
getSpliceJunctionChimeras( gi, sj_gr, sj_tolerance = 20, sj_tolerance_strict = 10 )getSpliceJunctionChimeras( gi, sj_gr, sj_tolerance = 20, sj_tolerance_strict = 10 )
gi |
GInteractions object |
sj_gr |
Granges object with the splice junctions data |
sj_tolerance |
maximum shift between either donor and acceptor splice sites and corresponding chimreic junction coordinates to count chimeric junction as splice junction |
sj_tolerance_strict |
maximum shift between either donor and acceptor splice sites irrespective of the particular splice junction. If both chimeric junction start and end correspond to donor or acceptor of any known junction, it is marked as splice junction. Used to catch novel combinations of known 3' and 5' sites |
gi object with added 'splicejnc' and field Additionally 'splicejnc_donor' 'splicejnc_acceptor' fields are added
data("RNADuplexesSampleData") gi <- getSpliceJunctionChimeras(RNADuplexSampleGI, SampleSpliceJncGR) table(gi$splicejnc) table(gi$splicejnc_acceptor, gi$splicejnc_donor)data("RNADuplexesSampleData") gi <- getSpliceJunctionChimeras(RNADuplexSampleGI, SampleSpliceJncGR) table(gi$splicejnc) table(gi$splicejnc_acceptor, gi$splicejnc_donor)
Converts GInteractions to tibble, preserves metadata
makeDfFromGi(gi)makeDfFromGi(gi)
gi |
GInteracttions |
Following naming conventions is used for region coordinates:
c('chromA','startA','endA','strandA', 'chromB','startB','endB','strandB')
tibble preserving metadata columns
data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)
Converts dataframe-like object to the GInteractions.
makeGiFromDf(df)makeGiFromDf(df)
df |
dataframe-like object. Should be convertable to tibble::tibble() |
arms will be consistent between different objects of same reference
Following columns are looked up in input dataframe to parse region coordinates:
c("chromA','startA','endA','strandA',"chromB",'startB','endB','strandB')
GInteractions(mode='strict') is enforced, to ensure that the order of the regions
Extra columns are stored as metadata fields
GInteractions(mode='strict')
# load example GInteractions data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)# load example GInteractions data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)
Calculates alignment coordinates and returns reads with categories
preproc_chim_junction_out_se(dt, keep_all_columns = FALSE)preproc_chim_junction_out_se(dt, keep_all_columns = FALSE)
dt |
Chimeric.out.junction with the correct column names |
keep_all_columns |
|
#'
multi-mapped read
more than one junction (more than two 'N' in CIGAR string)
Artifacts. I.e alignments for both arms are continious, but with 'backward' chimeric junction was wrongly put
tibble with annotated reads
File generated by mapping with STAR using --quantMode GeneCounts
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details on the pre-processing and sub-setting the
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class spec_tbl_df (inherits from tbl_df, tbl, data.frame) with 1445 rows and 2 columns.
tibble with columns of Chimeric.junction.out
A Chimeric.out.Junction file with a subset of chr 22 Chimeric reads detected by SPLASH protocol in Human embryonic stem cells.
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class spec_tbl_df (inherits from tbl_df, tbl, data.frame) with 2040 rows and 10 columns.
tibble with columns of bedpe format
SequenceReadArcive
Reads were aligned with STAR and filtered to contain only reads
which could be represented as 2-arm chimeric alignments. Converted to the
bedpe
format
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details on the pre-processing and sub-setting the
data
A Chimeric.out.Junction file with a subset of chr 22 Chimeric reads detected by SPLASH protocol in Human embryonic stem cells.
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class tbl_df (inherits from tbl, data.frame) with 5000 rows and 21 columns.
tibble with columns of Chimeric.junction.out
SequenceReadArcive
Reads were aligned with STAR
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details on the pre-processing and sub-setting the
data
GInteractions read-level object containing processed reads,annotated with duplex group
ids, read types gene names and p-values
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class StrictGInteractions of length 2090.
GInteractions with
n_reads_dg : number of reads in the duplex group (DG)
duplex_id : temporary id for RNA duplexes which could be found before
clustering (duplicated or shifted by couple of nt )
dg_id :id of the duplex group
score : median alignment score in duplex group
other columns inherited from the STAR Chimeric.out.Junction
SequenceReadArcive
Reads were aligned with STAR and duplex groups were identified
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details on the data generation proccedure.
GInteractions duplex group -level object containing detected duplex groups,
annotated with duplex group ids, gene_names and p-values
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class StrictGInteractions of length 79.
GInteractions with
n_reads : number of reads in the duplex group (DG)
dg_id :id of the duplex group
p_val : BH adjusted p-value of testing to reject hypothesis of DG arising
from random ligation
score : median alignment score in duplex group
other columns with .A and .B annotating to which genes either arm of the
DG maps
SequenceReadArcive
Reads were aligned with STAR and duplex groups were identified
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details on the data generation procedure.
GInteractions read-level object containing two-arm chimeric reads extracted from
mapping output and which can be represented in the GInteraction object
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class StrictGInteractions of length 2090.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details on the data generation proccedure.
GInteractions with
readname : read name
map_type : type of the mapped read (2arm by design of pre-filtering)
junction_type : if read jucntion is too short, or it not a 'true' ligated
reads because of the jucntoin coincides with splice junction
cigar_aln* columns inherited from the STAR Chimeric.out.Junction output
Imports dataframe with reads (.bedpe or Chimeric.out.junction ) or GInteractions
object.
Checks column names or tries to quess them if not provided.
Adds necessary annotation depending on the input type,
For STAR input, calculates length of the alignments
and marks unique 2-arm alignments.
For the .bedpe or GInteractions input, all entries are already represented as reads
with two different aligned parts (2-arm), so only check for unique readname
is performed.
runDuplexDiscoPreproc( data, table_type, library_type = "SE", keep_metadata = TRUE, return_gi = FALSE, min_arm_len = 15 )runDuplexDiscoPreproc( data, table_type, library_type = "SE", keep_metadata = TRUE, return_gi = FALSE, min_arm_len = 15 )
data |
Either dataframe-like object: Chimeric.out.junction from STAR or
.bedpe - formatted or |
table_type |
in |
library_type |
|
keep_metadata |
|
return_gi |
if the return object should be |
min_arm_len |
minimum allowed length of the alignment arm. Read will be dropped if either arm is shorter |
If not existed, adds fields required for the downstream steps: 'readname', 'map_type', 'score', 'n_reads'. 'map_type' field determines the type of the chimeric read:
multi-mapped read
more than one junction (more than two 'N' in CIGAR string)
Artifacts or possibly unaccounted types. I.e alignments for both arms are continuous, but with 'backward' chimeric junction was wrongly introduced in the mapping
tibble with new metadata fields OR GInteractions if return_gi is
set to TRUE
# load data data(RNADuplexesSampleData) # with bedpe input preproc_reads <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") # with STAR input preproc_reads_star <- runDuplexDiscoPreproc(RNADuplexesRawChimSTAR, table_type = "STAR", keep_metadata = FALSE )# load data data(RNADuplexesSampleData) # with bedpe input preproc_reads <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") # with STAR input preproc_reads_star <- runDuplexDiscoPreproc(RNADuplexesRawChimSTAR, table_type = "STAR", keep_metadata = FALSE )
Generates GInteractions object with duplex groups from the STAR Chimeric.out.junction or bedpe file. Classifies reads, annotates reads by overlap with the gene or transcript features, calculates p-values and hybridization energies. Additionally, returns mappings from duplex groupd back to genes.
runDuplexDiscoverer( data, table_type = "", junctions_gr = NULL, anno_gr = NULL, anno_gr_keys = c("gene_id", "gene_name", "gene_type"), fafile = NULL, df_counts = NULL, sample_name = "sample", lib_type = "SE", min_junction_len = 5, max_gap = 50, min_arm_ratio = 0.1, min_overlap = 10, max_sj_shift = 10, gap_collapse_similar = 2, collapse_n_inter = 5, trim_alignments = FALSE, trim_length = 40, min_arm_len = 9, compute_p_values = TRUE, compute_odds_test_p_values = FALSE )runDuplexDiscoverer( data, table_type = "", junctions_gr = NULL, anno_gr = NULL, anno_gr_keys = c("gene_id", "gene_name", "gene_type"), fafile = NULL, df_counts = NULL, sample_name = "sample", lib_type = "SE", min_junction_len = 5, max_gap = 50, min_arm_ratio = 0.1, min_overlap = 10, max_sj_shift = 10, gap_collapse_similar = 2, collapse_n_inter = 5, trim_alignments = FALSE, trim_length = 40, min_arm_len = 9, compute_p_values = TRUE, compute_odds_test_p_values = FALSE )
data |
dataframe-like object with the split reads. Output of Chimeric.out.junction or dataframe with fileds defined by bedpe format:
c("chromA","startA",'endA',"chromB",'startB','endB','readname','flag','strandA','strandB', ... )
Alternatively, |
table_type |
one in c("STAR","bedpe") Defines the type of the input dataframe.
ignored if input data is |
junctions_gr |
GRanges object with the splice junction coordinates |
anno_gr |
GRanges object to use for the annotation of the interactions. Optional |
anno_gr_keys |
c() vector with names of metadata fields in anno_gr which will be used for the annotation. Argument passed to
|
fafile |
path to the genome .fasta file. Used to calculate hybridization energy with RNADuplex. Sequence names should correspond to the sequences from which the mapping index was created. Optional |
df_counts |
A two- column dataframe with counts. Counts are used for p-value calculation. The first column should match the 'gene_id' feature in anno_gr. The second column is the respective count. Optional |
sample_name |
A name of the sample, used for assembling the analysis statistics dataframe |
lib_type |
one in c('SE','PE'). Type of the seqeuncing library. Default is 'SE' |
min_junction_len |
a minimum allowed distance between chimeric arms for the read input.
Reads with the junction closer than |
max_gap |
Parameter for read clustering. Minimum required shift between start and end coordinates of arms for pair of overlapping chimeric reads.
If the shift is longer than |
min_arm_ratio |
Parameter for read clustering.
If the overlap-to-span ratio for either arm (A or B) for pair of chimeric reads is less than |
min_overlap |
Parameter for read clustering. Minimum required overlap to for either arm (A or B) for pair of chimeric reads. |
max_sj_shift |
Maximum shift between either donor and acceptor splice sites and chimeric junction coordinates to count chimeric junction as splice junction |
gap_collapse_similar |
Parameter for read clustering (iterative step). Analogous to the max_gap, but applied |
collapse_n_inter |
Parameter for read clustering (iterative step). Number of iterations to repeat step of collapsing of the highly similar chimeric reads. Increasing this from i.e 0 to 5 reduces clustering time and memory for the libraries with many overlapping reads. |
trim_alignments |
TRUE or FALSE. Whether to trim arms alignments to 'trim_length' nucleotide around chimeric junction |
trim_length |
target size of trimmed alignment |
min_arm_len |
minimum allowed length of the alignment arm. Read will be dropped if either arm is shorter |
compute_p_values |
TRUE or FALSE. whether to calcualte random ligation test with binominal ligation model |
compute_odds_test_p_values |
TRUE or FALSE. Whether to calcualte odds-ratio random ligation test. |
This is a main function to do the initial discovery of the RNA duplexes after the chimeric read mapping. It wraps following procedures:
Classifies the input reads by the mapping type. Keeps 2-arm chimeric reads for downstream analysis
Compares 2arm duplex reads against provided splice junctions
Classifies 2arm duplexes into spurious self-overlapping, splice junction categoris
Performs clustering of the remaining reads into duplex groups
Collapses identically mapped reads
Collapses closely located reads, almost identical reads
Finds duplex groups throughout whole data set
Annotates duplex groups with genomic features if annotation is provided
Calculates p-values if gene counts and annotation are provided
Calculates hybridization energies if path to the .fasta file is provided
a list with the following keys
duplex_groupsGInteractions object with chimeric reads clustered duplex groups
chimeric_readsGInteractions object with non-collapsed chimeric reads
reads_classestbl_df dataframe parallel to the the input dataframe, annotated with read categories and duplex groups
chimeric_reads_statstbl_df dataframe containing read type classification statistics
run_statstbl_df dataframe with the time and memory info about the run
library(DuplexDiscovereR) # load data data("RNADuplexesSampleData") result <- runDuplexDiscoverer( data = RNADuplexesRawChimSTAR, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, df_counts = RNADuplexesGeneCounts, sample_name = "test clustering", fafile = NULL, collapse_n_inter = 3, lib_type = "SE", table_type = "STAR" ) # see results object print(result) # duplex groups dd_get_duplex_groups(result) # individual chimeric reads dd_get_chimeric_reads(result) # counts of detected read tyoes dd_get_chimeric_reads_stats(result)library(DuplexDiscovereR) # load data data("RNADuplexesSampleData") result <- runDuplexDiscoverer( data = RNADuplexesRawChimSTAR, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, df_counts = RNADuplexesGeneCounts, sample_name = "test clustering", fafile = NULL, collapse_n_inter = 3, lib_type = "SE", table_type = "STAR" ) # see results object print(result) # duplex groups dd_get_duplex_groups(result) # individual chimeric reads dd_get_chimeric_reads(result) # counts of detected read tyoes dd_get_chimeric_reads_stats(result)
Granges containing gene coordinates of human chromosome 22 obtained from
GENCODEv44 annotaion
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class GRanges of length 1445.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details
statdatd GENCODE gtf fields
GInteractions object containing two-arm chimeric reads extracted from
mapping output and which can be represented in the GInteraction object
and subset to chr22: 23877144-45562960 '*'
data(RNADuplexesSmallGI)data(RNADuplexesSmallGI)
An object of class StrictGInteractions of length 14.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details on the data generation procedure.
Granges containing coordinates of splice junctions human chromosome 22 obtained from
GENCODEv44 annotaion
data(RNADuplexesSampleData)data(RNADuplexesSampleData)
An object of class GRanges of length 8465.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR") for details
statdatd GENCODE gtf fields
This method provides a summary of the DuplexDiscovererResults object.
It prints chimeric_reads_stats followed by the run_stats.
## S4 method for signature 'DuplexDiscovererResults' show(object)## S4 method for signature 'DuplexDiscovererResults' show(object)
object |
A |
None. Prints a formatted summary.
Show method for DuplexTrack
## S4 method for signature 'DuplexTrack' show(object)## S4 method for signature 'DuplexTrack' show(object)
object |
DuplexTrack. |
class representation
library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") show(a)library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") show(a)
Trim alignements to contain only 'extract len' nucleotides adajcent to the chimeric junction
trimAroundJunction(dt, extract_len = 30)trimAroundJunction(dt, extract_len = 30)
dt |
table with the |
extract_len |
the length of the seqeunce to trim |
In case of the long alignemtns, it may be necessary trim chimeric alignments to identify RNA duplex. If 'extract_len' is longer than the read alignemnt length, then no trimming is performed
dataframe with the trimmed alignments
data("RNADuplexesSampleData") dt_preproc <- runDuplexDiscoPreproc(RNADuplexesRawChimSTAR, table_type = "STAR", library_type = "SE" ) trimAroundJunction(dt_preproc, 40)data("RNADuplexesSampleData") dt_preproc <- runDuplexDiscoPreproc(RNADuplexesRawChimSTAR, table_type = "STAR", library_type = "SE" ) trimAroundJunction(dt_preproc, 40)
Writes interactions to the sam file for visualization in extrnal browsers. Takes input as GInteractions object containing reads or duplex groups.
writeGiToSAMfile( gi_coords, file_out, distance_chim_junction = 10000, read_name_column = "readname", id_column = "dg_id", genome = "", sample_name = "noname_sample" )writeGiToSAMfile( gi_coords, file_out, distance_chim_junction = 10000, read_name_column = "readname", id_column = "dg_id", genome = "", sample_name = "noname_sample" )
gi_coords |
input Ginteraction object |
file_out |
path to write output file |
distance_chim_junction |
maximum distance between input duplex groups/reads, which will be represented as the single-line in .sam file. Junction will be output as N- gap. For the interactions with longer distances, chimeric junction will be represented as MR:Z:i tag |
read_name_column |
character field, pointing out to read names. Read names are generated automatically if not provided. |
id_column |
character name of the field containing integer duplex group ids. NA are replaced with zeros |
genome |
character. Genome version. Required for the retrieval of sequence lengths for sam file header- SQ and SN tags. For convenience, hg38 and hg19 chromosome lengths will be assigned automatically. If the value is not in c('hg38','hg19'), seqlengths will be looked for be in attribute in seqlengths() of regions(gi_coords) |
sample_name |
name to use in RG SAM tag in header |
no object is returned
# Load test data data("RNADuplexesSampleData") # if the input is read-based, it should have integer duplex group ids # here, we have 2090 reads length(RNADuplexSampleGI) # among them 300 reads does not belong to any DG # missing ids will be converted to 0 table(is.na(RNADuplexSampleGI$dg_id)) tmpf <- tempfile(".sam") writeGiToSAMfile( gi_coords = RNADuplexSampleGI, id_column = "dg_id", file_out = tmpf, distance_chim_junction = 1e5, genome = "hg38" )# Load test data data("RNADuplexesSampleData") # if the input is read-based, it should have integer duplex group ids # here, we have 2090 reads length(RNADuplexSampleGI) # among them 300 reads does not belong to any DG # missing ids will be converted to 0 table(is.na(RNADuplexSampleGI$dg_id)) tmpf <- tempfile(".sam") writeGiToSAMfile( gi_coords = RNADuplexSampleGI, id_column = "dg_id", file_out = tmpf, distance_chim_junction = 1e5, genome = "hg38" )