Title: | Analysis of the data from RNA duplex probing experiments |
---|---|
Description: | DuplexDiscovereR is a package designed for analyzing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq, and others. DuplexDiscovereR accepts input in the form of chimerically or split-aligned reads. It includes procedures for alignment classification, filtering, and efficient clustering of individual chimeric reads into duplex groups (DGs). Once DGs are identified, the package predicts RNA duplex formation and their hybridization energies. Additional metrics, such as p-values for random ligation hypothesis or mean DG alignment scores, can be calculated to rank final set of RNA duplexes. Data from multiple experiments or replicates can be processed separately and further compared to check the reproducibility of the experimental method. |
Authors: | Egor Semenchenko [aut, cre, cph] , Volodymyr Tsybulskyi [ctb] , Irmtraud M. Meyer [aut, cph] |
Maintainer: | Egor Semenchenko <[email protected]> |
License: | GPL-3 |
Version: | 1.1.1 |
Built: | 2025-01-10 03:03:28 UTC |
Source: | https://github.com/bioc/DuplexDiscovereR |
Check if there are a temporary duplex records with duplex_id
, which consist of more than one read n_reads > 1
, but
does not have assigned any dg_id
as the duplex group (DG) index. Creates new dg_id
if n_reads > 1
.addDGidsForTmpDGs(gi_input)
.addDGidsForTmpDGs(gi_input)
gi_input |
GInteractions with the |
Meant to be used in the situations when previous collapsing steps merged two or more reads to the temporary DG with duplex_id
, but
global clustering has not identified any overlap between this temporary group and other duplexes, resulting in undefined dg_id
.
This function looks up for these cases and creates new dg_id
for temporary DGs, marking them as the final DGs.
New dg_id
values are unique and allocated sequentially after the maximum value of dg_id
GInteractions object with new dg_id
for rows with n_reads > 1
GInteractions
Merges the count dataframe and interactions metadata by id_col
If key is not found, in metadata throws error
.addGeneCounts(gi, df_counts, id_col = "gene_id")
.addGeneCounts(gi, df_counts, id_col = "gene_id")
gi |
|
df_counts |
dataframe with read counts |
id_col |
key to use in merge |
GInteractions
with added counts
Overlays RNA duplexes with GRanges
annotation object.
annotateGI( gi, anno_gr, keys = c("gene_name", "gene_type", "gene_id"), save_ambig = TRUE )
annotateGI( gi, anno_gr, keys = c("gene_name", "gene_type", "gene_id"), save_ambig = TRUE )
gi |
|
anno_gr |
|
keys |
names of the features to use for annotation. |
save_ambig |
in case RNA duplex overlaps multiple features of the first key,
mark the existense of ambiguous annotation in the fields
|
For each annotation feature in keys
, i.e if keys=c(keyname1),
then <keyname1>.A
, <keyname1>.B
annotation fields will be created, containing the
names of overlapping features
If no overlap is found for the feature, then filed will have NA
GInteractions
object with new fields
data("RNADuplexesSampleData") annotateGI(gi = RNADuplexSampleDGs, anno_gr = SampleGeneAnnoGR)
data("RNADuplexesSampleData") annotateGI(gi = RNADuplexSampleDGs, anno_gr = SampleGeneAnnoGR)
DuplexTrack
objectDuplexTrack
inherits from [Gviz::Annotaiontrack()]
and its Gviz parents.
Most likely, user doesn't need all dioplay pars for the parents, so only
parameters relevant to the DuplexTrack
are returned by default.
availableDisplayPars(class)
availableDisplayPars(class)
class |
|
list of the default display parameters.
library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") availableDisplayPars("DuplexTrack") DuplexDiscovereR::availableDisplayPars(a)
library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") availableDisplayPars("DuplexTrack") DuplexDiscovereR::availableDisplayPars(a)
Calculates p-values by applying Fisher test to each gene/transcript pair Uses BH correction, outputs duplex abundance relative to the per - gene/transcript count, and counts of other RNA duplexes formed by either or none gene/transcript in this pair.
calculateLigationPvalues(gi, df_counts, id_col = "gene_id")
calculateLigationPvalues(gi, df_counts, id_col = "gene_id")
gi |
|
df_counts |
|
id_col |
the prefix for gene/transcript metadata id fields in input gi. Two fields of <id_col>.A and <id.col>.B are expected. Otherwise throws error. |
H0: RNA duplex not existing and reported due to the random ligation of fragments H1: RNA duplex is true and formed because of existing the RNA-RNA interaction
The probability of random ligation is modeled as \(P(a, b)\)
given by the following equation:
The probability is defined as:
where The probability (P(a)) (same as for P(b) ) is calculated as:
p-value calculated by comparing observed duplex abundance to the expected as the are under the curve distribution to the right of the observed. P(a, b) is normalized to sum up to one.
GInteractions
object with new fields
data("RNADuplexesSampleData") gi = calculateLigationPvalues(RNADuplexSampleDGs,df_counts = RNADuplexesGeneCounts) hist(gi$p.adj,breaks = 20)
data("RNADuplexesSampleData") gi = calculateLigationPvalues(RNADuplexSampleDGs,df_counts = RNADuplexesGeneCounts) hist(gi$p.adj,breaks = 20)
Wraps two procedures for different types of classification for read alignment:
test if chimeric junction map to two non-overlapped regions or shorter than defined minimum distance
test if chimeric junction is also a splice junction
classifyTwoArmChimeras( gi, min_junction_len = 4, junctions_gr, max_sj_shift = 4 )
classifyTwoArmChimeras( gi, min_junction_len = 4, junctions_gr, max_sj_shift = 4 )
gi |
|
min_junction_len |
minimum allowed distance between two chimeric arms |
junctions_gr |
|
max_sj_shift |
maximum shift between either donor and acceptor splice sites and corresponding chimreic junction coordinates to count chimeric junction as splice junction |
Calls detection of the chimeric junction type, annotates short junctions on same chromosome an strand as 'short'. Compares chimeric junctions with splice junctions. Adds results as the new metadata fields parallel to the input.
GInteractions
object object of the same size with new columns:
filled with 0 or 1
factor for the junction types
DuplexDiscovereR::getChimericJunctionTypes(), DuplexDiscovereR::getSpliceJunctionChimeras()
data("RNADuplexesSampleData") head(RNADuplexSampleGI) # remove all metadata mcols(RNADuplexSampleGI) <- NULL gi <- classifyTwoArmChimeras(RNADuplexSampleGI, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) table(gi$splicejnc) table(gi$junction_type)
data("RNADuplexesSampleData") head(RNADuplexSampleGI) # remove all metadata mcols(RNADuplexSampleGI) <- NULL gi <- classifyTwoArmChimeras(RNADuplexSampleGI, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) table(gi$splicejnc) table(gi$junction_type)
GInteractions
objectMain method to find duplex groups from the individual interactions
clusterDuplexGroups( gi, graphdf = NULL, maxgap = 40, minoverlap = 10, id_column = "duplex_id", weight_column = "weight", fast_greedy = FALSE, decompose = FALSE, id_columns_grapdf = paste(id_column, c(1, 2), sep = "."), min_arm_ratio = 0.3, dump_graph = FALSE, dump_path = "" )
clusterDuplexGroups( gi, graphdf = NULL, maxgap = 40, minoverlap = 10, id_column = "duplex_id", weight_column = "weight", fast_greedy = FALSE, decompose = FALSE, id_columns_grapdf = paste(id_column, c(1, 2), sep = "."), min_arm_ratio = 0.3, dump_graph = FALSE, dump_path = "" )
gi |
|
graphdf |
Optional. Dataframe representing connection edges between entries in gi If not provided, graphdf is created inside the function |
maxgap |
For graph creation only. Max shift between arms starts and ends for pair of overlapping reads |
minoverlap |
For graph creation only. Minimum required overlap between either arm for pair of overlapping reads Other optional arguments, which are not relevant, unless user want to modify clustering weights or modify clustering in some other way |
id_column |
Optional. Column name in the |
weight_column |
Optional. If graphdf is provided, field to use for weight overlaps |
fast_greedy |
Optional. Run the fast_greedy algorithm instead of Louvain. Can speed up calcualtion for the large graphs. |
decompose |
Decompose graph into separate sub-graphs before clustering. |
id_columns_grapdf |
Column in the graph dataframe, which was used for index |
min_arm_ratio |
For graph creation only. Span-to-overlap ratio threshold. If smaller than this value, then edge is not drawn |
dump_graph |
For debug. Export the graph elements. not used |
dump_path |
For debug. PArt to export the graph elements. not used |
Accepts or creates the connections graphdf dataframe, creates graph with
igraph
package, uses community detection algoritm to call clusters.
New field dg_id
is added to label the clusters (duplex groups).
If no community is found for the read, dg_id
is NA
GInteractions
object with new dg_id
column
data("RNADuplexesSampleData") # run preprocessing and filtering preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- classifyTwoArmChimeras(preproc_gi, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) # collapse duplicates gi <- collapseIdenticalReads(preproc_gi)$gi # run global clustering gi <- clusterDuplexGroups(gi) # check dg_ids table(is.na(gi$dg_id))
data("RNADuplexesSampleData") # run preprocessing and filtering preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- classifyTwoArmChimeras(preproc_gi, min_junction_len = 5, junctions_gr = SampleSpliceJncGR, max_sj_shift = 10 ) # collapse duplicates gi <- collapseIdenticalReads(preproc_gi)$gi # run global clustering gi <- clusterDuplexGroups(gi) # check dg_ids table(is.na(gi$dg_id))
Collapse each interaction in the input to the duplex group based on the pre-computed dg_id
collapse_duplex_groups( gi, return_unclustered = FALSE, return_collapsed = TRUE, keep_meta = TRUE )
collapse_duplex_groups( gi, return_unclustered = FALSE, return_collapsed = TRUE, keep_meta = TRUE )
gi |
|
return_unclustered |
add unclustered reads to output |
return_collapsed |
add duplex groups, which were created as temporary with n_reads > 1 but was not clustered to the DG golabally. This parameter is used internally and should be kept default in most situations. |
keep_meta |
whether to keep metadata, which only unclustered reads have, in case of a mixed output |
'dg_id' is used as the identifier for the duplex group
Reads belonging to the same duplex group are collapsed into a single entry
with start and end are set as min() and max() coordinate of the reads in within the duplex group.
The 'score' column is averaged across the duplex group reads is calculated and
put as the 'score' for the collapsed duplex group
Behavior in case 'dg_id' = NA: Option 'return_unclustered
' -
whether unclustered reads with should be added to the output gi
Interaction is not returned in the output. Default.
Interaction is returned in the output, output is mixed duplex groups and individual reads
Internally used argument #'
In case interaction already collapsed and n_read > 1, interaction will not be returned as duplex group
In case interaction has n_read > 1, interaction will be treated as duplex group
GInteractions
object with collapsed duplex groups
# load example of clustered data data("RNADuplexesSampleData") # some reads assigned to DG, some are not table(is.na(RNADuplexSampleGI$dg_id)) # Return only DGs gicollapsed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = FALSE) # Return DGs and unclustered reads as well gimixed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = TRUE) # load small sample GInteractions and process it manually data("RNADuplexesSmallGI") # First, collapse duplicated reads. This adds n_reads and duplex ids ginodup <- collapseIdenticalReads(SampleSmallGI)$gi_collapsed # Second, run clustering, get DG ids ginodup <- clusterDuplexGroups(ginodup) # Return all DGs result in n=3 DGS, one of them formed by # identical duplicated alignments collapse_duplex_groups(ginodup, return_collapsed = TRUE) # Return DGs, but drop duplicated returns n=2 DGs collapse_duplex_groups(ginodup, return_collapsed = FALSE)
# load example of clustered data data("RNADuplexesSampleData") # some reads assigned to DG, some are not table(is.na(RNADuplexSampleGI$dg_id)) # Return only DGs gicollapsed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = FALSE) # Return DGs and unclustered reads as well gimixed <- collapse_duplex_groups(RNADuplexSampleGI, return_unclustered = TRUE) # load small sample GInteractions and process it manually data("RNADuplexesSmallGI") # First, collapse duplicated reads. This adds n_reads and duplex ids ginodup <- collapseIdenticalReads(SampleSmallGI)$gi_collapsed # Second, run clustering, get DG ids ginodup <- clusterDuplexGroups(ginodup) # Return all DGs result in n=3 DGS, one of them formed by # identical duplicated alignments collapse_duplex_groups(ginodup, return_collapsed = TRUE) # Return DGs, but drop duplicated returns n=2 DGs collapse_duplex_groups(ginodup, return_collapsed = FALSE)
Two entries (reads) are considered identical if they share start, end, strand and score vales Identical entries are collapsed into the single one.
collapseIdenticalReads(gi)
collapseIdenticalReads(gi)
gi |
GInteractions(mode='strict') object with chromA, strandA, startA, endA, chromB, strandB, startB, endB, score columns Optionally cigar_alnA, cigar_alnB columns are also considered for collapsing 'read_id' column used as the index in the initial objects. Created, if not exists |
Adds columns to the collapsed object duplex_id (int) unique record id n_reads (int) number of entries collapsed
result_list object with keys ' gi_collapsed': New collapsed GInteraction object ' stats_df': tibble with the mapping of the original entries to the new duplex_id
# load data data("RNADuplexesSmallGI") res_collapse <- collapseIdenticalReads(SampleSmallGI) gi_new <- res_collapse[["gi_collapsed"]] # keeps the mapping of the colapsed object to new read_stats_df <- res_collapse[["stats_df"]]
# load data data("RNADuplexesSmallGI") res_collapse <- collapseIdenticalReads(SampleSmallGI) gi_new <- res_collapse[["gi_collapsed"]] # keeps the mapping of the colapsed object to new read_stats_df <- res_collapse[["stats_df"]]
Function calls clustering algorithm several times and collapses highly similar reads to the temporary duplex groups (DGs).
collapseSimilarChimeras( gi, read_stats_df, maxgap = 5, niter = 2, minoverlap = 10 )
collapseSimilarChimeras( gi, read_stats_df, maxgap = 5, niter = 2, minoverlap = 10 )
gi |
|
read_stats_df |
|
maxgap |
Maximum relative shift between the overlapping read arms |
niter |
Number of times clustering will be called |
minoverlap |
Minimum required overlap between either read arm |
Calling this procedure before global read clustering substantially reduces time required for calling DGs. Collapsed duplex groups are aggregated only from the reads which are shifted by only a few nucleotides from each other. These DGs are temporary until full library clustering is called. To keep track of the mapping of the temprary DGs to the input, dedicated dataframe is returned. The 'duplex_id' column will be added or updated as identifier for the temporary duplex group. The number of reads under single 'duplex_id' is recorded in the 'n_reads' fields
a list with the following keys
GInteractions
object with both collapsed duplex groups
and not-collapsed unchanged reads
tibble
With the mapping from the unique read -
with the the infromation about time and memory reaquired for the function
call
Combines all interaction into single superset by clustering & collapsing. Then compares every input entry with the superset. Overlaps between superset and inputs are recorded in a table as 0/1
compareMultipleInteractions( gi_samples_list, min_ratio = 0.3, minoverlap = 5, maxgap = 50, niter = 3, gi_superset = NULL, anno_gr = NULL )
compareMultipleInteractions( gi_samples_list, min_ratio = 0.3, minoverlap = 5, maxgap = 50, niter = 3, gi_superset = NULL, anno_gr = NULL )
gi_samples_list |
anmes list with the |
min_ratio |
If the overlap-to-span ratio for either arm (A or B) for pair
of chimeric reads is less than |
minoverlap |
Parameter for read clustering to create a superset. Minimum required overlap to for either arm (A or B) for pair of entries. |
maxgap |
Parameter for read clustering. Minimum required shift between start and end coordinates of arms for pair of overlapping entries..
If the shift is longer than |
niter |
Internal parameter for debugging. Number of cluster& collapse iterations to find superset |
gi_superset |
Optional. Superset defining the space (all) of the interactions, against which inputs from the list will be compared. |
anno_gr |
Optional. |
dataframe recodding the overlaps between samples and supeset
# Create test set of RNA interactions chrom <- "chr1" start1 <- c(1, 11, 21, 31, 41, 51, 61, 71, 81, 91) end1 <- start1 + 9 start2 <- c(101, 111, 121, 131, 141, 151, 161, 171, 181, 191) end2 <- start2 + 9 anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = start1, end = end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = start2, end = end2)) interaction <- GInteractions(anchor1, anchor2) # Ensure some overlaps n <- length(interaction) group_size <- ceiling(n / 2) group_indices1 <- sort(sample(seq_len(n), group_size)) group_indices2 <- sort(sample(seq_len(n), group_size)) group_indices3 <- sort(sample(seq_len(n), group_size)) # Create separate GInteractions objects for each group group1 <- interaction[group_indices1] group2 <- interaction[group_indices2] group3 <- interaction[group_indices3] # format input and call comparison a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3) res <- compareMultipleInteractions(a) # comparison result head(res$dt_upset) # superset res$gi_all # dataframe for the Upset plot res$dt_upset
# Create test set of RNA interactions chrom <- "chr1" start1 <- c(1, 11, 21, 31, 41, 51, 61, 71, 81, 91) end1 <- start1 + 9 start2 <- c(101, 111, 121, 131, 141, 151, 161, 171, 181, 191) end2 <- start2 + 9 anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = start1, end = end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = start2, end = end2)) interaction <- GInteractions(anchor1, anchor2) # Ensure some overlaps n <- length(interaction) group_size <- ceiling(n / 2) group_indices1 <- sort(sample(seq_len(n), group_size)) group_indices2 <- sort(sample(seq_len(n), group_size)) group_indices3 <- sort(sample(seq_len(n), group_size)) # Create separate GInteractions objects for each group group1 <- interaction[group_indices1] group2 <- interaction[group_indices2] group3 <- interaction[group_indices3] # format input and call comparison a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3) res <- compareMultipleInteractions(a) # comparison result head(res$dt_upset) # superset res$gi_all # dataframe for the Upset plot res$dt_upset
GInteractions
Utility function to find overlapping reads in the input and calculate overlap scores. Removes self-hits. Computes overlap/span ratios for each interaction arm. Sum of the scores is recorded in 'weight' field
computeGISelfOverlaps( gi, id_column = "duplex_id", maxgap = 40, minoverlap = 10 )
computeGISelfOverlaps( gi, id_column = "duplex_id", maxgap = 40, minoverlap = 10 )
gi |
input gi object |
id_column |
column which use for using as ids for entries |
maxgap |
parameter for call of |
minoverlap |
parameter for call |
dataframe with indexes of pairwise overlapsin input and columns for span, overlap, ratios of either read arm
data("RNADuplexesSmallGI") computeGISelfOverlaps(SampleSmallGI)
data("RNADuplexesSmallGI") computeGISelfOverlaps(SampleSmallGI)
GInteractions
object to Granges
Creates the 'long' GRanges
by stacking the A and B arms one 'on top' of the other.
Adds id
and group
fields as indicators of original index and interaction
arm (A- left arm, B- right arm)
convert_gi_to_ranges(gi)
convert_gi_to_ranges(gi)
gi |
|
GRanges
twice the length of the input
data("RNADuplexesSmallGI") convert_gi_to_ranges(SampleSmallGI)
data("RNADuplexesSmallGI") convert_gi_to_ranges(SampleSmallGI)
chimeric_reads
SlotRetrieves the value of the chimeric_reads
slot in a DuplexDiscovererResults
object.
dd_get_chimeric_reads(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads(object)
dd_get_chimeric_reads(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads(object)
object |
A |
GInteractions object from the chimeric_reads
slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
chimeric_reads_stats
SlotRetrieves the value of the chimeric_reads_stats
slot in a DuplexDiscovererResults
object.
dd_get_chimeric_reads_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads_stats(object)
dd_get_chimeric_reads_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_chimeric_reads_stats(object)
object |
A |
tibble
from the chimeric_reads_stats
slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
duplex_groups
slotRetrieves the value of the duplex_groups
slot in a DuplexDiscovererResults
object.
dd_get_duplex_groups(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_duplex_groups(object)
dd_get_duplex_groups(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_duplex_groups(object)
object |
A |
GInteractions object from the duplex_groups
slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
reads_classes
SlotRetrieves the value of the reads_classes
slot in a DuplexDiscovererResults
object.
dd_get_reads_classes(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_reads_classes(object)
dd_get_reads_classes(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_reads_classes(object)
object |
A |
tibble
from the reads_classes
slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
run_stats
SlotRetrieves the value of the run_stats
slot in a DuplexDiscovererResults
object.
dd_get_run_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_run_stats(object)
dd_get_run_stats(object) ## S4 method for signature 'DuplexDiscovererResults' dd_get_run_stats(object)
object |
A |
tibble
from the run_stats
slot.
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
DuplexDiscovereR is a package for analysing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq and others, which provide information about intra-molecular RNA-RNA interactions through chimeric RNA-seq reads. Chimerically aligned fragments in these experiments correspond to the base-paired stretches (RNA duplexes) of RNA molecules . DuplexDiscovereR takes input in the form of chimericly or split -aligned reads, It implements procedures for alignment classification, filtering and efficient clustering of individual chimeric reads into duplex groups (DGs). Once DGs are found, RNA duplex formation and their hybridization energies are predicted. Additional metrics, such as p-values or mean DG alignment scores, can be calculated to rank and analyse the final set of RNA duplexes. Data from multiple experiments or replicates can be processed separately and further compared to check the reproducibility of the experimental method.
DuplexDiscovereR
Egor Semenchenko
A helper S4 class to store the results of the full DuplexDiscovereR analysis. This class contains the following output:
duplex_groups
: clustered duplex groups.
chimeric_reads
: individual two-regions chimeric reads.
Contains both clustered and unclustered reads.
Clustered reads are linked to the duplex groups though 'dg_id' field in metadata
reads_classes
: dataframe parallel to the the input containing classification result and detected mapping type for each entry in the input
chimeric_reads_stats
: dataframe containing read type classification statistics
run_stats
: data frame containing statistics about the time and memory used by the pipeline
DuplexDiscovererResults( duplex_groups, chimeric_reads, reads_classes, chimeric_reads_stats, run_stats )
DuplexDiscovererResults( duplex_groups, chimeric_reads, reads_classes, chimeric_reads_stats, run_stats )
duplex_groups |
GInteractions object with duplex groups |
chimeric_reads |
GInteractions object with chimeric reads |
reads_classes |
|
chimeric_reads_stats |
|
run_stats |
|
Each output type has a corresponding accessor:
A DuplexDiscovererResults
object.
duplex_groups
GInteractions object with duplex groups
chimeric_reads
GInteractions object with chimeric reads
reads_classes
tibble
(tbl_df) with read classification data.
chimeric_reads_stats
tibble
(tbl_df) read type statistics.
run_stats
tibble
(tbl_df) runtime and memory info
dd_get_duplex_groups()
,
dd_get_chimeric_reads()
,
dd_get_reads_classes()
,
dd_get_chimeric_reads_stats()
,
dd_get_run_stats()
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
# load example input data("RNADuplexesSmallGI") data("RNADuplexesSampleData") # run whole pipeline result <- runDuplexDiscoverer( data = SampleSmallGI, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, sample_name = "run_example", lib_type = "SE", table_type = "STAR" ) # access results show(result) gi_clusters <- dd_get_duplex_groups(result) gi_reads <- dd_get_chimeric_reads(result) df_reads <- dd_get_reads_classes(result) dd_get_reads_classes(result) dd_get_run_stats(result)
Inherits the Gviz::AnnotationTrack
, plots interaction ranges as boxes.
Arguments from Gviz::AnnotationTrack
, as stacking
which set boxes
layout are accepted. Parent aesthetics for labels are overwritten with Display
parameters of this class.
Accepts GInteractions
object to plot and GRanges
to define plot region
Duplexes which can be displayed on the plot range are connected with arcs. Duplexes which are partially outside of the range are displayed without arcs. Labeles and appearance can be controlled with display parameters
gi |
An |
gr_region |
|
from |
Integer start coordinate of subset region. Used if |
to |
Integer end coordinate of subset region. Used if |
chromosome |
Chromosome of subset region. Used if |
strand |
Used if |
fill.column |
used for fill. Default is "" (empty) and triggers IGV color pallete. Display parameters
|
library(InteractionSet) library(Gviz) # generate input anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100, 150, 400), end = c(200, 700, 1200, 1700, 2200, 250, 500) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300, 1500, 1700), end = c(400, 900, 1400, 1900, 2400, 1600, 1800) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") # define plotting range gr_region <- range(anchor1, anchor2) interactions$anno_A <- sample(LETTERS, length(interactions)) interactions$anno_B <- interactions$anno_A a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") plotTracks(a, stacking = "dense") plotTracks(a, stacking = "squish", annotation.column1 = "anno_A") # add interactions which are not fully in plot range: outside the range or on different chromosome() # one left (A) interaction arm outside of the plot, other on different chromosome new_anchor1 <- GRanges( seqnames = c("chr1", "chr2"), ranges = IRanges( start = c(10, 600), end = c(90, 700) ), strand = "+" ) new_anchor2 <- GRanges( seqnames = c("chr1", "chr1"), ranges = IRanges( start = c(1500, 1000), end = c(1600, 1200) ), strand = "+" ) new_interactions <- GInteractions(new_anchor1, new_anchor2) new_interactions$anno_A <- c("A.out", "A.out_chr") new_interactions$anno_B <- c("B.in", "B.in") all_interactions <- c(interactions, new_interactions) b <- DuplexDiscovereR::DuplexTrack(all_interactions, gr_region = gr_region, annotation.column1 = "anno_A", annotation.column2 = "anno_B" ) plotTracks(b) # to customize plot, one can call, to see options DuplexDiscovereR::availableDisplayPars(b)
library(InteractionSet) library(Gviz) # generate input anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100, 150, 400), end = c(200, 700, 1200, 1700, 2200, 250, 500) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300, 1500, 1700), end = c(400, 900, 1400, 1900, 2400, 1600, 1800) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") # define plotting range gr_region <- range(anchor1, anchor2) interactions$anno_A <- sample(LETTERS, length(interactions)) interactions$anno_B <- interactions$anno_A a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") plotTracks(a, stacking = "dense") plotTracks(a, stacking = "squish", annotation.column1 = "anno_A") # add interactions which are not fully in plot range: outside the range or on different chromosome() # one left (A) interaction arm outside of the plot, other on different chromosome new_anchor1 <- GRanges( seqnames = c("chr1", "chr2"), ranges = IRanges( start = c(10, 600), end = c(90, 700) ), strand = "+" ) new_anchor2 <- GRanges( seqnames = c("chr1", "chr1"), ranges = IRanges( start = c(1500, 1000), end = c(1600, 1200) ), strand = "+" ) new_interactions <- GInteractions(new_anchor1, new_anchor2) new_interactions$anno_A <- c("A.out", "A.out_chr") new_interactions$anno_B <- c("B.in", "B.in") all_interactions <- c(interactions, new_interactions) b <- DuplexDiscovereR::DuplexTrack(all_interactions, gr_region = gr_region, annotation.column1 = "anno_A", annotation.column2 = "anno_B" ) plotTracks(b) # to customize plot, one can call, to see options DuplexDiscovereR::availableDisplayPars(b)
Takes CIGAR operands i.e M,N,S and sums the associated blocks length It is vectorized. i.e supports vector with CIGAR strings
get_char_count_cigar(strings, s)
get_char_count_cigar(strings, s)
strings |
CIGAR string vector |
s |
CIGAR operands |
vector with length values
# From a vector get_char_count_cigar(c("4S18M22S", "25S26M"), "S") get_char_count_cigar(c("18M22S", "20M20S"), "M")
# From a vector get_char_count_cigar(c("4S18M22S", "25S26M"), "S") get_char_count_cigar(c("18M22S", "20M20S"), "M")
Chimeric reads which can be represented ans two-arm interactions can be divided into several categories based on the distance between the chimeric fragments and existence of the overlap between these fragments.
getChimericJunctionTypes(gi, normal_gap_threshold = 10)
getChimericJunctionTypes(gi, normal_gap_threshold = 10)
gi |
|
normal_gap_threshold |
minimum allowed distance between chimeric arms |
Takes GInteractions
object and classifies junctions into following categories
normal chimeric read
normal chimeric read with junction < normal_gap_threshold
arms overlap
arms overlap on the opposite strand
gi object of the same size with the 'junction_type' field added
data("RNADuplexesSampleData") preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- getChimericJunctionTypes(preproc_gi) table(preproc_gi$junction_type)
data("RNADuplexesSampleData") preproc_df <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") preproc_gi <- makeGiFromDf(preproc_df) preproc_gi <- getChimericJunctionTypes(preproc_gi) table(preproc_gi$junction_type)
Calls RNAduplex from ViennaRNA to find base-pairs for every entry in the input, throws a message and system warning if it is not installed
getRNAHybrids(gi, fafile)
getRNAHybrids(gi, fafile)
gi |
|
fafile |
path to the .fasta file with genome |
object parallel to input with added energy GC content, dot-format base-pairings and lenghts of RNA hybrids will return the input, if RNAhybrids cannot be run
sequence <- paste0( "AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU", "CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU" ) # Save the sequence to a temp fasta file fasta_file <- tempfile(fileext = ".fa") chrom <- "test_chrA" writeLines(c(">test_chrA", sequence), con = fasta_file) # Create the GInteraction object # Define start and end positions for the base-pairing regions regions <- data.frame( start1 = c(1, 11, 21, 31, 41), end1 = c(10, 20, 30, 40, 50), start2 = c(91, 81, 71, 61, 51), end2 = c(100, 90, 80, 70, 60) ) # GRanges objects for the anchors anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2)) interaction <- GInteractions(anchor1, anchor2) # predict hybrids # In case ViennaRNA is installed ## Not run: gi_with_hybrids <- getRNAHybrids(interaction, fasta_file) ## End(Not run)
sequence <- paste0( "AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU", "CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU" ) # Save the sequence to a temp fasta file fasta_file <- tempfile(fileext = ".fa") chrom <- "test_chrA" writeLines(c(">test_chrA", sequence), con = fasta_file) # Create the GInteraction object # Define start and end positions for the base-pairing regions regions <- data.frame( start1 = c(1, 11, 21, 31, 41), end1 = c(10, 20, 30, 40, 50), start2 = c(91, 81, 71, 61, 51), end2 = c(100, 90, 80, 70, 60) ) # GRanges objects for the anchors anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1)) anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2)) interaction <- GInteractions(anchor1, anchor2) # predict hybrids # In case ViennaRNA is installed ## Not run: gi_with_hybrids <- getRNAHybrids(interaction, fasta_file) ## End(Not run)
Marks interactions which starts/ends within specified shift from the known splice junctions.
getSpliceJunctionChimeras( gi, sj_gr, sj_tolerance = 20, sj_tolerance_strict = 10 )
getSpliceJunctionChimeras( gi, sj_gr, sj_tolerance = 20, sj_tolerance_strict = 10 )
gi |
GInteractions object |
sj_gr |
Granges object with the splice junctions data |
sj_tolerance |
maximum shift between either donor and acceptor splice sites and corresponding chimreic junction coordinates to count chimeric junction as splice junction |
sj_tolerance_strict |
maximum shift between either donor and acceptor splice sites irrespective of the particular splice junction. If both chimeric junction start and end correspond to donor or acceptor of any known junction, it is marked as splice junction. Used to catch novel combinations of known 3' and 5' sites |
gi object with added 'splicejnc' and field Additionally 'splicejnc_donor' 'splicejnc_acceptor' fields are added
data("RNADuplexesSampleData") gi <- getSpliceJunctionChimeras(RNADuplexSampleGI, SampleSpliceJncGR) table(gi$splicejnc) table(gi$splicejnc_acceptor, gi$splicejnc_donor)
data("RNADuplexesSampleData") gi <- getSpliceJunctionChimeras(RNADuplexSampleGI, SampleSpliceJncGR) table(gi$splicejnc) table(gi$splicejnc_acceptor, gi$splicejnc_donor)
Converts GInteractions
to tibble, preserves metadata
makeDfFromGi(gi)
makeDfFromGi(gi)
gi |
GInteracttions |
Following naming conventions is used for region coordinates:
c('chromA','startA','endA','strandA', 'chromB','startB','endB','strandB')
tibble preserving metadata columns
data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)
data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)
Converts dataframe-like object to the GInteractions
.
makeGiFromDf(df)
makeGiFromDf(df)
df |
dataframe-like object. Should be convertable to tibble::tibble() |
arms will be consistent between different objects of same reference
Following columns are looked up in input dataframe to parse region coordinates:
c("chromA','startA','endA','strandA',"chromB",'startB','endB','strandB')
GInteractions(mode='strict')
is enforced, to ensure that the order of the regions
Extra columns are stored as metadata fields
GInteractions(mode='strict')
# load example GInteractions data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)
# load example GInteractions data(RNADuplexesSmallGI) converted_to_df <- makeDfFromGi(SampleSmallGI) converted_to_gi <- makeGiFromDf(converted_to_df)
Calculates alignment coordinates and returns reads with categories
preproc_chim_junction_out_se(dt, keep_all_columns = FALSE)
preproc_chim_junction_out_se(dt, keep_all_columns = FALSE)
dt |
Chimeric.out.junction with the correct column names |
keep_all_columns |
|
#'
multi-mapped read
more than one junction (more than two 'N' in CIGAR string)
Artifacts. I.e alignments for both arms are continious, but with 'backward' chimeric junction was wrongly put
tibble with annotated reads
File generated by mapping with STAR using --quantMode GeneCounts
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details on the pre-processing and sub-setting the
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class spec_tbl_df
(inherits from tbl_df
, tbl
, data.frame
) with 1445 rows and 2 columns.
tibble
with columns of Chimeric.junction.out
A Chimeric.out.Junction file with a subset of chr 22 Chimeric reads detected by SPLASH protocol in Human embryonic stem cells.
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class spec_tbl_df
(inherits from tbl_df
, tbl
, data.frame
) with 2040 rows and 10 columns.
tibble
with columns of bedpe format
SequenceReadArcive
Reads were aligned with STAR and filtered to contain only reads
which could be represented as 2-arm chimeric alignments. Converted to the
bedpe
format
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details on the pre-processing and sub-setting the
data
A Chimeric.out.Junction file with a subset of chr 22 Chimeric reads detected by SPLASH protocol in Human embryonic stem cells.
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class tbl_df
(inherits from tbl
, data.frame
) with 5000 rows and 21 columns.
tibble
with columns of Chimeric.junction.out
SequenceReadArcive
Reads were aligned with STAR
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details on the pre-processing and sub-setting the
data
GInteractions
read-level object containing processed reads,annotated with duplex group
ids, read types gene names and p-values
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class StrictGInteractions
of length 2090.
GInteractions
with
n_reads_dg
: number of reads in the duplex group (DG)
duplex_id
: temporary id for RNA duplexes which could be found before
clustering (duplicated or shifted by couple of nt )
dg_id
:id of the duplex group
score
: median alignment score in duplex group
other columns inherited from the STAR Chimeric.out.Junction
SequenceReadArcive
Reads were aligned with STAR and duplex groups were identified
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details on the data generation proccedure.
GInteractions
duplex group -level object containing detected duplex groups,
annotated with duplex group ids, gene_names and p-values
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class StrictGInteractions
of length 79.
GInteractions
with
n_reads
: number of reads in the duplex group (DG)
dg_id
:id of the duplex group
p_val
: BH adjusted p-value of testing to reject hypothesis of DG arising
from random ligation
score
: median alignment score in duplex group
other columns with .A
and .B
annotating to which genes either arm of the
DG maps
SequenceReadArcive
Reads were aligned with STAR and duplex groups were identified
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details on the data generation procedure.
GInteractions
read-level object containing two-arm chimeric reads extracted from
mapping output and which can be represented in the GInteraction
object
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class StrictGInteractions
of length 2090.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details on the data generation proccedure.
GInteractions
with
readname
: read name
map_type
: type of the mapped read (2arm by design of pre-filtering)
junction_type
: if read jucntion is too short, or it not a 'true' ligated
reads because of the jucntoin coincides with splice junction
cigar_aln* columns inherited from the STAR Chimeric.out.Junction output
Imports dataframe with reads (.bedpe or Chimeric.out.junction ) or GInteractions
object.
Checks column names or tries to quess them if not provided.
Adds necessary annotation depending on the input type,
For STAR input, calculates length of the alignments
and marks unique 2-arm alignments.
For the .bedpe or GInteractions
input, all entries are already represented as reads
with two different aligned parts (2-arm), so only check for unique readname
is performed.
runDuplexDiscoPreproc( data, table_type, library_type = "SE", keep_metadata = TRUE, return_gi = FALSE )
runDuplexDiscoPreproc( data, table_type, library_type = "SE", keep_metadata = TRUE, return_gi = FALSE )
data |
Either dataframe-like object: Chimeric.out.junction from STAR or
.bedpe - formatted or |
table_type |
in |
library_type |
|
keep_metadata |
|
return_gi |
if the return object should be |
If not existed, adds fields required for the downstream steps: 'readname', 'map_type', 'score', 'n_reads'. 'map_type' field determines the type of the chimeric read:
multi-mapped read
more than one junction (more than two 'N' in CIGAR string)
Artifacts or possibly unaccounted types. I.e alignments for both arms are continuous, but with 'backward' chimeric junction was wrongly introduced in the mapping
tibble with new metadata fields OR GInteractions if return_gi
is
set to TRUE
# load data data(RNADuplexesSampleData) # with bedpe input preproc_reads <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") # with STAR input preproc_reads_star <- runDuplexDiscoPreproc(RNADuplexesRawChimSTAR, table_type = "STAR", keep_metadata = FALSE )
# load data data(RNADuplexesSampleData) # with bedpe input preproc_reads <- runDuplexDiscoPreproc(RNADuplexesRawBed, table_type = "bedpe") # with STAR input preproc_reads_star <- runDuplexDiscoPreproc(RNADuplexesRawChimSTAR, table_type = "STAR", keep_metadata = FALSE )
Generates GInteractions object with duplex groups from the STAR Chimeric.out.junction or bedpe file. Classifies reads, annotates reads by overlap with the gene or transcript features, calculates p-values and hybridization energies. Additionally, returns mappings from duplex groupd back to genes.
runDuplexDiscoverer( data, table_type = "", junctions_gr = NULL, anno_gr = NULL, fafile = NULL, df_counts = NULL, sample_name = "sample", lib_type = "SE", min_junction_len = 5, max_gap = 50, min_arm_ratio = 0.1, min_overlap = 10, max_sj_shift = 10, gap_collapse_similar = 2, collapse_n_inter = 5 )
runDuplexDiscoverer( data, table_type = "", junctions_gr = NULL, anno_gr = NULL, fafile = NULL, df_counts = NULL, sample_name = "sample", lib_type = "SE", min_junction_len = 5, max_gap = 50, min_arm_ratio = 0.1, min_overlap = 10, max_sj_shift = 10, gap_collapse_similar = 2, collapse_n_inter = 5 )
data |
dataframe-like object with the split reads. Output of Chimeric.out.junction or dataframe with fileds defined by bedpe format:
c("chromA","startA",'endA',"chromB",'startB','endB','readname','flag','strandA','strandB', ... )
Alternatively, |
table_type |
one in c("STAR","bedpe") Defines the type of the input dataframe.
ignored if input data is |
junctions_gr |
GRanges object with the splice junction coordinates |
anno_gr |
GRanges object to use for the annotation of the interactions. The c('gene_id','gene_name','gene_types') columns in anno_gr are used by default. Optional |
fafile |
path to the genome .fasta file. Used to calculate hybridization energy with RNADuplex. Sequence names should correspond to the sequences from which the mapping index was created. Optional |
df_counts |
A two- column dataframe with counts to use for p-value calculation. The first column should match the 'gene_id' feature in anno_gr. The second column is the respective count. Optional |
sample_name |
A name of the sample, used for assembling the analysis statistics dataframe |
lib_type |
one in c('SE','PE'). Type of the seqeuncing library. Default is 'SE' |
min_junction_len |
a minimum allowed distance between chimeric arms for the read input.
Reads with the junction closer than |
max_gap |
Parameter for read clustering. Minimum required shift between start and end coordinates of arms for pair of overlapping chimeric reads.
If the shift is longer than |
min_arm_ratio |
Parameter for read clustering.
If the overlap-to-span ratio for either arm (A or B) for pair of chimeric reads is less than |
min_overlap |
Parameter for read clustering. Minimum required overlap to for either arm (A or B) for pair of chimeric reads. |
max_sj_shift |
Maximum shift between either donor and acceptor splice sites and chimeric junction coordinates to count chimeric junction as splice junction |
gap_collapse_similar |
Parameter for read clustering (iterative step). Analogous to the max_gap, but applied |
collapse_n_inter |
Parameter for read clustering (iterative step). Number of iterations to repeat step of collapsing of the highly similar chimeric reads. Increasing this from i.e 0 to 5 reduces clustering time and memory for the libraries with many overlapping reads. |
This is a main function to do the initial discovery of the RNA duplexes after the chimeric read mapping. It wraps following procedures:
Classifies the input reads by the mapping type. Keeps 2-arm chimeric reads for downstream analysis
Compares 2arm duplex reads against provided splice junctions
Classifies 2arm duplexes into spurious self-overlapping, splice junction categoris
Performs clustering of the remaining reads into duplex groups
Collapses identically mapped reads
Collapses closely located reads, almost identical reads
Finds duplex groups throughout whole data set
Annotates duplex groups with genomic features if annotation is provided
Calculates p-values if gene counts and annotation are provided
Calculates hybridization energies if path to the .fasta file is provided
a DuplexDiscovererResults
with the following output
duplex_groups
GInteractions
object with chimeric reads clustered duplex groups
chimeric_reads
GInteractions
object with non-collapsed chimeric reads
reads_classes
tbl_df
dataframe parallel to the the input dataframe, annotated with read categories and duplex groups
chimeric_reads_stats
tbl_df
dataframe containing read type classification statistics
run_stats
tbl_df
dataframe with the time and memory info about the run
library(DuplexDiscovereR) # load data data("RNADuplexesSampleData") result <- runDuplexDiscoverer( data = RNADuplexesRawChimSTAR, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, df_counts = RNADuplexesGeneCounts, sample_name = "test clustering", fafile = NULL, collapse_n_inter = 3, lib_type = "SE", table_type = "STAR" ) # see results object print(result) # duplex groups dd_get_duplex_groups(result) # individual chimeric reads dd_get_chimeric_reads(result) # counts of detected read tyoes dd_get_chimeric_reads_stats(result)
library(DuplexDiscovereR) # load data data("RNADuplexesSampleData") result <- runDuplexDiscoverer( data = RNADuplexesRawChimSTAR, junctions_gr = SampleSpliceJncGR, anno_gr = SampleGeneAnnoGR, df_counts = RNADuplexesGeneCounts, sample_name = "test clustering", fafile = NULL, collapse_n_inter = 3, lib_type = "SE", table_type = "STAR" ) # see results object print(result) # duplex groups dd_get_duplex_groups(result) # individual chimeric reads dd_get_chimeric_reads(result) # counts of detected read tyoes dd_get_chimeric_reads_stats(result)
Granges
containing gene coordinates of human chromosome 22 obtained from
GENCODEv44 annotaion
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class GRanges
of length 1445.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details
statdatd GENCODE gtf fields
GInteractions
object containing two-arm chimeric reads extracted from
mapping output and which can be represented in the GInteraction
object
and subset to chr22: 23877144-45562960 '*'
data(RNADuplexesSmallGI)
data(RNADuplexesSmallGI)
An object of class StrictGInteractions
of length 14.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details on the data generation procedure.
Granges
containing coordinates of splice junctions human chromosome 22 obtained from
GENCODEv44 annotaion
data(RNADuplexesSampleData)
data(RNADuplexesSampleData)
An object of class GRanges
of length 8465.
see system.file("extdata/scripts", "DD_data_generation.R", package =
"DuplexDiscovereR")
for details
statdatd GENCODE gtf fields
This method provides a summary of the DuplexDiscovererResults object.
It prints chimeric_reads_stats
followed by the run_stats
.
## S4 method for signature 'DuplexDiscovererResults' show(object)
## S4 method for signature 'DuplexDiscovererResults' show(object)
object |
A |
None. Prints a formatted summary.
Show method for DuplexTrack
## S4 method for signature 'DuplexTrack' show(object)
## S4 method for signature 'DuplexTrack' show(object)
object |
DuplexTrack. |
class representation
library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") show(a)
library(InteractionSet) anchor1 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(100, 600, 1100, 1600, 2100), end = c(200, 700, 1200, 1700, 2200) ), strand = "+" ) anchor2 <- GRanges( seqnames = "chr1", ranges = IRanges( start = c(300, 800, 1300, 1800, 2300), end = c(400, 900, 1400, 1900, 2400) ), strand = "+" ) interactions <- GInteractions(anchor1, anchor2, mode = "strict") gr_region <- range(anchor1, anchor2) a <- DuplexTrack(interactions, gr_region = gr_region, stacking = "dense") show(a)
Writes interactions to the sam file for visualization in extrnal browsers. Takes input as GInteractions object containing reads or duplex groups.
writeGiToSAMfile( gi_coords, file_out, distance_chim_junction = 10000, read_name_column = "readname", id_column = "dg_id", genome = "", sample_name = "noname_sample" )
writeGiToSAMfile( gi_coords, file_out, distance_chim_junction = 10000, read_name_column = "readname", id_column = "dg_id", genome = "", sample_name = "noname_sample" )
gi_coords |
input Ginteraction object |
file_out |
path to write output file |
distance_chim_junction |
maximum distance between input duplex groups/reads, which will be represented as the single-line in .sam file. Junction will be output as N- gap. For the interactions with longer distances, chimeric junction will be represented as MR:Z:i tag |
read_name_column |
character field, pointing out to read names. Read names are generated automatically if not provided. |
id_column |
character name of the field containing integer duplex group ids. NA are replaced with zeros |
genome |
character. Genome version. Required for the retrieval of sequence lengths for sam file header- SQ and SN tags. For convenience, hg38 and hg19 chromosome lengths will be assigned automatically. If the value is not in c('hg38','hg19'), seqlengths will be looked for be in attribute in seqlengths() of regions(gi_coords) |
sample_name |
name to use in RG SAM tag in header |
no object is returned
# Load test data data("RNADuplexesSampleData") # if the input is read-based, it should have integer duplex group ids # here, we have 2090 reads length(RNADuplexSampleGI) # among them 300 reads does not belong to any DG # missing ids will be converted to 0 table(is.na(RNADuplexSampleGI$dg_id)) tmpf <- tempfile(".sam") writeGiToSAMfile( gi_coords = RNADuplexSampleGI, id_column = "dg_id", file_out = tmpf, distance_chim_junction = 1e5, genome = "hg38" )
# Load test data data("RNADuplexesSampleData") # if the input is read-based, it should have integer duplex group ids # here, we have 2090 reads length(RNADuplexSampleGI) # among them 300 reads does not belong to any DG # missing ids will be converted to 0 table(is.na(RNADuplexSampleGI$dg_id)) tmpf <- tempfile(".sam") writeGiToSAMfile( gi_coords = RNADuplexSampleGI, id_column = "dg_id", file_out = tmpf, distance_chim_junction = 1e5, genome = "hg38" )