Title: | Irreproducible Discovery Rate for Genomic Interactions Data |
---|---|
Description: | A tool to measure reproducibility between genomic experiments that produce two-dimensional peaks (interactions between peaks), such as ChIA-PET, HiChIP, and HiC. idr2d is an extension of the original idr package, which is intended for (one-dimensional) ChIP-seq peaks. |
Authors: | Konstantin Krismer [aut, cre, cph] , David Gifford [ths, cph] |
Maintainer: | Konstantin Krismer <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.21.0 |
Built: | 2024-11-30 05:20:13 UTC |
Source: | https://github.com/bioc/idr2d |
Calculates the distance in nucleotides between the midpoints of two peaks.
Note: peaks must be on the same chromosome; start coordinate is always less than end coordinate
calculate_midpoint_distance1d(peak1_start, peak1_end, peak2_start, peak2_end)
calculate_midpoint_distance1d(peak1_start, peak1_end, peak2_start, peak2_end)
peak1_start |
integer vector; genomic start coordinate(s) of peak in replicate 1 |
peak1_end |
integer vector; genomic end coordinate(s) of peak in replicate 1 |
peak2_start |
integer vector; genomic start coordinate(s) of peak in replicate 2 |
peak2_end |
integer vector; genomic end coordinate(s) of peak in replicate 2 |
positive integer vector; distances between peak pairs
# identical, zero distance calculate_midpoint_distance1d(100, 120, 100, 120) # centered, zero distance calculate_midpoint_distance1d(100, 120, 90, 130) # off by 10 per anchor calculate_midpoint_distance1d(100, 120, 110, 130) # vectorized example calculate_midpoint_distance1d(c(100, 100, 100), c(120, 120, 120), c(100, 90, 110), c(120, 130, 130))
# identical, zero distance calculate_midpoint_distance1d(100, 120, 100, 120) # centered, zero distance calculate_midpoint_distance1d(100, 120, 90, 130) # off by 10 per anchor calculate_midpoint_distance1d(100, 120, 110, 130) # vectorized example calculate_midpoint_distance1d(c(100, 100, 100), c(120, 120, 120), c(100, 90, 110), c(120, 130, 130))
Calculates the distance in nucleotides between the anchor midpoints of two interactions, which is the sum of the distance between midpoints of anchor A in interaction 1 and anchor A in interaction 2, and the distance between midpoints of anchor B in interaction 1 and anchor B in interaction 2.
Note: all anchors must be on the same chromosome; start coordinate is always less than end coordinate
calculate_midpoint_distance2d( int1_anchor_a_start, int1_anchor_a_end, int1_anchor_b_start, int1_anchor_b_end, int2_anchor_a_start, int2_anchor_a_end, int2_anchor_b_start, int2_anchor_b_end )
calculate_midpoint_distance2d( int1_anchor_a_start, int1_anchor_a_end, int1_anchor_b_start, int1_anchor_b_end, int2_anchor_a_start, int2_anchor_a_end, int2_anchor_b_start, int2_anchor_b_end )
int1_anchor_a_start |
integer vector; genomic start coordinate(s) of anchor A in replicate 1 interaction |
int1_anchor_a_end |
integer vector; genomic end coordinate(s) of anchor A in replicate 1 interaction |
int1_anchor_b_start |
integer vector; genomic start coordinate(s) of anchor B in replicate 1 interaction |
int1_anchor_b_end |
integer vector; genomic end coordinate(s) of anchor B in replicate 1 interaction |
int2_anchor_a_start |
integer vector; genomic start coordinate(s) of anchor A in replicate 2 interaction |
int2_anchor_a_end |
integer vector; genomic end coordinate(s) of anchor A in replicate 2 interaction |
int2_anchor_b_start |
integer vector; genomic start coordinate(s) of anchor B in replicate 2 interaction |
int2_anchor_b_end |
integer vector; genomic end coordinate(s) of anchor B in replicate 2 interaction |
positive integer vector; distances between interaction pairs
# identical, zero distance calculate_midpoint_distance2d(100, 120, 240, 260, 100, 120, 240, 260) # centered, zero distance calculate_midpoint_distance2d(100, 120, 240, 260, 90, 130, 230, 270) # off by 10 per anchor calculate_midpoint_distance2d(100, 120, 240, 250, 110, 130, 230, 240) # off by 10 (anchor B only) calculate_midpoint_distance2d(100, 120, 240, 250, 90, 130, 250, 260) # vectorized example calculate_midpoint_distance2d(c(100, 100, 100, 100), c(120, 120, 120, 120), c(240, 240, 240, 240), c(260, 260, 250, 250), c(100, 90, 110, 90), c(120, 130, 130, 130), c(240, 230, 230, 250), c(260, 270, 240, 260))
# identical, zero distance calculate_midpoint_distance2d(100, 120, 240, 260, 100, 120, 240, 260) # centered, zero distance calculate_midpoint_distance2d(100, 120, 240, 260, 90, 130, 230, 270) # off by 10 per anchor calculate_midpoint_distance2d(100, 120, 240, 250, 110, 130, 230, 240) # off by 10 (anchor B only) calculate_midpoint_distance2d(100, 120, 240, 250, 90, 130, 250, 260) # vectorized example calculate_midpoint_distance2d(c(100, 100, 100, 100), c(120, 120, 120, 120), c(240, 240, 240, 240), c(260, 260, 250, 250), c(100, 90, 110, 90), c(120, 130, 130, 130), c(240, 230, 230, 250), c(260, 270, 240, 260))
Calculates the overlap between anchor A of interaction 1 and anchor A of interaction 2, as well as anchor B of interaction 1 and anchor B of interaction 2. The overlap (in nucleotides) is then normalized by the length of the anchors.
calculate_relative_overlap1d(peak1_start, peak1_end, peak2_start, peak2_end)
calculate_relative_overlap1d(peak1_start, peak1_end, peak2_start, peak2_end)
peak1_start |
integer vector; genomic start coordinate(s) of peak in replicate 1 |
peak1_end |
integer vector; genomic end coordinate(s) of peak in replicate 1 |
peak2_start |
integer vector; genomic start coordinate(s) of peak in replicate 2 |
peak2_end |
integer vector; genomic end coordinate(s) of peak in replicate 2 |
numeric vector; relative overlaps between peak pairs
# 100% overlap calculate_relative_overlap1d(100, 120, 100, 120) # 50% overlap calculate_relative_overlap1d(100, 120, 100, 110) # negative overlap calculate_relative_overlap1d(100, 120, 130, 140) # larger negative overlap calculate_relative_overlap1d(100, 120, 200, 220) # vectorized example calculate_relative_overlap1d(c(100, 100, 100, 100), c(120, 120, 120, 120), c(100, 100, 130, 200), c(120, 110, 140, 220))
# 100% overlap calculate_relative_overlap1d(100, 120, 100, 120) # 50% overlap calculate_relative_overlap1d(100, 120, 100, 110) # negative overlap calculate_relative_overlap1d(100, 120, 130, 140) # larger negative overlap calculate_relative_overlap1d(100, 120, 200, 220) # vectorized example calculate_relative_overlap1d(c(100, 100, 100, 100), c(120, 120, 120, 120), c(100, 100, 130, 200), c(120, 110, 140, 220))
Calculates the overlap between anchor A of interaction 1 and anchor A of interaction 2, as well as anchor B of interaction 1 and anchor B of interaction 2. The overlap (in nucleotides) is then normalized by the length of the anchors.
Note: anchors A and B of the same interaction have to be on the same chromosome; start coordinate is always less than end coordinate
calculate_relative_overlap2d( int1_anchor_a_start, int1_anchor_a_end, int1_anchor_b_start, int1_anchor_b_end, int2_anchor_a_start, int2_anchor_a_end, int2_anchor_b_start, int2_anchor_b_end )
calculate_relative_overlap2d( int1_anchor_a_start, int1_anchor_a_end, int1_anchor_b_start, int1_anchor_b_end, int2_anchor_a_start, int2_anchor_a_end, int2_anchor_b_start, int2_anchor_b_end )
int1_anchor_a_start |
integer vector; genomic start coordinate(s) of anchor A in replicate 1 interaction |
int1_anchor_a_end |
integer vector; genomic end coordinate(s) of anchor A in replicate 1 interaction |
int1_anchor_b_start |
integer vector; genomic start coordinate(s) of anchor B in replicate 1 interaction |
int1_anchor_b_end |
integer vector; genomic end coordinate(s) of anchor B in replicate 1 interaction |
int2_anchor_a_start |
integer vector; genomic start coordinate(s) of anchor A in replicate 2 interaction |
int2_anchor_a_end |
integer vector; genomic end coordinate(s) of anchor A in replicate 2 interaction |
int2_anchor_b_start |
integer vector; genomic start coordinate(s) of anchor B in replicate 2 interaction |
int2_anchor_b_end |
integer vector; genomic end coordinate(s) of anchor B in replicate 2 interaction |
numeric vector; relative overlaps between interaction pairs
# 100% overlap calculate_relative_overlap2d(100, 120, 240, 260, 100, 120, 240, 260) # 50% overlap calculate_relative_overlap2d(100, 120, 240, 250, 100, 110, 240, 260) # negative overlap calculate_relative_overlap2d(100, 120, 240, 250, 130, 140, 260, 280) # larger negative overlap calculate_relative_overlap2d(100, 120, 240, 250, 200, 220, 340, 350) # vectorized example calculate_relative_overlap2d(c(100, 100, 100, 100), c(120, 120, 120, 120), c(240, 240, 240, 240), c(260, 250, 250, 250), c(100, 100, 130, 200), c(120, 110, 140, 220), c(240, 240, 260, 340), c(260, 260, 280, 350))
# 100% overlap calculate_relative_overlap2d(100, 120, 240, 260, 100, 120, 240, 260) # 50% overlap calculate_relative_overlap2d(100, 120, 240, 250, 100, 110, 240, 260) # negative overlap calculate_relative_overlap2d(100, 120, 240, 250, 130, 140, 260, 280) # larger negative overlap calculate_relative_overlap2d(100, 120, 240, 250, 200, 220, 340, 350) # vectorized example calculate_relative_overlap2d(c(100, 100, 100, 100), c(120, 120, 120, 120), c(240, 240, 240, 240), c(260, 250, 250, 250), c(100, 100, 130, 200), c(120, 110, 140, 220), c(240, 240, 260, 340), c(260, 260, 280, 350))
This object contains genomic interactions on chromosomes 1 to 5, which could be the results of Hi-C or ChIA-PET experiments, done in duplicates.
chiapet
chiapet
A list with two components, the data frames rep1_df
and
rep2_df
, which have the following seven columns:
column 1: | chr_a |
character; genomic location of anchor A -
chromosome (e.g., "chr3" ) |
column 2: | start_a |
integer; genomic location of anchor A - start coordinate |
column 3: | end_a |
integer; genomic location of anchor A - end coordinate |
column 4: | chr_b |
character; genomic location of anchor B -
chromosome (e.g., "chr3" ) |
column 5: | start_b |
integer; genomic location of anchor B - start coordinate |
column 6: | end_b |
integer; genomic location of anchor B - end coordinate |
column 7: | fdr |
numeric; False Discovery Rate - significance of interaction |
This object contains genomic peaks from two replicate ChIP-seq experiments.
chipseq
chipseq
A list with two components, the data frames rep1_df
and
rep2_df
, which have the following four columns:
column 1: | chr |
character; genomic location of peak -
chromosome (e.g., "chr3" ) |
column 2: | start |
integer; genomic location of peak - start coordinate |
column 3: | end |
integer; genomic location of peak - end coordinate |
column 4: | value |
numeric; heuristic used to rank the peaks |
Identifies all overlapping anchor pairs (m:n mapping).
determine_anchor_overlap(rep1_anchor, rep2_anchor, max_gap = -1L)
determine_anchor_overlap(rep1_anchor, rep2_anchor, max_gap = -1L)
rep1_anchor |
data frame with the following columns:
|
|||||||||
rep2_anchor |
data frame with the following columns:
|
|||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
A data frame containing overlapping anchor pairs with the following columns:
column 1: | rep1_idx |
anchor index in data frame
rep1_anchor |
column 2: | rep2_idx |
anchor index in data frame
rep2_anchor
|
rep1_df <- idr2d:::chiapet$rep1_df rep2_df <- idr2d:::chiapet$rep2_df rep1_anchor_a <- data.frame(chr = rep1_df[, 1], start = rep1_df[, 2], end = rep1_df[, 3]) rep2_anchor_a <- data.frame(chr = rep2_df[, 1], start = rep2_df[, 2], end = rep2_df[, 3]) anchor_a_overlap <- determine_anchor_overlap(rep1_anchor_a, rep2_anchor_a)
rep1_df <- idr2d:::chiapet$rep1_df rep2_df <- idr2d:::chiapet$rep2_df rep1_anchor_a <- data.frame(chr = rep1_df[, 1], start = rep1_df[, 2], end = rep1_df[, 3]) rep2_anchor_a <- data.frame(chr = rep2_df[, 1], start = rep2_df[, 2], end = rep2_df[, 3]) anchor_a_overlap <- determine_anchor_overlap(rep1_anchor_a, rep2_anchor_a)
Creates Hi-C contact maps to visualize the results of
estimate_idr2d_hic
.
draw_hic_contact_map( df, idr_cutoff = NULL, chromosome = NULL, start_coordinate = NULL, end_coordinate = NULL, title = NULL, values_normalized = FALSE, log_values = TRUE )
draw_hic_contact_map( df, idr_cutoff = NULL, chromosome = NULL, start_coordinate = NULL, end_coordinate = NULL, title = NULL, values_normalized = FALSE, log_values = TRUE )
df |
output of
|
||||||||||||||||||
idr_cutoff |
numeric; only show blocks with IDR < |
||||||||||||||||||
chromosome |
character; chromsome name or list of chromosome names to
be analyzed, e.g., UCSC chromosome 1, |
||||||||||||||||||
start_coordinate |
integer; only show contact map window between
|
||||||||||||||||||
end_coordinate |
integer; only show contact map window between
|
||||||||||||||||||
title |
character; plot title |
||||||||||||||||||
values_normalized |
logical; are read counts in value column
raw or normalized? Defaults to |
||||||||||||||||||
log_values |
logical; log-transform value column? Defaults to
|
ggplot2 object; Hi-C contact map
idr_results_df <- estimate_idr2d_hic(idr2d:::hic$rep1_df, idr2d:::hic$rep2_df) draw_hic_contact_map(idr_results_df, idr_cutoff = 0.05, chromosome = "chr1")
idr_results_df <- estimate_idr2d_hic(idr2d:::hic$rep1_df, idr2d:::hic$rep2_df) draw_hic_contact_map(idr_results_df, idr_cutoff = 0.05, chromosome = "chr1")
Creates diagnostic plots to visualize the results of
estimate_idr
.
draw_idr_distribution_histogram( df, remove_na = TRUE, xlab = "IDR", ylab = "density", title = "IDR value distribution" )
draw_idr_distribution_histogram( df, remove_na = TRUE, xlab = "IDR", ylab = "density", title = "IDR value distribution" )
df |
part of output of
|
||
remove_na |
logical; should NA values be removed? |
||
xlab |
character; x axis label |
||
ylab |
character; y axis label |
||
title |
character; plot title |
ggplot2 object; IDR distribution histogram
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") draw_idr_distribution_histogram(idr_results$rep1_df)
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") draw_idr_distribution_histogram(idr_results$rep1_df)
Creates diagnostic plots to visualize the results of
estimate_idr
.
draw_rank_idr_scatterplot( df, remove_na = TRUE, xlab = "rank in replicate 1", ylab = "rank in replicate 2", log_idr = FALSE, title = "rank - IDR dependence", color_gradient = c("rainbow", "default"), alpha = 1, max_points_shown = 2500 )
draw_rank_idr_scatterplot( df, remove_na = TRUE, xlab = "rank in replicate 1", ylab = "rank in replicate 2", log_idr = FALSE, title = "rank - IDR dependence", color_gradient = c("rainbow", "default"), alpha = 1, max_points_shown = 2500 )
df |
part of output of
|
||||||
remove_na |
logical; should NA values be removed? |
||||||
xlab |
character; x axis label |
||||||
ylab |
character; y axis label |
||||||
log_idr |
logical; use logarithmized IDRs for colors to better distinguish highly significant IDRs |
||||||
title |
character; plot title |
||||||
color_gradient |
character; either "rainbow" or "default" |
||||||
alpha |
numeric; transparency of dots, from 0.0 - 1.0, where 1.0 is completely opaque; default is 1.0 |
||||||
max_points_shown |
integer; default is 2500 |
ggplot2 object; IDR rank scatterplot
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") draw_rank_idr_scatterplot(idr_results$rep1_df)
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") draw_rank_idr_scatterplot(idr_results$rep1_df)
Creates diagnostic plots to visualize the results of
estimate_idr
.
draw_value_idr_scatterplot( df, remove_na = TRUE, remove_outliers = TRUE, xlab = "transformed value in replicate 1", ylab = "transformed value in replicate 2", log_axes = FALSE, log_idr = FALSE, title = "value - IDR dependence", color_gradient = c("rainbow", "default"), alpha = 1, max_points_shown = 2500 )
draw_value_idr_scatterplot( df, remove_na = TRUE, remove_outliers = TRUE, xlab = "transformed value in replicate 1", ylab = "transformed value in replicate 2", log_axes = FALSE, log_idr = FALSE, title = "value - IDR dependence", color_gradient = c("rainbow", "default"), alpha = 1, max_points_shown = 2500 )
df |
part of output of
|
||||||
remove_na |
logical; should NA values be removed? |
||||||
remove_outliers |
logical; removes extreme data points |
||||||
xlab |
character; x axis label |
||||||
ylab |
character; y axis label |
||||||
log_axes |
logical; show logarithmized values from replicate 1 and 2 (default value is FALSE) |
||||||
log_idr |
logical; use logarithmized IDRs for colors to better distinguish highly significant IDRs (default value is FALSE) |
||||||
title |
character; plot title |
||||||
color_gradient |
character; either "rainbow" or "default" |
||||||
alpha |
numeric; transparency of dots, from 0.0 - 1.0, where 1.0 is completely opaque; default is 1.0 |
||||||
max_points_shown |
integer; default is 2500 |
ggplot2 object; IDR value scatterplot
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") draw_value_idr_scatterplot(idr_results$rep1_df)
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") draw_value_idr_scatterplot(idr_results$rep1_df)
This method establishes a bijective assignment between observations
(genomic peaks in case of ChIP-seq, genomic interactions in case of
ChIA-PET, HiChIP, and Hi-C) from
replicate 1 and 2. An observation in replicate 1 is assigned to an
observation in replicate 2 if and only if (1) the observation loci in both
replicates overlap (or the gap between them is less than
or equal to max_gap
), and (2) there is no other observation in
replicate 2 that overlaps with the observation in replicate 1 and has a
lower ambiguity resolution value.
establish_bijection( rep1_df, rep2_df, analysis_type = c("IDR1D", "IDR2D"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
establish_bijection( rep1_df, rep2_df, analysis_type = c("IDR1D", "IDR2D"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
rep1_df |
data frame of observations (i.e., genomic peaks or genomic
interactions) of
replicate 1. If |
rep2_df |
data frame of observations (i.e., genomic peaks or genomic
interactions) of replicate 2. Same columns as |
analysis_type |
"IDR2D" for genomic interaction data sets, "IDR1D" for genomic peak data sets |
ambiguity_resolution_method |
defines how ambiguous assignments
(when one interaction or peak in replicate 1 overlaps with
multiple interactions or peaks in replicate 2 or vice versa)
are resolved. For available methods, see |
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
See establish_bijection1d
or
establish_bijection2d
, respectively.
rep1_df <- idr2d:::chipseq$rep1_df rep1_df$value <- preprocess(rep1_df$value, "log") rep2_df <- idr2d:::chipseq$rep2_df rep2_df$value <- preprocess(rep2_df$value, "log") mapping <- establish_bijection(rep1_df, rep2_df, analysis_type = "IDR1D")
rep1_df <- idr2d:::chipseq$rep1_df rep1_df$value <- preprocess(rep1_df$value, "log") rep2_df <- idr2d:::chipseq$rep2_df rep2_df$value <- preprocess(rep2_df$value, "log") mapping <- establish_bijection(rep1_df, rep2_df, analysis_type = "IDR1D")
This method establishes a bijective assignment between peaks from
replicate 1 and 2. A peak in replicate 1 is assigned to a
peak in replicate 2 if and only if (1) they overlap (or the gap between the
peaks is less than or equal to max_gap
), and (2) there is no other
peak in
replicate 2 that overlaps with the peak in replicate 1 and has a
lower ambiguity resolution value.
establish_bijection1d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
establish_bijection1d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
rep1_df |
data frame of observations (i.e., genomic peaks) of replicate 1, with at least the following columns (position of columns matter, column names are irrelevant):
|
||||||||||||
rep2_df |
data frame of observations (i.e., genomic peaks) of replicate 2, with the following columns (position of columns matter, column names are irrelevant):
|
||||||||||||
ambiguity_resolution_method |
defines how ambiguous assignments (when one interaction in replicate 1 overlaps with multiple interactions in replicate 2 or vice versa) are resolved. Available methods:
|
||||||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
Data frames rep1_df
and rep2_df
with
the following columns:
column 1: | chr |
character; genomic location of peak -
chromosome (e.g., "chr3" ) |
column 2: | start |
integer; genomic location of peak - start coordinate |
column 3: | end |
integer; genomic location of peak - end coordinate |
column 4: | value |
numeric; p-value, FDR, or heuristic used to rank the peaks |
column 5: | rep_value |
numeric; value of corresponding
replicate peak. If no corresponding peak was found, rep_value is set
to NA . |
column 6: | rank |
integer; rank of the peak, established by value column, ascending order |
column 7: | rep_rank |
integer; rank of corresponding
replicate peak. If no corresponding peak was found, rep_rank is
set to NA . |
column 8: | idx |
integer; peak index, primary key |
column 9: | rep_idx |
integer; specifies the index of the
corresponding peak in the other replicate (foreign key). If no
corresponding peak was found, rep_idx is set to NA .
|
rep1_df <- idr2d:::chipseq$rep1_df rep1_df$value <- preprocess(rep1_df$value, "log") rep2_df <- idr2d:::chipseq$rep2_df rep2_df$value <- preprocess(rep2_df$value, "log") mapping <- establish_bijection1d(rep1_df, rep2_df)
rep1_df <- idr2d:::chipseq$rep1_df rep1_df$value <- preprocess(rep1_df$value, "log") rep2_df <- idr2d:::chipseq$rep2_df rep2_df$value <- preprocess(rep2_df$value, "log") mapping <- establish_bijection1d(rep1_df, rep2_df)
This method establishes a bijective assignment between interactions from
replicate 1 and 2. An interaction in replicate 1 is assigned to an
interaction in replicate 2 if and only if (1) both anchors of the
interactions
overlap (or the gap between anchor A/B in replicate 1 and 2 is less than
or equal to max_gap
), and (2) there is no other interaction in
replicate 2 that overlaps with the interaction in replicate 1 and has a
lower ambiguity resolution value.
establish_bijection2d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
establish_bijection2d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
rep1_df |
data frame of observations (i.e., genomic interactions) of replicate 1, with at least the following columns (position of columns matter, column names are irrelevant):
|
|||||||||||||||||||||
rep2_df |
data frame of observations (i.e., genomic interactions) of replicate 2, with the following columns (position of columns matter, column names are irrelevant):
|
|||||||||||||||||||||
ambiguity_resolution_method |
defines how ambiguous assignments (when one interaction in replicate 1 overlaps with multiple interactions in replicate 2 or vice versa) are resolved. Available methods:
|
|||||||||||||||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
Data frames rep1_df
and rep2_df
with
the following columns:
column 1: | chr_a |
character; genomic location of anchor A -
chromosome (e.g., "chr3" ) |
column 2: | start_a |
integer; genomic location of anchor A - start coordinate |
column 3: | end_a |
integer; genomic location of anchor A - end coordinate |
column 4: | chr_b |
character; genomic location of anchor B -
chromosome (e.g., "chr3" ) |
column 5: | start_b |
integer; genomic location of anchor B - start coordinate |
column 6: | end_b |
integer; genomic location of anchor B - end coordinate |
column 7: | value |
numeric; p-value, FDR, or heuristic used to rank the interactions |
column 8: | "rep_value" |
numeric; value of corresponding
replicate interaction. If no corresponding interaction was found,
rep_value is set to NA . |
column 9: | "rank" |
integer; rank of the interaction, established by value column, ascending order |
column 10: | "rep_rank" |
integer; rank of corresponding
replicate interaction. If no corresponding interaction was found,
rep_rank is set to NA . |
column 11: | "idx" |
integer; interaction index, primary key |
column 12: | "rep_idx" |
integer; specifies the index of the
corresponding interaction in the other replicate (foreign key). If no
corresponding interaction was found, rep_idx is set to NA .
|
rep1_df <- idr2d:::chiapet$rep1_df rep1_df$fdr <- preprocess(rep1_df$fdr, "log_additive_inverse") rep2_df <- idr2d:::chiapet$rep2_df rep2_df$fdr <- preprocess(rep2_df$fdr, "log_additive_inverse") mapping <- establish_bijection2d(rep1_df, rep2_df)
rep1_df <- idr2d:::chiapet$rep1_df rep1_df$fdr <- preprocess(rep1_df$fdr, "log_additive_inverse") rep2_df <- idr2d:::chiapet$rep2_df rep2_df$fdr <- preprocess(rep2_df$fdr, "log_additive_inverse") mapping <- establish_bijection2d(rep1_df, rep2_df)
This method returns all overlapping interactions between two replicates.
For each pair of overlapping interactions, the
ambiguity resolution value (ARV) is calculated, which helps to reduce
the m:n mapping to a 1:1 mapping. The semantics of the ARV depend on the
specified ambiguity_resolution_method
, but in general interaction
pairs with lower ARVs have priority over interaction pairs with higher ARVs
when the bijective mapping is established.
establish_overlap1d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
establish_overlap1d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
rep1_df |
data frame of observations (i.e., genomic peaks) of replicate 1, with at least the following columns (position of columns matter, column names are irrelevant):
|
||||||||||||
rep2_df |
data frame of observations (i.e., genomic peaks) of replicate 2, with the following columns (position of columns matter, column names are irrelevant):
|
||||||||||||
ambiguity_resolution_method |
defines how ambiguous assignments (when one interaction in replicate 1 overlaps with multiple interactions in replicate 2 or vice versa) are resolved. Available methods:
|
||||||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
data frame with the following columns:
column 1: | rep1_idx |
index of interaction in replicate 1
(i.e., row index in rep1_df ) |
column 2: | rep2_idx |
index of interaction in replicate 2
(i.e., row index in rep2_df ) |
column 3: | arv |
ambiguity resolution value used turn
m:n mapping into 1:1 mapping. Interaction pairs with lower arv
are prioritized.
|
rep1_df <- idr2d:::chipseq$rep1_df rep1_df$value <- preprocess(rep1_df$value, "log_additive_inverse") rep2_df <- idr2d:::chipseq$rep2_df rep2_df$value <- preprocess(rep2_df$value, "log_additive_inverse") # shuffle to break preexisting order rep1_df <- rep1_df[sample.int(nrow(rep1_df)), ] rep2_df <- rep2_df[sample.int(nrow(rep2_df)), ] # sort by value column rep1_df <- dplyr::arrange(rep1_df, value) rep2_df <- dplyr::arrange(rep2_df, value) pairs_df <- establish_overlap1d(rep1_df, rep2_df)
rep1_df <- idr2d:::chipseq$rep1_df rep1_df$value <- preprocess(rep1_df$value, "log_additive_inverse") rep2_df <- idr2d:::chipseq$rep2_df rep2_df$value <- preprocess(rep2_df$value, "log_additive_inverse") # shuffle to break preexisting order rep1_df <- rep1_df[sample.int(nrow(rep1_df)), ] rep2_df <- rep2_df[sample.int(nrow(rep2_df)), ] # sort by value column rep1_df <- dplyr::arrange(rep1_df, value) rep2_df <- dplyr::arrange(rep2_df, value) pairs_df <- establish_overlap1d(rep1_df, rep2_df)
This method returns all overlapping interactions between two replicates.
For each pair of overlapping interactions, the
ambiguity resolution value (ARV) is calculated, which helps to reduce
the m:n mapping to a 1:1 mapping. The semantics of the ARV depend on the
specified ambiguity_resolution_method
, but in general interaction
pairs with lower ARVs have priority over interaction pairs with higher ARVs
when the bijective mapping is established.
establish_overlap2d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
establish_overlap2d( rep1_df, rep2_df, ambiguity_resolution_method = c("overlap", "midpoint", "value"), max_gap = -1L )
rep1_df |
data frame of observations (i.e., genomic interactions) of replicate 1, with at least the following columns (position of columns matter, column names are irrelevant):
|
|||||||||||||||||||||
rep2_df |
data frame of observations (i.e., genomic interactions) of replicate 2, with the following columns (position of columns matter, column names are irrelevant):
|
|||||||||||||||||||||
ambiguity_resolution_method |
defines how ambiguous assignments (when one interaction in replicate 1 overlaps with multiple interactions in replicate 2 or vice versa) are resolved. Available methods:
|
|||||||||||||||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
data frame with the following columns:
column 1: | rep1_idx |
index of interaction in replicate 1
(i.e., row index in rep1_df ) |
column 2: | rep2_idx |
index of interaction in replicate 2
(i.e., row index in rep2_df ) |
column 3: | arv |
ambiguity resolution value used turn
m:n mapping into 1:1 mapping. Interaction pairs with lower arv
are prioritized.
|
rep1_df <- idr2d:::chiapet$rep1_df rep1_df$fdr <- preprocess(rep1_df$fdr, "log_additive_inverse") rep2_df <- idr2d:::chiapet$rep2_df rep2_df$fdr <- preprocess(rep2_df$fdr, "log_additive_inverse") # shuffle to break preexisting order rep1_df <- rep1_df[sample.int(nrow(rep1_df)), ] rep2_df <- rep2_df[sample.int(nrow(rep2_df)), ] # sort by value column rep1_df <- dplyr::arrange(rep1_df, rep1_df$fdr) rep2_df <- dplyr::arrange(rep2_df, rep2_df$fdr) pairs_df <- establish_overlap2d(rep1_df, rep2_df)
rep1_df <- idr2d:::chiapet$rep1_df rep1_df$fdr <- preprocess(rep1_df$fdr, "log_additive_inverse") rep2_df <- idr2d:::chiapet$rep2_df rep2_df$fdr <- preprocess(rep2_df$fdr, "log_additive_inverse") # shuffle to break preexisting order rep1_df <- rep1_df[sample.int(nrow(rep1_df)), ] rep2_df <- rep2_df[sample.int(nrow(rep2_df)), ] # sort by value column rep1_df <- dplyr::arrange(rep1_df, rep1_df$fdr) rep2_df <- dplyr::arrange(rep2_df, rep2_df$fdr) pairs_df <- establish_overlap2d(rep1_df, rep2_df)
Estimates IDR for Genomic Peaks or Genomic Interactions
estimate_idr( rep1_df, rep2_df, analysis_type = "IDR2D", value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), remove_nonstandard_chromosomes = TRUE, max_factor = 1.5, jitter_factor = 1e-04, max_gap = -1L, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
estimate_idr( rep1_df, rep2_df, analysis_type = "IDR2D", value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), remove_nonstandard_chromosomes = TRUE, max_factor = 1.5, jitter_factor = 1e-04, max_gap = -1L, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
rep1_df |
data frame of observations (i.e., genomic peaks or genomic
interactions) of
replicate 1. If |
||||||||||
rep2_df |
data frame of observations (i.e., genomic peaks or genomic
interactions) of replicate 2. Same columns as |
||||||||||
analysis_type |
"IDR2D" for genomic interaction data sets, "IDR1D" for genomic peak data sets |
||||||||||
value_transformation |
the values in
either |
||||||||||
ambiguity_resolution_method |
defines how ambiguous assignments
(when one interaction or peak in replicate 1 overlaps with
multiple interactions or peaks in replicate 2 or vice versa)
are resolved. For available methods, see |
||||||||||
remove_nonstandard_chromosomes |
removes peaks and interactions
containing
genomic locations on non-standard chromosomes using
|
||||||||||
max_factor |
numeric; controls the replacement values for |
||||||||||
jitter_factor |
numeric; controls the magnitude of the noise that
is added to |
||||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
||||||||||
mu |
a starting value for the mean of the reproducible component. |
||||||||||
sigma |
a starting value for the standard deviation of the reproducible component. |
||||||||||
rho |
a starting value for the correlation coefficient of the reproducible component. |
||||||||||
p |
a starting value for the proportion of reproducible component. |
||||||||||
eps |
Stopping criterion. Iterations stop when the increment of log-likelihood is < eps*log-likelihood, Default=0.001. |
||||||||||
max_iteration |
integer; maximum number of iterations for IDR estimation (defaults to 30) |
||||||||||
local_idr |
see |
See estimate_idr1d
or
estimate_idr2d
, respectively.
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
idr_results <- estimate_idr(idr2d:::chiapet$rep1_df, idr2d:::chiapet$rep2_df, analysis_type = "IDR2D", value_transformation = "log_additive_inverse") summary(idr_results)
idr_results <- estimate_idr(idr2d:::chiapet$rep1_df, idr2d:::chiapet$rep2_df, analysis_type = "IDR2D", value_transformation = "log_additive_inverse") summary(idr_results)
This method estimates Irreproducible Discovery Rates (IDR) for peaks in replicated ChIP-seq experiments.
estimate_idr1d( rep1_df, rep2_df, value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), remove_nonstandard_chromosomes = TRUE, max_factor = 1.5, jitter_factor = 1e-04, max_gap = -1L, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
estimate_idr1d( rep1_df, rep2_df, value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), remove_nonstandard_chromosomes = TRUE, max_factor = 1.5, jitter_factor = 1e-04, max_gap = -1L, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
rep1_df |
data frame of observations (i.e., genomic peaks) of replicate 1, with at least the following columns (position of columns matter, column names are irrelevant):
|
||||||||||||
rep2_df |
data frame of observations (i.e., genomic peaks) of replicate 2, with the following columns (position of columns matter, column names are irrelevant):
|
||||||||||||
value_transformation |
the values in
either |
||||||||||||
ambiguity_resolution_method |
defines how ambiguous assignments (when one interaction in replicate 1 overlaps with multiple interactions in replicate 2 or vice versa) are resolved. Available methods:
|
||||||||||||
remove_nonstandard_chromosomes |
removes peaks containing
genomic locations on non-standard chromosomes using
|
||||||||||||
max_factor |
numeric; controls the replacement values for |
||||||||||||
jitter_factor |
numeric; controls the magnitude of the noise that
is added to |
||||||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
||||||||||||
mu |
a starting value for the mean of the reproducible component. |
||||||||||||
sigma |
a starting value for the standard deviation of the reproducible component. |
||||||||||||
rho |
a starting value for the correlation coefficient of the reproducible component. |
||||||||||||
p |
a starting value for the proportion of reproducible component. |
||||||||||||
eps |
Stopping criterion. Iterations stop when the increment of log-likelihood is < eps*log-likelihood, Default=0.001. |
||||||||||||
max_iteration |
integer; maximum number of iterations for IDR estimation (defaults to 30) |
||||||||||||
local_idr |
see |
List with three components, (rep1_df
, rep2_df
,
and analysis_type
) containing the interactions from input
data frames rep1_df
and rep2_df
with
the following additional columns:
column 1: | chr |
character; genomic location of peak -
chromosome (e.g., "chr3" ) |
column 2: | start |
integer; genomic location of peak - start coordinate |
column 3: | end |
integer; genomic location of peak - end coordinate |
column 4: | value |
numeric; p-value, FDR, or heuristic used to rank the peaks |
column 5: | rep_value |
numeric; value of corresponding
replicate peak. If no corresponding peak was found, rep_value is set
to NA . |
column 6: | rank |
integer; rank of the peak, established by value column, ascending order |
column 7: | rep_rank |
integer; rank of corresponding
replicate peak. If no corresponding peak was found, rep_rank is
set to NA . |
column 8: | idx |
integer; peak index, primary key |
column 9: | rep_idx |
integer; specifies the index of the
corresponding peak in the other replicate (foreign key). If no
corresponding peak was found, rep_idx is set to NA . |
column 10: | idr |
IDR of the peak and the
corresponding peak in the other replicate. If no corresponding
peak was found, idr is set to NA .
|
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") summary(idr_results)
idr_results <- estimate_idr1d(idr2d:::chipseq$rep1_df, idr2d:::chipseq$rep2_df, value_transformation = "log") summary(idr_results)
This method estimates Irreproducible Discovery Rates (IDR) between two replicates of experiments identifying genomic interactions, such as Hi-C, ChIA-PET, and HiChIP.
estimate_idr2d( rep1_df, rep2_df, value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), remove_nonstandard_chromosomes = TRUE, max_factor = 1.5, jitter_factor = 1e-04, max_gap = -1L, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
estimate_idr2d( rep1_df, rep2_df, value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), ambiguity_resolution_method = c("overlap", "midpoint", "value"), remove_nonstandard_chromosomes = TRUE, max_factor = 1.5, jitter_factor = 1e-04, max_gap = -1L, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
rep1_df |
data frame of observations (i.e., genomic interactions) of replicate 1, with at least the following columns (position of columns matter, column names are irrelevant):
|
|||||||||||||||||||||
rep2_df |
data frame of observations (i.e., genomic interactions) of replicate 2, with the following columns (position of columns matter, column names are irrelevant):
|
|||||||||||||||||||||
value_transformation |
the values in
either |
|||||||||||||||||||||
ambiguity_resolution_method |
defines how ambiguous assignments (when one interaction in replicate 1 overlaps with multiple interactions in replicate 2 or vice versa) are resolved. Available methods:
|
|||||||||||||||||||||
remove_nonstandard_chromosomes |
removes interactions
containing
genomic locations on non-standard chromosomes using
|
|||||||||||||||||||||
max_factor |
numeric; controls the replacement values for |
|||||||||||||||||||||
jitter_factor |
numeric; controls the magnitude of the noise that
is added to |
|||||||||||||||||||||
max_gap |
integer; maximum gap in nucleotides allowed between two anchors for them to be considered as overlapping (defaults to -1, i.e., overlapping anchors) |
|||||||||||||||||||||
mu |
a starting value for the mean of the reproducible component. |
|||||||||||||||||||||
sigma |
a starting value for the standard deviation of the reproducible component. |
|||||||||||||||||||||
rho |
a starting value for the correlation coefficient of the reproducible component. |
|||||||||||||||||||||
p |
a starting value for the proportion of reproducible component. |
|||||||||||||||||||||
eps |
Stopping criterion. Iterations stop when the increment of log-likelihood is < eps*log-likelihood, Default=0.001. |
|||||||||||||||||||||
max_iteration |
integer; maximum number of iterations for IDR estimation (defaults to 30) |
|||||||||||||||||||||
local_idr |
see |
List with three components, (rep1_df
, rep2_df
,
and analysis_type
) containing the interactions from input
data frames rep1_df
and rep2_df
with
the following additional columns:
column 1: | chr_a |
character; genomic location of anchor A -
chromosome (e.g., "chr3" ) |
column 2: | start_a |
integer; genomic location of anchor A - start coordinate |
column 3: | end_a |
integer; genomic location of anchor A - end coordinate |
column 4: | chr_b |
character; genomic location of anchor B -
chromosome (e.g., "chr3" ) |
column 5: | start_b |
integer; genomic location of anchor B - start coordinate |
column 6: | end_b |
integer; genomic location of anchor B - end coordinate |
column 7: | value |
numeric; p-value, FDR, or heuristic used to rank the interactions |
column 8: | "rep_value" |
numeric; value of corresponding
replicate interaction. If no corresponding interaction was found,
rep_value is set to NA . |
column 9: | "rank" |
integer; rank of the interaction, established by value column, ascending order |
column 10: | "rep_rank" |
integer; rank of corresponding
replicate interaction. If no corresponding interaction was found,
rep_rank is set to NA . |
column 11: | "idx" |
integer; interaction index, primary key |
column 12: | "rep_idx" |
integer; specifies the index of the
corresponding interaction in the other replicate (foreign key). If no
corresponding interaction was found, rep_idx is set to NA . |
idr |
IDR of the interaction and the
corresponding interaction in the other replicate. If no corresponding
interaction was found, idr is set to NA .
|
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
idr_results <- estimate_idr2d(idr2d:::chiapet$rep1_df, idr2d:::chiapet$rep2_df, value_transformation = "log_additive_inverse") summary(idr_results)
idr_results <- estimate_idr2d(idr2d:::chiapet$rep1_df, idr2d:::chiapet$rep2_df, value_transformation = "log_additive_inverse") summary(idr_results)
This method estimates Irreproducible Discovery Rates (IDR) of genomic interactions between two replicates of Hi-C experiments.
Before calling this method, call Juicer .hic contact matrix c
The contact matrix is subdivided into blocks, where the block size is
determined by resolution
. The reads per block are used to rank blocks
and replicate blocks are easily matched by genomic location.
estimate_idr2d_hic( rep1_df, rep2_df, combined_min_value = 30, combined_max_value = Inf, min_value = -Inf, max_value = Inf, max_factor = 1.5, jitter_factor = 1e-04, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
estimate_idr2d_hic( rep1_df, rep2_df, combined_min_value = 30, combined_max_value = Inf, min_value = -Inf, max_value = Inf, max_factor = 1.5, jitter_factor = 1e-04, mu = 0.1, sigma = 1, rho = 0.2, p = 0.5, eps = 0.001, max_iteration = 30, local_idr = TRUE )
rep1_df |
data frame of either parsed .hic file from Juicer (output of
|
rep2_df |
data frame of either parsed .hic file from Juicer (output of
|
combined_min_value |
exclude blocks with a combined (replicate 1 +
replicate 2) read count or normalized read count of less than
|
combined_max_value |
exclude blocks with a combined (replicate 1 +
replicate 2) read count or normalized read count of more than
|
min_value |
exclude blocks with a read count or normalized read count
of less than |
max_value |
exclude blocks with a read count or normalized read count
of more than |
max_factor |
numeric; controls the replacement values for |
jitter_factor |
numeric; controls the magnitude of the noise that
is added to |
mu |
a starting value for the mean of the reproducible component. |
sigma |
a starting value for the standard deviation of the reproducible component. |
rho |
a starting value for the correlation coefficient of the reproducible component. |
p |
a starting value for the proportion of reproducible component. |
eps |
Stopping criterion. Iterations stop when the increment of log-likelihood is < eps*log-likelihood, Default=0.001. |
max_iteration |
integer; maximum number of iterations for IDR estimation (defaults to 30) |
local_idr |
see |
Data frame with the following columns:
column 1: | interaction |
character; genomic location of
interaction block (e.g., "chr1:204940000-204940000" ) |
column 2: | value |
numeric; p-value, FDR, or heuristic used to rank the interactions |
column 3: | "rep_value" |
numeric; value of corresponding replicate interaction |
column 4: | "rank" |
integer; rank of the interaction, established by value column, ascending order |
column 5: | "rep_rank" |
integer; rank of corresponding replicate interaction |
column 6: | "idr" |
integer; IDR of the block and the corresponding block in the other replicate |
Q. Li, J. B. Brown, H. Huang and P. J. Bickel. (2011) Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, Vol. 5, No. 3, 1752-1779.
idr_results_df <- estimate_idr2d_hic(idr2d:::hic$rep1_df, idr2d:::hic$rep2_df) summary(idr_results_df)
idr_results_df <- estimate_idr2d_hic(idr2d:::hic$rep1_df, idr2d:::hic$rep2_df) summary(idr_results_df)
This object contains data from a Hi-C contact map of human chromosome 1 and a resolution of 2.5 * 10^6, extracted from GEO series GSE71831.
hic
hic
A list with two components, the data frames rep1_df
and
rep2_df
, which have the following four columns:
column 1: | chr |
character; genomic location of block -
chromosome (e.g., "chr3" ) |
column 2: | region1 |
integer; genomic location of block - coordinate A |
column 3: | region2 |
integer; genomic location of block - coordinate B |
column 4: | value |
numeric; heuristic used to rank blocks, in this case: number of reads |
This function is used to convert the contact matrix from a HiC-Pro pipeline
analysis run into an IDR2D compatible format. It takes one .matrix and one
.bed file per replicate from HiC-Pro and returns the contact matrix for a
specific chromosome for IDR2D analysis (see estimate_idr2d_hic
)
parse_hic_pro_matrix(matrix_file, bed_file, chromosome = "chr1")
parse_hic_pro_matrix(matrix_file, bed_file, chromosome = "chr1")
matrix_file |
path to .matrix file from HiC-Pro analysis run |
bed_file |
path to .bed file from HiC-Pro analysis run |
chromosome |
chromsome name to be analyzed, defaults to
UCSC chromosome 1 ( |
Data frame with the following columns:
column 1: | chr |
character; chromosome of block
(e.g., "chr3" ) |
column 2: | region1 |
integer; genomic location of side A of block in Hi-C contact matrix |
column 3: | region2 |
integer; genomic location of side B of block in Hi-C contact matrix |
column 4: | value |
numeric; (normalized) read count in block |
Servant, N., Varoquaux, N., Lajoie, B.R. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol 16, 259 (2015) doi:10.1186/s13059-015-0831-x
parse_juicer_matrix
uses the Python package hic-straw
internally to read .hic contact matrix files (see
hic-straw on PyPI or the
Aiden lab GitHub repository
for more information).
The contact matrix is subdivided into blocks, where the block size is
determined by resolution
. The reads per block are used to rank blocks
and replicate blocks are easily matched by genomic location.
parse_juicer_matrix( hic_file, resolution = 1e+06, normalization = c("NONE", "VC", "VC_SQRT", "KR"), chromosome = "chr1", use_python = NULL, use_virtualenv = NULL, use_condaenv = NULL )
parse_juicer_matrix( hic_file, resolution = 1e+06, normalization = c("NONE", "VC", "VC_SQRT", "KR"), chromosome = "chr1", use_python = NULL, use_virtualenv = NULL, use_condaenv = NULL )
hic_file |
path to .hic file (either local file path or URL). |
resolution |
block resolution of Hi-C contact matrix in base pairs, defaults to 1,000,000 bp (usually one of the following: 2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000) |
normalization |
normalization step performed by Python package
|
chromosome |
chromsome name to be analyzed,
defaults to UCSC chromosome 1 ( |
use_python |
if Python is not on PATH, specify path to Python binary
here (see |
use_virtualenv |
if Python package |
use_condaenv |
if Python package |
Data frame with the following columns:
column 1: | chr |
character; chromosome of block
(e.g., "chr3" ) |
column 2: | region1 |
integer; genomic location of side A of block in Hi-C contact matrix |
column 3: | region2 |
integer; genomic location of side B of block in Hi-C contact matrix |
column 4: | value |
numeric; (normalized) read count in block |
Neva C. Durand, James T. Robinson, Muhammad S. Shamim, Ido Machol, Jill P. Mesirov, Eric S. Lander, and Erez Lieberman Aiden. "Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom." Cell Systems 3(1), 2016.
This method removes invalid values, establishes the correct ranking, and breaks ties prior to IDR analysis.
Inf
and -Inf
are replaced by max(x) * max_factor
and min(x) / max_factor
, respectively.
NA
values in x
are replaced by mean(x)
.
All values in x
are transformed using the transformation specified
in value_transformation
.
Lastly, a small amount of noise is added to x
to break ties. The
magnitude of the noise is controlled by jitter_factor
.
preprocess( x, value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), max_factor = 1.5, jitter_factor = 1e-04 )
preprocess( x, value_transformation = c("identity", "additive_inverse", "multiplicative_inverse", "log", "log_additive_inverse"), max_factor = 1.5, jitter_factor = 1e-04 )
x |
numeric vector of values |
||||||||||
value_transformation |
the values in
either |
||||||||||
max_factor |
numeric; controls the replacement values for |
||||||||||
jitter_factor |
numeric; controls the magnitude of the noise that
is added to |
numeric vector; transformed and stripped values of x
, ready
for IDR analysis
rep1_df <- idr2d:::chiapet$rep1_df rep1_df$fdr <- preprocess(rep1_df$fdr, "log_additive_inverse")
rep1_df <- idr2d:::chiapet$rep1_df rep1_df$fdr <- preprocess(rep1_df$fdr, "log_additive_inverse")
Removes Peaks on Non-standard Chromosomes
remove_nonstandard_chromosomes1d(x)
remove_nonstandard_chromosomes1d(x)
x |
data frame of genomic peaks, with the following columns (position of columns matter, column names are irrelevant):
|
x
without non-standard chromosomes.
rep1_df <- remove_nonstandard_chromosomes1d(idr2d:::chipseq$rep1_df)
rep1_df <- remove_nonstandard_chromosomes1d(idr2d:::chipseq$rep1_df)
Removes Interactions on Non-standard Chromosomes
remove_nonstandard_chromosomes2d(x)
remove_nonstandard_chromosomes2d(x)
x |
data frame of genomic interactions, with the following columns (position of columns matter, column names are irrelevant):
|
x
without non-standard chromosomes.
rep1_df <- remove_nonstandard_chromosomes2d(idr2d:::chiapet$rep1_df)
rep1_df <- remove_nonstandard_chromosomes2d(idr2d:::chiapet$rep1_df)