TADCompare is an R package for differential analysis of TAD
boundaries. It is designed to work on a wide range of formats and
resolutions of Hi-C data. TADCompare package contains four functions:
TADCompare
, TimeCompare
,
ConsensusTADs
, and DiffPlot
.
TADCompare
function allows for the identification of
differential TAD boundaries between two contact matrices.
TimeCompare
function takes a set of contact matrices, one
matrix per time point, identifies TAD boundaries, and classifies how they change over time.
ConsensusTADs
function takes a list of TADs and identifies
a consensus of TAD boundaries across all matrices using our novel consensus boundary score.
DiffPlot
allows for visualization of TAD boundary
differences between two matrices. The required input includes
matrices in sparse 3-column format, n × n, or n × (n + 3) formats. This
vignette provides a complete overview of input data formats.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
n × n contact matrices are most commonly associated with data coming from the Bing Ren lab (http://chromosome.sdsc.edu/mouse/hi-c/download.html). These contact matrices are square and symmetric with entry ij corresponding to the number of contacts between region i and region j. Below is an example of a 5 × 5 region of an n × n contact matrix derived from Rao et al. 2014 data, GM12878 cell line (Rao et al. 2014), chromosome 22, 50kb resolution. Note the symmetry around the diagonal - the typical shape of chromatin interaction matrix. The figure was created using the pheatmap package.
n × (n + 3)
matrices are commonly associated with the TopDom
TAD caller
(http://zhoulab.usc.edu/TopDom/). These matrices consist
of an n × n matrix
but with three additional leading columns containing the chromosome, the
start of the region and the end of the region. Regions in this case are
determined by the resolution of the data. The subset of a typical n × (n + 3) matrix is shown
below.
## chr start end X18500000 X18550000 X18600000 X18650000
## 1 chr22 18500000 18550000 13313 4817 1664 96
## 2 chr22 18550000 18600000 4817 15500 5120 178
## 3 chr22 18600000 18650000 1664 5120 11242 316
## 4 chr22 18650000 18700000 96 178 316 162
Sparse 3-column matrices are matrices where the first and second
columns refer to region i and
region j of the chromosome,
and the third column is the number of contacts between them. This style
is becoming increasingly popular and is associated with raw data from
Lieberman-Aiden lab (e.g., https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525),
and is the data output produced by the Juicer tool (Durand et al. 2016). 3-column matrices are
handled internally in the package by converting them to n × n matrices using the HiCcompare
package’s sparse2full()
function. The first 5 rows of a
typical sparse 3-column matrix are shown below.
## region1 region2 IF
## <num> <num> <num>
## 1: 16050000 16050000 12
## 2: 16200000 16200000 4
## 3: 16150000 16300000 1
## 4: 16200000 16300000 1
## 5: 16250000 16300000 1
## 6: 16300000 16300000 10
.hic files are a common form of files generally associated with the lab of Erez Lieberman-Aiden (http://aidenlab.org/data.html). To use .hic files you must use the following steps.
straw
from https://github.com/aidenlab/straw/ and follow
instalation instructions.wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63525/suppl/GSE63525_GM12878_insitu_primary_30.hic
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63525/suppl/GSE63525_GM12878_insitu_replicate.hic
./straw NONE GSE63525_GM12878_insitu_primary_30.hic 22 22 BP 50000 > primary.chr22.50kb.txt
./straw NONE GSE63525_GM12878_insitu_replicate_30.hic 22 22 BP 50000 > replicate.chr22.50kb.txt
Users can also find TADs from data output by cooler
(http://cooler.readthedocs.io/en/latest/index.html) and
HiC-Pro (https://github.com/nservant/HiC-Pro) with minor
pre-processing using the HiCcompare
package.
The cooler software can be downloaded from https://mirnylab.github.io/cooler/. A catalog of popular HiC datasets can be found at ftp://cooler.csail.mit.edu/coolers. We can extract chromatin interaction data from .cool files using the following steps:
cooler dump --join Zuin2014-HEK293CtcfControl-HindIII-allreps-filtered.50kb.cool > Zuin.HEK293.50kb.Control.txt
cooler dump --join Zuin2014-HEK293CtcfDepleted-HindIII-allreps-filtered.50kb.cool > Zuin.HEK293.50kb.Depleted.txt
# Read in data
cool_mat1 <- read.table("Zuin.HEK293.50kb.Control.txt")
cool_mat2 <- read.table("Zuin.HEK293.50kb.Depleted.txt")
# Convert to sparse 3-column matrix using cooler2sparse from HiCcompare
sparse_mat1 <- HiCcompare::cooler2sparse(cool_mat1)
sparse_mat2 <- HiCcompare::cooler2sparse(cool_mat2)
# Run TADCompare
diff_tads = lapply(names(sparse_mat1), function(x) {
TADCompare(sparse_mat1[[x]], sparse_mat2[[x]], resolution = 50000)
})
HiC-Pro data is represented as two files, the .matrix
file and the .bed
file. The .bed
file contains
four columns (chromosome, start, end, ID). The .matrix
file
is a three-column matrix where the 1st and 2nd
columns contain region IDs that map back to the coordinates in the bed
file, and the third column contains the number of contacts between the
two regions. In this example we analyze two matrix files
sample1_100000.matrix
and
sample2_100000.matrix
and their corresponding bed files
sample1_100000_abs.bed
and
sample2_100000_abs.bed
. We do not include HiC-Pro data in
the package, so these serve as placeholders for the traditional files
output by HiC-Pro. The steps for analyzing these files is shown
below:
# Read in both files
mat1 <- read.table("sample1_100000.matrix")
bed1 <- read.table("sample1_100000_abs.bed")
# Matrix 2
mat2 <- read.table("sample2_100000.matrix")
bed2 <- read.table("sample2_100000_abs.bed")
# Convert to modified bed format
sparse_mats1 <- HiCcompare::hicpro2bedpe(mat1,bed1)
sparse_mats2 <- HiCcompare::hicpro2bedpe(mat2,bed2)
# Remove empty matrices if necessary
# sparse_mats$cis = sparse_mats$cis[sapply(sparse_mats, nrow) != 0]
# Go through all pairwise chromosomes and run TADCompare
sparse_tads = lapply(1:length(sparse_mats1$cis), function(z) {
x <- sparse_mats1$cis[[z]]
y <- sparse_mats2$cis[[z]]
#Pull out chromosome
chr <- x[, 1][1]
#Subset to make three column matrix
x <- x[, c(2, 5, 7)]
y <- y[, c(2, 5, 7)]
#Run SpectralTAD
comp <- TADCompare(x, y, resolution = 100000)
return(list(comp, chr))
})
# Pull out differential TAD results
diff_res <- lapply(sparse_tads, function(x) x$comp)
# Pull out chromosomes
chr <- lapply(sparse_tads, function(x) x$chr)
# Name list by corresponding chr
names(diff_res) <- chr
The type of matrix input into the algorithm can affect runtimes for the algorithm. n × n matrices require no conversion and are the fastest. Meanwhile, n × (n + 3) matrices take slightly longer to run due to the need to remove the first 3 columns. Sparse 3-column matrices have the highest runtimes due to the complexity of converting them to an n × n matrix. The times are summarized below, holding all other parameters constant.
library(microbenchmark)
# Reading in the second matrix
data("rao_chr22_rep")
# Converting to sparse
prim_sparse <- HiCcompare::full2sparse(rao_chr22_prim)
rep_sparse <- HiCcompare::full2sparse(rao_chr22_rep)
# Converting to nxn+3
# Primary
prim_n_n_3 <- data.frame(chr = "chr22",
start = as.numeric(colnames(rao_chr22_prim)),
end = as.numeric(colnames(rao_chr22_prim))+50000,
rao_chr22_prim)
# Replicate
rep_n_n_3 <- data.frame(chr = "chr22",
start = as.numeric(colnames(rao_chr22_rep)),
end = as.numeric(colnames(rao_chr22_rep))+50000,
rao_chr22_rep)
# Defining each function
# Sparse
sparse <- TADCompare(cont_mat1 = prim_sparse, cont_mat2 = rep_sparse, resolution = 50000)
# NxN
n_by_n <- TADCompare(cont_mat1 = prim_sparse, cont_mat2 = rep_sparse, resolution = 50000)
# Nx(N+3)
n_by_n_3 <- TADCompare(cont_mat1 = prim_n_n_3, cont_mat2 = rep_n_n_3, resolution = 50000)
# Benchmarking different parameters
bench <- microbenchmark(
# Sparse
sparse <- TADCompare(cont_mat1 = prim_sparse, cont_mat2 = rep_sparse, resolution = 50000),
# NxN
n_by_n <- TADCompare(cont_mat1 = rao_chr22_prim, cont_mat2 = rao_chr22_rep, resolution = 50000),
# Nx(N+3)
n_by_n_3 <- TADCompare(cont_mat1 = prim_n_n_3, cont_mat2 = rep_n_n_3, resolution = 50000), times = 5, unit = "s"
)
summary_bench <- summary(bench) %>% dplyr::select(mean, median)
rownames(summary_bench) <- c("sparse", "n_by_n", "n_by_n_3")
summary_bench
## mean median
## sparse 0.21688475 0.13016980
## n_by_n 0.07472458 0.07615179
## n_by_n_3 0.08733772 0.08471655
The table above shows the mean and median of runtimes for different
types of contact matrices measured in seconds. As we see,
TADCompare
is extremely fast irrespectively of the
parameters. However, sparse matrix inputs will slow down the algorithm.
This can become more apparent as the size of the contact matrices
increase.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] microbenchmark_1.5.0 TADCompare_1.17.0 SpectralTAD_1.23.0
## [4] dplyr_1.1.4 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 fastmap_1.2.0
## [3] digest_0.6.37 lifecycle_1.0.4
## [5] cluster_2.1.6 HiCcompare_1.29.0
## [7] magrittr_2.0.3 compiler_4.4.2
## [9] rlang_1.1.4 sass_0.4.9
## [11] tools_4.4.2 utf8_1.2.4
## [13] yaml_2.3.10 data.table_1.16.2
## [15] ggsignif_0.6.4 knitr_1.49
## [17] S4Arrays_1.7.1 PRIMME_3.2-6
## [19] DelayedArray_0.33.2 plyr_1.8.9
## [21] RColorBrewer_1.1-3 abind_1.4-8
## [23] BiocParallel_1.41.0 KernSmooth_2.23-24
## [25] withr_3.0.2 purrr_1.0.2
## [27] BiocGenerics_0.53.3 sys_3.4.3
## [29] grid_4.4.2 stats4_4.4.2
## [31] fansi_1.0.6 ggpubr_0.6.0
## [33] colorspace_2.1-1 Rhdf5lib_1.29.0
## [35] ggplot2_3.5.1 scales_1.3.0
## [37] gtools_3.9.5 SummarizedExperiment_1.37.0
## [39] cli_3.6.3 rmarkdown_2.29
## [41] crayon_1.5.3 generics_0.1.3
## [43] reshape2_1.4.4 httr_1.4.7
## [45] cachem_1.1.0 rhdf5_2.51.0
## [47] stringr_1.5.1 zlibbioc_1.52.0
## [49] splines_4.4.2 parallel_4.4.2
## [51] BiocManager_1.30.25 XVector_0.47.0
## [53] matrixStats_1.4.1 vctrs_0.6.5
## [55] Matrix_1.7-1 carData_3.0-5
## [57] jsonlite_1.8.9 car_3.1-3
## [59] IRanges_2.41.1 S4Vectors_0.45.2
## [61] rstatix_0.7.2 Formula_1.2-5
## [63] maketools_1.3.1 tidyr_1.3.1
## [65] jquerylib_0.1.4 glue_1.8.0
## [67] codetools_0.2-20 cowplot_1.1.3
## [69] stringi_1.8.4 gtable_0.3.6
## [71] GenomeInfoDb_1.43.1 GenomicRanges_1.59.1
## [73] UCSC.utils_1.3.0 munsell_0.5.1
## [75] tibble_3.2.1 pillar_1.9.0
## [77] htmltools_0.5.8.1 rhdf5filters_1.19.0
## [79] GenomeInfoDbData_1.2.13 R6_2.5.1
## [81] evaluate_1.0.1 lattice_0.22-6
## [83] Biobase_2.67.0 backports_1.5.0
## [85] pheatmap_1.0.12 broom_1.0.7
## [87] bslib_0.8.0 Rcpp_1.0.13-1
## [89] InteractionSet_1.35.0 gridExtra_2.3
## [91] SparseArray_1.7.2 nlme_3.1-166
## [93] mgcv_1.9-1 xfun_0.49
## [95] MatrixGenerics_1.19.0 buildtools_1.0.0
## [97] pkgconfig_2.0.3