HiCool

Processing sequencing Hi-C libraries with HiCool

The HiCool R/Bioconductor package provides an end-to-end interface to process and normalize Hi-C paired-end fastq reads into .(m)cool files.

  1. The heavy lifting (fastq mapping, pairs parsing and pairs filtering) is performed by the underlying lightweight hicstuff python library (https://github.com/koszullab/hicstuff).
  2. Pairs filering is done using the approach described in Cournac et al., 2012 and implemented in hicstuff.
  3. cooler (https://github.com/open2c/cooler) library is used to parse pairs into a multi-resolution, balanced .mcool file. .(m)cool is a compact, indexed HDF5 file format specifically tailored for efficiently storing HiC-based data. The .(m)cool file format was developed by Abdennur and Mirny and published in 2019.
  4. Internally, all these external dependencies are automatically installed and managed in R by a basilisk environment.

The main processing function offered in this package is HiCool(). To process .fastq reads into .pairs & .mcool files, one needs to provide:

  • The path to each fastq file (r1 and r2);
  • The genome reference, as a path to a .fasta sequence file, a path to a pre-computed bowtie2 index or a supported ID character (hg38, mm10, dm6, R64-1-1, WBcel235, GRCz10, Galgal4);
  • The restriction enzyme(s) used for Hi-C.
x <- HiCool(
    r1 = '<PATH-TO-R1.fq.gz>', 
    r2 = '<PATH-TO-R2.fq.gz>', 
    restriction = '<RE1(,RE2)>', 
    resolutions = "<resolutions of interest>", 
    genome = '<GENOME_ID>'
)

Here is a concrete example of Hi-C data processing.

  • Example fastq files are retrieved using the HiContactsData package.
  • Two restriction enzymes are used (these are the enzymes used in the Arima Kit).
  • The final .mcool file will have three levels of resolutions, from 1000bp to 8000bp.
  • The data will be mapped on R64-1-1, the yeast genome reference.
  • All output files will be placed in output/ directory.
library(HiCool)
hcf <- HiCool(
    r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'), 
    r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'), 
    restriction = 'DpnII,HinfI', 
    resolutions = c(4000, 8000, 16000), 
    genome = 'R64-1-1', 
    output = './HiCool/'
)
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> HiCool :: Fetching bowtie genome index files from AWS iGenomes S3 bucket...
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda create --yes --prefix /github/home/.cache/R/basilisk/1.19.0/HiCool/1.7.0/env 'python=3.7.12' --quiet -c conda-forge -c bioconda --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/HiCool/1.7.0/env 'python=3.7.12' -c conda-forge -c bioconda --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/HiCool/1.7.0/env -c conda-forge -c bioconda 'python=3.7.12' 'python=3.7.12' 'bowtie2=2.5.0' 'samtools=1.16.1' 'hicstuff=3.1.5' 'chromosight=1.6.3' 'cooler=0.9.1' --override-channels
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpDIfLCT/U5R96O]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /tmp/RtmpfUFKpM/Rbuild1d517969d7/HiCool/vignettes/HiCool/1ebe40b33894_7833^mapped-R64-1-1^U5R96O.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/1ebe40b33894_7833^mapped-R64-1-1^U5R96O.mcool 
#> resolution: 4000 
#> pairs file: ./HiCool//pairs/1ebe40b33894_7833^mapped-R64-1-1^U5R96O.pairs 
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/1ebe40b33894_7833^mapped-R64-1-1^U5R96O.log"
#> 
#> $args
#> $args$r1
#> [1] "/github/home/.cache/R/ExperimentHub/1ebe40b33894_7833"
#> 
#> $args$r2
#> [1] "/github/home/.cache/R/ExperimentHub/1ebe1a0e9453_7834"
#> 
#> $args$genome
#> [1] "/tmp/RtmpDIfLCT/R64-1-1"
#> 
#> $args$resolutions
#> [1] "4000"
#> 
#> $args$resolutions
#> [1] "8000"
#> 
#> $args$resolutions
#> [1] "16000"
#> 
#> $args$restriction
#> [1] "DpnII,HinfI"
#> 
#> $args$iterative
#> [1] TRUE
#> 
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#> 
#> $args$threads
#> [1] 1
#> 
#> $args$output
#> [1] "./HiCool/"
#> 
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#> 
#> $args$keep_bam
#> [1] FALSE
#> 
#> $args$scratch
#> [1] "/tmp/RtmpDIfLCT"
#> 
#> $args$wd
#> [1] "/tmp/RtmpfUFKpM/Rbuild1d517969d7/HiCool/vignettes"
#> 
#> 
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#> 
#> $stats$nPairs
#> [1] 73993
#> 
#> $stats$nDangling
#> [1] 10027
#> 
#> $stats$nSelf
#> [1] 2205
#> 
#> $stats$nDumped
#> [1] 83
#> 
#> $stats$nFiltered
#> [1] 61678
#> 
#> $stats$nDups
#> [1] 719
#> 
#> $stats$nUnique
#> [1] 60959
#> 
#> $stats$threshold_uncut
#> [1] 7
#> 
#> $stats$threshold_self
#> [1] 7

Optional parameters

Extra optional arguments can be passed to the hicstuff workhorse library:

  • iterative (default: TRUE): By default, hicstuff first truncates your set of reads to 20bp and attempts to align the truncated reads, then moves on to aligning 40bp-truncated reads for those which could not be mapped, etc. This procedure is longer than a traditional mapping but allows for more pairs to be rescued. Set to FALSE if you want to perform standard alignment of fastq files without iterative alignment;
  • balancing_args (default: " --min-nnz 10 --mad-max 5 "): Specify here any balancing argument to be used by cooler when normalizing the binned contact matrices. Full list of options available at cooler documentation website;
  • threads (default: 1L): Number of CPUs to use to process data;
  • exclude_chr (default: 'Mito|chrM|MT'): List here any chromosome you wish to remove from the final contact matrix file;
  • keep_bam (default: FALSE): Set to TRUE if you wish to keep the pair of .bam files;
  • scratch (default: tempdir()): Points to a temporary directory to be used for processing.

Output files

The important files generated by HiCool are the following:

  • A log file: <output_folder>/logs/<prefix>^mapped-<genome>^<hash>.log
  • A multi-resolution, balanced contact matrix file: <output_folder>/matrices/<prefix>^mapped-<genome>^<hash>.mcool
  • A .pairs file: <output_folder>/pairs/<prefix>^mapped-<genome>^<hash>.pairs
  • Several diagnosis plots: <output_folder>/plots/<prefix>^mapped-<genome>^<hash>_*.pdf.

The diagnosis plots illustrate how pairs were filtered during the processing, using a strategy described in Cournac et al., BMC Genomics 2012. The event_distance chart represents the frequency of ++, +-, -+ and -- pairs in the library, as a function of the number of restriction sites between each end of the pairs, and shows the inferred filtering threshold. The event_distribution chart indicates the proportion of each type of pairs (e.g. dangling, uncut, abnormal, …) and the total number of pairs retained (3D intra + 3D inter).

Notes:

System dependencies

Processing Hi-C sequencing libraries into .pairs and .mcool files requires several dependencies, to (1) align reads to a reference genome, (2) manage alignment files (SAM), (3) filter pairs, (4) bin them to a specific resolution and (5)

All system dependencies are internally managed by basilisk. HiCool maintains a basilisk environment containing:

  • python 3.9.1
  • bowtie2 2.4.5
  • samtools 1.7
  • hicstuff 3.1.5
  • cooler 0.8.11
  • chromosight 1.6.3

The first time HiCool() is executed, a fresh basilisk environment will be created and required dependencies automatically installed. This ensures compatibility between the different system dependencies needed to process Hi-C fastq files.

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] HiContactsData_1.8.0 ExperimentHub_2.15.0 AnnotationHub_3.15.0
#>  [4] BiocFileCache_2.15.0 dbplyr_2.5.0         BiocGenerics_0.53.3 
#>  [7] generics_0.1.3       HiCool_1.7.0         HiCExperiment_1.7.0 
#> [10] BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3                   rlang_1.1.4                
#>   [3] magrittr_2.0.3              matrixStats_1.4.1          
#>   [5] compiler_4.4.2              RSQLite_2.3.8              
#>   [7] dir.expiry_1.15.0           png_0.1-8                  
#>   [9] vctrs_0.6.5                 stringr_1.5.1              
#>  [11] pkgconfig_2.0.3             crayon_1.5.3               
#>  [13] fastmap_1.2.0               XVector_0.47.0             
#>  [15] rmdformats_1.0.4            utf8_1.2.4                 
#>  [17] rmarkdown_2.29              sessioninfo_1.2.2          
#>  [19] tzdb_0.4.0                  UCSC.utils_1.3.0           
#>  [21] strawr_0.0.92               purrr_1.0.2                
#>  [23] bit_4.5.0                   xfun_0.49                  
#>  [25] zlibbioc_1.52.0             cachem_1.1.0               
#>  [27] GenomeInfoDb_1.43.1         jsonlite_1.8.9             
#>  [29] blob_1.2.4                  rhdf5filters_1.19.0        
#>  [31] DelayedArray_0.33.2         Rhdf5lib_1.29.0            
#>  [33] BiocParallel_1.41.0         parallel_4.4.2             
#>  [35] R6_2.5.1                    bslib_0.8.0                
#>  [37] stringi_1.8.4               reticulate_1.40.0          
#>  [39] GenomicRanges_1.59.1        jquerylib_0.1.4            
#>  [41] Rcpp_1.0.13-1               bookdown_0.41              
#>  [43] SummarizedExperiment_1.37.0 knitr_1.49                 
#>  [45] IRanges_2.41.1              Matrix_1.7-1               
#>  [47] tidyselect_1.2.1            abind_1.4-8                
#>  [49] yaml_2.3.10                 codetools_0.2-20           
#>  [51] curl_6.0.1                  lattice_0.22-6             
#>  [53] tibble_3.2.1                withr_3.0.2                
#>  [55] KEGGREST_1.47.0             InteractionSet_1.35.0      
#>  [57] Biobase_2.67.0              basilisk.utils_1.19.0      
#>  [59] evaluate_1.0.1              Biostrings_2.75.1          
#>  [61] pillar_1.9.0                BiocManager_1.30.25        
#>  [63] filelock_1.0.3              MatrixGenerics_1.19.0      
#>  [65] stats4_4.4.2                plotly_4.10.4              
#>  [67] vroom_1.6.5                 BiocVersion_3.21.1         
#>  [69] S4Vectors_0.45.2            ggplot2_3.5.1              
#>  [71] munsell_0.5.1               scales_1.3.0               
#>  [73] glue_1.8.0                  lazyeval_0.2.2             
#>  [75] maketools_1.3.1             tools_4.4.2                
#>  [77] BiocIO_1.17.0               sys_3.4.3                  
#>  [79] data.table_1.16.2           buildtools_1.0.0           
#>  [81] rhdf5_2.51.0                grid_4.4.2                 
#>  [83] tidyr_1.3.1                 crosstalk_1.2.1            
#>  [85] AnnotationDbi_1.69.0        colorspace_2.1-1           
#>  [87] GenomeInfoDbData_1.2.13     basilisk_1.19.0            
#>  [89] cli_3.6.3                   rappdirs_0.3.3             
#>  [91] fansi_1.0.6                 S4Arrays_1.7.1             
#>  [93] viridisLite_0.4.2           dplyr_1.1.4                
#>  [95] gtable_0.3.6                sass_0.4.9                 
#>  [97] digest_0.6.37               SparseArray_1.7.2          
#>  [99] htmlwidgets_1.6.4           memoise_2.0.1              
#> [101] htmltools_0.5.8.1           lifecycle_1.0.4            
#> [103] httr_1.4.7                  mime_0.12                  
#> [105] bit64_4.5.2