In this vignette, we provide a brief overview of the ChIPexoQual package. This package provides a statistical quality control (QC) pipeline that enables the exploration and analysis of ChIP-exo/nexus experiments. In this vignette we used the reads aligned to chr1 in the mouse liver ChIP-exo experiment (Serandour et al. 2013) to illustrate the use of the pipeline. To load the packages we use:
ChIPexoQual takes a set of aligned reads from a ChIP-exo (or ChIP-nexus) experiment as input and performs the following steps:
We analyzed a larger collection of ChIP-exo/nexus experiments in (Welch et al. 2016) including complete versions of this samples.
The minimum input to use ChIPexoQual are the aligned reads of a ChIP-exo/nexus experiment. ChIPexoQual accepts either the name of the bam file or the reads in a GAlignments object:
files = list.files(system.file("extdata",
package = "ChIPexoQualExample"),full.names = TRUE)
basename(files[1])
## [1] "ChIPexo_carroll_FoxA1_mouse_rep1_chr1.bam"
## ExoData object with 655785 ranges and 11 metadata columns:
## seqnames ranges strand | fwdReads revReads fwdPos
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## [1] chr1 3000941-3000976 * | 2 0 1
## [2] chr1 3001457-3001492 * | 0 1 0
## [3] chr1 3001583-3001618 * | 0 2 0
## [4] chr1 3001647-3001682 * | 1 0 1
## [5] chr1 3001852-3001887 * | 1 0 1
## ... ... ... ... . ... ... ...
## [655781] chr1 197192012-197192047 * | 0 1 0
## [655782] chr1 197192421-197192456 * | 0 1 0
## [655783] chr1 197193059-197193094 * | 1 0 1
## [655784] chr1 197193694-197193729 * | 0 3 0
## [655785] chr1 197194986-197195021 * | 0 2 0
## revPos depth uniquePos ARC URC FSR
## <integer> <integer> <integer> <numeric> <numeric> <numeric>
## [1] 0 2 1 0.0555556 0.5 1
## [2] 1 1 1 0.0277778 1.0 0
## [3] 1 2 1 0.0555556 0.5 0
## [4] 0 1 1 0.0277778 1.0 1
## [5] 0 1 1 0.0277778 1.0 1
## ... ... ... ... ... ... ...
## [655781] 1 1 1 0.0277778 1.000000 0
## [655782] 1 1 1 0.0277778 1.000000 0
## [655783] 0 1 1 0.0277778 1.000000 1
## [655784] 1 3 1 0.0833333 0.333333 0
## [655785] 1 2 1 0.0555556 0.500000 0
## M A
## <numeric> <numeric>
## [1] -Inf Inf
## [2] -Inf -Inf
## [3] -Inf -Inf
## [4] -Inf Inf
## [5] -Inf Inf
## ... ... ...
## [655781] -Inf -Inf
## [655782] -Inf -Inf
## [655783] -Inf Inf
## [655784] -Inf -Inf
## [655785] -Inf -Inf
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
reads = readGAlignments(files[1],param = NULL)
ex2 = ExoData(reads = reads,mc.cores = 2L,verbose = FALSE)
identical(GRanges(ex1),GRanges(ex2))
## [1] FALSE
For the rest of the vignette, we generate an ExoData object for each replicate:
files = files[grep("bai",files,invert = TRUE)] ## ignore index files
exampleExoData = lapply(files,ExoData,mc.cores = 2L,verbose = FALSE)
Finally, we can recover the number of reads that compose a ExoData object by using the nreads function:
## [1] 1654985 1766665 1670117
To create the ARC vs URC plot proposed in (Welch et al. 2016), we use the ARC_URC_plot function. This function allows to visually compare different samples:
This plot typically exhibits one of the following three patterns for any given sample. In all three panels we can observe two arms: the first with low Average Read Coefficient (ARC) and varying Unique Read Coefficient (URC); and the second where the URC decreases as the ARC increases. The first and third replicates exhibit a defined decreasing trend in URC as the ARC increases. This indicates that these samples exhibit a higher ChIP enrichment than the second replicate. On the other hand, the overall URC level from the first two replicates is higher than that of the third replicate, elucidating that the libraries for the first two replicates are more complex than that of the third replicate.
To create the FSR distribution and Region Composition plots suggested in Welch et. al 2016 (submitted), we use the FSR_dist_plot and region_comp_plot, respectively.
p1 = regionCompplot(exampleExoData,names.input = paste("Rep",1:3,
sep = "-"),depth.values = seq_len(50))
p2 = FSRDistplot(exampleExoData,names.input = paste("Rep",1:3,sep = "-"),
quantiles = c(.25,.5,.75),depth.values = seq_len(100))
gridExtra::grid.arrange(p1,p2,nrow = 1)
The left panel displays the Region Composition plot and the right panel shows the Forward Strand Ratio (FSR) distribution plot, both of which highlight specific problems with replicates 2 and 3. The Region Composition plot exhibits apparent decreasing trends in the proportions of regions formed by fragments in one exclusive strand. High quality experiments tend to show exponential decay in the proportion of single stranded regions, while for the lower quality experiments, the trend may be linear or even constant. The FSR distributions of both of replicates 2 and 3 are more spread around their respective medians. The rate at which the FSR distribution becomes more centralized around the median indicates the aforementioned lower enrichment in the second replicate and the low complexity in the third one. The asymmetric behavior of the second replicate is characteristic of low enrichment, while the constant values of replicate three for low minimum number of reads indicate that this replicate has islands composed of reads aligned to very few unique positions.
All the plot functions in ChIPexoQual allow a list or several separate ExoData objects. This allows to explore island subsets for each replicate. For example, to show that the first arm is composed of regions formed by reads aligned to few positions, we can generate the following plot:
ARCvURCplot(exampleExoData[[1]],
subset(exampleExoData[[1]],uniquePos > 10),
subset(exampleExoData[[1]],uniquePos > 20),
names.input = c("All", "uniquePos > 10", "uniquePos > 20"))
For this figure, we used the ARC vs URC plot to show how several of the regions with low ARC values are composed by reads that align to a small number of unique positions. This technique highlights a strategy that can be followed to further explore the data, as with all the previously listed plotting functions we may compare different subsets of the islands in the partition.
The last step of the quality control pipeline is to evaluate the linear model:
$$ \begin{align*} D_i = \beta_1 U_i + \beta_2 U_2 + \epsilon_i, \end{align*} $$
The distribution of the parameters of this model is built by sampling nregions regions (the default value is 1,000), fitting the model and repeating the process ntimes (the default value is 100). We visualize the distributions of the parameters with box-plots:
p1 = paramDistBoxplot(exampleExoData,which.param = "beta1", names.input = paste("Rep",1:3,sep = "-"))
p2 = paramDistBoxplot(exampleExoData,which.param = "beta2", names.input = paste("Rep",1:3,sep = "-"))
gridExtra::grid.arrange(p1,p2,nrow = 1)
Further details over this analysis are in Welch et. al 2016 (submitted). In short, when the ChIP-exo/nexus sample is not deeply sequenced, high values of β̂1 indicate that the library complexity is low. In contrast, lower values correspond to higher quality ChIP-exo experiments. We concluded that samples with estimated $\hat{\beta_1} \leq 10$ seem to be high quality samples. Similarly, samples with estimated $\hat{\beta_2} \approx 0$ can be considered as high quality samples. The estimated values for these parameters can be accessed with the beta1, beta2, and param_dist methods. For example, using the median to summarize these parameter distributions, we conclude that these three replicates (in chr1) are high quality samples:
## [1] 1.863641 1.500497 8.103250
## [1] 0.01426162 0.00952127 0.04592548
The behavior of the third’s FoxA1 replicate may be an indication of
problems in the sample. However, it is also common to observe that
pattern in deeply sequenced experiments. For convenience, we added the
function ExoDataSubsampling
, that performs the analysis
suggested by Welch et. al 2016 (submitted) when the experiment is deeply
sequenced. To use this function, we proceed as follows:
sample.depth = seq(1e5,2e5,5e4)
exoList = ExoDataSubsampling(file = files[3],sample.depth = sample.depth, verbose=FALSE)
The output of ExoDataSubsampling
is a list of
ExoData
objects, therefore its output can be used with any
of the plotting functions to asses the quality of the samples. For
example, using we may use paramDistBoxplot
to get the
following figures:
p1 = paramDistBoxplot(exoList,which.param = "beta1")
p2 = paramDistBoxplot(exoList,which.param = "beta2")
gridExtra::grid.arrange(p1,p2,nrow = 1)
Clearly there are increasing trends in both plots, and since we are only using the reads in chromosome 1, we are observing fewer reads than in a typical ChIP-exo/nexus experiment. In a higher quality experiment it is expected to show lower β̂1 and β̂2 levels. Additionally, the rate at which the estimated β̂2 parameter increases is going to be higher in a lower quality experiment.
We presented a systematic exploration of a ChIP-exo experiment and show how to use the QC pipeline provided in ChIPexoQual. ChIPexoQual takes aligned reads as input and automatically generates several diagnostic plots and summary measures that enable assessing enrichment and library complexity. The implications of the diagnostic plots and the summary measures align well with more elaborate analysis that is computationally more expensive to perform and/or requires additional imputes that often may not be available.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## character(0)
##
## other attached packages:
## [1] ChIPexoQual_1.31.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9
## [3] gridExtra_2.3 rlang_1.1.4
## [5] magrittr_2.0.3 biovizBase_1.55.0
## [7] matrixStats_1.4.1 compiler_4.4.2
## [9] RSQLite_2.3.9 GenomicFeatures_1.59.1
## [11] png_0.1-8 vctrs_0.6.5
## [13] ProtGenerics_1.39.0 stringr_1.5.1
## [15] pkgconfig_2.0.3 crayon_1.5.3
## [17] fastmap_1.2.0 backports_1.5.0
## [19] XVector_0.47.0 labeling_0.4.3
## [21] Rsamtools_2.23.1 rmarkdown_2.29
## [23] grDevices_4.4.2 UCSC.utils_1.3.0
## [25] purrr_1.0.2 bit_4.5.0.1
## [27] xfun_0.49 zlibbioc_1.52.0
## [29] cachem_1.1.0 graphics_4.4.2
## [31] GenomeInfoDb_1.43.2 jsonlite_1.8.9
## [33] blob_1.2.4 DelayedArray_0.33.3
## [35] BiocParallel_1.41.0 broom_1.0.7
## [37] parallel_4.4.2 cluster_2.1.8
## [39] VariantAnnotation_1.53.0 R6_2.5.1
## [41] bslib_0.8.0 stringi_1.8.4
## [43] RColorBrewer_1.1-3 rtracklayer_1.67.0
## [45] rpart_4.1.23 GenomicRanges_1.59.1
## [47] jquerylib_0.1.4 SummarizedExperiment_1.37.0
## [49] knitr_1.49 base64enc_0.1-3
## [51] IRanges_2.41.2 Matrix_1.7-1
## [53] nnet_7.3-19 tidyselect_1.2.1
## [55] viridis_0.6.5 rstudioapi_0.17.1
## [57] dichromat_2.0-0.1 abind_1.4-8
## [59] yaml_2.3.10 codetools_0.2-20
## [61] curl_6.0.1 lattice_0.22-6
## [63] tibble_3.2.1 withr_3.0.2
## [65] Biobase_2.67.0 KEGGREST_1.47.0
## [67] evaluate_1.0.1 foreign_0.8-87
## [69] base_4.4.2 Biostrings_2.75.3
## [71] pillar_1.10.0 BiocManager_1.30.25
## [73] MatrixGenerics_1.19.0 checkmate_2.3.2
## [75] stats4_4.4.2 generics_0.1.3
## [77] RCurl_1.98-1.16 ensembldb_2.31.0
## [79] S4Vectors_0.45.2 ggplot2_3.5.1
## [81] munsell_0.5.1 scales_1.3.0
## [83] BiocStyle_2.35.0 stats_4.4.2
## [85] glue_1.8.0 Hmisc_5.2-1
## [87] lazyeval_0.2.2 ChIPexoQualExample_1.30.0
## [89] maketools_1.3.1 tools_4.4.2
## [91] datasets_4.4.2 hexbin_1.28.5
## [93] BiocIO_1.17.1 sys_3.4.3
## [95] data.table_1.16.4 BSgenome_1.75.0
## [97] GenomicAlignments_1.43.0 buildtools_1.0.0
## [99] XML_3.99-0.17 grid_4.4.2
## [101] utils_4.4.2 tidyr_1.3.1
## [103] methods_4.4.2 AnnotationDbi_1.69.0
## [105] colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [107] htmlTable_2.4.3 restfulr_0.0.15
## [109] Formula_1.2-5 cli_3.6.3
## [111] viridisLite_0.4.2 S4Arrays_1.7.1
## [113] dplyr_1.1.4 AnnotationFilter_1.31.0
## [115] gtable_0.3.6 sass_0.4.9
## [117] digest_0.6.37 BiocGenerics_0.53.3
## [119] SparseArray_1.7.2 farver_2.1.2
## [121] rjson_0.2.23 htmlwidgets_1.6.4
## [123] memoise_2.0.1 htmltools_0.5.8.1
## [125] lifecycle_1.0.4 httr_1.4.7
## [127] bit64_4.5.2