The explosion of the usage of Next Generation Sequencing techniques during the past few years due to the seemingly endless portfolio of applications has created the need for new NGS data analytics tools which are able to offer comprehensive and at the same time flexible visualizations under several experimental settings and factors. An established visualization tool in NGS experiments is the visualization of the signal created by short reads after the application of every NGS protocol. Genome Browsers (e.g. the UCSC Genome Browser) serve very well this purpose considering single genomic areas. They are very good when it comes to the visualization of the abundance of a single or a few genes or the strength of a few protein-DNA interaction sites. However, when it comes to the visualization of average signal profiles over multiple genomic locations (gene regions or others like DNA methylation sites or transcription factor binding sites), Genome Browsers fail to depict such information in a comprehensive way. Furthermore, they cannot visualize average signal profiles of factored data, for example a set of genes categorized in high, medium and low expression or even by strand and they cannot visualize all signals of interest mapped on all the genomic regions of interest in a compact way, something that can be done using for example a heatmap.
In such cases, bioinformaticians use several toolkits like BEDTools
and facilities from R/Bioconductor to read/import short reads and
overlap them with genomic regions of interest, summarize them by
binning/averaging overlaps to control the resolution of final graphics
etc. This procedure often requires the usage of multiple tools and
custom scripts with multiple steps. One of the most comprehensive and
easy-to-use tools up to date is ngs.plot. It is
sufficiently fast for most applications and has a low memory footprint
which allows users to run it in low end machines too. It is command line
based and users can run it by using a simple configuration file most of
the times, has a rich database of genome annotation and genomic features
and uses R/Bioconductor for underlying calculations and plotting of
profiles. However, ngs.plot is not up to date with modern R graphics
systems like ggplot2
and ComplexHeatmap
. As a
result, among others, it is impossible to create faceted genomic
profiles using a statistical design and in such cases, a lot of
additional manual work and computational time is required in order to
reach the desired outcomes. The same applies to heatmap profiles.
Furthermore, the resolution of genomic profiles (e.g. per base coverage
or per bin of base-pairs coverage) cannot be controlled and this can
cause problems in cases where extreme levels of resolution (e.g.
DNAse-Seq experiments) is required to reach meaningful biological
conclusions. Last but not least, ngs.plot requires a not so
straightforward setup in order to run, does not run in a unified working
environment (e.g. R environment with its graphics displaying mechanisms)
and in some cases produces oversized and complex output.
The recoup package comes to fill such gaps by stepping on the shoulders of giants. It uses the now standardized and stable Bioconductor facilities to read and import short reads from BAM/BED files or processed genomic signals from BIGWig files and also modern R graphics systems, namely ggplot2 and ComplexHeatmap in order to create comprehensive averaged genomic profiles and genomic profile heatmaps. In addition it offers a lot of (easy to use) customization options and automation at various levels. Inexperienced users can gather their data in a simple text file and just choose one of the supported organisms and recoup does the rest for them. More experienced users can play with several options and provide more flexible input so as to produce optimal results. This vignette, although it covers basic usage of the package, it offers the basis for more sophisticated usage. recoup is not as fast as ngs.plot but we are working on this! Also, recoup is not here to replace other more mature packages. It is here to offer more options to users that need more sophisticated genomic profile visualizations. Finally, it offers a very flexible way to reuse genomic profiles whose calculations may be computationally and time expensive by offering a smart way to recognize which parameters have changed and acting based on these.
Specifically, recoup creates three types of plots:
All plots can be faceted using a design file with the categories into which the plots should be separated. All the above are illustrated in the examples in the vignettes of this package.
Detailed instructions on how to run the recoup genomic profile creation pipeline can be found under the main documentation of the package:
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
## Warning: multiple methods tables found for 'intersect'
Briefly, to run recoup you need:
recoup
man page). They can also be provided as an organism
version keyword, e.g. hg19
or mm9
and the
respective regions will be either retrieved from a local
recoup
annotation database setup, or downloaded on the fly
(takes significantly more time as some extra operations are
required).The package contains a small dataset which serves only for package building and testing purposes as well as complying with Bioconductor guidelines. These data are useful for the user to check how the input data to recoup should look like. For a more complete test dataset (a small one) have a look and download from here
The following commands should provide some information on the embedded test data, which are subsets of the complete test data in the above link.
In order to run smoothly some usage examples and produce some realistic results, you need to download a set of example BAM files, genomic regions and design files from here. Following, a description of each file in the archive (the tissue is always liver):
recoup
annotation databaseIn order to run the vignette examples, you should download and
extract the archive to a path of your preference,
e.g. /home/me/recoup_tutorial
.
Apart from a user specified file, the reference genomic regions used
by recoup to construct average profiles over, can be predefined gene set
from a few common organisms supported by recoup. See the
recoup
man page for a list of these organisms. In order to
use this “database” of predefined genomic areas, you should run the
function buildAnnotationStore
with a list of organisms, a
list of annotation sources (Ensembl, RefSeq and UCSC supported) and a
desired path to store the annotations (defaults to
/home/me/.recoup
). For example:
See the man page of buildAnnotationStore
for more
details. This step is not necessary for recoup to run as these
annotations can be also downloaded on the fly. However, if subsets of
the supported organisms are to be used often, it is much more
preferrable to spend some time building the local store as it can save a
lot of running time later by investing some time in the beginning to
build this store.
The recoup
function can be used to create coverage
profiles from ChIP-Seq like experiments (signals over continuous genomic
regions) or from RNA-Seq experiments (signals over non-continuous
genomic regions). Due to restrictions (logical ones) imposed by the
Bioconductor development core team, the data that are required for a
realistic tutorial of the recoup
package cannot be
included. Thus, this page along with the rest of the tutorial on how to
create genomic profiles from ChIP-Seq or RNA-Seq data can be found
either here or
in the wiki recoup
pages in github
When working with gene bodies, it happens very often that gene
lengths are a little to a lot smaller than the number of bins into which
they should be split and averaged in order to be able to create the
average curve and heatmap profiles. While other packages line ngs.plot
deal with this by always plotting in a pre-specifed axis system
(e.g. 1-100 in the x-axis for ngs.plot) and using splines to sample
coverages at equal spaces, recoup supports a more dynamic resolution by
allowing the user to set the number of bins into which genomic areas
will be binned or by allowing a per-base resolution where possible.
Thus, when genes are smaller than the desired number of bins, there
might be a problem. recoup
deals with that in four possible
ways:
spline
function with the
default method)approx
function with the
default method)NA
values are distributed randomly across
the small area coverage vector (e.g. gene with length smaller than the
desired number of bins), excluding the first and the last two positions,
in order to reach the desired number of bins. Then, each NA
position is filled with the mean value of the two values before and the
two values after the NA
position. This method should be
avoided when >20% of the values of the extended vector are
NA
’s as it may cause a crash. However, it should be the
most accurate one in the opposite case (few NA
’s). See the
recoup
man page for further information.In the two vignettes explaining the useage of the
recoup
package and its functions and in the examples, you
will notice the direct use of plot
function instead of
direct plotting (plotParams$plot=FALSE
instead of the
default TRUE
) or the recoupPlot
function. This
has been done on purpose as at this point, it is not straighforward how
to tell the knitr
vignette builder how to plot objects
created within called functions.
Due to strict Bioconductor guidelines regarding package size, these vignettes can also be found in the github package page with a few more examples.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] recoup_1.35.0 ComplexHeatmap_2.23.0
## [3] ggplot2_3.5.1 GenomicAlignments_1.43.0
## [5] Rsamtools_2.23.0 Biostrings_2.75.1
## [7] XVector_0.47.0 SummarizedExperiment_1.37.0
## [9] Biobase_2.67.0 MatrixGenerics_1.19.0
## [11] matrixStats_1.4.1 GenomicRanges_1.59.0
## [13] GenomeInfoDb_1.43.0 IRanges_2.41.1
## [15] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [17] generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9 httr2_1.0.6
## [4] biomaRt_2.63.0 rlang_1.1.4 magrittr_2.0.3
## [7] clue_0.3-66 GetoptLong_1.0.5 compiler_4.4.2
## [10] RSQLite_2.3.7 GenomicFeatures_1.59.1 txdbmaker_1.3.0
## [13] png_0.1-8 vctrs_0.6.5 stringr_1.5.1
## [16] pkgconfig_2.0.3 shape_1.4.6.1 crayon_1.5.3
## [19] fastmap_1.2.0 dbplyr_2.5.0 utf8_1.2.4
## [22] rmarkdown_2.29 UCSC.utils_1.3.0 bit_4.5.0
## [25] xfun_0.49 zlibbioc_1.52.0 cachem_1.1.0
## [28] jsonlite_1.8.9 progress_1.2.3 blob_1.2.4
## [31] DelayedArray_0.33.2 BiocParallel_1.41.0 parallel_4.4.2
## [34] prettyunits_1.2.0 cluster_2.1.6 R6_2.5.1
## [37] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
## [40] rtracklayer_1.67.0 jquerylib_0.1.4 iterators_1.0.14
## [43] knitr_1.49 Matrix_1.7-1 tidyselect_1.2.1
## [46] abind_1.4-8 yaml_2.3.10 doParallel_1.0.17
## [49] codetools_0.2-20 curl_6.0.1 lattice_0.22-6
## [52] tibble_3.2.1 withr_3.0.2 KEGGREST_1.47.0
## [55] evaluate_1.0.1 BiocFileCache_2.15.0 xml2_1.3.6
## [58] circlize_0.4.16 filelock_1.0.3 pillar_1.9.0
## [61] BiocManager_1.30.25 foreach_1.5.2 RCurl_1.98-1.16
## [64] hms_1.1.3 munsell_0.5.1 scales_1.3.0
## [67] glue_1.8.0 maketools_1.3.1 tools_4.4.2
## [70] BiocIO_1.17.0 sys_3.4.3 XML_3.99-0.17
## [73] buildtools_1.0.0 AnnotationDbi_1.69.0 colorspace_2.1-1
## [76] GenomeInfoDbData_1.2.13 restfulr_0.0.15 cli_3.6.3
## [79] rappdirs_0.3.3 fansi_1.0.6 S4Arrays_1.7.1
## [82] dplyr_1.1.4 gtable_0.3.6 sass_0.4.9
## [85] digest_0.6.37 SparseArray_1.7.2 rjson_0.2.23
## [88] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
## [91] httr_1.4.7 GlobalOptions_0.1.2 bit64_4.5.2