Install OGRE using Bioconductor’s package installer.
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("OGRE")
Load the OGRE package:
To start up OGRE you have to generate an OGREDataSet
that is used to store your datasets and additional information about the
analysis that you are conducting. Query and subjects files can be
conveniently stored in their own folders as GenomicRanges objects in
form of stored .rds / .RDS files. We point OGRE to the correct location
by supplying a path for each folder with the character vectors
queryFolder
and subjectFolder
. In this
vignette we are using lightweight query and subject example data sets to
show OGRE’s functionality.
myQueryFolder <- file.path(system.file('extdata', package = 'OGRE'),"query")
mySubjectFolder <- file.path(system.file('extdata', package = 'OGRE'),"subject")
myOGRE <- OGREDataSetFromDir(queryFolder=myQueryFolder,
subjectFolder=mySubjectFolder)
## Initializing OGREDataSet...
By monitoring OGRE’s metadata information you can make sure the input paths you supplied are stored correctly.
## $queryFolder
## [1] "/tmp/Rtmp0EKTm9/Rinst214d13dba438/OGRE/extdata/query"
##
## $subjectFolder
## [1] "/tmp/Rtmp0EKTm9/Rinst214d13dba438/OGRE/extdata/subject"
##
## $outputFolder
## [1] "/tmp/Rtmp0EKTm9/Rinst214d13dba438/OGRE/extdata/output"
##
## $gvizPlotsFolder
## [1] "/tmp/Rtmp0EKTm9/Rinst214d13dba438/OGRE/extdata/gvizPlots"
##
## $summaryDT
## list()
##
## $itracks
## list()
Query and subject datasets are read by loadAnnotations()
and stored in the OGREDataSet
as GRanges
objects. We are going to read in the following example datasets:
## Reading query dataset...
## Reading subject datasets...
OGRE uses your dataset file names to label query and subjects
internally, we can check these names by using the names()
function since every OGREDataSet
is a
GRangesList
.
## [1] "genes" "CGI" "TFBS"
Let’s have a look at the stored datasets:
## GRangesList object of length 3:
## $genes
## GRanges object with 242 ranges and 3 metadata columns:
## seqnames ranges strand | ID name
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] 21 10906201-11029719 - | ENSG00000166157 TPTE
## [2] 21 14741931-14745386 - | ENSG00000256715 AL050302.1
## [3] 21 14982498-15013906 + | ENSG00000166351 POTED
## [4] 21 15051621-15053459 - | ENSG00000269011 AL050303.1
## [5] 21 15481134-15583166 - | ENSG00000188992 LIPI
## ... ... ... ... . ... ...
## [238] 21 47720095-47743789 - | ENSG00000160298 C21orf58
## [239] 21 47744036-47865682 + | ENSG00000160299 PCNT
## [240] 21 47878812-47989926 + | ENSG00000160305 DIP2A
## [241] 21 48018875-48025121 - | ENSG00000160307 S100B
## [242] 21 48055079-48085036 + | ENSG00000160310 PRMT2
## score
## <numeric>
## [1] NA
## [2] NA
## [3] NA
## [4] NA
## [5] NA
## ... ...
## [238] NA
## [239] NA
## [240] NA
## [241] NA
## [242] NA
## -------
## seqinfo: 25 sequences (1 circular) from hg19 genome
##
## $CGI
## GRanges object with 365 ranges and 3 metadata columns:
## seqnames ranges strand | ID name score
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] 21 9437273-9439473 * | 26635 CpG:_285 NA
## [2] 21 9483486-9484663 * | 26636 CpG:_165 NA
## [3] 21 9647867-9648116 * | 26637 CpG:_18 NA
## [4] 21 9708936-9709231 * | 26638 CpG:_31 NA
## [5] 21 9825443-9826296 * | 26639 CpG:_120 NA
## ... ... ... ... . ... ... ...
## [361] 21 48018543-48018791 * | 26995 CpG:_21 NA
## [362] 21 48055200-48056060 * | 26996 CpG:_88 NA
## [363] 21 48068518-48068808 * | 26997 CpG:_24 NA
## [364] 21 48081242-48081849 * | 26998 CpG:_55 NA
## [365] 21 48087201-48088106 * | 26999 CpG:_93 NA
## -------
## seqinfo: 25 sequences (1 circular) from hg19 genome
##
## $TFBS
## GRanges object with 48761 ranges and 3 metadata columns:
## seqnames ranges strand | ID name
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] 21 29884415-29884427 + | GATA1.85108 GATA1_04
## [2] 21 46923766-46923780 + | CDP.81529 CDP_02
## [3] 21 9491627-9491638 - | HFH1.46541 HFH1_01
## [4] 21 9491706-9491725 - | PPARA.24892 PPARA_01
## [5] 21 9491792-9491815 + | GFI1.35413 GFI1_01
## ... ... ... ... . ... ...
## [48757] 21 48083381-48083404 + | STAT5A.43326 STAT5A_02
## [48758] 21 48083400-48083419 + | ARNT.19751 ARNT_02
## [48759] 21 48084826-48084841 + | BRN2.40426 BRN2_01
## [48760] 21 48084830-48084847 + | FOXJ2.121681 FOXJ2_01
## [48761] 21 48084834-48084845 + | NKX3A.47953 NKX3A_01
## score
## <numeric>
## [1] 891
## [2] 831
## [3] 865
## [4] 757
## [5] 817
## ... ...
## [48757] 751
## [48758] 792
## [48759] 803
## [48760] 889
## [48761] 851
## -------
## seqinfo: 25 sequences (1 circular) from hg19 genome
To find overlaps between your query and subject datasets we call
fOverlaps()
. Internally OGRE makes use of the
GenomicRanges
package to calculate full and partial overlap
as schematically shown.
Any existing subject - query hits are then listed in
detailDT
and stored as a data.table
.
## queryID queryType subjID subjType queryChr queryStart queryEnd
## <char> <char> <char> <char> <char> <int> <int>
## 1: ENSG00000166157 genes 26649 CGI 21 10906201 11029719
## 2: ENSG00000269011 genes 26654 CGI 21 15051621 15053459
## queryStrand subjChr subjStart subjEnd subjStrand overlapWidth overlapRatio
## <char> <char> <int> <int> <char> <int> <num>
## 1: - 21 10989914 10991413 * 1500 0.01214388
## 2: - 21 15052411 15052644 * 234 0.12724307
The summary plot provides us with useful information about the number of overlaps between your datasets.
Using the Gviz
visualization each query can be displayed
with all overlapping subject elements. Choose labels for all region
tracks by supplying a trackRegionLabels
vector. Plots are
stored in the same location as your dataset files.
myOGRE <- gvizPlot(myOGRE,"ENSG00000142168",showPlot = TRUE,
trackRegionLabels = setNames(c("name","name"),c("genes","CGI")))
## Plotting query: ENSG00000142168
The overlap distribution can be generated with
summarizeOverlap(myOGRE)
and outputs a table with
informative statistics such as minimum, lower quantile, mean, median,
upper quantile, and maximum number of overlaps per region and per
dataset. Overlap distribution can also be displayed as histograms using
plotHist(myOGRE)
and accessed by
metadata(myOGRE)$hist
and
metadata(myOGRE)$summaryDT
. Two tables / plots are
generated. The first one showing numbers for regions with and without
overlap and the second one showing numbers only for regions with overlap
by excluding all others. Next, we generate an histogram with the number
of TFBS per gene (x-axis, log scale) and the TFBS frequency (y-axis).
When focusing only on regions with overlap, we see that genes have on
average (median) 54 TFBS overlaps (black dashed line).
## $includes0
## CGI TFBS
## Min. 0.000000 0.0000
## 1st Qu. 0.000000 8.0000
## Median 1.000000 36.0000
## Mean 1.210744 119.6116
## 3rd Qu. 1.750000 129.7500
## Max. 14.000000 3136.0000
##
## $excludes0
## CGI TFBS
## Min. 1.00000 1.0000
## 1st Qu. 1.00000 15.0000
## Median 1.00000 54.0000
## Mean 2.02069 139.8357
## 3rd Qu. 2.00000 159.5000
## Max. 14.00000 3136.0000
## NA's 97.00000 35.0000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It is possible to create an average coverage profile of all gene-TFBS
overlaps, split in 100 bins, which represent gene bodies of all 242
genes. Both, forward and reverse coding genes are arranged on the x-Axis
and peaks indicate an TFBS overlap enrichment. Overlap coverage is
calculated as the sum of all gene TFBS overlaps in 5’-3’direction.
Generated plots can be accessed by
metadata(myOGRE)$covPlot$TFBS
and the resulting profile
shows an accumulation of TFBS around gene start and end positions.
## Generating coverage plot(s), this might take a while...
## Excluding regions with nucleotides<nbin
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
AnnotationHub offers a wide range of annotated datasets which can be
manually aquired but need some parsing to work with OGRE as detailed in
vignette section Frequently Asked Questions(FAQ). For
convenience addDataSetFromHub()
adds one of the predefined
human datasets of listPredefinedDataSets()
to an
OGREDataSet. Those are taken from AnnotationHub and are ready to use for
OGRE. We start by creating an empty OGREDataSet and attaching one
dataset after another, whereby one query and two subjects are added. The
datasets are now ready for further analysis.
myOGRE <- OGREDataSet()
listPredefinedDataSets()
myOGRE <- addDataSetFromHub(myOGRE,"protCodingGenes","query")
myOGRE <- addDataSetFromHub(myOGRE,"CGI","subject")
myOGRE <- addDataSetFromHub(myOGRE,"TFBS","subject")
names(myOGRE)
As you can see, the three datasets proteinCodingGenes, CGI and TFBS
are stored within OGRE. You can then continue with overlap analysis
using fOverlaps()
.
To offer more flexibility addGRanges()
enables the user
to attach additional datasets to OGRE in form of GenomicRanges objects.
Again we start by creating an empty OGREDataSet and generate an example
GenomicRanges object which is then added to your OGREDataSet either as
“query” or “subject”.
Use AnnotationHub()
to connect to AnnotationHub. Each
dataset is stored under a unique ID and can be accessed in a list like
fashion i.e. aH[["AH5086"]]
. Queries like
c("GRanges","Homo sapiens", "CpG")
enable browsing through
datasets. In this case we are searching for human CpG islands ranges
stored as GenomicRanges objects. For more information refer to
?AnnotationHub
To make those datasets compatible with OGRE
additional parsing is needed as stated in How to add custom
GenomicRanges datasets?
Any GenomicRanges datasets can be added that fulfill basic compatibility requirements. GenomicRanges objects must:
Use GenomeInfoDb::genome()
on any GenomicRanges object
to get/set genome information
Use GenomeInfoDb::seqinfo()
on any GenomicRanges object
to get/set chromosome information
Use S4Vectors::mcols()
on any GenomicRanges object to
get/set metadata information
Datasets from external sources are often stored as .gff (v2&v3) files. Once those files exist in the query or subject folder and their attribute columns contain “ID” and “name” information, OGRE tries to load them. Working example .gff files can be found on OGRE’s github page in folder: inst/extdata/gffTest.
Datasets stored as tabular files like .csv or .bed may need some
preprocessing for them work with OGRE. We recommend reading them in with
read.table()
or data.table::fread()
to obtain
a data frame object. After making sure the dataset complies with the
requirements in section How to add custom
GenomicRanges datasets?,
GenomicRanges::makeGRangesFromDataFrame()
offers a
convenient way to generate GenomicRanges object from data frames.
Both, partial overlap, where only a part of two (or more) regions are overlapping and complete overlap, where one region is completely overlapped by another, are reported.
OGRE automatically infers dataset names based on your file names. You
can either change your file names before you start OGRE or you can use
names(myOGRE) <- c("NewName1", "NewName2","...")
after
you read in your datasets.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] OGRE_1.11.0 S4Vectors_0.45.0 BiocGenerics_0.53.1
## [4] generics_0.1.3 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] rstudioapi_0.17.1 jsonlite_1.8.9
## [5] magrittr_2.0.3 GenomicFeatures_1.59.0
## [7] farver_2.1.2 fs_1.6.5
## [9] BiocIO_1.17.0 zlibbioc_1.52.0
## [11] vctrs_0.6.5 memoise_2.0.1
## [13] Rsamtools_2.23.0 RCurl_1.98-1.16
## [15] base64enc_0.1-3 htmltools_0.5.8.1
## [17] S4Arrays_1.7.1 progress_1.2.3
## [19] AnnotationHub_3.15.0 curl_6.0.0
## [21] SparseArray_1.7.1 Formula_1.2-5
## [23] shinyFiles_0.9.3 sass_0.4.9
## [25] bslib_0.8.0 htmlwidgets_1.6.4
## [27] Gviz_1.51.0 httr2_1.0.6
## [29] cachem_1.1.0 buildtools_1.0.0
## [31] GenomicAlignments_1.43.0 mime_0.12
## [33] lifecycle_1.0.4 pkgconfig_2.0.3
## [35] Matrix_1.7-1 R6_2.5.1
## [37] fastmap_1.2.0 shiny_1.9.1
## [39] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.0
## [41] digest_0.6.37 colorspace_2.1-1
## [43] AnnotationDbi_1.69.0 Hmisc_5.2-0
## [45] GenomicRanges_1.59.0 RSQLite_2.3.7
## [47] labeling_0.4.3 filelock_1.0.3
## [49] fansi_1.0.6 mgcv_1.9-1
## [51] httr_1.4.7 abind_1.4-8
## [53] compiler_4.4.2 withr_3.0.2
## [55] bit64_4.5.2 htmlTable_2.4.3
## [57] backports_1.5.0 BiocParallel_1.41.0
## [59] DBI_1.2.3 highr_0.11
## [61] biomaRt_2.63.0 rappdirs_0.3.3
## [63] DelayedArray_0.33.1 rjson_0.2.23
## [65] tools_4.4.2 foreign_0.8-87
## [67] httpuv_1.6.15 nnet_7.3-19
## [69] glue_1.8.0 restfulr_0.0.15
## [71] nlme_3.1-166 promises_1.3.0
## [73] grid_4.4.2 checkmate_2.3.2
## [75] cluster_2.1.6 gtable_0.3.6
## [77] BSgenome_1.75.0 shinyBS_0.61.1
## [79] tidyr_1.3.1 ensembldb_2.31.0
## [81] data.table_1.16.2 hms_1.1.3
## [83] xml2_1.3.6 utf8_1.2.4
## [85] XVector_0.47.0 BiocVersion_3.21.1
## [87] pillar_1.9.0 stringr_1.5.1
## [89] later_1.3.2 splines_4.4.2
## [91] dplyr_1.1.4 BiocFileCache_2.15.0
## [93] lattice_0.22-6 rtracklayer_1.67.0
## [95] bit_4.5.0 deldir_2.0-4
## [97] biovizBase_1.55.0 tidyselect_1.2.1
## [99] maketools_1.3.1 Biostrings_2.75.0
## [101] knitr_1.48 gridExtra_2.3
## [103] IRanges_2.41.0 ProtGenerics_1.39.0
## [105] SummarizedExperiment_1.37.0 shinydashboard_0.7.2
## [107] xfun_0.49 Biobase_2.67.0
## [109] matrixStats_1.4.1 DT_0.33
## [111] stringi_1.8.4 UCSC.utils_1.3.0
## [113] lazyeval_0.2.2 yaml_2.3.10
## [115] evaluate_1.0.1 codetools_0.2-20
## [117] interp_1.1-6 tibble_3.2.1
## [119] BiocManager_1.30.25 cli_3.6.3
## [121] rpart_4.1.23 xtable_1.8-4
## [123] munsell_0.5.1 jquerylib_0.1.4
## [125] dichromat_2.0-0.1 Rcpp_1.0.13-1
## [127] GenomeInfoDb_1.43.0 dbplyr_2.5.0
## [129] png_0.1-8 XML_3.99-0.17
## [131] parallel_4.4.2 assertthat_0.2.1
## [133] ggplot2_3.5.1 blob_1.2.4
## [135] prettyunits_1.2.0 latticeExtra_0.6-30
## [137] jpeg_0.1-10 AnnotationFilter_1.31.0
## [139] bitops_1.0-9 VariantAnnotation_1.53.0
## [141] scales_1.3.0 purrr_1.0.2
## [143] crayon_1.5.3 rlang_1.1.4
## [145] KEGGREST_1.47.0