KnowYourCG (KYCG) is a supervised learning framework designed for the functional analysis of DNA methylation data. Unlike existing tools that focus on genes or genomic intervals, KnowYourCG directly targets CpG dinucleotides, featuring automated supervised screenings of diverse biological and technical influences, including sequence motifs, transcription factor binding, histone modifications, replication timing, cell-type-specific methylation, and trait associations. KnowYourCG addresses the challenges of data sparsity in various methylation datasets, including low-pass Nanopore sequencing, single-cell DNA methylomes, 5-hydroxymethylation profiles, spatial DNA methylation maps, and array-based datasets for epigenome-wide association studies and epigenetic clocks.
The input to KYCG is a CpG set (query). The CpG sets can represent differential methylation, results from an epigenome-wide association studies, or any sets that may be derived from analysis. If array experiment, the preferred format of the query sets is a character vector of cg-numbers, following the standard Infinium array naming nomenclature.
The following commands prepare the use of knowYourCG. Several database sets are retrieved and caching is performed to enable faster access in future enrichment testing. More information on viewing and accessing available database sets is discussed later on.
library(knowYourCG)
library(sesameData)
sesameDataCache(data_titles=c("genomeInfo.hg38","genomeInfo.mm10",
"KYCG.MM285.tissueSignature.20211211",
"MM285.tissueSignature","MM285.address",
"probeIDSignature","KYCG.MM285.designGroup.20210210",
"KYCG.MM285.chromHMM.20210210",
"KYCG.MM285.TFBSconsensus.20220116",
"KYCG.MM285.HMconsensus.20220116",
"KYCG.MM285.chromosome.mm10.20210630"
))
The following example uses a query of CpGs methylated in mouse primordial germ cells (design group PGCMeth). First get the CG list using the following code.
## [1] "cg36615889_TC11" "cg36646136_BC21" "cg36647910_BC11" "cg36857173_TC21"
## [5] "cg36877289_BC21" "cg36899653_BC21"
Now test the enrichment. By default, KYCG will select all the categorical groups available but we can specify a subset of databases.
dbs <- c("KYCG.MM285.chromHMM.20210210",
"KYCG.HM450.TFBSconsensus.20211013",
"KYCG.MM285.HMconsensus.20220116",
"KYCG.MM285.tissueSignature.20211211",
"KYCG.MM285.chromosome.mm10.20210630",
"KYCG.MM285.designGroup.20210210")
results_pgc <- testEnrichment(query,databases = dbs,platform="MM285")
head(results_pgc)
## estimate p.value log10.p.value test nU nQ nD overlap
## 123 1024.000000 0.000000e+00 -1529.07602 Log2(OR) 296070 474 474 474
## 54 7.622407 0.000000e+00 -528.31906 Log2(OR) 296070 474 10603 415
## 6 5.943042 5.091536e-244 -243.29315 Log2(OR) 296070 474 3575 197
## 48 3.942719 1.890431e-113 -112.72344 Log2(OR) 296070 474 9641 160
## 37 1.809729 2.626803e-13 -12.58057 Log2(OR) 296070 474 10089 52
## 9 1.794176 1.324557e-06 -5.87793 Log2(OR) 296070 474 4113 22
## cf_Jaccard cf_MCC cf_overlap cf_NPMI cf_SorensenDice FDR
## 123 1.000000000 1.00000000 1.0000000 1.0000000 1.000000000 0.000000e+00
## 54 0.038923279 0.18095636 0.8755274 0.4865289 0.074930035 0.000000e+00
## 6 0.051142264 0.14795184 0.4156118 0.4837397 0.097307977 2.342107e-242
## 48 0.016072325 0.06880970 0.3375527 0.3108444 0.031636184 6.521987e-112
## 37 0.004947198 0.01669266 0.1097046 0.1352113 0.009845688 7.249978e-12
## 9 0.004819277 0.01112670 0.0464135 0.1268791 0.009592326 3.046480e-05
## group dbname n_min
## 123 KYCG.MM285.designGroup.20210210 PGCMeth NA
## 54 KYCG.MM285.HMconsensus.20220116 H3K9me3 14
## 6 KYCG.MM285.chromHMM.20210210 Het NA
## 48 KYCG.MM285.HMconsensus.20220116 H3K79me3 2
## 37 KYCG.MM285.HMconsensus.20220116 H3K36me3 30
## 9 KYCG.MM285.chromHMM.20210210 Quies3 NA
As expected, the PGCMeth group itself appears on the top of the list.
But one can also find histone H3K9me3, chromHMM Het
and
transcription factor Trim28
binding enriched in this CG
group.
The curated target features are called the knowledgebase sets. We have curated a variety of knowledgebases that represent different categorical and continuous methylation features such as CpGs associated with chromatin states, technical artifacts, gene association and gene expression correlation, transcription factor binding sites, tissue specific methylation, CpG density, etc.
Array knowledgebases are available as listed in the following tables.
Array | Link |
---|---|
MSA | https://github.com/zhou-lab/KYCG_knowledgebase_MSA |
Human (EPICv2) | https://github.com/zhou-lab/KYCG_knowledgebase_EPICv2 |
Human (EPIC) | https://github.com/zhou-lab/KYCG_knowledgebase_EPIC |
Human (HM450) | https://github.com/zhou-lab/KYCG_knowledgebase_HM450 |
The success of enrichment testing depends on the availability of biologically-relevant databases. To reflect the biological meaning of databases and facilitate selective testing, we have organized our database sets into different groups. Each group contains one or multiple databases. Here is how to find the names of pre-built database groups:
## # A tibble: 11 × 3
## Title Description type
## <chr> <chr> <chr>
## 1 KYCG.MM285.chromHMM.20210210 chromHMM consensus from mouseENCODE cate…
## 2 KYCG.MM285.chromosome.mm10.20210630 CpG position by mm10 chromosomes f… cate…
## 3 KYCG.MM285.designGroup.20210210 MM285 probe design categories cate…
## 4 KYCG.MM285.HMconsensus.20220116 CpGs associated with consensus his… cate…
## 5 KYCG.MM285.Mask.20220123 MM285 probe masking 20220123_MM285… cate…
## 6 KYCG.MM285.metagene.20220126 metagene coordinates with respect … cate…
## 7 KYCG.MM285.probeType.20210630 Probe type database sets (rs, cg, … cate…
## 8 KYCG.MM285.seqContext.20210630 Sequence context groups, e.g., CpG… cate…
## 9 KYCG.MM285.seqContextN.20210630 KYCG numerical database holding Se… nume…
## 10 KYCG.MM285.TFBSconsensus.20220116 CpGs associated with consensus TFB… cate…
## 11 KYCG.MM285.tissueSignature.20211211 MM285 probes associated with tissu… cate…
The listDBGroups()
function returns a data frame
containing information of these databases. The Title column is the
accession key one needs for the testEnrichment
function.
With the accessions, one can either directly use them in the
testEnrichment
function or explicitly call the
getDBs()
function to retrieve databases themselves. Caching
these databases on the local machine is important, for two reasons: it
limits the number of requests sent to the Bioconductor server, and
secondly it limits the amount of time the user needs to wait when
re-downloading database sets. For this reason, one should run
sesameDataCache()
before loading in any database sets. This
will take some time to download all of the database sets but this only
needs to be done once per installation. During the analysis the database
sets can be identified using these accessions. knowYourCG also does some
guessing when a unique substring is given. For example, the string
“MM285.designGroup” retrieves the “KYCG.MM285.designGroup.20210210”
database. Let’s look at the database group which we had used as the
query (query and database are reciprocal) in our first example:
## Selected the following database groups:
## 1. KYCG.MM285.designGroup.20210210
In total, 32 datasets have been loaded for this group. We can get the “PGCMeth” as an element of the list:
## chr [1:474] "cg36615889_TC11" "cg36646136_BC21" "cg36647910_BC11" ...
## - attr(*, "group")= chr "KYCG.MM285.designGroup.20210210"
## - attr(*, "dbname")= chr "PGCMeth"
On subsequent runs of the getDBs()
function, the
database loading can be faster thanks to the sesameData in-memory caching, if the
corresponding database has been loaded.
The main work horse function for testing enrichment of a categorical
query against categorical databases is the testEnrichment
function. This function will perform Fisher’s exact testing of the query
against each database set (one-tailed by default, but two-tailed
optionally) and reports overlap and enrichment statistics.
Choice of universe set: Universe set is the set of all probes for a given platform. It can either be passed in as an argument called
universeSet
or the platform name can be passed with argumentplatform
. If neither of these are supplied, the universe set will be inferred from the probes in the query.
library(SummarizedExperiment)
## prepare a query
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_brain" & df$type == "Hypo"]
results <- testEnrichment(query, "TFBS", platform="MM285")
results %>% dplyr::filter(overlap>10) %>% head
## estimate p.value log10.p.value test nU nQ nD overlap
## 1 3.058586 4.813228e-18 -17.317564 Log2(OR) 296070 200 6645 32
## 2 3.245138 5.187778e-18 -17.285019 Log2(OR) 296070 200 5228 29
## 3 2.959109 1.916663e-14 -13.717454 Log2(OR) 296070 200 5604 26
## 4 2.244055 3.357158e-12 -11.474028 Log2(OR) 296070 200 12296 34
## 5 2.536127 3.448999e-10 -9.462307 Log2(OR) 296070 200 6195 22
## 6 1.876932 1.114057e-04 -3.953092 Log2(OR) 296070 200 5509 13
## cf_Jaccard cf_MCC cf_overlap cf_NPMI cf_SorensenDice FDR
## 1 0.004696903 0.024144761 0.160 0.2150698 0.009349890 1.110185e-15
## 2 0.005371365 0.025138207 0.145 0.2280937 0.010685335 1.110185e-15
## 3 0.004499827 0.021191758 0.130 0.2063000 0.008959338 2.734439e-12
## 4 0.002728294 0.016741332 0.170 0.1553535 0.005441741 3.592159e-10
## 5 0.003452063 0.016180543 0.110 0.1745582 0.006880375 2.952344e-08
## 6 0.002282303 0.008925972 0.065 0.1246681 0.004554213 7.946942e-03
## n_min group dbname
## 1 1 KYCG.MM285.TFBSconsensus.20220116 LHX3
## 2 1 KYCG.MM285.TFBSconsensus.20220116 POU3F1
## 3 8 KYCG.MM285.TFBSconsensus.20220116 ISL1
## 4 2 KYCG.MM285.TFBSconsensus.20220116 ONECUT2
## 5 1 KYCG.MM285.TFBSconsensus.20220116 SOX3
## 6 1 KYCG.MM285.TFBSconsensus.20220116 NKX2-1
## prepare another query
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
results <- testEnrichment(query, "TFBS", platform="MM285")
results %>% dplyr::filter(overlap>10) %>%
dplyr::select(dbname, estimate, test, FDR) %>% head
## dbname estimate test FDR
## 1 TAL1 4.253039 Log2(OR) 8.749398e-42
## 2 GATA1 3.738643 Log2(OR) 9.060254e-30
## 3 SMAD1 3.162168 Log2(OR) 2.924401e-26
## 4 LDB1 2.084605 Log2(OR) 4.586066e-06
## 5 MYB 1.497470 Log2(OR) 8.785247e-04
## 6 GATA2 1.433997 Log2(OR) 7.724124e-03
The output of each test contains multiple variables including: the estimate (fold enrichment), p-value, overlap statistics, type of test, as well as the name of the database set and the database group. By default, the results are sorted by -log10 of the of p-value and the fold enrichment.
The nQ
and nD
columns identify the length
of the query set and the database set, respectively. Often, it’s
important to examine the extent of overlap between the two sets, so that
metric is reported as well in the overlap
column.
A query set represents probes of interest. It may either be in the form of a character vector where the values correspond to probe IDs or a named numeric vector where the names correspond to probe IDs. The query and database definition is rather arbitrary. One can regard a database as a query and turn a query into a database, like in our first example. In real world scenario, query can come from differential methylation testing, unsupervised clustering, correlation with a phenotypic trait, and many others. For example, we could consider CpGs that show tissue-specific methylation as the query. We are getting the B-cell-specific hypomethylation.
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "B_cell"]
head(query)
## [1] "cg32668003_TC11" "cg45118317_TC11" "cg37563895_TC11" "cg46105105_BC11"
## [5] "cg47206675_TC21" "cg38855216_TC21"
This query set represents hypomethylated probes in Mouse B-cells from the MM285 platform. This specific query set has 168 probes.
A special case of set enrichment is to test whether CpGs are
associated with specific genes. Automating the enrichment test process
only works when the number of database sets is small. This is important
when targeting all genes as there are tens of thousands of genes on each
platform. By testing only those genes that overlap with the query set,
we can greatly reduce the number of tests. For this reason, the gene
enrichment analysis is a special case of these enrichment tests. We can
perform this analysis using the buildGeneDBs()
function.
library(knowYourCG)
library(sesameData)
library(SummarizedExperiment)
query <- names(sesameData_getProbesByGene("Dnmt3a", "MM285"))
results <- testEnrichment(query,
buildGeneDBs(query, max_distance=100000, platform="MM285"),
platform="MM285")
main_stats <- c("dbname","estimate","gene_name","FDR", "nQ", "nD", "overlap")
results[,main_stats]
## dbname estimate gene_name FDR nQ nD overlap
## 5 ENSMUSG00000073242.4 1024.00000 Dnmt3aos 1.563399e-137 36 63 36
## 3 ENSMUSG00000020661.15 1024.00000 Dnmt3a 5.221704e-134 36 75 36
## 7 ENSMUSG00000112271.1 16.36791 Gm9088 3.532663e-118 36 60 32
## 2 ENSMUSG00000020660.6 15.45278 Pomc 6.421680e-108 36 63 30
## 1 ENSMUSG00000020658.10 13.96872 Efr3b 6.004251e-88 36 74 26
## 8 ENSMUSG00000112517.1 13.37524 Gm48001 3.814402e-70 36 60 21
## 4 ENSMUSG00000071454.13 12.51988 Dtnb 2.249011e-48 36 51 15
## 6 ENSMUSG00000092286.1 12.62688 Dtnbos 2.335881e-33 36 28 10
As expected, we recover our targeted gene (Dnmt3a).
Gene enrichment testing can easily be included with default or user specified database sets by setting include_genes=TRUE:
query <- names(sesameData_getProbesByGene("Dnmt3a", "MM285"))
dbs <- c("KYCG.MM285.chromHMM.20210210","KYCG.HM450.TFBSconsensus.20211013",
"KYCG.MM285.chromosome.mm10.20210630")
results <- testEnrichment(query,databases=dbs,
platform="MM285",include_genes=TRUE)
main_stats <- c("dbname","estimate","gene_name","FDR", "nQ", "nD", "overlap")
results[,main_stats] %>%
head()
## dbname estimate gene_name FDR nQ nD overlap
## 41 ENSMUSG00000020661.15 1024.000000 Dnmt3a 6.204571e-153 36 37 36
## 22 chr12 1024.000000 <NA> 5.564283e-51 36 10949 36
## 42 ENSMUSG00000073242.4 1024.000000 Dnmt3aos 2.893398e-31 36 8 8
## 2 EnhG 4.008313 <NA> 5.126860e-05 36 3640 6
## 17 Tx 2.588614 <NA> 3.081189e-04 36 17801 10
## 1 Enh 2.800184 <NA> 3.142022e-03 36 8269 6
One can get all the genes associated with a probe set and test the
Gene Ontology of the probe-associated genes using the
testGO()
function, which internally utilizes g:Profiler2 for the
enrichment analysis:
library(gprofiler2)
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
res <- testGO(query, platform="MM285",organism = "mmusculus")
head(res$result)
## query significant p_value term_size query_size intersection_size
## 6 query_1 TRUE 7.964257e-09 11561 128 92
## 7 query_1 TRUE 4.732933e-05 17412 128 109
## 25 query_1 TRUE 1.986984e-04 5296 130 59
## 12 query_1 TRUE 2.442191e-04 441 126 13
## 13 query_1 TRUE 2.442191e-04 441 126 13
## 26 query_1 TRUE 2.975598e-04 8023 130 77
## precision recall term_id source
## 6 0.7187500 0.007957789 GO:0005737 GO:CC
## 7 0.8515625 0.006260051 GO:0005622 GO:CC
## 25 0.4538462 0.011140483 TF:M05599 TF
## 12 0.1031746 0.029478458 GO:0030695 GO:MF
## 13 0.1031746 0.029478458 GO:0060589 GO:MF
## 26 0.5923077 0.009597407 TF:M00189 TF
## term_name effective_domain_size
## 6 cytoplasm 26995
## 7 intracellular anatomical structure 26995
## 25 Factor: WT1; motif: NGCGGGGGGGTSMMCYN 21629
## 12 GTPase regulator activity 25063
## 13 nucleoside-triphosphatase regulator activity 25063
## 26 Factor: AP-2; motif: MKCCCSCNGGCG 21629
## source_order parents
## 6 309 GO:00056....
## 7 237 GO:0110165
## 25 3828 TF:M00000
## 12 3988 GO:0060589
## 13 8391 GO:0030234
## 26 118 TF:M00000
Sometimes it may be of interest whether a query set of probes share
close genomic proximity. Co-localization may suggest co-regulation or
co-occupancy in the same regulatory element. KYCG can test for genomic
proximity using the testProbeProximity()
function. Poisson
statistics for the expected # of co-localized hits from the given query
size (lambda) and the actual co-localized CpG pairs along with the p
value are returned:
df <- rowData(sesameDataGet('MM285.tissueSignature'))
probes <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
res <- testProbeProximity(probeIDs=probes)
head(res)
## $Stats
## nQ Hits Lambda P.val
## 1 194 4 0.03 1.97502e-10
##
## $Clusters
## seqnames start end distance
## 1 chr1 165770666 165770667 11
## 2 chr1 165770677 165770678 377829
## 3 chr5 75601915 75601916 29
## 4 chr5 75601944 75601945 73617660
## 5 chr9 110235046 110235047 26
## 6 chr9 110235072 110235073 NA
## 7 chr11 32245638 32245639 95
## 8 chr11 32245733 32245734 63088309
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gprofiler2_0.2.3 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 GenomicRanges_1.59.1
## [5] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [7] S4Vectors_0.45.2 MatrixGenerics_1.19.0
## [9] matrixStats_1.4.1 sesameData_1.24.0
## [11] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [13] BiocFileCache_2.15.0 dbplyr_2.5.0
## [15] BiocGenerics_0.53.3 generics_0.1.3
## [17] knowYourCG_1.3.15
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 dplyr_1.1.4
## [4] blob_1.2.4 bitops_1.0-9 filelock_1.0.3
## [7] Biostrings_2.75.3 RCurl_1.98-1.16 lazyeval_0.2.2
## [10] fastmap_1.2.0 digest_0.6.37 mime_0.12
## [13] lifecycle_1.0.4 KEGGREST_1.47.0 RSQLite_2.3.9
## [16] magrittr_2.0.3 compiler_4.4.2 rlang_1.1.4
## [19] sass_0.4.9 tools_4.4.2 utf8_1.2.4
## [22] yaml_2.3.10 data.table_1.16.4 knitr_1.49
## [25] htmlwidgets_1.6.4 S4Arrays_1.7.1 bit_4.5.0.1
## [28] curl_6.0.1 DelayedArray_0.33.3 plyr_1.8.9
## [31] RColorBrewer_1.1-3 abind_1.4-8 withr_3.0.2
## [34] purrr_1.0.2 sys_3.4.3 grid_4.4.2
## [37] wheatmap_0.2.0 colorspace_2.1-1 ggplot2_3.5.1
## [40] scales_1.3.0 cli_3.6.3 rmarkdown_2.29
## [43] crayon_1.5.3 httr_1.4.7 reshape2_1.4.4
## [46] tzdb_0.4.0 DBI_1.2.3 cachem_1.1.0
## [49] stringr_1.5.1 AnnotationDbi_1.69.0 BiocManager_1.30.25
## [52] XVector_0.47.1 vctrs_0.6.5 Matrix_1.7-1
## [55] jsonlite_1.8.9 hms_1.1.3 bit64_4.5.2
## [58] ggrepel_0.9.6 maketools_1.3.1 plotly_4.10.4
## [61] tidyr_1.3.1 jquerylib_0.1.4 glue_1.8.0
## [64] stringi_1.8.4 gtable_0.3.6 BiocVersion_3.21.1
## [67] UCSC.utils_1.3.0 munsell_0.5.1 tibble_3.2.1
## [70] pillar_1.10.0 rappdirs_0.3.3 htmltools_0.5.8.1
## [73] GenomeInfoDbData_1.2.13 R6_2.5.1 lattice_0.22-6
## [76] evaluate_1.0.1 readr_2.1.5 png_0.1-8
## [79] memoise_2.0.1 bslib_0.8.0 Rcpp_1.0.13-1
## [82] SparseArray_1.7.2 xfun_0.49 buildtools_1.0.0
## [85] pkgconfig_2.0.3