This vignette exemplifies how to perform unsupervised footprint detection and quantification using FootprintCharter as per Baderna & Barzaghi et al., 2024 and Barzaghi et al., 2024.
FootprintCharter partitions molecules by their methylation
patterns without relying on orthogonal genomic annotations such as TF
motifs.
At the core, FootprintCharter performs a series of filtering
steps to obtain a dense matrix of cytosines spanning 80bps. The
molecules populating this matrix are than partitioned using the
k-medoids algorithm. Footprints for TFs and nucleosomes are
then detected by segmenting the average SMF signal separately for each
partition. Optionally, TF footprints can be annotated with a given list
of annotated motifs. Finally, biologically equivalent TF footprints are
aggregated based on coordinates overlaps. For more details consult Barzaghi et al., 2024.
The function FootprintCharter is tuned with 5 key parameters
FootprintCharter(
MethSM = MethSM,
RegionOfInterest = RegionOfInterest,
coverage = 30,
k = 16,
n = 5,
TF.length = c(5,75),
nucleosome.length = c(120,1000),
cytosine.coverage.thr = 5,
verbose = TRUE
) -> FC_results
## 1. Pooling molecules from all samples
## 2. Computing sliding windows
## Discarding 1973/2233 molecules that do not entirely cover the RegionOfInterest
## Discarding 41/59 cytosines that are NA in more than 0% of the reads
## Discarding 0/260 molecules that have NAs in more than 0% of the remaining cytosines
## 3. computing distance matrix
## 4. Partitioning
## partitions too slim detected...retrying with k=15
## partitions too slim detected...retrying with k=14
## partitions too slim detected...retrying with k=13
## partitions too slim detected...retrying with k=12
## partitions too slim detected...retrying with k=11
## partitions too slim detected...retrying with k=10
## partitions too slim detected...retrying with k=9
## partitions too slim detected...retrying with k=8
## 5. Footprints detection
## 6. Footprints annotation (skipping)
## 7. Footprints aggregation
## 8. Results wrangling
The function FootprintCharter outputs a list of two
items.
The first is a vector of partitioned.molecules, structured in
the same way as the one output by the function SortReads.
rrapply::rrapply(FC_results$partitioned.molecules, f = head, n = 2)
## $SMF_MM_TKO_DE_
## $SMF_MM_TKO_DE_$`1`
## [1] "D00404:273:H5757BCXY:1:1102:14949:3420"
## [2] "D00404:273:H5757BCXY:1:2116:7810:26400"
##
## $SMF_MM_TKO_DE_$`2`
## [1] "D00404:273:H5757BCXY:1:1102:20457:73436"
## [2] "D00404:273:H5757BCXY:1:1110:7624:25325"
##
## $SMF_MM_TKO_DE_$`3`
## [1] "D00404:273:H5757BCXY:1:1205:4416:100764"
## [2] "D00404:273:H5757BCXY:1:1206:16663:42295"
##
## $SMF_MM_TKO_DE_$`4`
## [1] "D00404:273:H5757BCXY:1:1209:3401:67031"
## [2] "D00404:273:H5757BCXY:1:2215:17233:16460"
##
## $SMF_MM_TKO_DE_$`5`
## [1] "D00404:273:H5757BCXY:1:1211:14401:50622"
## [2] "D00404:273:H5757BCXY:2:1101:1697:87040"
##
## $SMF_MM_TKO_DE_$`6`
## [1] "D00404:273:H5757BCXY:1:1215:8843:22685"
## [2] "D00404:273:H5757BCXY:2:1107:2792:41093"
##
## $SMF_MM_TKO_DE_$`7`
## [1] "D00404:273:H5757BCXY:1:2212:1496:27605"
## [2] "D00404:273:H5757BCXY:2:1102:9323:16615"
##
## $SMF_MM_TKO_DE_$`8`
## [1] "D00404:273:H5757BCXY:2:1103:16878:7787"
## [2] "D00404:283:HCFT7BCXY:2:1103:6140:48231"
The second is a data.frame of detected footprints. The
biological.state column indicates the putative nature of the
detected footprints and it can be either of TF, nucleosome, accessible,
unrecognized and noise.
The last two represent cases in which footprints couldn’t be interpreted
either because too short (noise) or because they are detected at the
edge of molecules and hence their true width cannot be established.
head(FC_results$footprints.df)
## partition.nr sample partition.coverage start end width
## 1 2 SMF_MM_TKO_DE_ 47 88106070 88106131 62
## 2 1 SMF_MM_TKO_DE_ 46 88106098 88106209 112
## 3 3 SMF_MM_TKO_DE_ 42 88106098 88106104 7
## 4 3 SMF_MM_TKO_DE_ 42 88106105 88106209 105
## 5 6 SMF_MM_TKO_DE_ 21 88106113 88106113 1
## 6 7 SMF_MM_TKO_DE_ 40 88106113 88106317 205
## nr.cytosines biological.state seqnames TF TF.name footprint.idx
## 1 6 accessible chr6 NA NA 9
## 2 15 accessible chr6 NA NA 1
## 3 1 unrecognized chr6 NA NA 36
## 4 14 accessible chr6 NA NA 1
## 5 1 noise chr6 NA NA 40
## 6 30 accessible chr6 NA NA 1
Optionally, it is possible to annotate TF footprints by providing the argument TFBSs with user-provided motifs. TFBSs needs a GRanges with at least two metadata columns: TF and absolute.idx for TF identity and unique motif index, respectively.
TFBSs = qs::qread(system.file("extdata", "TFBSs_1.qs", package="SingleMoleculeFootprinting"))
TFBSs$absolute.idx = names(TFBSs)
TFBSs
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | TF absolute.idx
## <Rle> <IRanges> <Rle> | <character> <character>
## TFBS_4305215 chr6 88106216-88106226 - | NRF1 TFBS_4305215
## TFBS_4305216 chr6 88106253-88106263 - | NRF1 TFBS_4305216
## -------
## seqinfo: 66 sequences (1 circular) from mm10 genome
FootprintCharter(
MethSM = MethSM,
RegionOfInterest = RegionOfInterest,
TFBSs = TFBSs,
coverage = 30,
k = 16,
n = 5,
TF.length = c(5,75),
nucleosome.length = c(120,1000),
cytosine.coverage.thr = 5,
verbose = FALSE
) -> FC_results
FC_results$footprints.df %>%
filter(biological.state == "TF") %>%
dplyr::select(start, end, partition.nr, TF, TF.name) %>%
tidyr::unnest(cols = c(TF, TF.name))
## # A tibble: 11 × 5
## start end partition.nr TF TF.name
## <int> <int> <chr> <chr> <chr>
## 1 88106132 88106144 2 NA <NA>
## 2 88106156 88106160 5 NA <NA>
## 3 88106210 88106237 1 NRF1 TFBS_4305215
## 4 88106210 88106237 2 NRF1 TFBS_4305215
## 5 88106210 88106237 3 NRF1 TFBS_4305215
## 6 88106238 88106255 8 NA <NA>
## 7 88106241 88106275 3 NRF1 TFBS_4305216
## 8 88106246 88106275 1 NRF1 TFBS_4305216
## 9 88106246 88106275 5 NRF1 TFBS_4305216
## 10 88106318 88106342 1 NA <NA>
## 11 88106318 88106388 2 NA <NA>
This way, the footprints data.frame reports TF motif annotations by populating two columns: TF and TF.name, each a list of annotations for each putative TF footprint.
The function PlotFootprints allows to visually inspect the footprint detection results.
x.axis.breaks = c(
start(resize(RegionOfInterest, width = 500, fix = "center")),
end(resize(RegionOfInterest, width = 500, fix = "center"))
)
PlotFootprints(
MethSM = MethSM,
partitioned.molecules = FC_results$partitioned.molecules,
footprints.df = FC_results$footprints.df,
TFBSs = TFBSs
) +
scale_x_continuous(
limits = x.axis.breaks,
breaks = x.axis.breaks,
labels = format(x.axis.breaks, nsmall=1, big.mark=",")
) +
theme(legend.position = "top")
The function plots the average SMF signal separately for each partition alongside the biological.state annotation. The dashed line indicates the footprint detection threshold.
The function PlotSM can be used to plot binary heatmaps of single molecules partitioned by FootprintCharter.
PlotSM(
MethSM = MethSM,
RegionOfInterest = IRanges::resize(RegionOfInterest, 500, "center"),
SortedReads = FC_results$partitioned.molecules,
sorting.strategy = "custom"
) +
scale_x_continuous(
limits = x.axis.breaks,
breaks = x.axis.breaks,
labels = format(x.axis.breaks, nsmall=1, big.mark=",")
)
## Arranging reads according to custom sorting.strategy
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
It might be desirable to arrange partitions according to a biologically meaningful order, such as TF binding and cobinding.
Users can leverage the plot produced with the PlotFootprints function as exemplified above and instruct a preferred order. In this case, we plot the partitions with two NRF1 footprints (3,1) on top, followed by partitions with single TF footprints (2,5), accessibility (6,7) and nucleosome footprints (4,8).
partitions.order = c(3,1,2,5,6,7,4,8)
ordered.molecules = lapply(FC_results$partitioned.molecules, function(x){x[rev(partitions.order)]})
PlotSM(
MethSM = MethSM,
RegionOfInterest = IRanges::resize(RegionOfInterest, 500, "center"),
SortedReads = ordered.molecules,
sorting.strategy = "custom"
) +
scale_x_continuous(
limits = x.axis.breaks,
breaks = x.axis.breaks,
labels = format(x.axis.breaks, nsmall=1, big.mark=",")
)
## Arranging reads according to custom sorting.strategy
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
Single molecules displaying FootprintCharter annotations can be plotted using the Plot_FootprintCharter_SM function.
Plot_FootprintCharter_SM(
footprints.df = FC_results$footprints.df,
RegionOfInterest = IRanges::resize(RegionOfInterest, 500, "center"),
partitions.order = partitions.order
) +
scale_x_continuous(
limits = x.axis.breaks,
breaks = x.axis.breaks,
labels = format(x.axis.breaks, nsmall=1, big.mark=",")
)
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.0
## [3] stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.0.4 readr_2.1.5
## [7] tidyr_1.3.1 tibble_3.2.1
## [9] ggplot2_3.5.1 tidyverse_2.0.0
## [11] GenomicRanges_1.59.1 GenomeInfoDb_1.43.4
## [13] IRanges_2.41.3 S4Vectors_0.45.4
## [15] BiocGenerics_0.53.6 generics_0.1.3
## [17] SingleMoleculeFootprinting_2.1.4
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] GenomicFeatures_1.59.1 farver_2.1.2
## [7] rmarkdown_2.29 BiocIO_1.17.1
## [9] vctrs_0.6.5 memoise_2.0.1
## [11] Rsamtools_2.23.1 RCurl_1.98-1.16
## [13] QuasR_1.47.2 ggpointdensity_0.1.0
## [15] htmltools_0.5.8.1 S4Arrays_1.7.3
## [17] progress_1.2.3 curl_6.2.0
## [19] SparseArray_1.7.5 sass_0.4.9
## [21] bslib_0.9.0 httr2_1.1.0
## [23] cachem_1.1.0 buildtools_1.0.0
## [25] GenomicAlignments_1.43.0 lifecycle_1.0.4
## [27] pkgconfig_2.0.3 Matrix_1.7-2
## [29] R6_2.6.1 fastmap_1.2.0
## [31] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.1
## [33] digest_0.6.37 colorspace_2.1-1
## [35] miscTools_0.6-28 ShortRead_1.65.0
## [37] patchwork_1.3.0 AnnotationDbi_1.69.0
## [39] RSQLite_2.3.9 hwriter_1.3.2.1
## [41] labeling_0.4.3 filelock_1.0.3
## [43] timechange_0.3.0 httr_1.4.7
## [45] abind_1.4-8 compiler_4.4.2
## [47] Rbowtie_1.47.0 bit64_4.6.0-1
## [49] withr_3.0.2 BiocParallel_1.41.1
## [51] viridis_0.6.5 DBI_1.2.3
## [53] qs_0.27.2 biomaRt_2.63.1
## [55] rappdirs_0.3.3 DelayedArray_0.33.6
## [57] rjson_0.2.23 tools_4.4.2
## [59] glue_1.8.0 restfulr_0.0.15
## [61] grid_4.4.2 cluster_2.1.8
## [63] gtable_0.3.6 BSgenome_1.75.1
## [65] tzdb_0.4.0 RApiSerialize_0.1.4
## [67] hms_1.1.3 utf8_1.2.4
## [69] stringfish_0.16.0 xml2_1.3.6
## [71] XVector_0.47.2 pillar_1.10.1
## [73] BiocFileCache_2.15.1 lattice_0.22-6
## [75] rtracklayer_1.67.0 bit_4.5.0.1
## [77] deldir_2.0-4 tidyselect_1.2.1
## [79] maketools_1.3.2 Biostrings_2.75.3
## [81] knitr_1.49 gridExtra_2.3
## [83] SummarizedExperiment_1.37.0 xfun_0.50
## [85] Biobase_2.67.0 matrixStats_1.5.0
## [87] stringi_1.8.4 UCSC.utils_1.3.1
## [89] yaml_2.3.10 evaluate_1.0.3
## [91] codetools_0.2-20 interp_1.1-6
## [93] GenomicFiles_1.43.0 cli_3.6.4
## [95] RcppParallel_5.1.10 munsell_0.5.1
## [97] jquerylib_0.1.4 Rcpp_1.0.14
## [99] dbplyr_2.5.0 png_0.1-8
## [101] XML_3.99-0.18 parallel_4.4.2
## [103] blob_1.2.4 prettyunits_1.2.0
## [105] latticeExtra_0.6-30 jpeg_0.1-10
## [107] parallelDist_0.2.6 plyranges_1.27.0
## [109] bitops_1.0-9 pwalign_1.3.2
## [111] txdbmaker_1.3.1 viridisLite_0.4.2
## [113] VariantAnnotation_1.53.1 scales_1.3.0
## [115] crayon_1.5.3 rrapply_1.2.7
## [117] rlang_1.1.5 KEGGREST_1.47.0