FootprintCharter

Introduction

This vignette exemplifies how to perform unsupervised footprint detection and quantification using FootprintCharter as per Baderna & Barzaghi et al., 2024 and Barzaghi et al., 2024.

FootprintCharter partitions molecules by their methylation patterns without relying on orthogonal genomic annotations such as TF motifs.
At the core, FootprintCharter performs a series of filtering steps to obtain a dense matrix of cytosines spanning 80bps. The molecules populating this matrix are than partitioned using the k-medoids algorithm. Footprints for TFs and nucleosomes are then detected by segmenting the average SMF signal separately for each partition. Optionally, TF footprints can be annotated with a given list of annotated motifs. Finally, biologically equivalent TF footprints are aggregated based on coordinates overlaps. For more details consult Barzaghi et al., 2024.

Loading libraries

suppressWarnings(library(SingleMoleculeFootprinting))
suppressWarnings(library(GenomicRanges))
suppressWarnings(library(tidyverse))
suppressWarnings(library(ggplot2))

Prepare inputs

Methylation = qs::qread(system.file("extdata", "Methylation_4.qs", package="SingleMoleculeFootprinting"))
MethSM = Methylation[[2]]
RegionOfInterest = GRanges("chr6", IRanges(88106000, 88106500))
RegionOfInterest = IRanges::resize(RegionOfInterest, 80, "center")

Run FootprintCharter

The function FootprintCharter is tuned with 5 key parameters

  • k, the number of desired partitions. This number is better set higher than lower. The tool will reduce it iteratively if partitions are not covered enough (set by n).
  • n, the minimum number of molecules per partition.
  • TF.length, length thresholds of expected TF footprints (in bps).
  • nucleosome.length, length thresholds of expected nucleosome footprints (in bps).
  • cytosine.coverage.thr, discards cytosines not covered enough from footprint detection step.
FootprintCharter(
  MethSM = MethSM,
  RegionOfInterest = RegionOfInterest,
  coverage = 30, 
  k = 16,
  n = 5, 
  TF.length = c(5,75),
  nucleosome.length = c(120,1000),
  cytosine.coverage.thr = 5,
  verbose = TRUE
) -> FC_results
## 1. Pooling molecules from all samples
## 2. Computing sliding windows
## Discarding 1973/2233 molecules that do not entirely cover the RegionOfInterest
## Discarding 41/59 cytosines that are NA in more than 0% of the reads
## Discarding 0/260 molecules that have NAs in more than 0% of the remaining cytosines
## 3. computing distance matrix
## 4. Partitioning
## partitions too slim detected...retrying with k=15
## partitions too slim detected...retrying with k=14
## partitions too slim detected...retrying with k=13
## partitions too slim detected...retrying with k=12
## partitions too slim detected...retrying with k=11
## partitions too slim detected...retrying with k=10
## partitions too slim detected...retrying with k=9
## partitions too slim detected...retrying with k=8
## 5. Footprints detection
## 6. Footprints annotation (skipping)
## 7. Footprints aggregation
## 8. Results wrangling

The function FootprintCharter outputs a list of two items.
The first is a vector of partitioned.molecules, structured in the same way as the one output by the function SortReads.

rrapply::rrapply(FC_results$partitioned.molecules, f = head, n = 2)
## $SMF_MM_TKO_DE_
## $SMF_MM_TKO_DE_$`1`
## [1] "D00404:273:H5757BCXY:1:1102:14949:3420"
## [2] "D00404:273:H5757BCXY:1:2116:7810:26400"
## 
## $SMF_MM_TKO_DE_$`2`
## [1] "D00404:273:H5757BCXY:1:1102:20457:73436"
## [2] "D00404:273:H5757BCXY:1:1110:7624:25325" 
## 
## $SMF_MM_TKO_DE_$`3`
## [1] "D00404:273:H5757BCXY:1:1205:4416:100764"
## [2] "D00404:273:H5757BCXY:1:1206:16663:42295"
## 
## $SMF_MM_TKO_DE_$`4`
## [1] "D00404:273:H5757BCXY:1:1209:3401:67031" 
## [2] "D00404:273:H5757BCXY:1:2215:17233:16460"
## 
## $SMF_MM_TKO_DE_$`5`
## [1] "D00404:273:H5757BCXY:1:1211:14401:50622"
## [2] "D00404:273:H5757BCXY:2:1101:1697:87040" 
## 
## $SMF_MM_TKO_DE_$`6`
## [1] "D00404:273:H5757BCXY:1:1215:8843:22685"
## [2] "D00404:273:H5757BCXY:2:1107:2792:41093"
## 
## $SMF_MM_TKO_DE_$`7`
## [1] "D00404:273:H5757BCXY:1:2212:1496:27605"
## [2] "D00404:273:H5757BCXY:2:1102:9323:16615"
## 
## $SMF_MM_TKO_DE_$`8`
## [1] "D00404:273:H5757BCXY:2:1103:16878:7787"
## [2] "D00404:283:HCFT7BCXY:2:1103:6140:48231"

The second is a data.frame of detected footprints. The biological.state column indicates the putative nature of the detected footprints and it can be either of TF, nucleosome, accessible, unrecognized and noise.
The last two represent cases in which footprints couldn’t be interpreted either because too short (noise) or because they are detected at the edge of molecules and hence their true width cannot be established.

head(FC_results$footprints.df)
##   partition.nr         sample partition.coverage    start      end width
## 1            2 SMF_MM_TKO_DE_                 47 88106070 88106131    62
## 2            1 SMF_MM_TKO_DE_                 46 88106098 88106209   112
## 3            3 SMF_MM_TKO_DE_                 42 88106098 88106104     7
## 4            3 SMF_MM_TKO_DE_                 42 88106105 88106209   105
## 5            6 SMF_MM_TKO_DE_                 21 88106113 88106113     1
## 6            7 SMF_MM_TKO_DE_                 40 88106113 88106317   205
##   nr.cytosines biological.state seqnames TF TF.name footprint.idx
## 1            6       accessible     chr6 NA      NA             9
## 2           15       accessible     chr6 NA      NA             1
## 3            1     unrecognized     chr6 NA      NA            36
## 4           14       accessible     chr6 NA      NA             1
## 5            1            noise     chr6 NA      NA            40
## 6           30       accessible     chr6 NA      NA             1

Optionally, it is possible to annotate TF footprints by providing the argument TFBSs with user-provided motifs. TFBSs needs a GRanges with at least two metadata columns: TF and absolute.idx for TF identity and unique motif index, respectively.

TFBSs = qs::qread(system.file("extdata", "TFBSs_1.qs", package="SingleMoleculeFootprinting"))
TFBSs$absolute.idx = names(TFBSs)
TFBSs
## GRanges object with 2 ranges and 2 metadata columns:
##                seqnames            ranges strand |          TF absolute.idx
##                   <Rle>         <IRanges>  <Rle> | <character>  <character>
##   TFBS_4305215     chr6 88106216-88106226      - |        NRF1 TFBS_4305215
##   TFBS_4305216     chr6 88106253-88106263      - |        NRF1 TFBS_4305216
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
FootprintCharter(
  MethSM = MethSM,
  RegionOfInterest = RegionOfInterest,
  TFBSs = TFBSs,
  coverage = 30, 
  k = 16,
  n = 5, 
  TF.length = c(5,75), 
  nucleosome.length = c(120,1000), 
  cytosine.coverage.thr = 5,
  verbose = FALSE
) -> FC_results

FC_results$footprints.df %>%
  filter(biological.state == "TF") %>%
  dplyr::select(start, end, partition.nr, TF, TF.name) %>%
  tidyr::unnest(cols = c(TF, TF.name))
## # A tibble: 11 × 5
##       start      end partition.nr TF    TF.name     
##       <int>    <int> <chr>        <chr> <chr>       
##  1 88106132 88106144 2            NA    <NA>        
##  2 88106156 88106160 5            NA    <NA>        
##  3 88106210 88106237 1            NRF1  TFBS_4305215
##  4 88106210 88106237 2            NRF1  TFBS_4305215
##  5 88106210 88106237 3            NRF1  TFBS_4305215
##  6 88106238 88106255 8            NA    <NA>        
##  7 88106241 88106275 3            NRF1  TFBS_4305216
##  8 88106246 88106275 1            NRF1  TFBS_4305216
##  9 88106246 88106275 5            NRF1  TFBS_4305216
## 10 88106318 88106342 1            NA    <NA>        
## 11 88106318 88106388 2            NA    <NA>

This way, the footprints data.frame reports TF motif annotations by populating two columns: TF and TF.name, each a list of annotations for each putative TF footprint.

Inspect detected footprints

The function PlotFootprints allows to visually inspect the footprint detection results.

x.axis.breaks = c(
  start(resize(RegionOfInterest, width = 500, fix = "center")), 
  end(resize(RegionOfInterest, width = 500, fix = "center"))
  )
PlotFootprints(
  MethSM = MethSM, 
  partitioned.molecules = FC_results$partitioned.molecules, 
  footprints.df = FC_results$footprints.df, 
  TFBSs = TFBSs
    ) +
  scale_x_continuous(
    limits = x.axis.breaks, 
    breaks = x.axis.breaks, 
    labels = format(x.axis.breaks, nsmall=1, big.mark=",")
    ) +
  theme(legend.position = "top")

The function plots the average SMF signal separately for each partition alongside the biological.state annotation. The dashed line indicates the footprint detection threshold.

Plot single molecule heatmaps

The function PlotSM can be used to plot binary heatmaps of single molecules partitioned by FootprintCharter.

PlotSM(
  MethSM = MethSM, 
  RegionOfInterest = IRanges::resize(RegionOfInterest, 500, "center"), 
  SortedReads = FC_results$partitioned.molecules, 
  sorting.strategy = "custom"
  ) +
  scale_x_continuous(
    limits = x.axis.breaks, 
    breaks = x.axis.breaks, 
    labels = format(x.axis.breaks, nsmall=1, big.mark=",")
    )
## Arranging reads according to custom sorting.strategy
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.

It might be desirable to arrange partitions according to a biologically meaningful order, such as TF binding and cobinding.

Users can leverage the plot produced with the PlotFootprints function as exemplified above and instruct a preferred order. In this case, we plot the partitions with two NRF1 footprints (3,1) on top, followed by partitions with single TF footprints (2,5), accessibility (6,7) and nucleosome footprints (4,8).

partitions.order = c(3,1,2,5,6,7,4,8)
ordered.molecules = lapply(FC_results$partitioned.molecules, function(x){x[rev(partitions.order)]})

PlotSM(
  MethSM = MethSM, 
  RegionOfInterest = IRanges::resize(RegionOfInterest, 500, "center"), 
  SortedReads = ordered.molecules, 
  sorting.strategy = "custom"
  ) +
  scale_x_continuous(
    limits = x.axis.breaks, 
    breaks = x.axis.breaks, 
    labels = format(x.axis.breaks, nsmall=1, big.mark=",")
    )
## Arranging reads according to custom sorting.strategy
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.

Single molecules displaying FootprintCharter annotations can be plotted using the Plot_FootprintCharter_SM function.

Plot_FootprintCharter_SM(
  footprints.df = FC_results$footprints.df, 
  RegionOfInterest = IRanges::resize(RegionOfInterest, 500, "center"), 
  partitions.order = partitions.order
) +
  scale_x_continuous(
    limits = x.axis.breaks, 
    breaks = x.axis.breaks, 
    labels = format(x.axis.breaks, nsmall=1, big.mark=",")
    )

sessionInfo

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] lubridate_1.9.4                  forcats_1.0.0                   
##  [3] stringr_1.5.1                    dplyr_1.1.4                     
##  [5] purrr_1.0.4                      readr_2.1.5                     
##  [7] tidyr_1.3.1                      tibble_3.2.1                    
##  [9] ggplot2_3.5.1                    tidyverse_2.0.0                 
## [11] GenomicRanges_1.59.1             GenomeInfoDb_1.43.4             
## [13] IRanges_2.41.3                   S4Vectors_0.45.4                
## [15] BiocGenerics_0.53.6              generics_0.1.3                  
## [17] SingleMoleculeFootprinting_2.1.4
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          sys_3.4.3                  
##   [3] jsonlite_1.8.9              magrittr_2.0.3             
##   [5] GenomicFeatures_1.59.1      farver_2.1.2               
##   [7] rmarkdown_2.29              BiocIO_1.17.1              
##   [9] vctrs_0.6.5                 memoise_2.0.1              
##  [11] Rsamtools_2.23.1            RCurl_1.98-1.16            
##  [13] QuasR_1.47.2                ggpointdensity_0.1.0       
##  [15] htmltools_0.5.8.1           S4Arrays_1.7.3             
##  [17] progress_1.2.3              curl_6.2.0                 
##  [19] SparseArray_1.7.5           sass_0.4.9                 
##  [21] bslib_0.9.0                 httr2_1.1.0                
##  [23] cachem_1.1.0                buildtools_1.0.0           
##  [25] GenomicAlignments_1.43.0    lifecycle_1.0.4            
##  [27] pkgconfig_2.0.3             Matrix_1.7-2               
##  [29] R6_2.6.1                    fastmap_1.2.0              
##  [31] GenomeInfoDbData_1.2.13     MatrixGenerics_1.19.1      
##  [33] digest_0.6.37               colorspace_2.1-1           
##  [35] miscTools_0.6-28            ShortRead_1.65.0           
##  [37] patchwork_1.3.0             AnnotationDbi_1.69.0       
##  [39] RSQLite_2.3.9               hwriter_1.3.2.1            
##  [41] labeling_0.4.3              filelock_1.0.3             
##  [43] timechange_0.3.0            httr_1.4.7                 
##  [45] abind_1.4-8                 compiler_4.4.2             
##  [47] Rbowtie_1.47.0              bit64_4.6.0-1              
##  [49] withr_3.0.2                 BiocParallel_1.41.1        
##  [51] viridis_0.6.5               DBI_1.2.3                  
##  [53] qs_0.27.2                   biomaRt_2.63.1             
##  [55] rappdirs_0.3.3              DelayedArray_0.33.6        
##  [57] rjson_0.2.23                tools_4.4.2                
##  [59] glue_1.8.0                  restfulr_0.0.15            
##  [61] grid_4.4.2                  cluster_2.1.8              
##  [63] gtable_0.3.6                BSgenome_1.75.1            
##  [65] tzdb_0.4.0                  RApiSerialize_0.1.4        
##  [67] hms_1.1.3                   utf8_1.2.4                 
##  [69] stringfish_0.16.0           xml2_1.3.6                 
##  [71] XVector_0.47.2              pillar_1.10.1              
##  [73] BiocFileCache_2.15.1        lattice_0.22-6             
##  [75] rtracklayer_1.67.0          bit_4.5.0.1                
##  [77] deldir_2.0-4                tidyselect_1.2.1           
##  [79] maketools_1.3.2             Biostrings_2.75.3          
##  [81] knitr_1.49                  gridExtra_2.3              
##  [83] SummarizedExperiment_1.37.0 xfun_0.50                  
##  [85] Biobase_2.67.0              matrixStats_1.5.0          
##  [87] stringi_1.8.4               UCSC.utils_1.3.1           
##  [89] yaml_2.3.10                 evaluate_1.0.3             
##  [91] codetools_0.2-20            interp_1.1-6               
##  [93] GenomicFiles_1.43.0         cli_3.6.4                  
##  [95] RcppParallel_5.1.10         munsell_0.5.1              
##  [97] jquerylib_0.1.4             Rcpp_1.0.14                
##  [99] dbplyr_2.5.0                png_0.1-8                  
## [101] XML_3.99-0.18               parallel_4.4.2             
## [103] blob_1.2.4                  prettyunits_1.2.0          
## [105] latticeExtra_0.6-30         jpeg_0.1-10                
## [107] parallelDist_0.2.6          plyranges_1.27.0           
## [109] bitops_1.0-9                pwalign_1.3.2              
## [111] txdbmaker_1.3.1             viridisLite_0.4.2          
## [113] VariantAnnotation_1.53.1    scales_1.3.0               
## [115] crayon_1.5.3                rrapply_1.2.7              
## [117] rlang_1.1.5                 KEGGREST_1.47.0