SingleMoleculeFootprinting is an R package providing functions to analyze Single Molecule Footprinting (SMF) data. Following the workflow exemplified in this vignette, the user will be able to perform basic data analysis of SMF data with minimal coding effort.
Starting from an aligned bam file, we show how to
For installation, the user can use the following:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SingleMoleculeFootprinting")
For compatibility with our analysis tools, we recommend the user to
perform genomic alignments using the qAlign
function from QuasR as exemplified in the chunk below.
library(parallel)
library(QuasR)
library(BSgenome.Mmusculus.UCSC.mm10)
cl <- makeCluster(40)
prj <- QuasR::qAlign(sampleFile = Qinput,
genome = "BSgenome.Mmusculus.UCSC.mm10",
aligner = "Rbowtie",
projectName = "prj",
paired = "fr",
bisulfite = "undir",
alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
alignmentsDir = "./",
cacheDir = tempdir(),
clObj = cl)
SingleMoleculeFootprinting inherits QuasR’s
philosophy of working with pointer files. Briefly, a pointer is a
tab-delimited file with two or three fields indicating the location of
the input data files. For more details, please check the
qAlign
documentation.
Here we show how our pointer file looks like.
N.b.: The execution of the example code in this vignette, as well as
some functions of this software package, depend on accessory data that
can be downloaded and cached through the data package
SingleMoleculeFootprintingData. Under the hood, we download
this data for the user and create the QuasR sampleSheet file for the
example data at tempdir()
under the name
NRF1Pair_Qinput.txt
. We just ask the user to make sure that
their default cache directory has enough storage capacity (~1 Gb). The
user can check this via
ExperimentHub::getExperimentHubOption(arg = "CACHE")
and
eventually change this value via
ExperimentHub::setExperimentHubOption(arg = "CACHE", value = "/your/favourite/location")
.
Before investing in a deep sequencing run for a SMF experiment, it is advisable to first perform a shallow sequencing run and to perform quality controls on the sequencing libraries.
It is always a good idea to examine canonical quality measures after
aligning. We advice the user to employ the qQCreport
function from QuasR.
If SMF was performed on an large genome (e.g Mouse) it is possible that bait capture was performed to focus the sequencing efforts: here we check how efficient that process was by essentially computing the ratio of genomic alignments inside bait regions over the total ones.
BaitRegions <- SingleMoleculeFootprintingData::EnrichmentRegions_mm10.rds()
## see ?SingleMoleculeFootprintingData and browseVignettes('SingleMoleculeFootprintingData') for documentation
## loading from cache
clObj = makeCluster(5)
BaitCaptureEfficiency <- BaitCapture(sampleSheet = Qinput,
genome = BSgenome.Mmusculus.UCSC.mm10,
baits = BaitRegions, clObj = clObj)
## all necessary alignment files found
## preparing to run on 5 nodes...done
## counting alignments...done
## preparing to run on 5 nodes...done
## counting alignments...done
stopCluster(clObj)
BaitCaptureEfficiency
## [1] 1
In this case the capture efficiency equals to 1 because the example data was purposefully subset for an interesting region which falls entirely within baits. Under normal circumstances, one expects this value to be lower than 1 for the mouse genome for example we observe values around 0.7.
Here we ask how much of the observed Cytosine methylation falls in the expected contexts (CpG or GpC). Beware, this is a much slower computation than the above.
ConversionRateValue <- ConversionRate(sampleSheet = Qinput,
genome = BSgenome.Mmusculus.UCSC.mm10,
chr = 6)
For this sample, the observed conversion rate is 92.4%. This value happens to be slightly below the expected value of ~95%
Methylation values at the single molecule level can be extracted for
the region of interest from aligned data using the
CallContextMethylation
function. Under the hood, the
function performs the following operations:
ExperimentType | substring | RelevanContext | Notes |
---|---|---|---|
Single enzyme SMF | _NO_ | DGCHN & NWCGW | Methylation info is reported separately for each context |
Double enzyme SMF | _DE_ | GCH + HCG | Methylation information is aggregated across the contexts |
No enzyme (endogenous mCpG only) | _SS_ | CG | - |
MySample <- suppressMessages(readr::read_delim(Qinput, delim = "\t")[[2]])
Region_of_interest <- GRanges(seqnames = "chr6",
ranges = IRanges(start = 88106000,
end = 88106500),
strand = "*")
Methylation <- CallContextMethylation(sampleSheet = Qinput,
sample = MySample,
genome = BSgenome.Mmusculus.UCSC.mm10,
range = Region_of_interest,
coverage = 20,
ConvRate.thr = 0.2)
## Setting QuasR project
## all necessary alignment files found
## Calling methylation at all Cytosines
## 0.9% of reads found with conversion rate above 0.2
## Subsetting Cytosines by permissive genomic context (NGCNN, NNCGN)
## Collapsing strands
## 0 reads found mapping to the + strand, collapsing to -
## 0 reads found mapping to the + strand, collapsing to -
## Filtering Cs for coverage
## Detected experiment type: DE
## Subsetting Cytosines by strict genomic context (GCH, HCG) based on the detected experiment type: DE
## Merging matrixes
Methylation[[1]]
## GRanges object with 38 ranges and 1 metadata column:
## seqnames ranges strand | NRF1pair_DE__M
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr6 88106112 - | 0.741117
## [2] chr6 88106116 - | 0.589141
## [3] chr6 88106118 - | 0.838407
## [4] chr6 88106127 - | 0.543319
## [5] chr6 88106138 - | 0.400601
## ... ... ... ... . ...
## [34] chr6 88106434 - | 0.554072
## [35] chr6 88106444 - | 0.607526
## [36] chr6 88106455 - | 0.679593
## [37] chr6 88106492 - | 0.720641
## [38] chr6 88106495 - | 0.809129
## -------
## seqinfo: 239 sequences (1 circular) from mm10 genome
Methylation[[2]][1:10, 1:10]
## 88106112 88106116 88106118
## M00758:819:000000000-CB7NB:1:1101:10081:9865 0 1 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10219:24481 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10408:27375 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10428:10873 0 1 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1 1 1
## 88106127 88106138 88106153
## M00758:819:000000000-CB7NB:1:1101:10081:9865 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 0 1 1
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10219:24481 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10408:27375 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10428:10873 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 0 1 0
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1 0 1
## 88106155 88106158 88106164
## M00758:819:000000000-CB7NB:1:1101:10081:9865 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10219:24481 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10408:27375 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10428:10873 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1 1 1
## 88106172
## M00758:819:000000000-CB7NB:1:1101:10081:9865 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 1
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 1
## M00758:819:000000000-CB7NB:1:1101:10219:24481 1
## M00758:819:000000000-CB7NB:1:1101:10408:27375 1
## M00758:819:000000000-CB7NB:1:1101:10428:10873 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 1
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1
The following chunk has the sole scope of downsampling reads to make the vignette lighter. The user should skip this.
Already at this stage it is possible to visually inspect results. To
that end, we provide the function PlotAvgSMF
to plot the
average SMF signal, defined as 1 - average methylation,
at a genomic locus of interest. Optionally, the user can pass a GRanges
object of genomic coordinates for the transcription factor binding sites
(TFBSs) of interest to include in the plot, we show an
example of such an object.
TFBSs <- GenomicRanges::GRanges("chr6",
IRanges(c(88106216, 88106253),
c(88106226, 88106263)),
strand = "-")
elementMetadata(TFBSs)$name <- c("NRF1", "NRF1")
names(TFBSs) <- c(paste0("TFBS_", c(4305215, 4305216)))
TFBSs
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## TFBS_4305215 chr6 88106216-88106226 - | NRF1
## TFBS_4305216 chr6 88106253-88106263 - | NRF1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
PlotAvgSMF(MethGR = Methylation[[1]],
range = Region_of_interest,
TFBSs = TFBSs)
Furthermore, the function PlotSM
can be used to plot the
single molecule SMF information at a given site.
PlotSM(MethSM = Methylation[[2]],
range = Region_of_interest)
## No sorting passed or specified, will plot unsorted reads
Optionally, hierarchical clustering can be performed by setting the
parameter SortedReads = "HC"
. This can be useful to
aggregate reads visually in order to spot PCR artifacts. N.b. reads will
be downsampled to 500.
Ultimately though, we want to classify reads based on their patterns
of molecular occupancy. To that end we provide the functions
SortReadsBySingleTF
and SortReadsByTFCluster
to classify reads based either on the occupancy patterns of one or
multiple transcription factors.
Under the hood, the classification is based on the definition of n + 2 bins (with n being the number of TFs). The n bins are each centered around one of the TFBSs of interest, while the 2 extra bins are located upstream and downstream of the two outmost TFBSs.
For SortReadsBySingleTF
, the coordinates of the bins
relative to the center of the TFBS are [-35;-25], [-15;+15], [+25,+35].
Instead, the function SortReadsByTFCluster
draws a bin with
coordinates [-7;+7] around the center of each TFBS, a bin with
coordinates [-35;-25] relative to center of the most upstream TFBS and a
bin with coordinates [+25,+35] relative to the center of the most
downstream TFBS. The user can also employ custom coordinates by
specifying them explicitly using the function
SortReads
.
For each read, the binary methylation status of the cytosines contained in each bin is averaged to either a 0 or a 1 such that each read is eventually represented as sequence of n + 2 binary digits, for a total of 2(n + 2) possible sequences.
Here, we show a usage case for the SortReadsByTFCluster
function, as we have already identified the double binding of NRF1 at
the genomic site under scrutiny. Usage and exploration of the output is
identical for the other function, except for the the format of the
TFBSs argument which should consist of a GRanges object of
length 1 for SortReadsBySingleTF
and of length > 1 for
SortReadsByTFCluster
.
SortedReads_TFCluster <- SortReadsByTFCluster(MethSM = Methylation[[2]],
TFBSs = TFBSs)
## Collecting summarized methylation for bins
## TF cluster mode
## Number of retrieved states: 13
## States population:
## 0000 0001 0011 0101 0111 1000 1001 1010 1011 1100 1101 1110 1111
## 4 13 8 4 9 20 311 8 158 34 108 12 306
N.b. non-populated states are not returned.
The output of each of these sorting functions can be passed directly
as the SortedReads
argument of the PlotSM
function.
PlotSM(MethSM = Methylation[[2]],
range = Region_of_interest,
SortedReads = SortedReads_TFCluster)
## Sorting reads according to passed values before plotting
## Inferring sorting was performed by TF pair
N.b. despite sorting reads by a TF cluster is in principle possible
with clusters of any size, as of now the PlotSM
function
can only deal with TF pairs.
In order to be quantitative about these observations, the user can
employ the StateQuantificationPlot
. The function outputs a
bar plot annotated with the percentage of reads found in each state. The
function takes, as argument, the output of either of the two sorting
functions.
StateQuantificationPlot(SortedReads = SortedReads_TFCluster)
## Inferring sorting was performed by TF pair
Finally, we provide the wrapper function
PlotSingleSiteSMF
to plot at once the three kinds of
information detailed above and to export results as a pdf.
PlotSingleSiteSMF(ContextMethylation = Methylation,
sample = MySample,
range = Region_of_interest,
SortedReads = SortedReads_TFCluster,
TFBSs = TFBSs,
saveAs = NULL)
## Subsetting data by range (extended)
## Sorting reads according to passed values before plotting
## Inferring sorting was performed by TF pair
## Inferring sorting was performed by TF pair
This vignette was produced under:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SingleMoleculeFootprintingData_1.13.0 SingleMoleculeFootprinting_1.13.0
## [3] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.73.0
## [5] rtracklayer_1.65.0 BiocIO_1.15.2
## [7] Biostrings_2.73.1 XVector_0.45.0
## [9] GenomicRanges_1.57.1 GenomeInfoDb_1.41.1
## [11] IRanges_2.39.2 S4Vectors_0.43.2
## [13] BiocGenerics_0.51.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-8
## [3] QuasR_1.45.2 deldir_2.0-4
## [5] httr2_1.0.3 biomaRt_2.61.3
## [7] rlang_1.1.4 magrittr_2.0.3
## [9] Rbowtie_1.45.0 matrixStats_1.3.0
## [11] compiler_4.4.1 RSQLite_2.3.7
## [13] GenomicFeatures_1.57.0 png_0.1-8
## [15] vctrs_0.6.5 txdbmaker_1.1.1
## [17] stringr_1.5.1 pwalign_1.1.0
## [19] pkgconfig_2.0.3 crayon_1.5.3
## [21] fastmap_1.2.0 dbplyr_2.5.0
## [23] utf8_1.2.4 Rsamtools_2.21.1
## [25] rmarkdown_2.28 tzdb_0.4.0
## [27] UCSC.utils_1.1.0 purrr_1.0.2
## [29] bit_4.0.5 xfun_0.47
## [31] zlibbioc_1.51.1 cachem_1.1.0
## [33] jsonlite_1.8.8 progress_1.2.3
## [35] blob_1.2.4 highr_0.11
## [37] DelayedArray_0.31.11 BiocParallel_1.39.0
## [39] jpeg_0.1-10 prettyunits_1.2.0
## [41] R6_2.5.1 VariantAnnotation_1.51.0
## [43] bslib_0.8.0 stringi_1.8.4
## [45] RColorBrewer_1.1-3 jquerylib_0.1.4
## [47] Rcpp_1.0.13 SummarizedExperiment_1.35.1
## [49] knitr_1.48 readr_2.1.5
## [51] Matrix_1.7-0 tidyselect_1.2.1
## [53] abind_1.4-5 yaml_2.3.10
## [55] codetools_0.2-20 hwriter_1.3.2.1
## [57] curl_5.2.2 plyr_1.8.9
## [59] lattice_0.22-6 tibble_3.2.1
## [61] withr_3.0.1 Biobase_2.65.1
## [63] ShortRead_1.63.1 KEGGREST_1.45.1
## [65] evaluate_0.24.0 BiocFileCache_2.13.0
## [67] xml2_1.3.6 ExperimentHub_2.13.1
## [69] BiocManager_1.30.25 pillar_1.9.0
## [71] filelock_1.0.3 MatrixGenerics_1.17.0
## [73] generics_0.1.3 vroom_1.6.5
## [75] RCurl_1.98-1.16 BiocVersion_3.20.0
## [77] hms_1.1.3 GenomicFiles_1.41.0
## [79] glue_1.7.0 maketools_1.3.0
## [81] tools_4.4.1 interp_1.1-6
## [83] AnnotationHub_3.13.3 sys_3.4.2
## [85] data.table_1.16.0 GenomicAlignments_1.41.0
## [87] buildtools_1.0.0 XML_3.99-0.17
## [89] grid_4.4.1 latticeExtra_0.6-30
## [91] AnnotationDbi_1.67.0 GenomeInfoDbData_1.2.12
## [93] restfulr_0.0.15 cli_3.6.3
## [95] rappdirs_0.3.3 fansi_1.0.6
## [97] S4Arrays_1.5.7 dplyr_1.1.4
## [99] sass_0.4.9 digest_0.6.37
## [101] SparseArray_1.5.31 rjson_0.2.22
## [103] memoise_2.0.1 htmltools_0.5.8.1
## [105] lifecycle_1.0.4 httr_1.4.7
## [107] mime_0.12 bit64_4.0.5