Transcription factors often bind to specific DNA nucleotide patterns referred to as sequence motifs. Although sequence motifs are well-characterized for many transcription factors, detecting the actual binding sites is not always a straightforward task. Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) is the major technology to detect binding regions of transcription factors. However, the binding regions detected by ChIP-seq may contains several or no DNA sequence motifs. Motif2Site is a novel R package which uses ChIP-seq information and detect transcription factor binding sites from a user provided DNA sequence motifs set.
Motif2Site gets two different input, motif and ChIP-seq alignment information, to detect binding sites. First input is ChIP-seq alignment short reads in the bam or bed format. For each aligned short read the center of the alignment is calculated using GenomicAlignments (Lawrence et al. 2013) and Rsamtools (Morgan et al. 2020) packages. Motif information is the second input which is provided by user either as a bed file or a DNA string with a mismatch number. In the case of DNA string input, Biostrings (H. Pagès et al. 2019) and BSgenome (Hervé Pagès 2019) packages are used to find motif locations on the genome in GenomicRanges (Lawrence et al. 2013) format.
Negative binomial distribution is used to model count data by using edgeR package (Robinson, McCarthy, and Smyth 2010). mixtools (Benaglia et al. 2009) is used to deconvolve binding intensities of closely spaced binding sites. Motif2Site also combines binding sites across different experiments. It calls differential binding sites using TMM normalization and GLM test using edgeR package (Robinson, McCarthy, and Smyth 2010).
First install and load the libraries needed to run the examples of this document:
library(GenomicRanges)
library(Motif2Site)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
library(BSgenome.Ecoli.NCBI.20080805)
The functions, implemented in Motif2Site, perform three tasks: 1. To assist users to select better sequence motif input. 2. To detect binding sites from sequence motifs and ChIP-seq datasets 3. To combine and compare binding sites across different experiments, conditions, or tissues. Each of these functions are explained in a separate section.
Motif2Site uses DNA motif information as one of its
input. To facilitate choosing proper sequence motif,
compareBedFiless2UserProvidedRegions
and
compareMotifs2UserProvidedRegions
functions compare motif
regions with a user provided confident regions in terms of precision and
recall.
As the first example, an artificially generated
“YeastSampleMotif.bed” bed file in yeast is considered as a confident
binding regions set. compareMotifs2UserProvidedRegions
function compares these regions with locations of ‘TGATTSCAGGANT’
1-mismatch, ‘TGATTCCAGGANT’ 0-mismatch, ‘TGATWSCAGGANT’ 2-mismatches on
the yeast genome in terms of precision and recall.
yeastExampleFile =
system.file("extdata", "YeastSampleMotif.bed", package="Motif2Site")
YeastRegionsChIPseq <- Bed2Granges(yeastExampleFile)
SequenceComparison <-
compareMotifs2UserProvidedRegions(
givenRegion=YeastRegionsChIPseq,
motifs = c("TGATTSCAGGANT", "TGATTCCAGGANT", "TGATWSCAGGANT"),
mismatchNumbers = c(1,0,2),
genome="Scerevisiae",
genomeBuild="sacCer3"
)
## motifName regionCoverage motifCoverage
## 1 TGATTSCAGGANT_1mismatch 0.350877193 0.6060606
## 2 TGATTCCAGGANT_0mismatch 0.005847953 1.0000000
## 3 TGATWSCAGGANT_2mismatch 1.000000000 0.0475132
In the following example, an artificially generated
“YeastSampleMotif.bed” bed file in yeast is considered as a confident
binding regions set. compareBedFiless2UserProvidedRegions
function compares these regions with two bed files “YeastBedFile1.bed”
and “YeastBedFile2.bed” in terms of precision and recall.
# Yeast artificial dataset for comparison bed files
yeastExampleFile =
system.file("extdata", "YeastSampleMotif.bed", package="Motif2Site")
YeastRegionsChIPseq <- Bed2Granges(yeastExampleFile)
bed1 <- system.file("extdata", "YeastBedFile1.bed", package="Motif2Site")
bed2 <- system.file("extdata", "YeastBedFile2.bed", package="Motif2Site")
BedFilesVector <- c(bed1, bed2)
SequenceComparison <-
compareBedFiless2UserProvidedRegions(
givenRegion=YeastRegionsChIPseq,
bedfiles = BedFilesVector,
motifnames = c("YeastBed1", "YeastBed2")
)
## motifName regionCoverage motifCoverage
## 1 YeastBed1 0.3099415 1.000000
## 2 YeastBed2 0.4853801 0.209068
DetectBindingSitesMotif
function detects binding sites
from provided sequence motif information. Here, Artificial ChIP-seq data
for FUR transcription factor in Escherichia coli was generated in two
conditions fe2+ and dpd inspired by (Seo et al.
2014). In the following examples, artificial sequence motif
locations have been and provided as a bed file called ‘FurMotifs.bed’.
The alignment of ChIP-seq short reads is the other input of this
function. The alignment files can be passed to the function as bam or
bed files. In the following examples both IP and background alignment
files have been passed as single-end bed files to this function.
# FUR candidate motifs in NC_000913 E. coli
FurMotifs = system.file("extdata", "FurMotifs.bed", package="Motif2Site")
# ChIP-seq FUR fe datasets binding sites from user provided bed file
# ChIP-seq datasets in bed single end format
IPFe <- c(system.file("extdata", "FUR_fe1.bed", package="Motif2Site"),
system.file("extdata", "FUR_fe2.bed", package="Motif2Site"))
Inputs <- c(system.file("extdata", "Input1.bed", package="Motif2Site"),
system.file("extdata", "Input2.bed", package="Motif2Site"))
FURfeBedInputStats <-
DetectBindingSitesBed(BedFile=FurMotifs,
IPfiles=IPFe,
BackgroundFiles=Inputs,
genome="Ecoli",
genomeBuild="20080805",
DB="NCBI",
expName="FUR_Fe_BedInput",
format="BEDSE"
)
FURfeBedInputStats
## $FRiP
## [1] 0.3668289 0.1997165
##
## $sequencingStatitic
## nonSequenced underBinding overBinding
## 1 0.02293399 0 0.005022041
##
## $motifStatistics
## skewnessTestRejected decompositionRejected accepted
## 1 98 52 199
# ChIP-seq FUR dpd datasets binding sites from user provided bed file
# ChIP-seq datasets in bed single end format
IPDpd <- c(system.file("extdata", "FUR_dpd1.bed", package="Motif2Site"),
system.file("extdata", "FUR_dpd2.bed", package="Motif2Site"))
FURdpdBedInputStats <-
DetectBindingSitesBed(BedFile=FurMotifs,
IPfiles=IPDpd,
BackgroundFiles=Inputs,
genome="Ecoli",
genomeBuild="20080805",
DB="NCBI",
expName="FUR_Dpd_BedInput",
format="BEDSE"
)
FURdpdBedInputStats
## $FRiP
## [1] 0.2826046 0.2823270
##
## $sequencingStatitic
## nonSequenced underBinding overBinding
## 1 0.02254338 0 0.005022041
##
## $motifStatistics
## skewnessTestRejected decompositionRejected accepted
## 1 63 62 189
DetectBindingSitesMotif
function also works with DNA
string motifs. In the following example, FUR binding sites in fe2+
condition are detected from ‘GWWTGAGAA’ with 1-mismatch motif. The
dataset is generated only for ‘NC_000913’ build. Therefore, in this
example the coordinates of this regions are provided as a ‘GivenRegion’
field. Providing this field ensures that binding sites are only detected
in the given regions. This will accelerate the peak calling, and also
improve the prediction accuracy.
# Granages region for motif search
NC_000913_Coordiante <- GRanges(seqnames=Rle("NC_000913"),
ranges = IRanges(1, 4639675))
# ChIP-seq FUR fe datasets binding sites from user provided string motif
# ChIP-seq datasets in bed single end format
FURfeStringInputStats <-
DetectBindingSitesMotif(motif = "GWWTGAGAA",
mismatchNumber = 1,
IPfiles=IPFe,
BackgroundFiles=Inputs,
genome="Ecoli",
genomeBuild="20080805",
DB= "NCBI",
expName="FUR_Fe_StringInput",
format="BEDSE",
GivenRegion = NC_000913_Coordiante
)
FURfeStringInputStats
## $FRiP
## [1] 0.07620002 0.08057296
##
## $sequencingStatitic
## nonSequenced underBinding overBinding
## 1 0.02464332 0 0.003891051
##
## $motifStatistics
## skewnessTestRejected decompositionRejected accepted
## 1 3 2 36
recenterBindingSitesAcrossExperiments
function combines
the binding sites of different tissues or conditions into a single count
matrix. In the FUR example, at the first step it combines fe2+ and dpd
binding sites. At the next step, it recalculates the p-adjusted values.
To ensure the high quality of the combined binding site set, an
stringent cross-experiment FDR cutoff is applied (default 0.001). The
accepted binding sites should fullfill this cutoff at least in one
experiment. Another FDR cutoff value (default 0.05) is used to assign
binding or non-binding labels to each binding site for each
experiment.
# Combine FUR binding sites from bed input into one table
corMAT <- recenterBindingSitesAcrossExperiments(
expLocations=c("FUR_Fe_BedInput","FUR_Dpd_BedInput"),
experimentNames=c("FUR_Fe","FUR_Dpd"),
expName="combinedFUR"
)
corMAT
## FUR_Fe FUR_Fe FUR_Dpd FUR_Dpd
## FUR_Fe 1.0000000 0.4172015 -0.4277362 -0.3712685
## FUR_Fe 0.4172015 1.0000000 -0.4812927 -0.4519734
## FUR_Dpd -0.4277362 -0.4812927 1.0000000 0.7274417
## FUR_Dpd -0.3712685 -0.4519734 0.7274417 1.0000000
FurTable <-
read.table(file.path("combinedFUR","CombinedMatrix"),
header = TRUE,
check.names = FALSE
)
FurBindingTotal <-
GRanges(seqnames=Rle(FurTable[,1]),
ranges = IRanges(FurTable[,2], FurTable[,3])
)
FurFe <- FurBindingTotal[which((FurTable$FUR_Fe_binding =="Binding")==TRUE)]
FurDpd <- FurBindingTotal[which((FurTable$FUR_Dpd_binding =="Binding")==TRUE)]
findOverlaps(FurFe,FurDpd)
## Hits object with 107 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 7 5
## [3] 10 7
## [4] 11 8
## [5] 17 10
## ... ... ...
## [103] 207 187
## [104] 208 188
## [105] 209 189
## [106] 210 190
## [107] 212 192
## -------
## queryLength: 216 / subjectLength: 193
pairwisDifferential
function uses edgeR TMM
normalization and GLM test to detect differential binding sites across
two experiments. In the following example it takes combined FUR count
matrix and detect differential binding sites across fe2+ and dpd.
# Differential binding sites across FUR conditions fe vs dpd
diffFUR <- pairwisDifferential(tableOfCountsDir="combinedFUR",
exp1="FUR_Fe",
exp2="FUR_Dpd",
FDRcutoff = 0.05,
logFCcuttoff = 1
)
FeUp <- diffFUR[[1]]
DpdUp <- diffFUR[[2]]
TotalComparison <- diffFUR[[3]]
head(TotalComparison)
## seqnames start end logFC logCPM LR PValue
## 1 NC_000913 8346 8355 0.3894958 11.71413 0.05503008 0.814530921
## 2 NC_000913 58862 58874 7.7288074 11.52722 9.39574933 0.002174891
## 3 NC_000913 70522 70531 5.4750865 11.81084 6.76519659 0.009295268
## 4 NC_000913 111527 111535 5.0360670 11.44461 5.09591510 0.023982257
## 5 NC_000913 146492 146501 -6.8326623 10.77875 4.92341789 0.026495052
## 6 NC_000913 167150 167158 -4.4346309 11.37027 3.96157917 0.046550014
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenome.Ecoli.NCBI.20080805_1.3.1000
## [2] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0
## [3] BSgenome_1.75.0
## [4] rtracklayer_1.67.0
## [5] BiocIO_1.17.1
## [6] Biostrings_2.75.1
## [7] XVector_0.47.0
## [8] Motif2Site_1.11.0
## [9] GenomicRanges_1.59.1
## [10] GenomeInfoDb_1.43.1
## [11] IRanges_2.41.1
## [12] S4Vectors_0.45.2
## [13] BiocGenerics_0.53.3
## [14] generics_0.1.3
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2
## [3] dplyr_1.1.4 mixtools_2.0.0
## [5] bitops_1.0-9 lazyeval_0.2.2
## [7] fastmap_1.2.0 RCurl_1.98-1.16
## [9] GenomicAlignments_1.43.0 XML_3.99-0.17
## [11] digest_0.6.37 lifecycle_1.0.4
## [13] survival_3.7-0 statmod_1.5.0
## [15] kernlab_0.9-33 magrittr_2.0.3
## [17] compiler_4.4.2 rlang_1.1.4
## [19] sass_0.4.9 tools_4.4.2
## [21] utf8_1.2.4 yaml_2.3.10
## [23] data.table_1.16.2 knitr_1.49
## [25] S4Arrays_1.7.1 htmlwidgets_1.6.4
## [27] curl_6.0.1 DelayedArray_0.33.2
## [29] abind_1.4-8 BiocParallel_1.41.0
## [31] purrr_1.0.2 sys_3.4.3
## [33] grid_4.4.2 fansi_1.0.6
## [35] colorspace_2.1-1 edgeR_4.5.0
## [37] ggplot2_3.5.1 scales_1.3.0
## [39] MASS_7.3-61 SummarizedExperiment_1.37.0
## [41] cli_3.6.3 rmarkdown_2.29
## [43] crayon_1.5.3 httr_1.4.7
## [45] rjson_0.2.23 cachem_1.1.0
## [47] splines_4.4.2 zlibbioc_1.52.0
## [49] parallel_4.4.2 BiocManager_1.30.25
## [51] restfulr_0.0.15 matrixStats_1.4.1
## [53] vctrs_0.6.5 Matrix_1.7-1
## [55] jsonlite_1.8.9 maketools_1.3.1
## [57] locfit_1.5-9.10 limma_3.63.2
## [59] plotly_4.10.4 tidyr_1.3.1
## [61] jquerylib_0.1.4 glue_1.8.0
## [63] codetools_0.2-20 gtable_0.3.6
## [65] UCSC.utils_1.3.0 munsell_0.5.1
## [67] tibble_3.2.1 pillar_1.9.0
## [69] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [71] R6_2.5.1 evaluate_1.0.1
## [73] lattice_0.22-6 Biobase_2.67.0
## [75] Rsamtools_2.23.0 segmented_2.1-3
## [77] bslib_0.8.0 nlme_3.1-166
## [79] SparseArray_1.7.2 xfun_0.49
## [81] MatrixGenerics_1.19.0 buildtools_1.0.0
## [83] pkgconfig_2.0.3