ATAC-seq, an assay for Transposase-Accessible Chromatin using sequencing, is a widely used technique for chromatin accessibility analysis. Detecting differential activation of transcription factors between two different experiment conditions provides the possibility of decoding the key factors in a phenotype. Lots of tools have been developed to detect the differential activity of TFs (DATFs) for different groups of samples. Those tools can be divided into two groups. One group detects DATFs from differential accessibility analysis, such as MEME1, HOMER2, enrichr3, and ChEA4. Another group finds the DATFs by enrichment tests, such as BiFET5, diffTF6, and TFEA7. For single-cell ATAC-seq analysis, Signac and chromVar are widely used tools.
All of these tools detect the DATF by only considering the open status of chromatin. None of them take the TF footprint into count. The open status provides the possibility of TF can bind to that position. The TF footprint by ATAC-seq shows the status of TF bindings.
To help researchers quickly assess the differential activity of hundreds of TFs by detecting the difference in TF footprint via enrichment score8, we have developed the ATACseqTFEA package. The ATACseqTFEA package is a robust and reliable computational tool to identify the key regulators responding to a phenotype.
Here is an example using ATACseqTFEA with a subset of ATAC-seq data.
First, install ATACseqTFEA and other packages required to
run the examples. Please note that the example dataset used here is from
zebrafish. To run an analysis with a dataset from a different species or
different assembly, please install the corresponding Bsgenome and
“TxDb”. For example, to analyze mouse data aligned to “mm10”, please
install “BSgenome.Mmusculus.UCSC.mm10”, and
“TxDb.Mmusculus.UCSC.mm10.knownGene”. You can also generate a TxDb
object by functions makeTxDbFromGFF
from a local “gff”
file, or makeTxDbFromUCSC
,
makeTxDbFromBiomart
, and makeTxDbFromEnsembl
,
from online resources in the GenomicFeatures package.
To do TFEA, there are two inputs, the binding sites, and the change
ranks. To get the binding sites, the ATACseqTFEA package
provides the prepareBindingSites
function. Users can also
try to get the binding sites list by other tools such as “fimo”9.
The prepareBindingSites
function request a cluster of
position weight matrix (PWM) of TF motifs. ATACseqTFEA prepared
a merged PWMatrixList
for 405 motifs. The
PWMatrixList
is a collection of jasper2018, jolma2013 and
cisbp_1.02 from package motifDB (v 1.28.0) and merged by distance
smaller than 1e-9 calculated by MotIV::motifDistances function (v
1.42.0). The merged motifs were exported by motifStack (v 1.30.0).
The best_curated_Human
is a list of TF motifs downloaded
from TFEA github7. There are 1279 human motifs in the
data set.
Another list of non-redundant TF motifs are also available by downloading the data from DeepSTARR10. There are 6502 motifs in the data set.
To scan the binding sites along a genome, a BSgenome
object is required by the prepareBindingSites
function.
# for test run, we use a subset of data within chr1:5000-100000
# for real data, use the merged peaklist as grange input.
# Drerio is the short-link of BSgenome.Drerio.UCSC.danRer10
seqlev <- "chr1"
bindingSites <-
prepareBindingSites(motifs, Drerio, seqlev,
grange=GRanges("chr1", IRanges(5000, 100000)),
p.cutoff = 5e-05)#set higher p.cutoff to get more sites.
The correct insertion site is the key to the enrichment analysis of
TF binding sites. The parameter positive
and
negative
in the function of TFEA
are used to
shift the 5’ ends of the reads to the correct insertion positions.
However, this shift does not consider the soft clip of the reads. The
best way to generate correct shifted bam files is using
ATACseqQC::shiftGAlignmentsList11 for paired-end or shiftGAlignments
for single-end of the bam file. The samples must be at least
biologically duplicated for the one-step TFEA
function.
bamExp <- system.file("extdata",
c("KD.shift.rep1.bam",
"KD.shift.rep2.bam"),
package="ATACseqTFEA")
bamCtl <- system.file("extdata",
c("WT.shift.rep1.bam",
"WT.shift.rep2.bam"),
package="ATACseqTFEA")
res <- TFEA(bamExp, bamCtl, bindingSites=bindingSites,
positive=0, negative=0) # the bam files were shifted reads
The results will be saved in a TFEAresults
object. We
will use multiple functions to present the results. The
plotES
function will return a ggplot
object
for single TF input and no outfolder
is defined. The
ESvolcanoplot
function will provide an overview of all the
TFs enrichment. And we can borrow the factorFootprints
function from ATACseqQC
package to view the footprints of
one TF.
## footprint
sigs <- factorFootprints(c(bamCtl, bamExp),
pfm = as.matrix(motifs[[TF]]),
bindingSites = getBindingSites(res, TF=TF),
seqlev = seqlev, genome = Drerio,
upstream = 100, downstream = 100,
group = rep(c("WT", "KD"), each=2))
## export the results into a csv file
write.csv(res$resultsTable, tempfile(fileext = ".csv"),
row.names=FALSE)
The command-line scripts are available at extdata
named
as sample_scripts.R
.
The one-step TFEA
is a function containing multiple
steps, which include:
If you want to tune the parameters, it will be much better to do it step by step to avoid repeating the computation for the same step. Here are the details for each step.
We will count the insertion site in binding sites, proximal and
distal regions by counting the 5’ ends of the reads in a shifted bam
file. Here we suggest keeping the proximal
and
distal
the same value.
# prepare the counting region
exbs <- expandBindingSites(bindingSites=bindingSites,
proximal=40,
distal=40,
gap=10)
## count reads by 5'ends
counts <- count5ends(bam=c(bamExp, bamCtl),
positive=0L, negative=0L,
bindingSites = bindingSites,
bindingSitesWithGap=exbs$bindingSitesWithGap,
bindingSitesWithProximal=exbs$bindingSitesWithProximal,
bindingSitesWithProximalAndGap=
exbs$bindingSitesWithProximalAndGap,
bindingSitesWithDistal=exbs$bindingSitesWithDistal)
We filter the binding sites by at least there is 1 reads in proximal region. Users may want to try filter the sites by more stringent criteria such as “proximalRegion>1”.
## [1] "bindingSites" "proximalRegion" "distalRegion"
We will normalize the counts to count per base (CPB).
Here we use the open score to weight the binding score. Users can
also define the weight for binding score via parameter
weight
in the function
getWeightedBindingScore
.
Here we use DBscore
, which borrows the power of the
limma
package, to do differential binding analysis.
We can filter the binding results to decrease the data size by the
function eventsFilter
. For the sample data, we skip this
step.
Last, we use the function doTFEA
to get the enrichment
scores.
## This is an object of TFEAresults with
## slot enrichmentScore ( matrix: 399 x 2166 ),
## slot bindingSites ( GRanges object with 2166 ranges and 12 metadata columns ),
## slot motifID ( a list of the positions of binding sites for 399 TFs ), and
## slot resultsTable ( 399 x 5 ). Here is the top 2 rows:
## TF enrichmentScore normalizedEnrichmentScore p_value adjPval
## NRF1 NRF1 0.1923613 0.7960275 0.7253614 0.9994472
## Gfi1b Gfi1b 0.3099024 1.3769160 0.1143751 0.9585665
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ATACseqQC_1.31.0 Rsamtools_2.23.1
## [3] BSgenome.Drerio.UCSC.danRer10_1.4.2 BSgenome_1.75.0
## [5] rtracklayer_1.67.0 BiocIO_1.17.1
## [7] Biostrings_2.75.3 XVector_0.47.1
## [9] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [11] IRanges_2.41.2 S4Vectors_0.45.2
## [13] BiocGenerics_0.53.3 generics_0.1.3
## [15] ATACseqTFEA_1.9.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.2 bitops_1.0-9
## [3] filelock_1.0.3 tibble_3.2.1
## [5] R.oo_1.27.0 graph_1.85.1
## [7] XML_3.99-0.18 DirichletMultinomial_1.49.0
## [9] lifecycle_1.0.4 httr2_1.0.7
## [11] pwalign_1.3.1 edgeR_4.5.1
## [13] lattice_0.22-6 ensembldb_2.31.0
## [15] MASS_7.3-63 magrittr_2.0.3
## [17] limma_3.63.2 sass_0.4.9
## [19] rmarkdown_2.29 jquerylib_0.1.4
## [21] yaml_2.3.10 grImport2_0.3-3
## [23] DBI_1.2.3 buildtools_1.0.0
## [25] CNEr_1.43.0 ade4_1.7-22
## [27] abind_1.4-8 zlibbioc_1.53.0
## [29] purrr_1.0.2 R.utils_2.12.3
## [31] preseqR_4.0.0 AnnotationFilter_1.31.0
## [33] RCurl_1.98-1.16 pracma_2.4.4
## [35] rappdirs_0.3.3 GenomeInfoDbData_1.2.13
## [37] ggrepel_0.9.6 maketools_1.3.1
## [39] seqLogo_1.73.0 annotate_1.85.0
## [41] codetools_0.2-20 DelayedArray_0.33.3
## [43] xml2_1.3.6 tidyselect_1.2.1
## [45] futile.logger_1.4.3 farver_2.1.2
## [47] UCSC.utils_1.3.0 universalmotif_1.25.1
## [49] base64enc_0.1-3 matrixStats_1.4.1
## [51] BiocFileCache_2.15.0 GenomicAlignments_1.43.0
## [53] jsonlite_1.8.9 multtest_2.63.0
## [55] motifStack_1.51.0 survival_3.8-3
## [57] motifmatchr_1.29.0 tools_4.4.2
## [59] progress_1.2.3 TFMPvalue_0.0.9
## [61] Rcpp_1.0.13-1 glue_1.8.0
## [63] ChIPpeakAnno_3.41.0 SparseArray_1.7.2
## [65] xfun_0.49 MatrixGenerics_1.19.0
## [67] dplyr_1.1.4 HDF5Array_1.35.2
## [69] withr_3.0.2 formatR_1.14
## [71] BiocManager_1.30.25 fastmap_1.2.0
## [73] rhdf5filters_1.19.0 caTools_1.18.3
## [75] digest_0.6.37 R6_2.5.1
## [77] colorspace_2.1-1 GO.db_3.20.0
## [79] gtools_3.9.5 poweRlaw_0.80.0
## [81] jpeg_0.1-10 biomaRt_2.63.0
## [83] RSQLite_2.3.9 R.methodsS3_1.8.2
## [85] tidyr_1.3.1 data.table_1.16.4
## [87] prettyunits_1.2.0 InteractionSet_1.35.0
## [89] httr_1.4.7 htmlwidgets_1.6.4
## [91] S4Arrays_1.7.1 TFBSTools_1.45.1
## [93] regioneR_1.39.0 pkgconfig_2.0.3
## [95] gtable_0.3.6 blob_1.2.4
## [97] sys_3.4.3 htmltools_0.5.8.1
## [99] RBGL_1.83.0 ProtGenerics_1.39.1
## [101] scales_1.3.0 Biobase_2.67.0
## [103] png_0.1-8 knitr_1.49
## [105] lambda.r_1.2.4 tzdb_0.4.0
## [107] reshape2_1.4.4 rjson_0.2.23
## [109] curl_6.0.1 cachem_1.1.0
## [111] rhdf5_2.51.1 stringr_1.5.1
## [113] BiocVersion_3.21.1 KernSmooth_2.23-26
## [115] parallel_4.4.2 AnnotationDbi_1.69.0
## [117] restfulr_0.0.15 pillar_1.10.0
## [119] grid_4.4.2 vctrs_0.6.5
## [121] randomForest_4.7-1.2 dbplyr_2.5.0
## [123] xtable_1.8-4 evaluate_1.0.1
## [125] readr_2.1.5 VennDiagram_1.7.3
## [127] GenomicFeatures_1.59.1 cli_3.6.3
## [129] locfit_1.5-9.10 compiler_4.4.2
## [131] futile.options_1.0.1 rlang_1.1.4
## [133] crayon_1.5.3 labeling_0.4.3
## [135] plyr_1.8.9 stringi_1.8.4
## [137] BiocParallel_1.41.0 munsell_0.5.1
## [139] lazyeval_0.2.2 Matrix_1.7-1
## [141] hms_1.1.3 bit64_4.5.2
## [143] ggplot2_3.5.1 Rhdf5lib_1.29.0
## [145] KEGGREST_1.47.0 statmod_1.5.0
## [147] SummarizedExperiment_1.37.0 AnnotationHub_3.15.0
## [149] GenomicScores_2.19.0 memoise_2.0.1
## [151] bslib_0.8.0 bit_4.5.0.1
## [153] polynom_1.4-1