A central concern of genome biology is improving understanding of gene transcription. In simple terms, transcription factors (TFs) are proteins that bind to DNA, typically near gene promoter regions. The role of TFs in gene expression variation is of great interest. Progress in deciphering genetic and epigenetic processes that affect TF abundance and function will be essential in clarifying and interpreting gene expression variation patterns and their effects on phenotype. Difficulties of identifying functional binding of TFs, and opportunities for using information of TF binding in systems biology contexts, are reviewed in Lambert et al. (2018) and Weirauch et al. (2014).
This paper describes an R/Bioconductor package called TFutils, which assembles various resources intended to clarify and unify approaches to working with TF concepts in bioinformatic analysis. Computations described in this paper can be carried out with Bioconductor version 3.8. The package can be installed with
# use install.packages("BiocManager") if not already available
library(BiocManager)
install("TFutils")
In the next section we describe the basic concepts of enumerating and classifying TFs, enumerating TF targets, and representing genome-wide quantification of TF binding affinity. This is followed by a review of the key data structures and functions provided in the package, and an example in cancer informatics.
The present paper does not deal directly with the manipulation or interpretation of sequence motifs. An excellent Bioconductor package that synthesizes many approaches to these tasks is universalmotif.
Given the importance of the topic, it is not surprising that a number of bioinformatic research groups have published catalogs of transcription factors along with metadata about their features. Standard nomenclature for TFs has yet to be established. Gene symbols, motif sequences, and position-weight matrix catalog entries have all been used as TF identifiers.
In TFutils we have gathered information from four widely used
resources, focusing specifically on human TFs: Gene Ontology (GO, Ashburner et al. (2000), in which
GO:0003700
is the tag for the molecular function concept
“DNA binding transcription factor activity”), CISBP (Weirauch et al. (2014)), HOCOMOCO (Kulakovskiy et al. (2018)), and the “c3 TFT
(transcription factor target)” signature set of MSigDb (Subramanian et al. (2005)). Figure
@ref(fig:lkupset) depicts the sizes of these catalogs, measured using
counts of unique HGNC gene symbols. The enumeration for GO uses
Bioconductor’s org.Hs.eg.db
package to find direct associations from GO:0003700
to HGNC
symbols. The enumeration for MSigDb is heuristic and involves parsing
the gene set identifiers used in MSigDb for exact or close matches to
HGNC symbols. For CISBP and HOCOMOCO, the associated web servers provide
easily parsed tabular catalogs.
As noted by Weirauch et al. (2014), interpretation of the “function and evolution of DNA sequences” is dependent on the analysis of sequence-specific DNA binding domains. These domains are dynamic and cell-type specific (Gertz et al. (2013)). Classifying TFs according to features of the binding domain is an ongoing process of increasing intricacy. Figure @ref(fig:TFclass) shows excerpts of hierarchies of terms related to TF type derived from GO (on the left) and TFclass (Wingender et al. (2018)). There is a disagreement between our enumeration of TFs based on GO in Figure @ref(fig:lkupset) and the 1919 shown in AmiGO, as the latter includes a broader collection of receptor activities.
Table @ref(tab:classtab) provides examples of frequently encountered TF classifications in the CISBP and HOCOMOCO catalogs. The numerical components of the HOCOMOCO classes correspond to TFClass subfamilies (Wingender et al. (2018)).
CISBP | Nc | HOCOMOCO | Nh |
---|---|---|---|
C2H2 ZF | 655 | More than 3 adjacent zinc finger factors{2.3.3} | 106 |
Homeodomain | 199 | HOX-related factors{3.1.1} | 41 |
bHLH | 104 | NK-related factors{3.1.2} | 36 |
bZIP | 66 | Paired-related HD factors{3.1.3} | 35 |
Unknown | 49 | Factors with multiple dispersed zinc fingers{2.3.4} | 30 |
Forkhead | 48 | Forkhead box (FOX) factors{3.3.1} | 27 |
Sox | 48 | Ets-related factors{3.5.2} | 25 |
Nuclear receptor | 46 | Three-zinc finger Krueppel-related factors{2.3.1} | 20 |
Myb/SANT | 30 | POU domain factors{3.1.10} | 18 |
Ets | 27 | Tal-related factors{1.2.3} | 18 |
The Broad Institute MSigDb (Subramanian et al.
(2005)) includes a gene set collection devoted to cataloging TF
targets. We have used Bioconductor’s GSEABase
package to import and serialize the gmt
representation of
this collection.
## GeneSetCollection
## names: AAANWWTGC_UNKNOWN, AAAYRNCTG_UNKNOWN, ..., GCCATNTTG_YY1_Q6 (615 total)
## unique identifiers: 4208, 481, ..., 56903 (12774 total)
## types in collection:
## geneIdType: EntrezIdentifier (1 total)
## collectionType: NullCollection (1 total)
Names of TFs for which target sets are assembled are encoded in a systematic way, with underscores separating substrings describing motifs, genes, and versions. Some peculiarity in nomenclature in the MSigDb labels can be observed:
## [1] "NFKAPPAB65_01" "NFKAPPAB_01" "NFKB_Q6"
## [4] "NFKB_C" "NFKB_Q6_01" "GGGNNTTTCC_NFKB_Q6_01"
Manual curation will be needed to improve the precision with which MSigDb TF target sets can be associated with specific TFs or motifs.
In this subsection we address representation of putative binding sites. First we illustrate how to represent sequence-based affinity measures and the binding site locations implied by these. We then discuss use of results of ChIP-seq experiments for cell-type-specific binding site enumeration.
Affinity scores based on reference sequence. The FIMO algorithm of the MEME suite (Grant, Bailey, and Noble (2011)) was used to score the human reference genome for TF binding affinity for 689 motif matrices to which genes are associated. Full details are provided in Sonawane et al. (2017). Sixteen (16) tabix-indexed BED files are lodged in an AWS S3 bucket for illustration purposes.
## GenomicFiles object with 0 ranges and 16 files:
## files: M0635_1.02sort.bed.gz, M3433_1.02sort.bed.gz, ..., M6159_1.02sort.bed.gz, M6497_1.02sort.bed.gz
## detail: use files(), rowRanges(), colData(), ...
## DataFrame with 6 rows and 2 columns
## Mtag HGNC
## <character> <character>
## 1 M0635_1 DMRTC2
## 2 M3433_1 HOXA3
## 3 M3467_1 IRF1
## 4 M3675_1 POU2F1
## 5 M3698_1 TP53
## 6 M3966_1 STAT1
We harvest scores in a genomic interval of interest (bound to
fimo16
in the rowRanges
assignment below)
using reduceByFile
. This yields a list with one element per
file. Each such element holds a list of scanTabix
results,
one per query range.
if (.Platform$OS.type != "windows") {
library(BiocParallel)
register(SerialParam()) # important for macosx?
rowRanges(fimo16) = GRanges("chr17", IRanges(38.077e6, 38.084e6))
rr = GenomicFiles::reduceByFile(fimo16, MAP=function(r,f)
scanTabix(f, param=r))
} else {
rr = readRDS(system.file("fimoser/fimo16rr.rds", package="TFutils"))
}
scanTabix produces a list of vectors of text strings, which we parse
with data.table::fread
. The resulting tables are then
reduced to a genomic location and -log10 of the p-value derived from the
binding affinity statistic of FIMO in the vicinity of that location.
asdf = function(x) data.table::fread(paste0(x, collapse="\n"), header=FALSE)
gg = lapply(rr, function(x) {
tmp = asdf(x[[1]][[1]])
data.frame(loc=tmp$V2, score=-log10(tmp$V7))
})
for (i in 1:length(gg)) gg[[i]]$tf = colData(fimo16)[i,2]
It turns out there are too many distinct TFs to display names individually, so we label the scores with the names of the associated TF families as defined in CISBP.
matchcis = match(colData(fimo16)[,2], cisbpTFcat[,2])
famn = cisbpTFcat[matchcis,]$Family_Name
for (i in 1:length(gg)) gg[[i]]$tffam = famn[i]
nn = do.call(rbind, gg)
A simple display of predicted TF binding affinity near the gene ORMDL3 is provided in Figure @ref(fig:finish).
TF binding predictions based on ChIP-seq data from
ENCODE. The ENCODE project provides BED-formatted reports on
ChIP-seq experiments for many combinations of cell type and DNA-binding
factors. TFutils includes a table encode690
that gives
information on 690 experiments involving pairs formed from 91 cell lines
and 161 TFs for which results have been recorded as GRanges instances
that can be acquired with the AnnotationHub
package. Positional relationships between cell-type specific binding
sites and genomic features can be investigated. An illustration is given
in Figure @ref(fig:lkbi), in which is it suggested that in HepG2 cells,
CEBPB exhibits a distinctive pattern of binding in the vicinity of
ORMDL3.
We have compared enumerations of human transcription factors by different projects, provided access to two forms of binding domain classification, and illustrated the use of cloud-resident genome-wide binding predictions. In the next section we review selected details of data structures and methods of the TFutils package.
The TFutils package is designed to lower barriers to usage of key findings of TF biology in human genome research. TFutils is supplied as a conventional R package distributed with, and making use of, the Bioconductor software ecosystem. TFutils includes ready-to-use reference data, tools for visualizing binding sites, and tools that simplify integrative use of TF binding information with GWAS findings.
Catalogs. Two reference resources have been
collected into the TFutils package as data.frame instances. These are
cisbpTFcat
(CISBP: 7592 x 28), and
hocomoco.mono.sep2018
(mononucleotide models, full catalog,
769 x 9). These data.frames are snapshots of the CISBP and HOCOMOCO
catalogs
Indexed BED in AWS S3. As described above
fimo16
provides programmatic access to FIMO scores for 16
TFs, using the GenomicFiles
protocol.
Annotated reference to ENCODE ChIP-seq results.
encode690
simplifies programmatic access to TF:cell-line
combinations available in Bioconductor AnnotationHub.
TF targets enumerated in MsigDb. The c3-TFT (TF targets) subset from MSigDb is provided as a GeneSetCollection instance as defined in GSEABase.
Illustrative GWAS records. The full EBI/EMBL GWAS
catalog is available in the gwascat
package; for convenience, an excerpt focusing on chromosome 17 is
supplied with TFutils as gwascat_hg19_chr17
.
Interactive enumeration of TF targets implicated in
GWAS. The TFtargs
function runs a shiny app that
permits selection of a TF in the nomenclature of the MSigDb c3/TFT gene
set collection. The app will search an object provided by the gwascat
package for references in the MAPPED_GENE
field that match
the targets of the selected TF. Figure @ref(fig:lktarapp) gives an
illustration.
The TFCatalog S4 class. Reference catalogs for TF
biology are structured with the TFCatalog
S4 class. Two
essential components for managing a catalog are the native TF identifier
for the catalog and the HGNC gene symbol typically used to name the TF.
The TFCatalog
class includes a name field to name the
catalog, and a character vector with elements comprised of the native
identifiers for catalogued TFs.
For example, CISBP uses T004843_1.02
to refer to motifs
associated with gene TFAP2B. There are five such motifs, three derived
from SELEX, one from Transfac, and one from Hocomoco.
A data.frame
instance that has an obligatory column
named ‘HGNC’ can include any collection of fields that offer metadata
about the TF in the specified catalog. Here is how we construct and view
a TFCatalog object using the CISBP reference data.
data(cisbpTFcat)
TFs_CISBP = TFCatalog(name="CISBP.info",
nativeIds=cisbpTFcat[,1],
HGNCmap = cisbpTFcat)
TFs_CISBP
## TFutils TFCatalog instance CISBP.info
## 7592 native Ids, including
## T004843_1.02 ... T153733_1.02
## 1551 unique HGNC tags, including
## TFAP2B TFAP2B ... ZNF10 ZNF350
In this section we consider applications of the tools in genetic epidemiology. First we look for TFs that may harbor variants associated with traits in the EBI GWAS catalog. Then we show how to enumerate traits associated with targets of a selected TF.
TFs that are direct GWAS hits for a given trait.
directHitsInCISBP
accepts a string naming a trait , and
returns a data.frame of TFs identified as “mapped genes” for the trait,
with their TF “family name”.
library(dplyr)
library(magrittr)
library(gwascat)
#data(ebicat37)
load(system.file("legacy/ebicat37.rda", package="gwascat"))
directHitsInCISBP("Rheumatoid arthritis", ebicat37)
## Joining with `by = join_by(HGNC)`
## HGNC Family_Name
## 1 ARID5B ARID/BRIGHT
## 7 EOMES T-box
## 15 GATA3 GATA
## 35 JAZF1 C2H2 ZF
## 37 MECP2 MBD
## 45 MTF1 C2H2 ZF
## 57 REL Rel
## 65 STAT4 STAT
## 79 AIRE SAND
## 82 IRF5 IRF
Traits mapped to genes that are targets of a given TF
topTraitsOfTargets
will acquire the targets of a
selected TF, check for hits in these genes in a given GWAS catalog
instance, and tabulate the most commonly reported traits.
## remapping identifiers of input GeneSetCollection to Symbol...
## done
## DISEASE.TRAIT MAPPED_GENE SNPS CHR_ID
## 1 Atopic dermatitis TNXB rs41268896 6
## 2 Atopic dermatitis TNXB rs12153855 6
## 3 Atopic dermatitis KIF3A rs2897442 5
## 4 Attention deficit hyperactivity disorder SEMA3A rs797820 7
## 5 Attention deficit hyperactivity disorder DNM1 rs2502731 9
## 6 Attention deficit hyperactivity disorder GPC6 rs7995215 13
## CHR_POS
## 1 32102292
## 2 32107027
## 3 132713335
## 4 83979723
## 5 128214278
## 6 93756253
##
## Atopic dermatitis
## 3
## Attention deficit hyperactivity disorder
## 3
## Height
## 7
## Menarche (age at onset)
## 4
## Obesity-related traits
## 11
## Rheumatoid arthritis
## 3
Sources and consequences of variations in DNA transcription are fundamental problems for cell biology, and the projects we have made use of for cataloging transcription factors are at the boundaries of current knowledge.
It is noteworthy that the four resources used for Figure @ref(fig:lkupset) agree on names of only 119 TFs. The fact that CISBP distinguishes 475 TFs that are not identified in any other source should be better understood. We observe that the ascription of TF status to AHRR is based on its sharing motifs with AHR (see ).
Figure @ref(fig:TFclass) and Table @ref(tab:classtab) show that the classification of TFs is now fairly elaborate. Use of the precise terminology of the TFClass system to label TFs of interest at present relies on associations provided with the HOCOMOCO catalog.
As population studies in genomic and genetic epidemiology grow in
size and scope, principles for organizing and prioritizing loci
associated with phenotypes of interest are urgently needed. Figure
@ref(fig:lktarapp) shows that loci associated with phenotypes related to
kidney function, lung function, and IL-8 levels are potentially unified
through the fact that the GWAS hits are connected with genes identified
as targets of VDR (vitamin D receptor). This example limited attention
to hits on chromosome 17; the TFtargs
tool permits
exploration of phenotype-locus-gene-TF associations. Our hope is that
the tools and resources collected in TFutils will foster systematic
development of evidence-based mechanistic network models for
transcription regulation in human disease contexts, thereby contributing
to the development of personalized genomic medicine.
Support for the development of this software was provided by NIH grants U01 CA214846 (Carey, PI), U24 CA180996 (Morgan, PI), Chan Zuckerberg Initiative DAF 2018-183436 (Carey, PI), and R01 NHLBI HL118455 (Raby, PI).
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] UpSetR_1.4.0 magrittr_2.0.3
## [3] dplyr_1.1.4 gwascat_2.39.0
## [5] GSEABase_1.69.0 graph_1.85.0
## [7] annotate_1.85.0 XML_3.99-0.17
## [9] png_0.1-8 ggplot2_3.5.1
## [11] knitr_1.49 data.table_1.16.2
## [13] GO.db_3.20.0 GenomicFiles_1.43.0
## [15] rtracklayer_1.67.0 Rsamtools_2.23.0
## [17] Biostrings_2.75.1 XVector_0.47.0
## [19] BiocParallel_1.41.0 SummarizedExperiment_1.37.0
## [21] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [23] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [25] org.Hs.eg.db_3.20.0 AnnotationDbi_1.69.0
## [27] IRanges_2.41.0 S4Vectors_0.45.1
## [29] Biobase_2.67.0 BiocGenerics_0.53.2
## [31] generics_0.1.3 TFutils_1.27.1
## [33] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9 gridExtra_2.3
## [4] readxl_1.4.3 rlang_1.1.4 compiler_4.4.2
## [7] RSQLite_2.3.7 GenomicFeatures_1.59.1 vctrs_0.6.5
## [10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [13] dbplyr_2.5.0 labeling_0.4.3 utf8_1.2.4
## [16] promises_1.3.0 rmarkdown_2.29 tzdb_0.4.0
## [19] UCSC.utils_1.3.0 bit_4.5.0 xfun_0.49
## [22] zlibbioc_1.52.0 cachem_1.1.0 jsonlite_1.8.9
## [25] blob_1.2.4 later_1.3.2 DelayedArray_0.33.1
## [28] parallel_4.4.2 R6_2.5.1 VariantAnnotation_1.53.0
## [31] bslib_0.8.0 jquerylib_0.1.4 cellranger_1.1.0
## [34] Rcpp_1.0.13-1 readr_2.1.5 splines_4.4.2
## [37] httpuv_1.6.15 Matrix_1.7-1 tidyselect_1.2.1
## [40] abind_1.4-8 yaml_2.3.10 codetools_0.2-20
## [43] miniUI_0.1.1.1 curl_6.0.0 plyr_1.8.9
## [46] lattice_0.22-6 tibble_3.2.1 withr_3.0.2
## [49] shiny_1.9.1 KEGGREST_1.47.0 evaluate_1.0.1
## [52] survival_3.7-0 BiocFileCache_2.15.0 snpStats_1.57.0
## [55] pillar_1.9.0 BiocManager_1.30.25 filelock_1.0.3
## [58] RCurl_1.98-1.16 hms_1.1.3 munsell_0.5.1
## [61] scales_1.3.0 xtable_1.8-4 glue_1.8.0
## [64] maketools_1.3.1 tools_4.4.2 BiocIO_1.17.0
## [67] sys_3.4.3 BSgenome_1.75.0 GenomicAlignments_1.43.0
## [70] buildtools_1.0.0 colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [73] restfulr_0.0.15 cli_3.6.3 fansi_1.0.6
## [76] S4Arrays_1.7.1 gtable_0.3.6 sass_0.4.9
## [79] digest_0.6.37 SparseArray_1.7.1 farver_2.1.2
## [82] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.8.1
## [85] lifecycle_1.0.4 httr_1.4.7 mime_0.12
## [88] bit64_4.5.2