RcisTarget: Transcription factor binding motif enrichment

Vignette built on Nov 30, 2024 with RcisTarget version 1.27.0.

What is RcisTarget?

RcisTarget is an R-package to identify transcription factor (TF) binding motifs over-represented on a gene list.

RcisTarget is based on the methods previously implemented in i-cisTarget (web interface, region-based) and iRegulon (Cytoscape plug-in, gene-based).

If you use RcisTarget in your research, please cite:

## Aibar et al. (2017) SCENIC: single-cell regulatory network 
##   inference and clustering. Nature Methods. doi: 10.1038/nmeth.4463

Overview of the workflow

The function cisTarget() allows to perform the motif-enrichment analysis on a gene list. The main input parameters are the gene list and the motif databases, which should be chosen depending on the organism and the search space around the TSS of the genes. This is a sample on how to run the analysis (see the following sections for details):

library(RcisTarget)

# Load gene sets to analyze. e.g.:
geneList1 <- read.table(file.path(system.file('examples', package='RcisTarget'), "hypoxiaGeneSet.txt"), stringsAsFactors=FALSE)[,1]
geneLists <- list(geneListName=geneList1)
# geneLists <- GSEABase::GeneSet(genes, setName="geneListName") # alternative

# Select motif database to use (i.e. organism and distance around TSS)
data(motifAnnotations_hgnc)
motifRankings <- importRankings("~/databases/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather")

# Motif enrichment analysis:
motifEnrichmentTable_wGenes <- cisTarget(geneLists, motifRankings,
                               motifAnnot=motifAnnotations)

Advanced use: The cisTarget() function is enough for most simple analyses. However, for further flexibility (e.g. removing unnecessary steps on bigger analyses), RcisTarget also provides the possibility to run the inner functions individually. Running cisTarget() is equivalent to running this code:

# 1. Calculate AUC
motifs_AUC <- calcAUC(geneLists, motifRankings)

# 2. Select significant motifs, add TF annotation & format as table
motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, 
                          motifAnnot=motifAnnotations)

# 3. Identify significant genes for each motif
# (i.e. genes from the gene set in the top of the ranking)
# Note: Method 'iCisTarget' instead of 'aprox' is more accurate, but slower
motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable, 
                                                   geneSets=geneLists,
                                                   rankings=motifRankings, 
                                                   nCores=1,
                                                   method="aprox")

Before starting

Setup

RcisTarget uses species-specific databases which are provided as independent R-packages. Prior to running RcisTarget, you will need to download and install the databases for the relevant organism (more details in the “Motif databases” section).

In addition, some extra packages can be installed to run RcisTarget in parallel or run the interactive examples in this tutorial:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
# To support paralell execution:
BiocManager::install(c("doMC", "doRNG"))
# For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "visNetwork"))

Some tips…

Help

At any time, remember you an access the help files for any function (i.e. ?cisTarget), and the other tutorials in the package with the following commands:

# Explore tutorials in the web browser:
browseVignettes(package="RcisTarget") 

# Commnad line-based:
vignette(package="RcisTarget") # list
vignette("RcisTarget") # open

Reports

To generate an HTML report with your own data and comments, you can use the .Rmd file of this tutorial as template (i.e. copy the .Rmd file, and edit it as R notebook in RStudio).

vignetteFile <- paste(file.path(system.file('doc', package='RcisTarget')), 
                      "RcisTarget.Rmd", sep="/")
# Copy to edit as markdown
file.copy(vignetteFile, ".")
# Alternative: extract R code
Stangle(vignetteFile)

Running RcisTarget

Input: Gene sets

The main input to RcisTarget is the gene set(s) to analyze. The gene sets can be provided as a ‘named list’ in which each element is a gene-set (character vector containing gene names) or as a GSEABase::GeneSet. The gene-set name will be used as ID in the following steps.

library(RcisTarget)
geneSet1 <- c("gene1", "gene2", "gene3")
geneLists <- list(geneSetName=geneSet1)
# or: 
# geneLists <- GSEABase::GeneSet(geneSet1, setName="geneSetName") 

Some extra help:

class(geneSet1)
class(geneLists)
geneSet2 <- c("gene2", "gene4", "gene5", "gene6")
geneLists[["anotherGeneSet"]] <- geneSet2
names(geneLists)
geneLists$anotherGeneSet
lengths(geneLists)

In this example we will be using a list of genes up-regulated in MCF7 cells under hypoxia conditions (PMID:16565084).

The original work highlighted the role of hypoxia-inducible factor (HIF1-alpha or HIF1A) pathways under hypoxia. This gene list is also used for the turorials in iRegulon (http://iregulon.aertslab.org/tutorial.html).

txtFile <- paste(file.path(system.file('examples', package='RcisTarget')),
                 "hypoxiaGeneSet.txt", sep="/")
geneLists <- list(hypoxia=read.table(txtFile, stringsAsFactors=FALSE)[,1])
head(geneLists$hypoxia)
## [1] "ADM"     "ADORA2B" "AHNAK2"  "AK4"     "AKAP12"  "ALDOC"

Required databases

To analyze the gene list, RcisTarget needs two types of databases:

    1. Gene-motif rankings: which provides the rankings (based on the motif score) of all the genes for each motif.
    1. The annotation of motifs to transcription factors.

Gene-motif rankings

The score of each pair of gene-motif can be performed using different parameters. Therefore, we provide multiple databases (motif-rankings, or alternative: mirror), according to several possibilities:

  • Species: Species of the input gene set. Available values: Human (Homo sapiens), mouse (Mus musculus) or fly (Drosophila melanogaster)
  • Scoring/search space: determine the search space around the transcription start site (TSS). Available values: 500 bp uptream the TSS, 5kbp or 10kbp around the TSS (e.g. 10kbp upstream and 10kbp downstream of the TSS).
  • Motif collection: We recommend to use the latest one (mc10_clust).

If you don’t know which database to choose, for an analisis of a gene list we would suggest using the 500bp upstream TSS, and a larger search space (e.g. TSS+/-5kbp or TSS+/-10kbp). Of course, selecting Human, Mouse or fly depending on your input gene list.

For other settings (e.g. new species), you can check the tutorial on how to create databases.

Each database is stored in a .feather file. Note that the download size is typically over 1GB (100GB for human region databases), we recommend downloading the files with zsync_curl` (see the Help with downloads).

Once you have the .feather files, they can be loaded with importRankings():

# Search space: 10k bp around TSS - HUMAN
motifRankings <- importRankings("hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather")
# Load the annotation to human transcription factors
data(motifAnnotations_hgnc)

Motif annotations

All the calculations performed by RcisTarget are motif-based. However, most users will be interested on TFs potentially regulating the gene-set. The association of motif to transcription factors are provided in an independent file. In addition to the motifs annotated by the source datatabse (i.e. direct annotation), we have also inferred some further annotations based on motif similarity and gene ortology (e.g. similarity to other genes annotated to the motif). These annotations are typically used with the functions cisTarget() or addMotifAnnotation().

For the motifs in version ‘mc10nr’ of the rankings, these annotations are already included in RcisTarget package and can be loaded with the command:

# mouse:
# data(motifAnnotations_mgi)

# human:
data(motifAnnotations_hgnc)
motifAnnotations[199:202,]
## Key: <motif, TF>
##            motif     TF directAnnotation inferred_Orthology inferred_MotifSimil
##           <char> <char>           <lgcl>             <lgcl>              <lgcl>
## 1: cisbp__M00192  SMAD3            FALSE               TRUE               FALSE
## 2: cisbp__M00193   TCF7            FALSE               TRUE               FALSE
## 3: cisbp__M00194   HBP1            FALSE               TRUE               FALSE
## 4: cisbp__M00195    CIC            FALSE               TRUE               FALSE
##        annotationSource
##                  <fctr>
## 1: inferredBy_Orthology
## 2: inferredBy_Orthology
## 3: inferredBy_Orthology
## 4: inferredBy_Orthology
##                                                                                                         description
##                                                                                                              <char>
## 1: gene is orthologous to ENSMUSG00000032402 in M. musculus (identity = 100%) which is directly annotated for motif
## 2:  gene is orthologous to ENSMUSG00000000782 in M. musculus (identity = 93%) which is directly annotated for motif
## 3:  gene is orthologous to ENSMUSG00000002996 in M. musculus (identity = 95%) which is directly annotated for motif
## 4:                       motif is annotated for orthologous gene ENSMUSG00000005442 in M. musculus (identity = 92%)
##    keptInRanking
##           <lgcl>
## 1:         FALSE
## 2:         FALSE
## 3:         FALSE
## 4:          TRUE

For other versions of the rankings, the function importAnnotations allows to import the annotations from the source file.

These annotations can be easily queried to obtain further information about specific motifs or TFs:

showLogo(motifAnnotations[(directAnnotation==TRUE) & keptInRanking 
                               & (TF %in% c("HIF1A", "HIF2A", "EPAS1")), ])

Database example (subset)

In addition to the full versions of the databases (~20k motifs), we also provide a subset containing only the 4.6k motifs from cisbp (human only: RcisTarget.hg19.motifDBs.cisbpOnly.500bp). These subsets are available in Bioconductor for demonstration purposes. They will provide the same AUC score for the existing motifs. However, we highly recommend to use the full version (~20k motifs) for more accurate results, since the normalized enrichment score (NES) of the motif depends on the total number of motifs in the database.

To install this package:

# From Bioconductor
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("RcisTarget.hg19.motifDBs.cisbpOnly.500bp")

For this vignette (demonstration purposes), we will use this database:

library(RcisTarget.hg19.motifDBs.cisbpOnly.500bp)
# Rankings
data(hg19_500bpUpstream_motifRanking_cispbOnly)
motifRankings <- hg19_500bpUpstream_motifRanking_cispbOnly
motifRankings
## Rankings for RcisTarget.
##   Organism: human (hg19)
##   Number of genes: 22284 (22285 available in the full DB)
##   Number of MOTIF: 4687
## ** This database includes rankings up to 5050
## 
## Subset (4.6k cisbp motifs) of the Human database scoring motifs 500bp upstream the TSS (hg19-500bp-upstream-7species.mc9nr)
# Annotations
data(hg19_motifAnnotation_cisbpOnly) 
motifAnnotations <- hg19_motifAnnotation_cisbpOnly

Running the analysis

Once the gene lists and databases are loaded, they can be used by cisTarget(). cisTarget() runs sequentially the steps to perform the (1) motif enrichment analysis, (2) motif-TF annotation, and (3) selection of significant genes.

It is also possible to run these steps as individual commands. For example, to skip steps, for analyses in which the user is interested in one of the outputs, or to optimize the workflow to run it on multiple gene lists (see Advanced section for details).

motifEnrichmentTable_wGenes <- cisTarget(geneLists, 
         motifRankings,
         motifAnnot=motifAnnotations)

Advanced: Execute step by step

1. Calculate enrichment

The first step to estimate the over-representation of each motif on the gene-set is to calculate the Area Under the Curve (AUC) for each pair of motif-geneSet. This is calculated based on the recovery curve of the gene-set on the motif ranking (genes ranked decreasingly by the score of motif in its proximity, as provided in the motifRanking database).

motifs_AUC <- calcAUC(geneLists, motifRankings, nCores=1)

The AUC is provided as a matrix of Motifs by GeneSets. In principle, the AUC is mostly meant as input for the next step. However, it is also possible to explore the distribution of the scores, for example in a gene-set of interest:

auc <- getAUC(motifs_AUC)["hypoxia",]
hist(auc, main="hypoxia", xlab="AUC histogram",
     breaks=100, col="#ff000050", border="darkred")
nes3 <- (3*sd(auc)) + mean(auc)
abline(v=nes3, col="red")

2. Select significant motifs and/or annotate to TFs

The selection of significant motifs is done based on the Normalized Enrichment Score (NES). The NES is calculated -for each motif- based on the AUC distribution of all the motifs for the gene-set [(x-mean)/sd]. Those motifs that pass the given threshold (3.0 by default) are considered significant.

Furthermore, this step also allows to add the TFs annotated to the motif.

motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, nesThreshold=3,
     motifAnnot=motifAnnotations,
     highlightTFs=list(hypoxia=c("HIF1A", "EPAS1")))
class(motifEnrichmentTable)
## [1] "data.table" "data.frame"
dim(motifEnrichmentTable)
## [1] 17  8
head(motifEnrichmentTable[,-"TF_lowConf", with=FALSE])
##    geneSet        motif   NES    AUC highlightedTFs TFinDB
##     <char>       <char> <num>  <num>         <char> <char>
## 1: hypoxia cisbp__M6275  4.41 0.0966   HIF1A, EPAS1     **
## 2: hypoxia cisbp__M0062  3.57 0.0841   HIF1A, EPAS1       
## 3: hypoxia cisbp__M6279  3.56 0.0840   HIF1A, EPAS1       
## 4: hypoxia cisbp__M6212  3.49 0.0829   HIF1A, EPAS1     **
## 5: hypoxia cisbp__M2332  3.48 0.0828   HIF1A, EPAS1       
## 6: hypoxia cisbp__M0387  3.41 0.0817   HIF1A, EPAS1       
##                   TF_highConf
##                        <char>
## 1: HIF1A (directAnnotation). 
## 2:                           
## 3: HMGA1 (directAnnotation). 
## 4: EPAS1 (directAnnotation). 
## 5:                           
## 6:

The cathegories considered high/low confidence can me modified with the arguments motifAnnot_highConfCat and motifAnnot_lowConfCat.

3. Identify the genes with the best enrichment for each Motif

Since RcisTarget searches for enrichment of a motif within a gene list, finding a motif ‘enriched’ does not imply that all the genes in the gene-list have a high score for the motif. In this way, the third step of the workflow is to identify which genes (of the gene-set) are highly ranked for each of the significant motifs.

There are two methods to identify these genes: (1) equivalent to the ones used in iRegulon and i-cisTarget (method="iCisTarget", recommended if running time is not an issue), and (2) a faster implementation based on an approximate distribution using the average at each rank (method="aprox", useful to scan multiple gene sets).

IMPORTANT: Make sure that the motifRankings are the same as in Step 1.

motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable,
                      rankings=motifRankings, 
                      geneSets=geneLists)
dim(motifEnrichmentTable_wGenes)
## [1] 17 11

Plot for a few motifs:

geneSetName <- "hypoxia"
selectedMotifs <- c(sample(motifEnrichmentTable$motif, 2))
par(mfrow=c(2,2))
getSignificantGenes(geneLists[[geneSetName]], 
                    rankings=motifRankings,
                    signifRankingNames=selectedMotifs,
                    plotCurve=TRUE, maxRank=5000, genesFormat="none",
                    method="aprox")

Output

The final output of RcisTarget is a data.table containing the information about the motif enrichment and its annotation organized in the following fields:

  • geneSet: Name of the gene set
  • motif: ID of the motif
  • NES: Normalized enrichment score of the motif in the gene-set
  • AUC: Area Under the Curve (used to calculate the NES)
  • TFinDB: Indicates whether the highlightedTFs are included within the high confidence annotation (two asterisks) or low confidence annotation (one asterisk).
  • TF_highConf: Transcription factors annotated to the motif according to ‘motifAnnot_highConfCat’.
  • TF_lowConf: Transcription factors annotated to the motif according to ‘motifAnnot_lowConfCat’.
  • enrichedGenes: Genes that are highly ranked for the given motif.
  • nErnGenes: Number of genes highly ranked
  • rankAtMax: Ranking at the maximum enrichment, used to determine the number of enriched genes.
resultsSubset <- motifEnrichmentTable_wGenes[1:10,]
showLogo(resultsSubset)

Follow up examples

TFs annotated to the enriched motifs

Note that the TFs are provided based on the motif annotation. They can be used as a guide to select relevant motifs or to prioritize some TFs, but the motif annotation does not imply that all the TFs appearing in the table regulate the gene list.

anotatedTfs <- lapply(split(motifEnrichmentTable_wGenes$TF_highConf,
                            motifEnrichmentTable$geneSet),
                      function(x) {
                        genes <- gsub(" \\(.*\\). ", "; ", x, fixed=FALSE)
                        genesSplit <- unique(unlist(strsplit(genes, "; ")))
                        return(genesSplit)
                        })
                      
anotatedTfs$hypoxia
##  [1] "HIF1A" "HMGA1" "EPAS1" "FOXJ1" "FOXJ2" "FOXJ3" "FOXP1" "FOXP2" "FOXP3"
## [10] "FOXP4" "FOXG1"

Building a network

signifMotifNames <- motifEnrichmentTable$motif[1:3]

incidenceMatrix <- getSignificantGenes(geneLists$hypoxia, 
                                       motifRankings,
                                       signifRankingNames=signifMotifNames,
                                       plotCurve=TRUE, maxRank=5000, 
                                       genesFormat="incidMatrix",
                                       method="aprox")$incidMatrix

library(reshape2)
edges <- melt(incidenceMatrix)
edges <- edges[which(edges[,3]==1),1:2]
colnames(edges) <- c("from","to")

Output not shown:

library(visNetwork)
motifs <- unique(as.character(edges[,1]))
genes <- unique(as.character(edges[,2]))
nodes <- data.frame(id=c(motifs, genes),   
      label=c(motifs, genes),    
      title=c(motifs, genes), # tooltip 
      shape=c(rep("diamond", length(motifs)), rep("elypse", length(genes))),
      color=c(rep("purple", length(motifs)), rep("skyblue", length(genes))))
visNetwork(nodes, edges) %>% visOptions(highlightNearest = TRUE, 
                                        nodesIdSelection = TRUE)

sessionInfo()

This is the output of sessionInfo() on the system on which this document was compiled:

date()
## [1] "Sat Nov 30 03:50:28 2024"
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.16.2                              
## [2] DT_0.33                                        
## [3] RcisTarget.hg19.motifDBs.cisbpOnly.500bp_1.26.0
## [4] RcisTarget_1.27.0                              
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.37.0 KEGGREST_1.47.0            
##  [3] xfun_0.49                   bslib_0.8.0                
##  [5] htmlwidgets_1.6.4           Biobase_2.67.0             
##  [7] lattice_0.22-6              crosstalk_1.2.1            
##  [9] vctrs_0.6.5                 tools_4.4.2                
## [11] generics_0.1.3              stats4_4.4.2               
## [13] tibble_3.2.1                fansi_1.0.6                
## [15] AnnotationDbi_1.69.0        RSQLite_2.3.8              
## [17] blob_1.2.4                  R.oo_1.27.0                
## [19] pkgconfig_2.0.3             Matrix_1.7-1               
## [21] sparseMatrixStats_1.19.0    S4Vectors_0.45.2           
## [23] assertthat_0.2.1            graph_1.85.0               
## [25] lifecycle_1.0.4             GenomeInfoDbData_1.2.13    
## [27] compiler_4.4.2              Biostrings_2.75.1          
## [29] codetools_0.2-20            GenomeInfoDb_1.43.2        
## [31] htmltools_0.5.8.1           sys_3.4.3                  
## [33] buildtools_1.0.0            sass_0.4.9                 
## [35] yaml_2.3.10                 pillar_1.9.0               
## [37] crayon_1.5.3                jquerylib_0.1.4            
## [39] R.utils_2.12.3              cachem_1.1.0               
## [41] DelayedArray_0.33.2         abind_1.4-8                
## [43] tidyselect_1.2.1            digest_0.6.37              
## [45] purrr_1.0.2                 dplyr_1.1.4                
## [47] arrow_17.0.0.1              maketools_1.3.1            
## [49] AUCell_1.29.0               fastmap_1.2.0              
## [51] grid_4.4.2                  SparseArray_1.7.2          
## [53] cli_3.6.3                   magrittr_2.0.3             
## [55] S4Arrays_1.7.1              XML_3.99-0.17              
## [57] utf8_1.2.4                  GSEABase_1.69.0            
## [59] DelayedMatrixStats_1.29.0   UCSC.utils_1.3.0           
## [61] bit64_4.5.2                 rmarkdown_2.29             
## [63] XVector_0.47.0              httr_1.4.7                 
## [65] matrixStats_1.4.1           bit_4.5.0                  
## [67] zoo_1.8-12                  R.methodsS3_1.8.2          
## [69] png_0.1-8                   memoise_2.0.1              
## [71] evaluate_1.0.1              knitr_1.49                 
## [73] GenomicRanges_1.59.1        IRanges_2.41.1             
## [75] rlang_1.1.4                 Rcpp_1.0.13-1              
## [77] xtable_1.8-4                glue_1.8.0                 
## [79] DBI_1.2.3                   BiocGenerics_0.53.3        
## [81] annotate_1.85.0             jsonlite_1.8.9             
## [83] R6_2.5.1                    MatrixGenerics_1.19.0      
## [85] zlibbioc_1.52.0