Vignette built on Nov 30, 2024 with RcisTarget version 1.27.0.
RcisTarget is an R-package to identify transcription factor (TF) binding motifs over-represented on a gene list.
RcisTarget is based on the methods previously implemented in i-cisTarget (web interface, region-based) and iRegulon (Cytoscape plug-in, gene-based).
If you use RcisTarget in your research, please cite:
## Aibar et al. (2017) SCENIC: single-cell regulatory network
## inference and clustering. Nature Methods. doi: 10.1038/nmeth.4463
The function cisTarget()
allows to perform the
motif-enrichment analysis on a gene list. The main input parameters are
the gene list and the motif databases, which should be chosen depending
on the organism and the search space around the TSS of the genes. This
is a sample on how to run the analysis (see the following sections for
details):
library(RcisTarget)
# Load gene sets to analyze. e.g.:
geneList1 <- read.table(file.path(system.file('examples', package='RcisTarget'), "hypoxiaGeneSet.txt"), stringsAsFactors=FALSE)[,1]
geneLists <- list(geneListName=geneList1)
# geneLists <- GSEABase::GeneSet(genes, setName="geneListName") # alternative
# Select motif database to use (i.e. organism and distance around TSS)
data(motifAnnotations_hgnc)
motifRankings <- importRankings("~/databases/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather")
# Motif enrichment analysis:
motifEnrichmentTable_wGenes <- cisTarget(geneLists, motifRankings,
motifAnnot=motifAnnotations)
Advanced use: The cisTarget()
function
is enough for most simple analyses. However, for further flexibility
(e.g. removing unnecessary steps on bigger analyses), RcisTarget also
provides the possibility to run the inner functions individually.
Running cisTarget()
is equivalent to running this code:
# 1. Calculate AUC
motifs_AUC <- calcAUC(geneLists, motifRankings)
# 2. Select significant motifs, add TF annotation & format as table
motifEnrichmentTable <- addMotifAnnotation(motifs_AUC,
motifAnnot=motifAnnotations)
# 3. Identify significant genes for each motif
# (i.e. genes from the gene set in the top of the ranking)
# Note: Method 'iCisTarget' instead of 'aprox' is more accurate, but slower
motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable,
geneSets=geneLists,
rankings=motifRankings,
nCores=1,
method="aprox")
RcisTarget uses species-specific databases which are provided as independent R-packages. Prior to running RcisTarget, you will need to download and install the databases for the relevant organism (more details in the “Motif databases” section).
In addition, some extra packages can be installed to run RcisTarget in parallel or run the interactive examples in this tutorial:
At any time, remember you an access the help files for any function
(i.e. ?cisTarget
), and the other tutorials in the package
with the following commands:
To generate an HTML report with your own data and comments, you can
use the .Rmd file
of this tutorial as template (i.e. copy
the .Rmd file, and edit it as R notebook in
RStudio).
The main input to RcisTarget is the gene set(s) to analyze. The gene
sets can be provided as a ‘named list’ in which each element is a
gene-set (character vector containing gene names) or as a
GSEABase::GeneSet
. The gene-set name will be used as ID in
the following steps.
library(RcisTarget)
geneSet1 <- c("gene1", "gene2", "gene3")
geneLists <- list(geneSetName=geneSet1)
# or:
# geneLists <- GSEABase::GeneSet(geneSet1, setName="geneSetName")
Some extra help:
class(geneSet1)
class(geneLists)
geneSet2 <- c("gene2", "gene4", "gene5", "gene6")
geneLists[["anotherGeneSet"]] <- geneSet2
names(geneLists)
geneLists$anotherGeneSet
lengths(geneLists)
In this example we will be using a list of genes up-regulated in MCF7 cells under hypoxia conditions (PMID:16565084).
The original work highlighted the role of hypoxia-inducible factor (HIF1-alpha or HIF1A) pathways under hypoxia. This gene list is also used for the turorials in iRegulon (http://iregulon.aertslab.org/tutorial.html).
txtFile <- paste(file.path(system.file('examples', package='RcisTarget')),
"hypoxiaGeneSet.txt", sep="/")
geneLists <- list(hypoxia=read.table(txtFile, stringsAsFactors=FALSE)[,1])
head(geneLists$hypoxia)
## [1] "ADM" "ADORA2B" "AHNAK2" "AK4" "AKAP12" "ALDOC"
To analyze the gene list, RcisTarget needs two types of databases:
The score of each pair of gene-motif can be performed using different parameters. Therefore, we provide multiple databases (motif-rankings, or alternative: mirror), according to several possibilities:
If you don’t know which database to choose, for an analisis of a gene list we would suggest using the 500bp upstream TSS, and a larger search space (e.g. TSS+/-5kbp or TSS+/-10kbp). Of course, selecting Human, Mouse or fly depending on your input gene list.
For other settings (e.g. new species), you can check the tutorial on how to create databases.
Each database is stored in a .feather
file. Note that
the download size is typically over 1GB (100GB for human region
databases), we recommend downloading the files with
zsync_curl
` (see the Help with
downloads).
Once you have the .feather files, they can be loaded with
importRankings()
:
All the calculations performed by RcisTarget are motif-based.
However, most users will be interested on TFs potentially regulating the
gene-set. The association of motif to transcription factors are provided
in an independent file. In addition to the motifs annotated by the
source datatabse (i.e. direct annotation), we have also
inferred some further annotations based on motif
similarity and gene ortology (e.g. similarity to other genes annotated
to the motif). These annotations are typically used with the functions
cisTarget()
or addMotifAnnotation()
.
For the motifs in version ‘mc10nr’ of the rankings, these annotations are already included in RcisTarget package and can be loaded with the command:
# mouse:
# data(motifAnnotations_mgi)
# human:
data(motifAnnotations_hgnc)
motifAnnotations[199:202,]
## Key: <motif, TF>
## motif TF directAnnotation inferred_Orthology inferred_MotifSimil
## <char> <char> <lgcl> <lgcl> <lgcl>
## 1: cisbp__M00192 SMAD3 FALSE TRUE FALSE
## 2: cisbp__M00193 TCF7 FALSE TRUE FALSE
## 3: cisbp__M00194 HBP1 FALSE TRUE FALSE
## 4: cisbp__M00195 CIC FALSE TRUE FALSE
## annotationSource
## <fctr>
## 1: inferredBy_Orthology
## 2: inferredBy_Orthology
## 3: inferredBy_Orthology
## 4: inferredBy_Orthology
## description
## <char>
## 1: gene is orthologous to ENSMUSG00000032402 in M. musculus (identity = 100%) which is directly annotated for motif
## 2: gene is orthologous to ENSMUSG00000000782 in M. musculus (identity = 93%) which is directly annotated for motif
## 3: gene is orthologous to ENSMUSG00000002996 in M. musculus (identity = 95%) which is directly annotated for motif
## 4: motif is annotated for orthologous gene ENSMUSG00000005442 in M. musculus (identity = 92%)
## keptInRanking
## <lgcl>
## 1: FALSE
## 2: FALSE
## 3: FALSE
## 4: TRUE
For other versions of the rankings, the function
importAnnotations
allows to import the annotations from the
source file.
These annotations can be easily queried to obtain further information about specific motifs or TFs:
showLogo(motifAnnotations[(directAnnotation==TRUE) & keptInRanking
& (TF %in% c("HIF1A", "HIF2A", "EPAS1")), ])
In addition to the full versions of the databases (~20k motifs), we also provide a subset containing only the 4.6k motifs from cisbp (human only: RcisTarget.hg19.motifDBs.cisbpOnly.500bp). These subsets are available in Bioconductor for demonstration purposes. They will provide the same AUC score for the existing motifs. However, we highly recommend to use the full version (~20k motifs) for more accurate results, since the normalized enrichment score (NES) of the motif depends on the total number of motifs in the database.
To install this package:
# From Bioconductor
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("RcisTarget.hg19.motifDBs.cisbpOnly.500bp")
For this vignette (demonstration purposes), we will use this database:
library(RcisTarget.hg19.motifDBs.cisbpOnly.500bp)
# Rankings
data(hg19_500bpUpstream_motifRanking_cispbOnly)
motifRankings <- hg19_500bpUpstream_motifRanking_cispbOnly
motifRankings
## Rankings for RcisTarget.
## Organism: human (hg19)
## Number of genes: 22284 (22285 available in the full DB)
## Number of MOTIF: 4687
## ** This database includes rankings up to 5050
##
## Subset (4.6k cisbp motifs) of the Human database scoring motifs 500bp upstream the TSS (hg19-500bp-upstream-7species.mc9nr)
Once the gene lists and databases are loaded, they can be used by
cisTarget()
. cisTarget()
runs sequentially the
steps to perform the (1) motif enrichment analysis,
(2) motif-TF annotation, and (3)
selection of significant genes.
It is also possible to run these steps as individual commands. For example, to skip steps, for analyses in which the user is interested in one of the outputs, or to optimize the workflow to run it on multiple gene lists (see Advanced section for details).
The first step to estimate the over-representation of each motif on the gene-set is to calculate the Area Under the Curve (AUC) for each pair of motif-geneSet. This is calculated based on the recovery curve of the gene-set on the motif ranking (genes ranked decreasingly by the score of motif in its proximity, as provided in the motifRanking database).
The AUC is provided as a matrix of Motifs by GeneSets. In principle, the AUC is mostly meant as input for the next step. However, it is also possible to explore the distribution of the scores, for example in a gene-set of interest:
The selection of significant motifs is done based on the Normalized Enrichment Score (NES). The NES is calculated -for each motif- based on the AUC distribution of all the motifs for the gene-set [(x-mean)/sd]. Those motifs that pass the given threshold (3.0 by default) are considered significant.
Furthermore, this step also allows to add the TFs annotated to the motif.
motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, nesThreshold=3,
motifAnnot=motifAnnotations,
highlightTFs=list(hypoxia=c("HIF1A", "EPAS1")))
## [1] "data.table" "data.frame"
## [1] 17 8
## geneSet motif NES AUC highlightedTFs TFinDB
## <char> <char> <num> <num> <char> <char>
## 1: hypoxia cisbp__M6275 4.41 0.0966 HIF1A, EPAS1 **
## 2: hypoxia cisbp__M0062 3.57 0.0841 HIF1A, EPAS1
## 3: hypoxia cisbp__M6279 3.56 0.0840 HIF1A, EPAS1
## 4: hypoxia cisbp__M6212 3.49 0.0829 HIF1A, EPAS1 **
## 5: hypoxia cisbp__M2332 3.48 0.0828 HIF1A, EPAS1
## 6: hypoxia cisbp__M0387 3.41 0.0817 HIF1A, EPAS1
## TF_highConf
## <char>
## 1: HIF1A (directAnnotation).
## 2:
## 3: HMGA1 (directAnnotation).
## 4: EPAS1 (directAnnotation).
## 5:
## 6:
The cathegories considered high/low confidence can me modified with
the arguments motifAnnot_highConfCat
and
motifAnnot_lowConfCat
.
Since RcisTarget searches for enrichment of a motif within a gene list, finding a motif ‘enriched’ does not imply that all the genes in the gene-list have a high score for the motif. In this way, the third step of the workflow is to identify which genes (of the gene-set) are highly ranked for each of the significant motifs.
There are two methods to identify these genes: (1) equivalent to the
ones used in iRegulon and i-cisTarget (method="iCisTarget"
,
recommended if running time is not an issue), and (2) a faster
implementation based on an approximate distribution using the average at
each rank (method="aprox"
, useful to scan multiple gene
sets).
IMPORTANT: Make sure that the motifRankings are the same as in Step 1.
motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable,
rankings=motifRankings,
geneSets=geneLists)
dim(motifEnrichmentTable_wGenes)
## [1] 17 11
Plot for a few motifs:
The final output of RcisTarget is a data.table
containing the information about the motif enrichment and its annotation
organized in the following fields:
Note that the TFs are provided based on the motif annotation. They can be used as a guide to select relevant motifs or to prioritize some TFs, but the motif annotation does not imply that all the TFs appearing in the table regulate the gene list.
anotatedTfs <- lapply(split(motifEnrichmentTable_wGenes$TF_highConf,
motifEnrichmentTable$geneSet),
function(x) {
genes <- gsub(" \\(.*\\). ", "; ", x, fixed=FALSE)
genesSplit <- unique(unlist(strsplit(genes, "; ")))
return(genesSplit)
})
anotatedTfs$hypoxia
## [1] "HIF1A" "HMGA1" "EPAS1" "FOXJ1" "FOXJ2" "FOXJ3" "FOXP1" "FOXP2" "FOXP3"
## [10] "FOXP4" "FOXG1"
signifMotifNames <- motifEnrichmentTable$motif[1:3]
incidenceMatrix <- getSignificantGenes(geneLists$hypoxia,
motifRankings,
signifRankingNames=signifMotifNames,
plotCurve=TRUE, maxRank=5000,
genesFormat="incidMatrix",
method="aprox")$incidMatrix
library(reshape2)
edges <- melt(incidenceMatrix)
edges <- edges[which(edges[,3]==1),1:2]
colnames(edges) <- c("from","to")
Output not shown:
library(visNetwork)
motifs <- unique(as.character(edges[,1]))
genes <- unique(as.character(edges[,2]))
nodes <- data.frame(id=c(motifs, genes),
label=c(motifs, genes),
title=c(motifs, genes), # tooltip
shape=c(rep("diamond", length(motifs)), rep("elypse", length(genes))),
color=c(rep("purple", length(motifs)), rep("skyblue", length(genes))))
visNetwork(nodes, edges) %>% visOptions(highlightNearest = TRUE,
nodesIdSelection = TRUE)
This is the output of sessionInfo()
on the system on
which this document was compiled:
## [1] "Sat Nov 30 03:50:28 2024"
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.16.2
## [2] DT_0.33
## [3] RcisTarget.hg19.motifDBs.cisbpOnly.500bp_1.26.0
## [4] RcisTarget_1.27.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.37.0 KEGGREST_1.47.0
## [3] xfun_0.49 bslib_0.8.0
## [5] htmlwidgets_1.6.4 Biobase_2.67.0
## [7] lattice_0.22-6 crosstalk_1.2.1
## [9] vctrs_0.6.5 tools_4.4.2
## [11] generics_0.1.3 stats4_4.4.2
## [13] tibble_3.2.1 fansi_1.0.6
## [15] AnnotationDbi_1.69.0 RSQLite_2.3.8
## [17] blob_1.2.4 R.oo_1.27.0
## [19] pkgconfig_2.0.3 Matrix_1.7-1
## [21] sparseMatrixStats_1.19.0 S4Vectors_0.45.2
## [23] assertthat_0.2.1 graph_1.85.0
## [25] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [27] compiler_4.4.2 Biostrings_2.75.1
## [29] codetools_0.2-20 GenomeInfoDb_1.43.2
## [31] htmltools_0.5.8.1 sys_3.4.3
## [33] buildtools_1.0.0 sass_0.4.9
## [35] yaml_2.3.10 pillar_1.9.0
## [37] crayon_1.5.3 jquerylib_0.1.4
## [39] R.utils_2.12.3 cachem_1.1.0
## [41] DelayedArray_0.33.2 abind_1.4-8
## [43] tidyselect_1.2.1 digest_0.6.37
## [45] purrr_1.0.2 dplyr_1.1.4
## [47] arrow_17.0.0.1 maketools_1.3.1
## [49] AUCell_1.29.0 fastmap_1.2.0
## [51] grid_4.4.2 SparseArray_1.7.2
## [53] cli_3.6.3 magrittr_2.0.3
## [55] S4Arrays_1.7.1 XML_3.99-0.17
## [57] utf8_1.2.4 GSEABase_1.69.0
## [59] DelayedMatrixStats_1.29.0 UCSC.utils_1.3.0
## [61] bit64_4.5.2 rmarkdown_2.29
## [63] XVector_0.47.0 httr_1.4.7
## [65] matrixStats_1.4.1 bit_4.5.0
## [67] zoo_1.8-12 R.methodsS3_1.8.2
## [69] png_0.1-8 memoise_2.0.1
## [71] evaluate_1.0.1 knitr_1.49
## [73] GenomicRanges_1.59.1 IRanges_2.41.1
## [75] rlang_1.1.4 Rcpp_1.0.13-1
## [77] xtable_1.8-4 glue_1.8.0
## [79] DBI_1.2.3 BiocGenerics_0.53.3
## [81] annotate_1.85.0 jsonlite_1.8.9
## [83] R6_2.5.1 MatrixGenerics_1.19.0
## [85] zlibbioc_1.52.0