The drugTargetInteractions
package provides utilities
for identifying drug-target interactions for sets of small molecule
and/or gene/protein identifiers (Wu et al.
2006). The required drug-target interaction information is
obained from a downloaded SQLite instance of the ChEMBL database (Gaulton et al. 2012; Bento et al. 2014). ChEMBL
has been chosen for this purpose, because it provides one of the most
comprehensive and best annotatated knowledge resources for drug-target
information available in the public domain.
As Bioconductor package drugTargetInteraction
can be
installed with the BiocManager::install()
function.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("drugTargetInteractions")
Alternatively, the package can be installed from GitHub as follows.
To use drugTargetInteractions
, the package needs to be
loaded in a user’s R session.
The following commands are useful to list the available help files and open the the vignette of the package.
The drugTargetInteractions
package uses a downloaded
SQLite instance of the ChEMBL database. The following
code in this vignette uses a slimmed down toy version of this database
that is small enough to be included in this package for demonstrating
the usage of its functions. For real drug-target analysis work, it is
important that users download and uncompress a recent version of the
ChEMBL SQLite database from here,
and then replace the path assigned to chembldb
below with
the path to the full version of the ChEMBL database they have downloaded
to their system. Since the SQLite database from ChEMBL can be used by
this package as is, creating a copy of the ChEMBL SQLite database on
Bioconductor’s AnnotationHub
is not necessary at this point. This way users can always use the latest
or historical versions of ChEMBL without the need of maintaining a
mirror instance.
The following genConfig
function call creates a list
containing the paths to input and output directories used by the sample
code introduced in this vignette. For real analyses, users want to
customize these paths to match the environment on their system. Usually,
this means all paths generated by the system.file
function
need to be changed, as those are specific to work with the test data of
a package (e.g. toy ChEMBL SQLite instance).
chembldb <- system.file("extdata", "chembl_sample.db", package="drugTargetInteractions")
resultsPath <- system.file("extdata", "results", package="drugTargetInteractions")
config <- genConfig(chemblDbPath=chembldb, resultsPath=resultsPath)
In addition, a lookup table is downloaded on the fly from UniChem
(see web
page and corresponding ftp
site. This lookup table is used by
drugTargetInteractions
to translate compound identifiers
across different drug databases. Currently, this includes three types of
compound identifiers: DrugBank, PubChem, and ChEBI.
Users mainly interested in generating analysis results can skip the technical details in the following sections and continue with the section entitled Workflow to Run Everything.
The following returns for a set of query IDs (e.g. ENSEMBL
gene IDs) the corresponding UniProt IDs based on a stict ID matching as
well as a more relaxed sequence similarity-based approach. The latter
sequence similarity associations are obtained with the
getUniprotIDs
or the getParalogs
functions
using UniProt’s UNIREF cluster or BioMart’s paralog annotations,
respectively.
The UniProt.ws
package is used to to return for a set of
query IDs (here Ensembl gene IDs) the corresponding UniProt IDs based on
two independent approaches: ID mappings (IDMs) and sequence similarity
nearest neighbors (SSNNs) using UNIREF clusters. The latter have been
generated by UniProt with the MMSeqs2 and Linclust algorithms (Steinegger and Söding 2017; Steinegger and Soeding
2018). Additional details on the UNIREF clusters are available here. The organism, query
ID type and sequence similarity level can be selected under the
taxId
, kt
and seq_cluster
arguments, respectively. The seq_cluster
argument can be
assigned one of: UNIREF100
, UNIREF90
or
UNIREF50
. The result is a list with two
data.frames
. The first one is based on IDMs and the second
one on SSNNs.
keys <- c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90")
The following shows the first data.frame
containing the
ID mapping results.
library(DT)
datatable(res_list90[[1]], options = list(scrollX=TRUE, scrollY="600px", autoWidth = TRUE))
The following shows how to return the dimensions of the two
data.frames
and how to obtain the UniProt IDs as character
vectors required for the downstream analysis steps.
## $IDM
## [1] 18 8
##
## $SSNN
## [1] 22 8
## $IDM
## [1] "A0A669KAY2" "D6RJB7" "Q8N7Z5" "F8VP73" "F8W606" "G8JLQ3" "H0YHA4"
## [8] "P78537" "A0A0G2JQP8" "A0A1W2PPV8" "A0A1W2PQT4" "A0A1W2PRA9" "A0A1W2PRB5" "A0A1W2PRR3"
## [15] "A0A1W2PS83" "A0A3B3IT55" "I3L233" "Q7Z3B3"
##
## $SSNN
## [1] "A0A669KAY2" "D6RJB7" "Q8N7Z5" "A0A087WSV2" "F8VP73" "F8W606" "G8JLQ3"
## [8] "F8VNQ1" "H0YHA4" "P78537" "A0A1W2PPV8" "A0A1W2PQT4" "A0A1W2PRA9" "A0A1W2PRB5"
## [15] "A0A1W2PRR3" "A0A1W2PS83" "A0A3B3IT55" "A0A024R9Y2" "A0A0G2JNT7" "A0A0G2JQF4" "I3L233"
## [22] "Q7Z3B3"
The following obtains via biomaRt
for a set of query
genes the corresponding UniProt IDs as well as their paralogs. The query
genes can be Gene Names or ENSEMBL Gene IDs from H. sapiens.
The result is similar to IDMs and SSNNs from the
getUniprotIDs
function, but instead of UNIREF clusters,
biomaRt’s paralogs are used to obtain SSNNs.
queryBy <- list(molType="gene", idType="external_gene_name", ids=c("ANKRD31", "BLOC1S1", "KANSL1"))
queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071"))
res_list <- getParalogs(queryBy)
The following shows the first data.frame
containing the
ID mapping results.
library(DT)
datatable(res_list[[1]], options = list(scrollX = TRUE, scrollY="400px", autoWidth = TRUE))
The following shows how to return the dimensions of the two
data.frames
and how to obtain the UniProt IDs as character
vectors required for the downstream analysis steps.
## $IDM
## [1] 30 5
##
## $SSNN
## [1] 33 7
## $IDM
## [1] "Q7Z3B3" "P78537" "Q8N7Z5"
##
## $SSNN
## [1] "Q7Z3B3" "A0AUZ9" "P78537" "Q8N7Z5" "Q5JPF3" "A6QL64" "Q8N2N9"
The drugTargetAnnot
function returns for a set of
compound or gene/protein IDs the corresponding known drug-target
annotation data available in ChEMBL. A related function called
getDrugTarget
is described in the subsequent subsection.
This method generates very similar results, but uses internally
pre-computed annotation summary tables which is less flexible than the
usage of pure SQL statements.
drugTargetAnnot
The drugTargetAnnot
function queries the ChEMBL database
with SQL statements without depending on pre-computed annotation
tables.
The following returns drug-target annotations for a set of query Ensembl gene IDs. For this the Ensembl gene IDs are translated into UniProt IDs using both the IDM and SSNN approaches.
keys <- c("ENSG00000120088", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90")
Next, drug-target annotations are returned for the Uniprot IDs with
the IDM or SSNN methods. The following example uses the UniProt IDs of
the SSNN method. Note, to include the upstream Ensembl gene IDs in the
final result table, the below Ensembl ID collapse step via a
tapply
is necessary since occasionally UniProt IDs are
assigned to several Ensembl gene IDs (e.g. recent gene
duplications).
queryBy <- list(molType="protein", idType="UniProt_ID", ids=id_list[[2]])
qresultSSNN <- drugTargetAnnot( queryBy, config=config)
ensidsSSNN <- tapply(res_list90[[2]]$ENSEMBL, res_list90[[2]]$ID, paste, collapse=", ")
qresultSSNN <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultSSNN$QueryIDs)], qresultSSNN)
getDrugTarget
The getDrugTarget
function generates similar results as
drugTargetAnnot
, but depends on a pre-computed query table
(here drugTargetAnnot.xls
).
id_mapping <- c(chembl="chembl_id", pubchem="PubChem_ID", uniprot="UniProt_ID", drugbank="DrugBank_ID")
queryBy <- list(molType="cmp", idType="pubchem", ids=c("2244", "65869", "2244"))
queryBy <- list(molType="protein", idType="uniprot", ids=c("P43166", "P00915", "P43166"))
queryBy <- list(molType="cmp", idType="drugbank", ids=c("DB00945", "DB01202"))
#qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17,39,46:53),config=config)
qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config)
queryBy <- list(molType="protein", idType="chembl", ids=c("CHEMBL25", "nomatch", "CHEMBL1742471"))
#qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping, columns=c(1,5,8,16,17,39,46:52),config=config)
qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config)
The drugTargetBioactivity
function returns for a set of
compound or gene/protein IDs the corresponding bioassay data available
in ChEMBL.
Example query for compounds IDs.
Example query for protein IDs. Note, the Ensembl gene to UniProt ID
mappings are derived from above and stored in the named character vector
ensidsSSNN
.
This section explains how to run all of the above drug-target interaction analysis steps with a few convenience meta functions. Users mainly interested in generating analysis results quickly can focus on this section only.
The getSymEnsUp
function returns for a query of gene or
protein IDs a mapping table containing: ENSEMBL Gene IDs, Gene
Names/Symbols, UniProt IDs and ENSEMBL Protein IDs. Internally, the
function uses the ensembldb
package. Its results are
returned in a list where the first slot contains the ID mapping table,
while the subsequent slots include the corresponding named character
vectors: ens_gene_id
, up_ens_id
, and
up_gene_id
. Currently, the following query IDs are
supported by getSymEnsUp
: GENE_NAME
,
ENSEMBL_GENE_ID
and UNIPROT_ID
.
gene_name <- c("CA7", "CFTR")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=gene_name, idtype="GENE_NAME")
ens_gene_id <- idMap$ens_gene_id
ens_gene_id
## ENSG00000168748 ENSG00000001626
## "CA7" "CFTR"
The perfect match and nearest neighbor UniProt IDs can be obtained from UniProt’s UNIREF cluster or BioMart’s paralog annotations.
The corresponding IDM and SSNN UniProt IDs for the above ENSEMBL gene
IDs are obtained with the getUniprotIDs
function. This step
is slow since the queries have to be performed with
chunksize=1
in order to reliably track the query ENSEMBL
gene ID information in the results.
Here the corresponding perfect match (IDM) and paralog (SSNN) UniProt
IDs for the above ENSEMBL gene IDs are obtained with the
getParalogs
function. The latter is much faster than
getUniprotIDs
, and also covers wider evolutionary
distances. Thus, it may be the preferred methods for many use cases.
queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=names(ens_gene_id))
res_list <- getParalogs(queryBy)
## IDM SSNN
## [1,] 13 127
## [2,] 5 7
Both drug-target annotation and bioassay data are obtained with a
meta function called runDrugTarget_Annot_Bioassay
that
internally uses the main processing functions
drugTargetAnnot
and drugTargetBioactivity
. It
organizes the result in a list with the annotation and bioassay data
(data.frames
) in the first and second slot, respectively.
Importantly, the results from the IDM and SSNN UniProt IDs are combined
in a single table, where duplicated rows have been removed. To track in
the result table, which method was used for obtaining UniProt IDs, an
IDM_Mapping_Type
column has been added. Note, the gene IDs
in the SSNN rows have the string Query_
prepended to
indicate that they are not necessarily the genes encoding the SSNN
UniProt proteins listed in the corresponding rows. Instead they are the
genes encoding the query proteins used for searching for SSNNs.
drug_target_list <- runDrugTarget_Annot_Bioassay(res_list=res_list, up_col_id="ID_up_sp", ens_gene_id, config=config)
sapply(drug_target_list, dim)
## Annotation Bioassay
## [1,] 55 35
## [2,] 18 17
View content of annotation result slot:
datatable(drug_target_list$Annotation, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
View content of bioassay result slot (restricted to first 500 rows):
The following generates a summary table containing drug-target frequency information.
df <- drug_target_list$Annotation
df[,"GeneName"] <- gsub("Query_", "", as.character(df$GeneName))
stats <- tapply(df$CHEMBL_CMP_ID, as.factor(df$GeneName), function(x) unique(x))
stats <- sapply(names(stats), function(x) stats[[x]][nchar(stats[[x]]) > 0])
stats <- sapply(names(stats), function(x) stats[[x]][!is.na(stats[[x]])])
statsDF <- data.frame(GeneNames=names(stats), Drugs=sapply(stats, paste, collapse=", "), N_Drugs=sapply(stats, length))
Print drug-target frequency table.
Both the annotation and bioassay data of the
drug_target_list
object can be exported to separate tabular
files as follows.
write.table(drug_target_list$Annotation, "DrugTargetAnnotation.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(drug_target_list$Bioassay, "DrugTargetBioassay.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(statDF, "statDF.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.31.0 AnnotationFilter_1.31.0
## [4] GenomicFeatures_1.59.0 AnnotationDbi_1.69.0 Biobase_2.67.0
## [7] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0 IRanges_2.41.0
## [10] S4Vectors_0.45.0 BiocGenerics_0.53.1 generics_0.1.3
## [13] DT_0.33 drugTargetInteractions_1.15.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
## [4] filelock_1.0.3 Biostrings_2.75.0 bitops_1.0-9
## [7] lazyeval_0.2.2 fastmap_1.2.0 RCurl_1.98-1.16
## [10] BiocFileCache_2.15.0 GenomicAlignments_1.43.0 XML_3.99-0.17
## [13] digest_0.6.37 lifecycle_1.0.4 ProtGenerics_1.39.0
## [16] KEGGREST_1.47.0 RSQLite_2.3.7 magrittr_2.0.3
## [19] compiler_4.4.1 rlang_1.1.4 sass_0.4.9
## [22] progress_1.2.3 tools_4.4.1 utf8_1.2.4
## [25] yaml_2.3.10 rtracklayer_1.67.0 knitr_1.48
## [28] htmlwidgets_1.6.4 S4Arrays_1.7.1 prettyunits_1.2.0
## [31] bit_4.5.0 curl_5.2.3 DelayedArray_0.33.1
## [34] xml2_1.3.6 abind_1.4-8 BiocParallel_1.41.0
## [37] withr_3.0.2 purrr_1.0.2 sys_3.4.3
## [40] grid_4.4.1 fansi_1.0.6 biomaRt_2.63.0
## [43] SummarizedExperiment_1.37.0 cli_3.6.3 rmarkdown_2.28
## [46] crayon_1.5.3 httr_1.4.7 rjson_0.2.23
## [49] BiocBaseUtils_1.9.0 DBI_1.2.3 cachem_1.1.0
## [52] stringr_1.5.1 zlibbioc_1.52.0 parallel_4.4.1
## [55] BiocManager_1.30.25 XVector_0.47.0 restfulr_0.0.15
## [58] matrixStats_1.4.1 vctrs_0.6.5 Matrix_1.7-1
## [61] jsonlite_1.8.9 hms_1.1.3 bit64_4.5.2
## [64] crosstalk_1.2.1 maketools_1.3.1 jquerylib_0.1.4
## [67] glue_1.8.0 codetools_0.2-20 stringi_1.8.4
## [70] BiocIO_1.17.0 UCSC.utils_1.3.0 tibble_3.2.1
## [73] pillar_1.9.0 rappdirs_0.3.3 htmltools_0.5.8.1
## [76] GenomeInfoDbData_1.2.13 R6_2.5.1 dbplyr_2.5.0
## [79] httr2_1.0.5 lattice_0.22-6 evaluate_1.0.1
## [82] png_0.1-8 Rsamtools_2.23.0 memoise_2.0.1
## [85] bslib_0.8.0 rjsoncons_1.3.1 SparseArray_1.7.0
## [88] xfun_0.48 MatrixGenerics_1.19.0 UniProt.ws_2.47.0
## [91] buildtools_1.0.0 pkgconfig_2.0.3