Title: | Customize and Query Compound Annotation Database |
---|---|
Description: | This package serves as a query interface for important community collections of small molecules, while also allowing users to include custom compound collections. |
Authors: | Yuzhu Duan [aut, cre], Thomas Girke [aut] |
Maintainer: | Yuzhu Duan <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.17.0 |
Built: | 2024-11-19 03:42:38 UTC |
Source: | https://github.com/bioc/customCMPdb |
This package is served as the query and customization interface for compound annotations from DrugAge, DrugBank, CMAP02 and LINCS databases. It also stores the structure SDF datasets for compounds in the above four databases.
Specifically, the annotation database created by this package is an SQLite database
containing 5 tables, including 4 compound annotation tables from DrugAge,
DrugBank, CMAP02 and LINCS databases, respectively. The other one is an ID
mapping table of ChEMBL IDs to IDs of individual databases. The other 4 datasets
stores the structures of compounds in the DrugAge, DrugBank, CMAP02 and LINCS
databases in SDF files. For detailed description of the 5 datasets generated
by this package, please consult to the vignette of this package by running
browseVignettes("customCMPdb")
. The actual datasets are hosted in
AnnotationHub
.
This package also provides functionalities to customize and query the compound
annotation SQLite database. Users could add their customized compound annotation
tables to the SQLite database and query both the default (DrugAge, DrugBank, CMAP02,
LINCS) and customized annotations by providing ChEMBL ids of the query compounds.
The customization and query functions are available at customAnnot
and queryAnnotDB
, respectively.
The description of the 5 datasets in this package is as follows.
Annotation SQLite database:
It is a SQLite database storing compound annotation tables for DrugAge, DrugBank, CMAP02 and LINCS, respectively. It also contains an ID mapping table of ChEMBL ID to IDs of individual databases.
DrugAge SDF:
It is an SDF (Structure-Data File) file storing molecular structures of
DrugAge compounds. The source DrugAge annotation file was downloaded from
here. The extracted csv
file only contains drug names, without id mappings to external resources
such as PubChem or ChEMBL. The extracted 'drugage.csv' file was further processed by the
processDrugage
function in this package. The result DrugAge annotation table
as well as the id-mapping table (DrugAge internal id to ChEMBL ID) were then
stored in the SQLite annotation database named as 'compoundCollection'.
The drug structures were obtained from PubChem CIDs by getIds
function from ChemmineR package. The SDFset
object was then
written to the drugage_build2.sdf
file
DrugBank SDF:
This SDF file stores structures of compounds in
DrugBank database. The full DrugBank xml
file was downloaded from https://www.drugbank.ca/releases/latest.
The most recent release version at the time of writing this document is 5.1.5.
The extracted xml file was processed by the dbxml2df
function in this package.
The result DrugBank annotation table was then stored in the compoundCollection
SQLite database. The DrugBank to ChEMBL id mappings were obtained from
UniChem.
The DrugBank SDF file was downloaded from
https://www.drugbank.ca/releases/latest#structures.
Some validity checks and modifications were made via utilities in the
ChemmineR package. The results were written to the drugbank_5.1.5.sdf
file
CMAP SDF:
The CMAP compound instance table was downloaded from
CMAP02
website and processed by the buildCMAPdb
function
in this package. The result 'cmap.db' contains both compound annotation and
structure information.
Since the annotation table only contains PubChem CID, the ChEMBL ids were added
via PubChem CID to ChEMBL id mappings from
UniChem.
The CMAP internal IDs were made for ChEMBL id to CMAP id mappings. The
structures were written to the cmap02.sdf
file
LINCS SDF:
The LINCS compound annotation table was downloaded from
GEO.
where only compounds type were selected.
The LINCS ids were mapped to ChEMBL ids via inchi key. The LINCS compounds
structures were obtained from PubChem CIDs via getIds
function from
ChemmineR package. The structures were written to the lincs_pilot1.sdf
file
The R script of generating the above 5 datasets is available at the
'inst/scripts/make-data.R' file in this package. The file location can
be found by running system.file("scripts/make-data.R",package="customCMPdb")
in user's R session or from the
GitHub repository
of this package.
Yuzhu Duan ([email protected])
Thomas Girke ([email protected])
library(AnnotationHub) ## Not run: ah <- AnnotationHub() ## Load compoundCollection annotation SQLite database query(ah, c("customCMPdb", "annot_0.1")) annot_path <- ah[["AH79563"]] library(RSQLite) conn <- dbConnect(SQLite(), annot_path) dbListTables(conn) drugAgeAnnot <- dbReadTable(conn, "drugAgeAnnot") head(drugAgeAnnot) dbDisconnect(conn) ## Load DrugAge SDF file query(ah, c("customCMPdb", "drugage_build2")) da_path <- ah[["AH79564"]] da_sdfset <- ChemmineR::read.SDFset(da_path) ## Load DrugBank SDF file query(ah, c("customCMPdb", "drugbank_5.1.5")) db_path <- ah[["AH79565"]] db_sdfset <- ChemmineR::read.SDFset(db_path) ## Load CMAP SDF file query(ah, c("customCMPdb", "cmap02")) cmap_path <- ah[["AH79566"]] cmap_sdfset <- ChemmineR::read.SDFset(cmap_path) ## Load LINCS SDF file query(ah, c("customCMPdb", "lincs_pilot1")) lincs_path <- ah[["AH79567"]] lincs_sdfset <- ChemmineR::read.SDFset(lincs_path) ## End(Not run)
library(AnnotationHub) ## Not run: ah <- AnnotationHub() ## Load compoundCollection annotation SQLite database query(ah, c("customCMPdb", "annot_0.1")) annot_path <- ah[["AH79563"]] library(RSQLite) conn <- dbConnect(SQLite(), annot_path) dbListTables(conn) drugAgeAnnot <- dbReadTable(conn, "drugAgeAnnot") head(drugAgeAnnot) dbDisconnect(conn) ## Load DrugAge SDF file query(ah, c("customCMPdb", "drugage_build2")) da_path <- ah[["AH79564"]] da_sdfset <- ChemmineR::read.SDFset(da_path) ## Load DrugBank SDF file query(ah, c("customCMPdb", "drugbank_5.1.5")) db_path <- ah[["AH79565"]] db_sdfset <- ChemmineR::read.SDFset(db_path) ## Load CMAP SDF file query(ah, c("customCMPdb", "cmap02")) cmap_path <- ah[["AH79566"]] cmap_sdfset <- ChemmineR::read.SDFset(cmap_path) ## Load LINCS SDF file query(ah, c("customCMPdb", "lincs_pilot1")) lincs_path <- ah[["AH79567"]] lincs_sdfset <- ChemmineR::read.SDFset(lincs_path) ## End(Not run)
Functions could be used to add/delete user's custom compound annotations
from the annotation SQLite database.
The added custom compound annotation table should contains a column named as
chembl_id
that represents the ChEMBL ids of the added compounds.
The listAnnot
function lists the available annotation
resources in the SQLite annotation database.
The defaultAnnot
function sets the annotation SQLite
database to the default one by deleting the existing one and re-downloading
from AnnotationHub.
addCustomAnnot(annot_tb, id_col = NULL, annot_name, overwrite = FALSE) deleteAnnot(annot_name) listAnnot() defaultAnnot()
addCustomAnnot(annot_tb, id_col = NULL, annot_name, overwrite = FALSE) deleteAnnot(annot_name) listAnnot() defaultAnnot()
annot_tb |
data.frame representing the custom annotation table, Note, it should contains a 'chembl_id' column representing the compound ChEMBL ids |
id_col |
column name in |
annot_name |
character(1), user defined name of the annotation table |
overwrite |
a logical specifying whether to overwrite an existing table or not. Its default is FALSE. |
character vector of names of the annotation tables in the SQLite DB
character(1), path to the annotation SQLite database
chembl_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL10", "CHEMBL100", "CHEMBL1000", NA) annot_tb <- data.frame(compound_name=paste0("name", 1:6), chembl_id=chembl_id, feature1=paste0("f", 1:6), feature2=rnorm(6)) addCustomAnnot(annot_tb, annot_name="mycustom3") deleteAnnot("mycustom3") annot_names <- listAnnot() # defaultAnnot()
chembl_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL10", "CHEMBL100", "CHEMBL1000", NA) annot_tb <- data.frame(compound_name=paste0("name", 1:6), chembl_id=chembl_id, feature1=paste0("f", 1:6), feature2=rnorm(6)) addCustomAnnot(annot_tb, annot_name="mycustom3") deleteAnnot("mycustom3") annot_names <- listAnnot() # defaultAnnot()
This function builds a SQLite database named as 'cmap.db' that contains id mappings of cmap names to PubChem/DrugBank IDs as well as compound structure information.
buildCMAPdb(dest_dir = ".")
buildCMAPdb(dest_dir = ".")
dest_dir |
character(1), destination directory under which the result SQLite database named as 'cmap.db' stored. The default is user's current working directory. |
For about 2/3 of the CMAP drugs, one can obtain their PubChem/DrugBank IDs from the DMAP site here: http://bio.informatics.iupui.edu/cmaps. Since this website is no longer supported, the processed CMAP name to PubChem and DrugBank ID mapping table is stored under the "inst/extdata" folder of this package named as "dmap_unique.txt". The SMILES strings for CMAP entries were obtained from ChemBank. Compounds were matched by names using the 'stringdist' library where cmap_name from CMAP were mapped to the closest name in ChemBank.
write "cmap.db" SQLite database to the destination directory defined by user.
library(ChemmineR) ## Query database # buildCMAPdb(dest_dir="./inst/scripts") # conn <- initDb("/inst/scripts/cmap.db") # results <- getAllCompoundIds(conn) # sdfset <- getCompounds(conn, results, keepOrder=TRUE) # sdfset # as.data.frame(datablock2ma(datablock(sdfset)))[1:4,] # myfeat <- listFeatures(conn) # feat <- getCompoundFeatures(conn, results, myfeat) # feat[1:4,]
library(ChemmineR) ## Query database # buildCMAPdb(dest_dir="./inst/scripts") # conn <- initDb("/inst/scripts/cmap.db") # results <- getAllCompoundIds(conn) # sdfset <- getCompounds(conn, results, keepOrder=TRUE) # sdfset # as.data.frame(datablock2ma(datablock(sdfset)))[1:4,] # myfeat <- listFeatures(conn) # feat <- getCompoundFeatures(conn, results, myfeat) # feat[1:4,]
This function builds the DrugAge annotation SQLite database from the 'drugage_id_mapping' table stored in the 'inst/extdata' directory of this package. The 'drugage_id_mapping.tsv' table contains the DrugAge compounds annotation information (such as species, avg_lifespan_change etc) as well as the compound name to ChEMBL id and PubChem id mappings.
buildDrugAgeDB( da_path = system.file("extdata/drugage_id_mapping.tsv", package = "customCMPdb"), dest_path )
buildDrugAgeDB( da_path = system.file("extdata/drugage_id_mapping.tsv", package = "customCMPdb"), dest_path )
da_path |
character(1), file path to the tabular file generated from
|
dest_path |
character(1), destination path of the result DrugAge annotation SQLite database |
Part of the id mappings in the 'drugage_id_mapping.tsv' table is generated
by the processDrugage
function for compound names that have ChEMBL
ids from the ChEMBL database (version 24). The missing IDs were added
manually. A semi-manual approach was to use this web service:
https://cts.fiehnlab.ucdavis.edu/batch. After the semi-manual process,
the left ones were manually mapped to ChEMBL, PubChem and DrugBank ids.
The mixed items were commented.
DrugAge annotation SQLite database
buildDrugAgeDB(dest_path=tempfile(fileext="_drugage.db"))
buildDrugAgeDB(dest_path=tempfile(fileext="_drugage.db"))
Download the original DrugBank database
at http://www.drugbank.ca/releases/latest (xml file) into your current
working directory and rename as "drugbank.xml"
then run:
drugbank_df = dbxml2df(xmlfile="drugbank.xml", version="5.0.10")
.
dbxml2df(xmlfile, version)
dbxml2df(xmlfile, version)
xmlfile |
Character(1), file path to the xml file downloaded from the DrugBank website at https://www.drugbank.ca/releases/latest |
version |
Character(1), DrugBank version of the xml file |
Dataframe of drugbank xml database.
This process with take about 20 minutes.
Yuzhu Duan [email protected]
http://www.drugbank.ca/releases/latest
library(XML) ## Not run: ## download the original drugbank database at \url{http://www.drugbank.ca/releases/latest} (xml file) ## into your current directory and rename as drugbank.xml ## convert drugbank database (xml file) into dataframe: drugbank_df <- dbxml2df(xmlfile="drugbank.xml", version="5.0.10") ## End(Not run)
library(XML) ## Not run: ## download the original drugbank database at \url{http://www.drugbank.ca/releases/latest} (xml file) ## into your current directory and rename as drugbank.xml ## convert drugbank database (xml file) into dataframe: drugbank_df <- dbxml2df(xmlfile="drugbank.xml", version="5.0.10") ## End(Not run)
Store specific version of drugbank dataframe into an SQLite database under user defined directory, the default is user's present working directory of R session
df2SQLite(dbdf, version, dest_dir = ".")
df2SQLite(dbdf, version, dest_dir = ".")
dbdf |
Drugbank dataframe generated by |
version |
Character(1), version of the input drugbank dataframe generated
by |
dest_dir |
Character(1), destination directory that the result SQLite database stored in. The default is user's current working directory |
SQLite database named as "drugbank_<versionNumber>.db" stored under user's present working directory of R session or user's specified directory.
Yuzhu Duan [email protected]
library(RSQLite) ## Not run: # download the original drugbank database (http://www.drugbank.ca/releases/latest) (xml file) # to your current R working directory, and rename as "drugbank.xml". # Read in the xml file and convert to a data.frame in R drugbank_df = dbxml2df(xmlfile="drugbank.xml", version="5.1.5") # store the converted drugbank dataframe into SQLite database under user's present R working direcotry, or other directory defined by 'dest_dir' df2SQLite(dbdf=drugbank_df, version="5.1.5") # set version as version of xml file ## End(Not run)
library(RSQLite) ## Not run: # download the original drugbank database (http://www.drugbank.ca/releases/latest) (xml file) # to your current R working directory, and rename as "drugbank.xml". # Read in the xml file and convert to a data.frame in R drugbank_df = dbxml2df(xmlfile="drugbank.xml", version="5.1.5") # store the converted drugbank dataframe into SQLite database under user's present R working direcotry, or other directory defined by 'dest_dir' df2SQLite(dbdf=drugbank_df, version="5.1.5") # set version as version of xml file ## End(Not run)
The compound annotation tables from different databases/sources are stored in one SQLite database. This function can be used to load the SQLite annotation database
loadAnnot()
loadAnnot()
conn <- loadAnnot()
conn <- loadAnnot()
This function could be used to get SDFset of compounds in CMAP2, LINCS 2017,
DrugAge build 2 or DrugBank 5.1.5 databases. The cid
of the SDFset are
compound names instead of their internal IDs.
loadSDFwithName(source = "LINCS")
loadSDFwithName(source = "LINCS")
source |
character(1), one of "CMAP2", "LINCS", "DrugBank", "DrugAge" |
SDFset object of compounds in the source
database, the cid
of the SDFset are compound names.
da_sdf <- loadSDFwithName(source="DrugAge")
da_sdf <- loadSDFwithName(source="DrugAge")
This function processes the source DrugAge datasets by adding the ChEMBL, PubChem and DrugBank id mapping information to the source DrugAge table which only has compound names without id mapping information. Source file of DrugAge is linked here: http://genomics.senescence.info/drugs/dataset.zip
processDrugage(dest_file = "drugage_id_mapping.tsv", verbose = TRUE)
processDrugage(dest_file = "drugage_id_mapping.tsv", verbose = TRUE)
dest_file |
character(1), file path to the generated DrugAge annotation tabular file with id mappings. The default will write the file named as "drugage_id_mapping.tsv" to user's current working directory. |
verbose |
logical(1), If descriptive message and list of issues should be included as output |
This function only annotates compound names that have ChEMBL ids from the ChEMBL database (version 24). The missing IDs were added manually. A semi-manual approach was to use this web service: https://cts.fiehnlab.ucdavis.edu/batch. After the semi-manual process, the left ones were manually mapped to ChEMBL, PubChem and DrugBank ids. The mixed items were commented.
write the default 'drugage_id_mapping.tsv' file to user's current working
directory or the file path defined by users to the dest_dafile
argument.
library(ChemmineR) ## Not run: processDrugage(dest_file="drugage_id_mapping.tsv") # Now the missing IDs need to be added manually. A semi-manual approach is to # use this web service: https://cts.fiehnlab.ucdavis.edu/batch ## End(Not run)
library(ChemmineR) ## Not run: processDrugage(dest_file="drugage_id_mapping.tsv") # Now the missing IDs need to be added manually. A semi-manual approach is to # use this web service: https://cts.fiehnlab.ucdavis.edu/batch ## End(Not run)
This function can be used to query compound annotations from the default
resources as well as the custom resources stored in the SQLite annotation
database. The default annotation resources are DrugAge, DrugBank, CMAP02 and
LINCS. The customized compound annotations could be added/deleted by the
customAnnot
utilities.
queryAnnotDB( chembl_id, annot = c("drugAgeAnnot", "DrugBankAnnot", "cmapAnnot", "lincsAnnot") )
queryAnnotDB( chembl_id, annot = c("drugAgeAnnot", "DrugBankAnnot", "cmapAnnot", "lincsAnnot") )
chembl_id |
character vector of ChEMBL IDs or compound ids from other annotation system.. |
annot |
character vector of annotation resources, such as
|
The input of this query function could be a set of ChEMBL IDs, it returns a
data.frame storing annotations of the input compounds from selected
annotation resources defined by the annot
argument.
Since in the SQLite annotation database, ID identifiers from different ID systems, such as DrugBank and LINCS, are connected by ChEMBL IDs, it is hard to tell whether two IDs, such as DB00341, BRD-A42571354, refer to the same compound if either of them lack ID mappings to ChEMBL. So for querying compounds that don't have ChEMBL IDs, only one isolated database where the compounds belong to are supported. For example, a compound with LINCS id as "BRD-A00150179" doesn't have the ChEMBL ID mapping, when it is passed to the 'chembl_id' argument, the 'annot' need only to be set as 'lincsAnnot' and the result will be the compound annotation table from the LINCS annotation.
data.frame of annotation result
query_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL100109", "CHEMBL100", "CHEMBL1000", "CHEMBL10") qres <- queryAnnotDB(query_id, annot=c("drugAgeAnnot", "lincsAnnot")) # Add a custom compound annotation table chembl_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL10", "CHEMBL100", "CHEMBL1000", NA) annot_tb <- data.frame(compound_name=paste0("name", 1:6), chembl_id=chembl_id, feature1=paste0("f", 1:6), feature2=rnorm(6)) addCustomAnnot(annot_tb, annot_name="myCustom2") # query custom annotation qres2 <- queryAnnotDB(query_id, annot=c("lincsAnnot", "myCustom2")) # query compounds that don't have ChEMBL IDs query_id <- c("BRD-A00474148", "BRD-A00150179", "BRD-A00763758", "BRD-A00267231") qres3 <- queryAnnotDB(chembl_id=query_id, annot=c("lincsAnnot")) qres3
query_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL100109", "CHEMBL100", "CHEMBL1000", "CHEMBL10") qres <- queryAnnotDB(query_id, annot=c("drugAgeAnnot", "lincsAnnot")) # Add a custom compound annotation table chembl_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL10", "CHEMBL100", "CHEMBL1000", NA) annot_tb <- data.frame(compound_name=paste0("name", 1:6), chembl_id=chembl_id, feature1=paste0("f", 1:6), feature2=rnorm(6)) addCustomAnnot(annot_tb, annot_name="myCustom2") # query custom annotation qres2 <- queryAnnotDB(query_id, annot=c("lincsAnnot", "myCustom2")) # query compounds that don't have ChEMBL IDs query_id <- c("BRD-A00474148", "BRD-A00150179", "BRD-A00763758", "BRD-A00267231") qres3 <- queryAnnotDB(chembl_id=query_id, annot=c("lincsAnnot")) qres3