An introduction to biodbNci

Purpose

biodbNci is a biodb extension package that implements a connector to biodbNci, a library for connecting to the National Cancer Institute (USA) CACTUS API (Institute 2022).

Installation

Install using Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install('biodbNci')

Initialization

The first step in using biodbNci, is to create an instance of the biodb class Biodb from the main biodb package. This is done by calling the constructor of the class:

mybiodb <- biodb::newInst()

During this step the configuration is set up, the cache system is initialized and extension packages are loaded.

We will see at the end of this vignette that the biodb instance needs to be terminated with a call to the terminate() method.

Creating a connector to CACTUS

In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbNci implements a connector to a remote database. Here is the code to instantiate a connector:

conn <- mybiodb$getFactory()$createConn('nci.cactus')
## Loading required package: biodbNci

For this vignette, we will avoid the downloading of the full NCI CACTUS database, and use instead an extract containing a few entries:

dbExtract <- system.file("extdata", 'generated', "cactus_extract.txt.gz",
    package="biodbNci")
conn$setPropValSlot('urls', 'db.gz.url', dbExtract)

Accessing entries

To get some of the first entry IDs (accession numbers) from the database, run:

ids <- conn$getEntryIds(2)
## INFO  [04:09:59.800] Create cache folder "/github/home/.cache/R/biodb/nci.cactus-8efc65070d01b214e760b1d4932ee427" for "nci.cactus-8efc65070d01b214e760b1d4932ee427".
## INFO  [04:09:59.801] Downloading whole database of nci.cactus.
## INFO  [04:09:59.801] Downloading NCI CACTUS database at "/tmp/RtmpuWeC8n/Rinst14f023827ccc/biodbNci/extdata/generated/cactus_extract.txt.gz" ...
## INFO  [04:09:59.805] Extract whole database of nci.cactus.
## INFO  [04:09:59.806] Extracting content of downloaded biodbNci, a library for connecting to the National Cancer Institute (USA) CACTUS Database....
ids
## [1] "749674" "750690"

To retrieve entries, use:

entries <- conn$getEntry(ids)
entries
## [[1]]
## Biodb NCI CACTUS entry instance 749674.
## 
## [[2]]
## Biodb NCI CACTUS entry instance 750690.

To convert a list of entries into a dataframe, run:

x <- mybiodb$entriesToDataframe(entries)
x
##   accession     formula molecular.mass
## 1    749674   C16H14N4O       278.3128
## 2    750690 C22H27FN4O2       398.4793
##                                                                                                                                                                     inchi
## 1                                                           InChI=1S/C16H14N4O/c1-11-15(20-19-12-7-3-2-4-8-12)16(21)18-14-10-6-5-9-13(14)17-11/h2-10,19H,1H3,(H,18,20,21)
## 2 InChI=1S/C22H27FN4O2/c1-5-27(6-2)10-9-24-22(29)20-13(3)19(25-14(20)4)12-17-16-11-15(23)7-8-18(16)26-21(17)28/h7-8,11-12,25H,5-6,9-10H2,1-4H3,(H,24,29)(H,26,28)/b17-12-
##                      inchikey nci.cactus.id      cas.id
## 1 RWIQZKLIGWLCEK-UHFFFAOYSA-N        749674        <NA>
## 2 WINHZLLDWRZWRT-ATVHPVEESA-N        750690 557795-19-4
##                                                                                                                                                    name
## 1                                                                                                                                                  <NA>
## 2 Sunitinib (free base);1H-Pyrrole-3-carboxamide, N-[2-(diethylamino)ethyl]-5-[(Z)-(5-fluoro-1,2-dihydro-2-oxo-3H-indol-3-ylidene)methyl]-2,4-dimethyl-

Chemical Identifier Resolver web service

Here is an example of calling the Chemical Identifier Resolver for converting a SMILES into an InChI:

conn$wsChemicalIdentifierResolver(structid='C=O', repr='InChI')
## [1] "InChI=1/CH2O/c1-2/h1H2"

Convert CAS IDs

There are currently two methods in NCI CACTUS for converting from CAS IDs to InChI or InChI keys:

conn$convCasToInchi('87605-72-9')
## [1] "InChI=1/C25H30O5/c1-15(2)6-5-7-16(3)8-9-30-19-11-17-10-18-13-25(4,29)14-21(27)23(18)24(28)22(17)20(26)12-19/h6,8,10-12,26,28-29H,5,7,9,13-14H2,1-4H3/b16-8+"
conn$convCasToInchikey('87605-72-9')
## [1] "KZPCPZBBGCTGCN-LZYBPNLTNA-N"

The conversion is made thanks to the Chemical Identifier Resolver web service.

Closing biodb instance

When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):

mybiodb$terminate()
## INFO  [04:10:02.129] Closing BiodbMain instance...
## INFO  [04:10:02.135] Connector "nci.cactus" deleted.

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] biodbNci_1.11.0  BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] rappdirs_0.3.3       sass_0.4.9           utf8_1.2.4          
##  [4] generics_0.1.3       bitops_1.0-9         stringi_1.8.4       
##  [7] RSQLite_2.3.8        hms_1.1.3            digest_0.6.37       
## [10] magrittr_2.0.3       evaluate_1.0.1       fastmap_1.2.0       
## [13] blob_1.2.4           plyr_1.8.9           jsonlite_1.8.9      
## [16] progress_1.2.3       DBI_1.2.3            BiocManager_1.30.25 
## [19] httr_1.4.7           fansi_1.0.6          XML_3.99-0.17       
## [22] jquerylib_0.1.4      cli_3.6.3            rlang_1.1.4         
## [25] chk_0.9.2            crayon_1.5.3         dbplyr_2.5.0        
## [28] bit64_4.5.2          withr_3.0.2          cachem_1.1.0        
## [31] yaml_2.3.10          tools_4.4.2          memoise_2.0.1       
## [34] biodb_1.15.0         dplyr_1.1.4          filelock_1.0.3      
## [37] curl_6.0.1           buildtools_1.0.0     vctrs_0.6.5         
## [40] R6_2.5.1             BiocFileCache_2.15.0 lifecycle_1.0.4     
## [43] stringr_1.5.1        bit_4.5.0            pkgconfig_2.0.3     
## [46] pillar_1.9.0         bslib_0.8.0          glue_1.8.0          
## [49] Rcpp_1.0.13-1        lgr_0.4.4            xfun_0.49           
## [52] tibble_3.2.1         tidyselect_1.2.1     sys_3.4.3           
## [55] knitr_1.49           htmltools_0.5.8.1    rmarkdown_2.29      
## [58] maketools_1.3.1      compiler_4.4.2       prettyunits_1.2.0   
## [61] askpass_1.2.1        RCurl_1.98-1.16      openssl_2.2.2

References

Institute, National Cancer. 2022. “CADD Group Chemoinformatics Tools and User Services (CACTUS).” https://cactus.nci.nih.gov/.