biodbNcbi is a biodb extension package that implements a connector to the NCBI databases (Sayers et al. 2022) Gene, CCDS (Pruitt et al. 2009; Harte et al. 2012; Farrell et al. 2013), Pubchem Comp and Pubchem Subst (Kim et al. 2015).
Install using Bioconductor:
The first step in using biodbNcbi, is to create an instance
of the biodb class Biodb
from the main biodb
package. This is done by calling the constructor of the class:
During this step the configuration is set up, the cache system is initialized and extension packages are loaded.
We will see at the end of this vignette that the biodb
instance needs to be terminated with a call to the
terminate()
method.
In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbNcbi implements a connector to a remote database. Here is the code to instantiate a connector:
## Loading required package: biodbNcbi
Creating other connectors follow the same process:
To get the number of entries stored inside the database, run:
## INFO [06:25:47.328] Create cache folder "/github/home/.cache/R/biodb/ncbi.gene-b6d7417e507eb4f1a2e0047bde5295e8" for "ncbi.gene-b6d7417e507eb4f1a2e0047bde5295e8".
## [1] 82951329
To get some of the first entry IDs (accession numbers) from the database, run:
## [1] "138918163" "138916285"
To retrieve entries, use:
## INFO [06:25:49.600] Get entry content(s) for 2 id(s)...
## [[1]]
## Biodb NCBI Gene entry instance 138918163.
##
## [[2]]
## Biodb NCBI Gene entry instance 138916285.
To convert a list of entries into a dataframe, run:
## accession description gene.symbol
## 1 138918163 PRR23 family member E PRR23E
## 2 138916285 V-type proton ATPase subunit E 1 pseudogene LOC138916285
## name ncbi.gene.id
## 1 C16H3orf56;CUNH3orf56 138918163
## 2 <NA> 138916285
efetch web service is accessible through the
wsEfetch()
method, available on Entrez connectors:
ncbi.gene
, ncbi.pubchem.comp
and
ncbi.pubchem.subst
.
Get the a Gene entry as an XML object and print the
Entrezgene_prot
node:
entryxml <- gene$wsEfetch('2833', retmode='xml', retfmt='parsed')
XML::getNodeSet(entryxml, "//Entrezgene_prot")
## [[1]]
## <Entrezgene_prot>
## <Prot-ref>
## <Prot-ref_name>
## <Prot-ref_name_E>G protein-coupled receptor 9</Prot-ref_name_E>
## <Prot-ref_name_E>IP-10 receptor</Prot-ref_name_E>
## <Prot-ref_name_E>Mig receptor</Prot-ref_name_E>
## <Prot-ref_name_E>chemokine (C-X-C motif) receptor 3</Prot-ref_name_E>
## <Prot-ref_name_E>chemokine receptor 3</Prot-ref_name_E>
## <Prot-ref_name_E>interferon-inducible protein 10 receptor</Prot-ref_name_E>
## </Prot-ref_name>
## <Prot-ref_desc>C-X-C chemokine receptor type 3</Prot-ref_desc>
## </Prot-ref>
## </Entrezgene_prot>
##
## attr(,"class")
## [1] "XMLNodeSet"
The object returned is an XML::XMLInternalDocument
.
esearch web service is accessible through the
wsEsearch()
method, available on Entrez connectors:
ncbi.gene
, ncbi.pubchem.comp
and
ncbi.pubchem.subst
.
Search for Gene entries by name and get the IDs of the matching
entries (equivalent of running gene$searchForEntries()
:
## [1] "395552" "417536" "128014773" "108261914" "128599176"
The same result can be obtained with a call to
searchForEntries()
:
## [1] "395552" "417536" "128014773" "108261914" "128599176"
einfo web service is accessible through the
wsEinfo()
method, available on Entrez connectors:
ncbi.gene
, ncbi.pubchem.comp
and
ncbi.pubchem.subst
.
Get PubChem Comp database information as an XML object and print information on first field:
## INFO [06:25:53.135] Create cache folder "/github/home/.cache/R/biodb/ncbi.pubchem.comp-e80afa72f6aa7425ca0c72f99f7b9d75" for "ncbi.pubchem.comp-e80afa72f6aa7425ca0c72f99f7b9d75".
## [[1]]
## <Field>
## <Name>ALL</Name>
## <FullName>All Fields</FullName>
## <Description>All terms from all searchable fields</Description>
## <TermCount>1155315532</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>N</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## <IsTruncatable>Y</IsTruncatable>
## <IsRangable>N</IsRangable>
## </Field>
##
## attr(,"class")
## [1] "XMLNodeSet"
When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):
## INFO [06:25:53.936] Closing BiodbMain instance...
## INFO [06:25:53.937] Connector "ncbi.gene" deleted.
## INFO [06:25:53.943] Connector "ncbi.ccds" deleted.
## INFO [06:25:53.943] Connector "ncbi.pubchem.comp" deleted.
## INFO [06:25:53.944] Connector "ncbi.pubchem.subst" deleted.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biodbNcbi_1.11.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.9 generics_0.1.3
## [4] bitops_1.0-9 stringi_1.8.4 RSQLite_2.3.9
## [7] hms_1.1.3 digest_0.6.37 magrittr_2.0.3
## [10] evaluate_1.0.1 fastmap_1.2.0 blob_1.2.4
## [13] plyr_1.8.9 jsonlite_1.8.9 progress_1.2.3
## [16] DBI_1.2.3 BiocManager_1.30.25 httr_1.4.7
## [19] XML_3.99-0.17 jquerylib_0.1.4 cli_3.6.3
## [22] rlang_1.1.4 chk_0.9.2 crayon_1.5.3
## [25] dbplyr_2.5.0 bit64_4.5.2 withr_3.0.2
## [28] cachem_1.1.0 yaml_2.3.10 tools_4.4.2
## [31] memoise_2.0.1 biodb_1.15.0 dplyr_1.1.4
## [34] filelock_1.0.3 curl_6.0.1 buildtools_1.0.0
## [37] vctrs_0.6.5 R6_2.5.1 BiocFileCache_2.15.0
## [40] lifecycle_1.0.4 stringr_1.5.1 bit_4.5.0.1
## [43] pkgconfig_2.0.3 pillar_1.10.0 bslib_0.8.0
## [46] glue_1.8.0 Rcpp_1.0.13-1 lgr_0.4.4
## [49] xfun_0.49 tibble_3.2.1 tidyselect_1.2.1
## [52] sys_3.4.3 knitr_1.49 htmltools_0.5.8.1
## [55] rmarkdown_2.29 maketools_1.3.1 compiler_4.4.2
## [58] prettyunits_1.2.0 askpass_1.2.1 RCurl_1.98-1.16
## [61] openssl_2.3.0