An introduction to biodbNcbi


biodbNcbi is a biodb extension package that implements a connector to the NCBI databases (Sayers et al. 2022) Gene, CCDS (Pruitt et al. 2009; Harte et al. 2012; Farrell et al. 2013), Pubchem Comp and Pubchem Subst (Kim et al. 2015).


Install using Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE))


The first step in using biodbNcbi, is to create an instance of the biodb class Biodb from the main biodb package. This is done by calling the constructor of the class:

mybiodb <- biodb::newInst()

During this step the configuration is set up, the cache system is initialized and extension packages are loaded.

We will see at the end of this vignette that the biodb instance needs to be terminated with a call to the terminate() method.

Creating a connector to Gene

In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbNcbi implements a connector to a remote database. Here is the code to instantiate a connector:

gene <- mybiodb$getFactory()$createConn('ncbi.gene')
## Loading required package: biodbNcbi

Creating other connectors follow the same process:

ccds <- mybiodb$getFactory()$createConn('ncbi.ccds')
pubchem.comp <- mybiodb$getFactory()$createConn('ncbi.pubchem.comp')
pubchem.subst <- mybiodb$getFactory()$createConn('ncbi.pubchem.subst')

Accessing entries

To get the number of entries stored inside the database, run:

## [1] 84064753

To get some of the first entry IDs (accession numbers) from the database, run:

ids <- gene$getEntryIds(2)
## [1] "14910" "7157"

To retrieve entries, use:

entries <- gene$getEntry(ids)
## [[1]]
## Biodb NCBI Gene entry instance 14910.
## [[2]]
## Biodb NCBI Gene entry instance 7157.

To convert a list of entries into a dataframe, run:

x <- mybiodb$entriesToDataframe(entries)
##   accession                         description aa.seq.location   gene.symbol
## 1     14910 gene trap ROSA 26, Philippe Soriano      6 52.73 cM Gt(ROSA)26Sor
## 2      7157                   tumor protein p53         17p13.1          TP53
##                                      name
## 1 R26;ROSA26;Gtrgeo26;Gtrosa26;Thumpd3as1        14910
## 2               P53;BCC7;LFS1;BMFS5;TRP53         7157
## 1                                                                                                                                                                                         <NA>
## 2 P04637;Q15086;Q15087;Q15088;Q16535;Q16807;Q16808;Q16809;Q16810;Q16811;Q16848;Q2XN98;Q3LRW1;Q3LRW2;Q3LRW3;Q3LRW4;Q3LRW5;Q86UG1;Q8J016;Q99659;Q9BTM4;Q9HAQ8;Q9NP68;Q9NPJ2;Q9NZD0;Q9UBI2;Q9UQ61
## 1         <NA>
## 2  CCDS11118.1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           aa.seq
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           <NA>

Accessing efetch web service

efetch web service is accessible through the wsEfetch() method, available on Entrez connectors: ncbi.gene, ncbi.pubchem.comp and ncbi.pubchem.subst.

Get the a Gene entry as an XML object and print the Entrezgene_prot node:

entryxml <- gene$wsEfetch('2833', retmode='xml', retfmt='parsed')
XML::getNodeSet(entryxml, "//Entrezgene_prot")
## [[1]]
## <Entrezgene_prot>
##   <Prot-ref>
##     <Prot-ref_name>
##       <Prot-ref_name_E>G protein-coupled receptor 9</Prot-ref_name_E>
##       <Prot-ref_name_E>IP-10 receptor</Prot-ref_name_E>
##       <Prot-ref_name_E>Mig receptor</Prot-ref_name_E>
##       <Prot-ref_name_E>chemokine (C-X-C motif) receptor 3</Prot-ref_name_E>
##       <Prot-ref_name_E>chemokine receptor 3</Prot-ref_name_E>
##       <Prot-ref_name_E>interferon-inducible protein 10 receptor</Prot-ref_name_E>
##     </Prot-ref_name>
##     <Prot-ref_desc>C-X-C chemokine receptor type 3</Prot-ref_desc>
##   </Prot-ref>
## </Entrezgene_prot> 
## attr(,"class")
## [1] "XMLNodeSet"

The object returned is an XML::XMLInternalDocument.

Accessing esearch web service

esearch web service is accessible through the wsEsearch() method, available on Entrez connectors: ncbi.gene, ncbi.pubchem.comp and ncbi.pubchem.subst.

Search for Gene entries by name and get the IDs of the matching entries (equivalent of running gene$searchForEntries():

gene$wsEsearch(term='"chemokine"[Gene Name]', retmax=10, retfmt='ids')
## [1] "395552"    "417536"    "128014773" "108261914" "128599176"

The same result can be obtained with a call to searchForEntries():

gene$searchForEntries(fields=list(name='chemokine'), max.results=10)
## [1] "395552"    "417536"    "128014773" "108261914" "128599176"

Accessing einfo web service

einfo web service is accessible through the wsEinfo() method, available on Entrez connectors: ncbi.gene, ncbi.pubchem.comp and ncbi.pubchem.subst.

Get PubChem Comp database information as an XML object and print information on first field:

infoxml <- pubchem.comp$wsEinfo(retfmt='parsed')
XML::getNodeSet(infoxml, "//Field[1]")
## [[1]]
## <Field>
##   <Name>ALL</Name>
##   <FullName>All Fields</FullName>
##   <Description>All terms from all searchable fields</Description>
##   <TermCount>1157821217</TermCount>
##   <IsDate>N</IsDate>
##   <IsNumerical>N</IsNumerical>
##   <SingleToken>N</SingleToken>
##   <Hierarchy>N</Hierarchy>
##   <IsHidden>N</IsHidden>
##   <IsTruncatable>Y</IsTruncatable>
##   <IsRangable>N</IsRangable>
## </Field> 
## attr(,"class")
## [1] "XMLNodeSet"

Closing biodb instance

When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):

