In this document are presented different aspects of biodb in more details: the object oriented programming model adopted, its architecture, how to configure the package, how to create connectors and delete them, understanding the request scheduler, how to tune the logging system.
In R, you may already know the two classical OOP models S3 and S4. S4 is certainly the most used model, and is based on an original system of generic methods that can be specialized for each class.
Two more recent OOP models exist in R: Reference Classes (aka RC or R5 from package methods) and R6 (from package R6). These two models are very similar. The biodb package uses a mix of RC and R6, but will switch completely to R6 in a near future.
They both implement an object model and based on references. This means that each object is unique, never copied and accessed through references. In this system, the created objects are not copied, but their reference are copied. Any modification of an object in one part of the code, will be visible from all other parts of the code. This means that when you pass an instance to a function, that function is able to modify the instance.
Also, in RC and R6, the functions are attached to
the object. In other words, the mechanism to call a function on a object
is different from S4. The calling mechanism is thus slightly
different, in RC or R6 we write
myObject$myFunction()
instead of
myFunction(myObject)
in S4.
See Reference Classes chapter from package methods and this introduction to R6 for more details.
biodb uses an initialization/termination scheme. You must
first initialize the library by creating an instance of the main class
BiodbMain
:
## INFO [03:07:33.400] Loading definitions from package biodb version 1.15.0.
And when you are done with the library, you have to terminate the instance explicitly:
## INFO [03:07:33.472] Closing BiodbMain instance...
We will need a biodb instance for the rest of this vignette. Let us call the constructor again:
## INFO [03:07:33.487] Loading definitions from package biodb version 1.15.0.
Several class instances are attached to the biodb instance for managing different aspects of biodb: creating connectors, configuring biodb, accessing the cache system, etc.
See table @ref(tab:mngtClasses) for a list of these instances and their purpose.
Class | Method to get the instance | Description |
---|---|---|
BiodbConfig | mybiodb$getConfig() |
Access to configuration values, and modification. |
BiodbDbsInfo | mybiodb$getDbsInfo() |
Databases information (name, description, request frequency, etc). |
BiodbEntryFields | mybiodb$getEntryFields() |
Entry fields information (description, type, cardinality, etc). |
BiodbFactory | mybiodb$getFactory() |
Creation of connectors and entries. |
BiodbPersistentCache | mybiodb$getPersistentCache() |
Cache system on disk. |
BiodbRequestScheduler | mybiodb$getRequestScheduler() |
Send requests to web servers, respecting the frequency limit for each database server. |
Several configuration values are defined inside the
definitions.yml
file of biodb. New configuration
values can also be defined in extension packages.
o get a list of the existing configuration keys with their current value, run:
## Biodb configuration instance.
## Values:
## allow.huge.downloads : TRUE
## autoload.extra.pkgs : FALSE
## cache.all.requests : TRUE
## cache.directory : NA
## cache.read.only : FALSE
## cache.system : TRUE
## compute.fields : TRUE
## dwnld.chunk.size : NA
## dwnld.timeout : 3600
## entries.sep : |
## force.locale : TRUE
## intra.field.name.sep : .
## multival.field.sep : ;
## offline : FALSE
## persistent.cache.impl : custom
## proton.mass : 1.007276
## svn.binary.path :
## test.functions : NA
## use.cache.for.local.db : FALSE
## useragent : R Bioconductor biodb library.
To get a data frame of all keys with their title (short description), type and default value, call (result in table @ref(tab:keysDf)):
key | title | type | default |
---|---|---|---|
allow.huge.downloads | Authorize download of big files. | logical | TRUE |
autoload.extra.pkgs | Enable automatic loading of extension packages. | logical | FALSE |
cache.all.requests | Enable caching of all requests and their results. | logical | TRUE |
cache.directory | Path to the cache folder. | character | NA |
cache.read.only | Set cache system in read only mode. | logical | FALSE |
cache.system | Enable cache system. | logical | TRUE |
use.cache.for.local.db | Enable the use of the cache system also for local databases. | logical | FALSE |
dwnld.chunk.size | The number of new entries to wait before saving them into the cache. | integer | NA |
dwnld.timeout | Download timeout in seconds. | integer | 3600 |
compute.fields | Enable automatic computing of missing fields. | logical | TRUE |
force.locale | Force change of current locale for the application. | logical | TRUE |
multival.field.sep | The separator used for concatenating values. | character | ; |
intra.field.name.sep | The separator use for building a field name. | character | . |
entries.sep | The separator used between values from different entries. | character | | |
offline | Stops sending requests to the network. | logical | FALSE |
persistent.cache.impl | The implementation to use for the persistent cache. | character | custom |
proton.mass | The mass of one proton. | numeric | 1.0072765 |
svn.binary.path | The path to the svn binary. | character | |
test.functions | List of functions to test. | character | NA |
useragent | The application name and contact address to send to the contacted web server. | character | R Bioconductor biodb library. |
To get the description of a key, run:
## [1] "The user agent description string. This string is compulsory when connection to remote databases."
To get a value, run:
## [1] "R Bioconductor biodb library."
To set a field value, run:
mybiodb$getConfig()$set('useragent', 'My application ; [email protected]')
mybiodb$getConfig()$get('useragent')
## [1] "My application ; [email protected]"
If the field is boolean, you can use the following methods instead:
## INFO [03:07:33.631] Enable offline.
## INFO [03:07:33.632] Disable offline.
Configuration keys have default values. You can get a key’s default value with this call:
## [1] "R Bioconductor biodb library."
Environment variables can be used to overwrite default values. To get the name of the environment variable associated with a particular key, call the following method:
## [1] "BIODB_USERAGENT"
Before creating any connector, you can information on the available databases and their connector classes.
Getting the databases info instance will print you a list of all available database connector classes:
## Biodb databases information instance.
## The following databases are defined:
## comp.csv.file: Compound CSV File connector class.
## comp.sqlite: Compound SQLite connector class.
## mass.csv.file: Mass spectra CSV File connector class.
## mass.sqlite: Mass spectra SQLite connector class.
If you want more information on one particular connector, run:
## Mass spectra CSV File class.
## Class: mass.csv.file.
## Package: biodb.
## Description: A connector to handle a mass spectra database stored inside a CSV file. It is possible to choose the separator for the CSV file, as well as match the column names with the biodb entry fields...
## Entry content type: tsv.
This package is delivered with two connectors for local databasses: MassCsvFile annd MassSqlite. However it is extendable, and in fact other packages already exist or will soon be made available on Bioconductor or GitHub for accessing other databases like ChEBI, Uniprot, HMDB, KEGG, Massbank or Lipidmaps. You may also write your own connector by extending biodb. If you are interested, a vignette explains what you need to do in details.
When creating the instance of the BiodbMain
class you
should have received a message like “Loading definitions from package …”
if any extending package has also been installed on your system.
Connector definitions found in extending packages are automatically
loaded when instantiating BiodbMain
, thus you do not need
to call library()
to individually load each extending
package.
To get a list of available connectors, simply print information about
your BiodbMain
instance:
mybiodb
Connectors are created through the factory instance.
To get the factory instance, run:
## Biodb factory instance.
Here is the creation of a Compound CSV File connector, using TSV file:
chebi.tsv <- system.file("extdata", "chebi_extract.tsv", package='biodb')
conn <- mybiodb$getFactory()$createConn('comp.csv.file', url=chebi.tsv)
conn
## INFO [03:07:33.711] Loading file database "/tmp/Rtmp3rxLwh/Rinst14cc71df8582/biodb/extdata/chebi_extract.tsv".
## Compound CSV File instance.
## Class: comp.csv.file.
## Package: biodb.
## Description: A connector to handle a compound database stored inside a CSV file. It is possible to choose the separator for the CSV file, as well as match the column names with the biodb entry fields..
## Entry content type: tsv.
## URLs: base.url: /tmp/Rtmp3rxLwh/Rinst14cc71df8582/biodb/extdata/chebi_extract.tsv.
## ID: comp.csv.file.
## The following fields have been defined: accession, formula, monoisotopic.mass, molecular.mass, kegg.compound.id, name, smiles, description.
The connector instance allows you to send requests to the database to
retrieve entries directly with getEntry()
:
## Biodb Compound CSV File entry instance 1018.
or run more complex queries like a search:
## [1] 1390
See vignette Manipulating entry objects to learn how to access database entries and manipulate them.
All the connectors inherit from super class BiodbConn
,
hence they share a set of common methods like
getEntryIds()
, getEntry()
and
searchForEntries()
. Moreover a connector may be specialized
as a connector to a compound database or to a mass spectra database, in
which case they will inherit specific methods. In the case of a mass
spectra database, it will be methods targeted toward mass spectra like
getNbPeaks()
, searchForMassSpectra()
,
msmsSearch()
, etc. In the case of a compound database, it
will be annotateMzValues()
. Those methods are generic and
thus can be used with any connector inheriting the super class.
When creating a connector, the factory keeps a reference to it in a
list. If you try to create again the same connector, the method
createConn()
will throw an error. However to can use the
getConn()
method to get back the connector instance from
its identifier or database name (if there is only one instance for the
database):
The factory is also responsible for creating the
BiodbEntry
objects, and as for the connectors, it stores
them into a list (called the “volatile cache”). When asking again for
the same entry, the factory will return the reference is has kept in the
list.
To delete a connector, which is a good thing to do if you are not done with biodb but you have finished using a particular connector, run:
## INFO [03:07:33.804] Connector "comp.csv.file" deleted.
This will free all memory used for this connector, including the created entries. Be careful to do not keep those entry objects in some variable on your side, otherwise the memory will not be released by R.
When creating a connector with CompCsvFileConn
or
MassCsvFileConn
, if your CSV file uses standard biodb field
names as column names in its header line, then everything will be fine
and all values will read, recognized and set into entry objects.
However if your CSV file uses custom column names, those values will
be ignore by biodb. To tell biodb to use those columns, you must define
a mapping between each custom column with a valid biodb entry field, by
using the setField()
method.
Here we create a connector to a CSV file database (see table @ref(tab:compTable) for content) of chemical compounds that uses the semi-colon as a separator:
compUrl <- system.file("extdata", "chebi_extract_custom.csv", package='biodb')
compdb <- mybiodb$getFactory()$createConn('comp.csv.file', url=compUrl)
compdb$setCsvSep(';')
We use the getUnassociatedColumns()
method to get a list
of custom column names:
## INFO [03:07:33.833] Loading file database "/tmp/Rtmp3rxLwh/Rinst14cc71df8582/biodb/extdata/chebi_extract_custom.csv".
## WARN [03:07:33.834] Column "ID" does not match any biodb field.
## Warning in warn("Column \"%s\" does not match any biodb field.", colname):
## Column "ID" does not match any biodb field.
## WARN [03:07:33.836] Column "molmass" does not match any biodb field.
## Warning in warn("Column \"%s\" does not match any biodb field.", colname):
## Column "molmass" does not match any biodb field.
## WARN [03:07:33.837] Column "kegg" does not match any biodb field.
## Warning in warn("Column \"%s\" does not match any biodb field.", colname):
## Column "kegg" does not match any biodb field.
## [1] "ID" "molmass" "kegg"
The method returns 3 column names that have not been automatically
mapped. However there is a little trick here, since mass
field has been automatically mapped but with the wrong biodb field
molecular.mass
, as you can see when calling the method
getFieldsAndColumnsAssociation()
:
## $formula
## [1] "formula"
##
## $molecular.mass
## [1] "mass"
##
## $name
## [1] "name"
##
## $smiles
## [1] "smiles"
##
## $description
## [1] "description"
The mass
column of the CSV file stores in fact the
monoisotopic masses. So we need to remap this column, and before that to
reset the connector:
## INFO [03:07:33.869] Connector "comp.csv.file" deleted.
compdb <- mybiodb$getFactory()$createConn('comp.csv.file', url=compUrl)
compdb$setCsvSep(';')
compdb$setField('accession', 'ID')
## INFO [03:07:33.873] Loading file database "/tmp/Rtmp3rxLwh/Rinst14cc71df8582/biodb/extdata/chebi_extract_custom.csv".
compdb$setField('kegg.compound.id', 'kegg')
compdb$setField('monoisotopic.mass', 'mass')
compdb$setField('molecular.mass', 'molmass')
Now the connector works fine, and we can for instance get a list of all accession numbers:
## [1] "1018" "1390" "1456" "1549" "1894" "1932" "1997" "10561" "15939"
## [10] "16750" "35485" "40304" "64679"
And get whichever entry we want:
## accession kegg.compound.id monoisotopic.mass molecular.mass formula
## 1 1018 C07279 168.972 169.012 C2H8AsNO3
## name smiles description comp.csv.file.id
## 1 2-Aminoethylarsonate NCC[As](O)(O)=O 1018
ID | formula | mass | molmass | kegg | name | smiles | description |
---|---|---|---|---|---|---|---|
1018 | C2H8AsNO3 | 168.97201 | 169.012 | C07279 | 2-Aminoethylarsonate | NCC[As](O)(O)=O |
|
1390 | C8H8O2 | 136.05243 | 136.148 | C06224 | 3,4-Dihydroxystyrene | Oc1ccc(C=C)cc1O |
|
1456 | C3H9NO2 | 91.06333 | 91.109 | C06057 | 3-aminopropane-1,2-diol | NC[C@H](O)CO |
|
1549 | C3H5O3R | 89.02387 | 89.070 | C03834 | 3-hydroxymonocarboxylic acid | OC([*])CC(O)=O |
|
1894 | C5H11NO | 101.08406 | 101.147 | C10974 | 4-Methylaminobutanal | CNCCCC=O |
|
1932 | C6H6NR | 92.05002 | 92.119 | C03084 | 4-Substituted aniline | Nc1ccc([*])cc1 |
The BiodbRequestScheduler
instance is responsible for
sending requests to web server, taking care of respecting the frequency
specified by the scheduler.n
and scheduler.t
parameters, and for receiving results and saving them to cache.
The cache is used to give back a result immediately without contacting the server, in case the exact same request has already been run previously.
You do not have to interact with the request scheduler, it runs as a back end component.
The BiodbEntryFields
instance stores information about
all entry fields declared inside definitions YAML files.
Get the entry fields instance:
## Biodb entry fields information instance.
Get a list of all defined fields:
## [1] "aa.seq" "aa.seq.length"
## [3] "aa.seq.location" "accession"
## [5] "average.mass" "cas.id"
## [7] "catalytic.activity" "charge"
## [9] "chebi.id" "chemspider.id"
## [11] "chrom.col.constructor" "chrom.col.diameter"
## [13] "chrom.col.id" "chrom.col.length"
## [15] "chrom.col.method.protocol" "chrom.col.name"
## [17] "chrom.flow.gradient" "chrom.flow.rate"
## [19] "chrom.rt" "chrom.rt.max"
## [21] "chrom.rt.min" "chrom.rt.unit"
## [23] "chrom.solvent" "cofactor"
## [25] "comp.csv.file.id" "comp.iupac.name.allowed"
## [27] "comp.iupac.name.cas" "comp.iupac.name.pref"
## [29] "comp.iupac.name.syst" "comp.iupac.name.trad"
## [31] "comp.sqlite.id" "comp.super.class"
## [33] "composition" "compound.id"
## [35] "description" "ec"
## [37] "equation" "expasy.enzyme.id"
## [39] "formula" "gene.symbol"
## [41] "hmdb.metabolites.id" "inchi"
## [43] "inchikey" "kegg.compound.id"
## [45] "kegg.genes.id" "logp"
## [47] "mass.csv.file.id" "mass.sqlite.id"
## [49] "molecular.mass" "monoisotopic.mass"
## [51] "ms.level" "ms.mode"
## [53] "msdev" "msdevtype"
## [55] "msprecannot" "msprecmz"
## [57] "mstype" "name"
## [59] "nb.compounds" "nb.peaks"
## [61] "ncbi.gene.id" "ncbi.pubchem.comp.id"
## [63] "nominal.mass" "nt.seq"
## [65] "nt.seq.length" "organism"
## [67] "pathway.class" "peak.attr"
## [69] "peak.comp" "peak.error.ppm"
## [71] "peak.formula" "peak.intensity"
## [73] "peak.mass" "peak.mz"
## [75] "peak.mzexp" "peak.mztheo"
## [77] "peak.relative.intensity" "peaks"
## [79] "products" "smiles"
## [81] "smiles.canonical" "smiles.isomeric"
## [83] "substrates"
Get information about a field:
## Entry field "monoisotopic.mass".
## Description: Monoisotopic mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass of the primary isotope of the elements including the mass defect (mass difference between neutron and proton, and nuclear binding energy). Used with high resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass.
## Class: double.
## Type: mass.
## Cardinality: one.
## Aliases: exact.mass.
The object returned is a BiodbEntryField
instance. See
the help page of this class to get a list of all methods you can call on
such an instance.
The persistent cache system is responsible for saving entry contents and results of web server requests onto disk, and reuse them later to avoid recontacting the web server.
Run the following method to get the instance of the
BiodbPersistentCache
class:
## Biodb persistent cache system instance.
## The used implementation is: custom.
## The path to the cache system is: /github/home/.cache/R/biodb.
## The cache is readable.
## The cache is writable.
It is possible to delete files from the cache directly from the persistent cache instance. However it is a lot preferable to do it from the connector instance. If we open an instance of the ChEBI example connector from the vignette Creating a new connector. :
source(system.file("extdata", "ChebiExConn.R", package='biodb'))
source(system.file("extdata", "ChebiExEntry.R", package='biodb'))
mybiodb$loadDefinitions(system.file("extdata", "chebi_ex.yml", package='biodb'))
conn <- mybiodb$getFactory()$createConn('chebi.ex')
And load some entry:
## INFO [03:07:34.096] Create cache folder "/github/home/.cache/R/biodb/chebi.ex-0c5076ac2a43d16dbce503a44b09f649" for "chebi.ex-0c5076ac2a43d16dbce503a44b09f649".
The entry is now downloaded into the cache system. We can check that with the following call:
## Warning: `fileExist()` was deprecated in biodb 1.1.0.
## ℹ Please use `fileExists()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [1] TRUE
And get the path to the cache file:
mybiodb$getPersistentCache()$getFilePath(conn$getCacheId(), name='17001', ext=conn$getEntryFileExt())
## [1] "/github/home/.cache/R/biodb/chebi.ex-0c5076ac2a43d16dbce503a44b09f649/17001.xml"
If we delete the entry content from the cache:
The file does not exist anymore:
## [1] FALSE
But the entry object is still in memory. We need to delete entry instances with the following call:
Note that the results of web server requests are still inside the
cache folder. In order to force a new downloading of data, we need to
erase those files too. The following call will erase all cache files
associated with a connector, including the files deleted by
deleteAllEntriesFromPersistentCache()
:
## INFO [03:07:35.973] Erasing all files in "/github/home/.cache/R/biodb/chebi.ex-0c5076ac2a43d16dbce503a44b09f649".
biodb uses the lgr package for logging messages. The lgr instance used by biodb can be gotten by calling:
## <Logger> [info] biodb
##
## inherited appenders:
## console: <AppenderConsole> [all] -> console
See the lgr home page for demonstration on how to use it.
You can use the following biodb short cuts to send messages of different levels. To send an information message:
## INFO [03:07:36.000] 12 entries have been processed.
To send a debug message:
To send a trace message:
In addition biodb defines two methods to throw an error or a
warning and log this error or warning at the same time. These are
biodb::error()
and biodb::warn()
.
By default the lgr package displays information messages. If
you want to silence all messages, just run
lgr::lgr$remove_appender(1)
. This is will remove the
default appender and silence all messages from all packages using
lgr, including biodb. However if you just want to
silence biodb messages, run:
Information messages are now silenced:
For enabling again:
And messages are echoed again:
## INFO [03:07:36.078] hello
Do not forget to terminate your biodb instance once you are done with it:
## INFO [03:07:36.092] Closing BiodbMain instance...
## INFO [03:07:36.093] Connector "comp.csv.file" deleted.
## INFO [03:07:36.094] Connector "chebi.ex" deleted.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biodb_1.15.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.9 generics_0.1.3
## [4] bitops_1.0-9 stringi_1.8.4 RSQLite_2.3.9
## [7] hms_1.1.3 digest_0.6.37 magrittr_2.0.3
## [10] evaluate_1.0.1 fastmap_1.2.0 blob_1.2.4
## [13] plyr_1.8.9 jsonlite_1.8.9 progress_1.2.3
## [16] DBI_1.2.3 BiocManager_1.30.25 httr_1.4.7
## [19] XML_3.99-0.17 jquerylib_0.1.4 cli_3.6.3
## [22] rlang_1.1.4 chk_0.9.2 crayon_1.5.3
## [25] dbplyr_2.5.0 bit64_4.5.2 withr_3.0.2
## [28] cachem_1.1.0 yaml_2.3.10 tools_4.4.2
## [31] memoise_2.0.1 dplyr_1.1.4 filelock_1.0.3
## [34] curl_6.0.1 buildtools_1.0.0 vctrs_0.6.5
## [37] R6_2.5.1 BiocFileCache_2.15.0 lifecycle_1.0.4
## [40] stringr_1.5.1 bit_4.5.0.1 pkgconfig_2.0.3
## [43] pillar_1.10.0 bslib_0.8.0 glue_1.8.0
## [46] Rcpp_1.0.13-1 lgr_0.4.4 xfun_0.49
## [49] tibble_3.2.1 tidyselect_1.2.1 sys_3.4.3
## [52] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29
## [55] maketools_1.3.1 compiler_4.4.2 prettyunits_1.2.0
## [58] askpass_1.2.1 RCurl_1.98-1.16 openssl_2.3.0