Note: the most recent version of this tutorial can be found here and a short overview slide show here.
The R package eiR
provides an index for chemical
compound databases allowing one to quickly find similar compounds in a
very large database. To create this index, r reference
compounds are selected to represent the database. Then each compound in
the database is embedded into d-dimensional space based on
their distance to each reference compound. This requires time linear in
the size of the database, but only needs to be done once for a database.
Within this space, Locality Sensitive Hashing (LSH) is employed to allow
sub-linear time nearest neighbor lookups (Dong,
Wang, Charikar, et al. 2008; Dong, Wang, Josephson, et al. 2008)
that nearest neighbors can be found without doing a linear scan through
the entire compound database. Additional compounds can be added to the
database in time linear to the number of new compounds. No time is spend
processing existing compounds, as long as the set of reference compounds
remains the same. Given the ability to quickly find nearest neighbors,
this method enables fast clustering with the Jarvis-Pattrick algorithm
as well (Jarvis and Patrick 1973). For
details on the whole process see (Cao, Jiang, and
Girke 2010).
This library uses an SQL back-end (SQLite by default) to store
chemical compound definitions, either in SDF or SMILE format, as well as
descriptors. Several different kinds of descriptors can be stored for
each compound, for example, one could compute and store atom-pair and
fingerprint descriptors for each compound. The SQLite database, if used,
is stored in a directory called data
. The
eiInit
function is used to create a new database, it can
import data from SDF or SMILE formated files, or an SDFset
object.
Once a database has been created, an embedding must also be created
(Dimitris K. Agrafiotis, Dmitrii N. Rassokhin,
Victor S. Lobanov 2001). In this step the reference compounds are
chosen and each compound is embedded into the new space. This step
creates a new directory called run-r-d
, where
r
and d
are the corresponding values. This is
the most costly step of the process and is handled by the
eiMakeDb
function. This step can be parallelized by
providing a SNOW cluster to the eiMakeDb function.
Given an embedded database, queries can be run against it with the
eiQuery
function. Additional compounds can also be added to
an existing database and embedding using eiAdd
. Performance
tests can be run using the eiPerformanceTest function, and
Jarvis-Patrick clustering can be done with the eiCluster
function.
eiR
also provides some mechanisms to allow the user to
extend the set of descriptor formats used and to define new distance
functions. See Section Customization for
details.
An initial compound database must be created with the following command:
## Loading required package: RSQLite
## [1] "createing db"
## loaded 99 compounds
## [1] "99 loaded by eiInit"
## setting priorities for 4 groups.
## [1] 295 211 264 272 258 273 228 261 206 299 288 203 220 278 286 244 257 263 270 274 249 239 214 226
## [25] 202 298 296 242 262 235 291 256 207 225 230 280 283 210 289 248 259 290 246 293 265 271 234 236
## [49] 241 245 253 205 240 217 294 223 255 221 281 251 238 268 252 212 247 277 266 250 269 282 254 204
## [73] 260 285 216 227 215 229 279 267 276 243 237 275 218 224 233 284 297 292 287 208 232 231 213 201
## [97] 219 209 222
EiInit can take either an SDFset, or a filename. If a filename is
given it must be in either SDF or SMILE format and the format must be
specified with the format
paramter. It might complain if
your SDF file does not follow the SDF specification. If this happens,
you can create an SDFset with the read.SDFset
command and
then use that instead of the filename.
Descriptors will also be computed at this time. The default
descriptor type is atompair. Other types can be used by setting the
descriptorType
parameter. Currently available types are
“ap” for atompair, and “fp” for fingerprint. The set of available
descriptors can be extended, see Section Customization. EiInit will create a folder
called ‘data’. Commands should always be executed in the folder
containing this directory (ie, the parent directory of “data”), or else
specify the location of that directory with the dir
option.
EiInit
also has some options to support loading data in
parallel. This is only possible if the SQL database being used supports
parallel writes. For now, this means only PostgreSQL. To use this
feature, the inputs
parameter must be set to an array of
filenames, cl
set to a SNOW cluster, and
connSource
set to a function which will return a new
database connection when called with no parameters. Then each file will
be loaded on a different node of the given cluster, in parallel. The
connSource
function must actually create a new connection
on each call, simply returning the same reference to an existing
connection will not work. This is because that function will be called
on potentially different machines which will have to establish their own
connection to the database.
Compounds which already exist in the database will be skipped over,
so it is safe to re-run eiInit
on an input file which has
already been partially loaded. By default it discovers duplicates by
comparing the entire compound definition. However, if you want two
compounds with the same name to be considered equal even if the
definition is different, you can set updateByName
to be
true. In this mode, if a compound being loaded is found to exist in the
database already, by name, but has a different definition, the compound
will be updated with the new definition and any associated descriptors
and/or features will be re-computed.
eiInit will return a list of compound id numbers which represent the compounds just inserted. These numbers can be used to issue queries later.
In this step the compounds in the data directory will be embedded in
another space which allows for more efficient searching. The main two
parameters are refs
and d
. refs
can be either a list of compound ids to use as references, or else an
integer value indicating the number of references to use, which will
then be randomly chosen. We will use r to represent the number
of reference compounds. d is the dimension of the embedding
space. We have found in practice that setting d to around 100
works well. r should be large enough to “represent” the full
compound database. Since creating a database is the longest running
step, a SNOW cluster can be provided to parallelize the task. A
conSource
function is also required in this case, as
described in Section Initialization.
To help tune these values, eiMakeDb
will pick
numSamples
non-reference samples which can later be used by
the eiPerformanceTest
function.
eiMakdDb
does its job in a job folder, named after the
number of reference compounds and the number of embedding dimensions.
For example, using 300 reference compounds to generate a 100-dimensional
embedding (r=300,
d=100
) will result in a job
folder called run-300-100
. The embedding result is the file
matrix.r.d. In the above example, the output would be
run-300-100/matrix.300.100
.
Since more than one type of descriptor can be stored for each
compound, the desired descriptor type must be given to this function
with the descriptorType
parameter. The default value is
“ap”, for atompair. You can also specify a custom distance function that
must be able to take two descriptors in the format specified and return
a distance value. The default distance method used is
1-Tanimoto(descriptor1,descriptor2).
The return value is called the Run Id. This value is needed by other functions to identify the data set and embedding parameters to use.
## found 95 un-embedded descriptors for runId 1
## Regenerating matrix file...
## Regenerating matrix file...
Queries can be given in several formats, defined by the
format
parameter. The default format is “sdf”. The
queries
parameter can be either an SDF file or an SDFset
under this format. Other valid values for format
are “name”
and “compound_id”. Under these two formats the queries
parameter is expected to be a list of compound names (as returned by
sdfid on an SDFset), or a list of compound id numbers from the database,
such as what is returned by the eiInit function.
The runId
parameter is required to determine which
embedded database to use. As with eiMakeDb
, the
distance
parameter may be given if desired, it will default
to the Tanimoto Coefficient otherwise. Finally, the parameter
K
is the number of results that will be returned. In some
cases, particularly if K
is small, you may need to set it
to a larger value and then trim down the result set yourself. This is
because LSH is not an exact algorithm. Internally, it actually searches
for xK
neighbors, where x
is referred to as
the expansion ratio, generally set to 2. This allows it to pick the best
K
matches, according to the true distance function, out of
a larger set of candidates. When K
is small though,
sometimes that expansion ratio is not quite enough.
Note also that this function returns distance values and not
similarities. Similarities can be computed by setting
asSimilarity
to TRUE. This assumes that whatever distance
function is currently in use returns values between 0 and 1. This is
true for the default distance functions for “ap” and “fp”
descriptors.
Then you can perform a query as follows:
## creating annoy index
## loading matrix file
## done. num items in index: 95
## found 1 queries
## query target similarity target_ids
## 1 650046 650046 1.0000000 245
## 2 650046 650054 0.5133418 251
## 3 650046 650011 0.4651163 211
## 4 650046 650092 0.3922652 286
## 5 650046 650069 0.3897316 264
## 6 650046 650048 0.3825758 247
## 7 650046 650015 0.3277592 215
## 8 650046 650008 0.3256659 208
## 9 650046 650070 0.3142112 265
## 10 650046 650068 0.3130545 263
#Compare to traditional similarity search:
data(apset)
print(cmp.search(apset,apset[45],type=3,cutoff=4,quiet=TRUE))
## index cid scores
## 1 45 650046 1.0000000
## 2 51 650054 0.5133418
## 3 11 650011 0.4651163
## 4 86 650092 0.3922652
The result will be a data frame with four columns. The first is a
query id, the second is a target, or hit, id, the third is the distance
(or similarity if asSimilary
was true) between the query
and the target, and the fourth column is the compound id number of the
target. Lsh parameters can be passed in as well, see Section Performance Tests for more details.
New Compounds can be added to an existing database, however, the
reference compounds cannot be changed. To add new compounds, use the
eiAdd function. This function is very similar to the eiQuery function,
except instead of a queries
parameter, there is an
additions
parameter, defining the compounds to be added.
The format of the value of this parameter works the same as in the
eiQuery function. For example, to add one compound from an SDFset you
would do:
## loaded 1 compounds
## [1] "1 loaded by eiInit"
## setting priorities for 0 groups.
## found 1 un-embedded descriptors for runId 1
## Regenerating matrix file...
## [1] 300
The returned value is a list of the compounds ids that were just added. This function will also write out a new matrix file in the run directory. For large databases, this can take a significant amount of time.
The eiPerforamceTest function will run several tests using some
sample data to evaluate the performance of the current embedding. It
takes the usual runId
parameter, as well as on optional
distance function. It also takes several LSH parameters, though the
defaults are usually fine. To evaluate the performance you can run:
## computing embedded distances on test quries
## writing top 50000 to file ./run-60-40/eucsearch.60-40
## A connection with
## description "./run-60-40/eucsearch.60-40"
## class "file"
## mode "w"
## text "text"
## opened "opened"
## can read "no"
## can write "yes"
## [1] "getting test query ids"
## writing top 50000 to file ./data/chemical-search.results
## A connection with
## description "./data/chemical-search.results"
## class "file"
## mode "w"
## text "text"
## opened "opened"
## can read "no"
## can write "yes"
## average embedding performance: 0.474087460669952
## creating annoy index
## loading matrix file
## done. num items in index: 96
## found 9 queries
## average indexing performance: 0.792859724989751
## [1] 0.9492650 0.8852363 0.7826207 0.7602859 0.5526534 0.9128568 0.6304909 0.7271193 0.9352092
This will perform two different tests. The first tests the embedding results in similarity search. The way this works is by approximating 1,000 random similarity searches (determined by data/test_queries.iddb) by nearest neighbor search using the coordinates from the embedding results. The search results are then compared to the reference search results (chemical-search.results.gz).
The comparison results are summarized in two types of files. The first type lists the recall for different k values, k being the number of numbers to retrieve. These files are named as “recall-ratio-k”. For example, if the recall is 70% for top-100 compound search (70 of the 100 results are among the real top-100 compounds) then the value at line 100 is 0.7. Several relaxation ratios are used, each generating a file in this form. For instance, recall.ratio-10 is the file listing the recalls when relaxation ratio is 10. The other file, recall.csv, lists recalls of different relaxation ratios in one file by limiting to selected k value. In this CSV file, the rows correspond to different relaxation ratios, and the columns are different k values. You will be able to pick an appropriate relaxation ratio for the k values you are interested in.
The second test measures the performance of the Locality Sensitive
Hash (LSH). The results for lsh-assisted search will be in
run-r-d/indexed.performance. It’s a 1,000-line file of recall values.
Each line corresponds to one test query. LSH search performance is
highly sensitive to your LSH parameters (K, W, M, L, T). The default
parameters are listed in the man page for
eiPerformanceTest
. When you have your embedding result in a
matrix file, you should follow instruction on http://lshkit.sourceforge.net/dd/d2a/mplsh-tune_8cpp.html
to find the best values for these parameters.
Compounds can be clustered in near linear time using the
Jarvis-Patrick clustering algorithm by taking advantage of the near
constant time nearest neighbor lookup provided by the LSH index.
Clustering is done with the eiCluster
function. It takes a
runId
to identify the data set and embedding, and two
parameters for the Jarvis-Patrick algorithm: K
is the
number of neighbors to fetch for each compound, and minNbrs
is the minimum number of neighbors two compounds must have in common in
order to be joined into the same cluster. A cutoff
value
can also be given to set a maximum distance between neighbors. Any two
compounds farther apart than this cutoff will never be considered
neighbors. This parameter is helpful in preventing compounds which are
very different from almost every other compound from being considered
similar to other distant compounds simply because they happened to be
closest.
By default eiCluster
will cluster the entire dataset
specified by runId
. If you want to only cluster a subset of
those compounds, you can provide their compound id values to the
compoundIds
parameter.
eiR
can be extended to understand new descriptor types
and new distance functions. New distance functions can be set in two
different ways. Any function that takes a distance parameter can be
given a new distance function that will be used for just that call. If
no distance function is given, it will fetch a default distance function
that has been defined for the given descriptor type. This default value
can be changed using the setDefaultDistance
function, which
takes the descriptor type and a distance function. Once this function
has been called, the new distance function will be used for that
descriptor type by all functions using a distance function. The built-in
defaults are defined as follows:
setDefaultDistance("ap", function(d1,d2) 1-cmp.similarity(d1,d2) )
setDefaultDistance("fp", function(d1,d2) 1-fpSim(d1,d2) )
New descriptor types can also be added using the
addTransform
function. These transforms are basically just
ways to read descriptors from compound definitions, and to convert
descriptors between string and object form. This conversion is required
because descriptors are stored as strings in the SQL database, but are
used by the rest of the program as objects.
There are two main components that need to be added. The
addTransform
function takes the name of the transform and
two functions, toString
, and toObject
. These
have slightly different meanings depending on the component you are
adding. The first component to add is a transform from a chemical
compound format, such as SDF, to a descriptor format, such as atom pair
(AP), in either string or object form. The toString function should take
any kind of chemical compound source, such an SDF file, an SDF object or
an SDFset, and output a string representation of the descriptors. Since
this function can be written in terms of other functions that will be
defined, you can usually accept the default value of this function. The
toObject function should take the same kind of input, but output the
descriptors as an object. The actual return value is a list containing
the names of the compounds (in the names field), and the actual
descriptor objects ( in the descriptors field).
The second component to add is a transform that converts between string and object representations of descriptors. In this case the toString function takes descriptors in object form and returns a string representation for each. The toObject function performs the inverse operation. It takes descriptors in string form and returns them as objects. The objects returned by this function will be exactly what is handed to the distance function, so you need to make sure that the two match each other.
For example, to allow atom pair descriptors to be extracted from and SDF source we would make the following call:
addTransform("ap","sdf",
toObject = function(input,conn=NULL,dir="."){
sdfset=if(is.character(input) && file.exists(input)){
read.SDFset(input)
}else if(inherits(input,"SDFset")){
input
}else{
stop(paste("unknown type for 'input',
or filename does not exist. type found:",class(input)))
}
list(names=sdfid(sdfset),descriptors=sdf2ap(sdfset))
}
)
addTransform("ap",
toString = function(apset,conn=NULL,dir="."){
unlist(lapply(ap(apset), function(x) paste(x,collapse=", ")))
},
toObject= function(v,conn=NULL,dir="."){
if(inherits(v,"list") || length(v)==0)
return(v)
as( if(!inherits(v,"APset")){
names(v)=as.character(1:length(v));
read.AP(v,type="ap",isFile=FALSE)
} else v,
"list")
}
)
R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8
[4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC tzcode source: system (glibc)
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] RSQLite_2.3.9 eiR_1.47.0 DBI_1.2.3 ChemmineR_3.59.0 BiocStyle_2.35.0
loaded via a namespace (and not attached): [1] sass_0.4.9
bitops_1.0-9 digest_0.6.37 magrittr_2.0.3
[5] evaluate_1.0.1 grid_4.4.2 blob_1.2.4 fastmap_1.2.0
[9] jsonlite_1.8.9 snowfall_1.84-6.3 gridExtra_2.3 BiocManager_1.30.25
[13] scales_1.3.0 codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3
[17] rlang_1.1.4 RcppAnnoy_0.0.22 bit64_4.5.2 munsell_0.5.1
[21] base64enc_0.1-3 cachem_1.1.0 yaml_2.3.10 tools_4.4.2
[25] parallel_4.4.2 memoise_2.0.1 colorspace_2.1-1 ggplot2_3.5.1
[29] DT_0.33 buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1
[33] png_0.1-8 lifecycle_1.0.4 rsvg_2.6.1 bit_4.5.0.1
[37] htmlwidgets_1.6.4 pkgconfig_2.0.3 pillar_1.10.0 bslib_0.8.0
[41] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.13-1 xfun_0.49
[45] tibble_3.2.1 sys_3.4.3 knitr_1.49 rjson_0.2.23
[49] htmltools_0.5.8.1 snow_0.4-4 rmarkdown_2.29 maketools_1.3.1
[53] compiler_4.4.2 RCurl_1.98-1.16
This software was developed with funding from the National Science Foundation: ABI-0957099, 2010-0520325 and IGERT-0504249.