Title: | Accelerated similarity searching of small molecules |
---|---|
Description: | The eiR package provides utilities for accelerated structure similarity searching of very large small molecule data sets using an embedding and indexing approach. |
Authors: | Kevin Horan, Yiqun Cao and Tyler Backman |
Maintainer: | Thomas Girke <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.47.0 |
Built: | 2024-11-18 10:14:51 UTC |
Source: | https://github.com/bioc/eiR |
New descriptor types can be added using the addTransform
function. These transforms are basically just ways to read descriptors from
compound definitions, and to convert descriptors between string and object form. This conversion
is required because descriptors are stored as strings in the SQL database, but are used
by the rest of the program as objects.
There are two main components that need to be added. The
addTransform
function takes the name of the transform and two functions,
toString
, and toObject
. These have slightly different
meanings depending on the component you are adding.
The first component to add
is a transform from a chemical compound format, such as SDF, to a descriptor format,
such as atom pair (AP), in either string or object form.
The toString function should take
any kind of chemical compound source, such an SDF file, an SDF object or an SDFset,
and output a string representation of the descriptors. Since this function
can be written in terms of other functions that will be defined, you
can usually accept the default value of this function. The toObject function
should take the same kind of input, but output the descriptors as an object.
The actual return value is a list containing the names of the compounds (in the names field), and
the actual descriptor objects ( in the descriptors field).
The second component to add is a transform that converts between string and object representations of descriptors. In this case the toString function takes descriptors in object form and returns a string representation for each. The toObject function performs the inverse operation. It takes descriptors in string form and returns them as objects. The objects returned by this function will be exactly what is handed to the distance function, so you need to make sure that the two match each other.
addTransform(descriptorType, compoundFormat = NULL, toString = NULL, toObject)
addTransform(descriptorType, compoundFormat = NULL, toString = NULL, toObject)
descriptorType |
The name of the type of the descritor being added. |
compoundFormat |
The format of the compound data the descriptor will be extracted from. |
toString |
A function with three arguments, the data, an SQL connection object, and a directory name. The last two are optional and can be set to a default value of NULL if not used in the body of the function. If this parameter is NULL and compoundFormat is not NULL, then a default function will be used for this value. |
toObject |
A function with three arguments, the data, an SQL connection object, and a directory name. The last two are optional and can be set to a default value of NULL if not used in the body of the function. If compoundFormat is not NULL, then the return value of this function should be a list with the fields "names" and "descriptors", containing the compound names and descriptor objects, respectivly. If compoundFormat is NULL, then the return value should be a collection of descriptor objects, in whatever format the distance function for this descrptor type requires. |
No value returned.
Kevin Horan
# adding support for atompair (ap) descriptors extracted from # sdf formmatted data. #first component addTransform("ap-example","sdf-example", # Any sdf source -> APset toObject = function(input,conn=NULL,dir="."){ sdfset=if(is.character(input) && file.exists(input)){ read.SDFset(input) }else if(inherits(input,"SDFset")){ input }else{ stop(paste("unknown type for 'input', or filename does not exist. type found:",class(input))) } list(names=sdfid(sdfset),descriptors=sdf2ap(sdfset)) } ) #second component addTransform("ap-example", # APset -> string, toString = function(apset,conn=NULL,dir="."){ unlist(lapply(ap(apset), function(x) paste(x,collapse=", "))) }, # string or list -> AP set list toObject= function(v,conn=NULL,dir="."){ if(inherits(v,"list") || length(v)==0) return(v) as( if(!inherits(v,"APset")){ names(v)=as.character(1:length(v)); read.AP(v,type="ap",isFile=FALSE) } else v, "list") } )
# adding support for atompair (ap) descriptors extracted from # sdf formmatted data. #first component addTransform("ap-example","sdf-example", # Any sdf source -> APset toObject = function(input,conn=NULL,dir="."){ sdfset=if(is.character(input) && file.exists(input)){ read.SDFset(input) }else if(inherits(input,"SDFset")){ input }else{ stop(paste("unknown type for 'input', or filename does not exist. type found:",class(input))) } list(names=sdfid(sdfset),descriptors=sdf2ap(sdfset)) } ) #second component addTransform("ap-example", # APset -> string, toString = function(apset,conn=NULL,dir="."){ unlist(lapply(ap(apset), function(x) paste(x,collapse=", "))) }, # string or list -> AP set list toObject= function(v,conn=NULL,dir="."){ if(inherits(v,"list") || length(v)==0) return(v) as( if(!inherits(v,"APset")){ names(v)=as.character(1:length(v)); read.AP(v,type="ap",isFile=FALSE) } else v, "list") } )
Add additional compounds to and existing database
eiAdd(runId,additions,dir=".",format="sdf",conn=defaultConn(dir), distance=getDefaultDist(descriptorType), updateByName = FALSE,...)
eiAdd(runId,additions,dir=".",format="sdf",conn=defaultConn(dir), distance=getDefaultDist(descriptorType), updateByName = FALSE,...)
runId |
The id number identifying a particular set of settings for a database. This is generally
the number returned by |
additions |
The compounds to add. This can be either a file in sdf format, or and SDFset object. |
dir |
The directory where the "data" directory lives. Defaults to the current directory. |
format |
The format of the data given in |
conn |
Database connection to use. |
distance |
The distance function to be used to compute the distance between two descriptors. A default function is provided for "ap" and "fp" descriptors. |
updateByName |
If true we make the assumption that all compounds, both in the existing database and the given dataset, have unique names. This function will then avoid re-adding existing, identical compounds, and will update existing compounds with a new definition if a new compound definition with an existing name is given. If false, we allow duplicate compound names to exist in the database, though not duplicate definitions. So identical compounds will not be re-added, but if a new version of an existing compound is added it will not update the existing one, it will add the modified one as a completely new compound with a new compound id. |
... |
Additional options passed to |
New Compounds can be added to an existing database, however, the reference compounds cannot be changed. This will also update the matrix file in the run/job directory with the new compounds.
Kevin Horan
eiMakeDb
eiPerformanceTest
eiQuery
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"add") dir.create(dir) eiInit(sdfsample[1:99],dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir) #find compounds similar two each query eiAdd(runId,sdfsample[100],dir=dir)
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"add") dir.create(dir) eiInit(sdfsample[1:99],dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir) #find compounds similar two each query eiAdd(runId,sdfsample[100],dir=dir)
Uses Jarvis-Patrick clustering to cluster the compound database using the LSH algorithm to quickly find nearest neighbors.
eiCluster(runId,K,minNbrs,compoundIds=c(), dir=".",cutoff=NULL, distance=getDefaultDist(descriptorType), conn=defaultConn(dir), searchK=-1,type="cluster",linkage="single")
eiCluster(runId,K,minNbrs,compoundIds=c(), dir=".",cutoff=NULL, distance=getDefaultDist(descriptorType), conn=defaultConn(dir), searchK=-1,type="cluster",linkage="single")
runId |
The id number identifying a particular set of settings for a database. This is generally
the number returned by |
K |
The number of neighbors to consider for each compound. |
minNbrs |
The minimum number of neighbors that two comopunds must have in common in order to be joined. |
compoundIds |
If this variable is set to a vector of compound ids, then clustering will be done with just those compounds. If left unset or empty, clustering will apply to all compounds in the given run. |
dir |
The directory where the "data" directory lives. Defaults to the current directory. |
distance |
The distance function to be used to compute the distance between two descriptors. A default function is provided for "ap" and "fp" descriptors. |
cutoff |
Distance cutoff value. Compounds having a distance larger this this value will not be included in the nearest neighbor table. Note that this is a distance value, not a similarity value, as is often used in other ChemmineR functions. |
conn |
Database connection to use. |
searchK |
Tunable Annoy LSH parameter. A larger value will give more accurate results, but will take longer time to return. The default value of -1 will allow the value to chosen automatically, which will set a value of numTrees * (approximate number of nearest neighbors). See Annoy page for details. https://github.com/spotify/annoy |
type |
If "cluster", returns a clustering, else, if "matrix", returns a list in the format
expected by the |
linkage |
Can be one of "single", "average", or "complete", for single linkage, average linkage and complete linkage merge requirements, respectively. In the context of Jarvis-Patrick, average linkage means that at least half of the pairs between the clusters under consideration must pass the merge requirement. Similarly, for complete linkage, all pairs must pass the merge requirement. Single linkage is the normal case for Jarvis-Patrick and just means that at least one pair must meet the requirement. |
The jarvis patrick clustering algorithm takes a set of items, a distance function, and two
parameters, K
, and minNbrs
. For each item, it find the K
nearest neighbors
of that item. Normally this requires computing the distance between every pair of items.
However, using Locality Sensative Hashing (LSH), the set of nearst neighbors can be found in
near constant time. Once the nearest neighbor matrix is computed, the algorithm makes one pass
through the items and merges all pairs that have at least minNbrs
neighbors in common.
Although not required, it is avisable to specify a cutoff
value. This is the maximum
distance two items can have from each other and still be considered to be neighbors. It is
thus possible for an item to end up with less than K
neighbors if less than K
items are close enough to it. If a cutoff is not specified, it is possible for highly
un-related items to be listed as neighbors of another item simply because nothing else was
nearby. This can lead to items being joined into clusters with which they have no true
connection.
The type
parameter can be used to return a list which can be used to call the
jarvisPatrick
function in ChemmineR directly. The advantage of this is that it will
contain the similarity matrix which can then be used to quickly set different cutoff values
(using trimNeighbors
) whithout having to re-compute the similarity matrix. Note that this
requires that the given distance function return a value between 0 and 1 so it can be converted
to a similarity function.
If type
is "cluster", returns a clustering.
This will be a vector in which the names are the compound names, and the values are the cluster
labels.
Otherwise, if type
is "matrix", returns a list with the following components:
indexes |
index values of nearest neighbors, for each item. |
names |
The database compound id of each item in the set. |
similarities |
The similarity values of each neighbor to the item for that row. Each similarity values corresponds to the id number in the same position in the indexes entry |
If there are not K
neibhbors for a compound, that row will be padded with NAs.
Kevin Horan
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"cluster") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile="")) eiCluster(runId,K=5,minNbrs=2,cutoff=0.5,dir=dir)
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"cluster") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile="")) eiCluster(runId,K=5,minNbrs=2,cutoff=0.5,dir=dir)
Takes the raw compound database in whatever format the given measure supports and creates a "data" directory.
eiInit(inputs,dir=".",format="sdf",descriptorType="ap",append=FALSE, conn=defaultConn(dir,create=TRUE), updateByName = FALSE, cl = NULL, connSource = NULL, priorityFn = forestSizePriorities,skipPriorities=FALSE)
eiInit(inputs,dir=".",format="sdf",descriptorType="ap",append=FALSE, conn=defaultConn(dir,create=TRUE), updateByName = FALSE, cl = NULL, connSource = NULL, priorityFn = forestSizePriorities,skipPriorities=FALSE)
inputs |
Either a filename of a file in |
dir |
The directory where the "data" directory lives. Defaults to the current directory. |
format |
The format of the data in |
descriptorType |
The format of the descriptor. Currently supported values are "ap" for atom pair, and "fp" for fingerprint. |
append |
If true the given compounds will be added to an existing database
and the <data-dir>/Main.iddb file will be updated with the new
compound id numbers. This should not normally be used directly, use
|
conn |
Database connection to use. If a connection is given, you must ensure that it has been initialized using
the |
updateByName |
If true we make the assumption that all compounds, both in the existing database and the given dataset, have unique names. This function will then avoid re-adding existing, identical compounds, and will update existing compounds with a new definition if a new compound definition with an existing name is given. If false, we allow duplicate compound names to exist in the database, though not duplicate definitions. So identical compounds will not be re-added, but if a new version of an existing compound is added it will not update the existing one, it will add the modified one as a completely new compound with a new compound id. |
cl |
A SNOW cluster can be given here to run this function in parallel. |
connSource |
A function returning a new database connection. Note that it is not sufficient to return a reference to an existing connection, it must be a distinct, new connection. This is needed for cluster operations that make use of the database as they will each need to create a new connection. If not given, certain parts of this function will not be parallelized. This function can also be used to setup the environment on the cluster worker nodes. For example, you might need to re-load libraries like RSQLite and such. |
priorityFn |
This option takes a function that takes a list of compound ids and returns a data frame with the compound ids as the column 'compound_id', and their priority as the column 'priority'. There are two pre-defined functions in ChemmineR: 'randomPriorities', and 'forestSizePriorities' (default). When several compounds map to the same descriptor, then when some functions need to go from a descriptor to a compound, there is ambiguity about which compound to select. In that case, it will pick the compound with the highest priority. |
skipPriorities |
If this is true, then no priority values will be computed. See option |
EiInit can take either an SDFset, or a filename. SDF and SMILES is supported
by default.
It might complain if your SDF file does not
follow the SDF specification. If this happens, you can create an
SDFset with the read.SDFset
command and then use that
instead of the filename.
EiInit will create a folder called
'data'. Commands should always be executed in the folder containing
this directory (ie, the parent directory of "data"), or else
specify the location of that directory with the dir
option.
A directory called "data" will have been created in the current working directory.
The generated compound ids of the given compounds will be returned. These can be used to
reference a compound or set of compounds in other functions, such as eiQuery
.
Kevin Horan
eiMakeDb
eiPerformanceTest
eiQuery
data(sdfsample) dir=file.path(tempdir(),"init") dir.create(dir) eiInit(sdfsample,dir=dir,priorityFn=randomPriorities)
data(sdfsample) dir=file.path(tempdir(),"init") dir.create(dir) eiInit(sdfsample,dir=dir,priorityFn=randomPriorities)
Uses the initalized compound data to create an embedded compound
databbase with r
reference compounds in d
dimensions.
eiMakeDb(refs,d,descriptorType="ap",distance=getDefaultDist(descriptorType), dir=".",numSamples=getGroupSize(conn, name = file.path(dir,Main)) * 0.1,conn=defaultConn(dir), cl=makeCluster(1,type="SOCK",outfile=""),connSource=NULL,numTrees=100)
eiMakeDb(refs,d,descriptorType="ap",distance=getDefaultDist(descriptorType), dir=".",numSamples=getGroupSize(conn, name = file.path(dir,Main)) * 0.1,conn=defaultConn(dir), cl=makeCluster(1,type="SOCK",outfile=""),connSource=NULL,numTrees=100)
refs |
The reference compounds to use to build the database you wish to query against.
|
d |
The number of dimensions used to build the database you wish to query against. |
descriptorType |
The format of the descriptor. Currently supported values are "ap" for atom pair, and "fp" for fingerprint. |
distance |
The distance function to be used to compute the distance between two descriptors. A default function is provided for "ap" and "fp" descriptors. |
dir |
The directory where the "data" directory lives. Defaults to the current directory. |
numSamples |
The number of non-reference samples to be chosen now to be used later by the eiPerformanceTest function. |
conn |
Database connection to use. |
cl |
A SNOW cluster can be given here to run this function in parrallel. |
connSource |
A function returning a new database connection. Note that it is not suffient to return a reference to an existing connection, it must be a distinct, new connection. This is needed for cluster operations that make use of the database as they will each need to craete a new connection. If not given, certain parts of this function will not be parrallelized. This function can also be used to setup the envrionment on the cluster worker nodes. For example, you might need to re-load libraries like RSQLite and such. |
numTrees |
Affects the build time and the index size. A larger value will produce more accurate results, but use more disk space. See https://github.com/spotify/annoy for more details. |
This function will embedd compounds from the data directory in another space which allows for more efficient searching. The main two parameters are r and d. r is the number of reference compounds to use and d is the dimension of the embedding space. We have found in practice that setting d to around 100 works well. r should be large enough to “represent” the full compound database. Note that an r by r matrix will be constructed during the course of execution, so r should be less than about 46,000 to avoid overflowing an integer. Since this is the longest running step, a SNOW cluster can be provided to parallelize the task.
To help tune these values, eiMakeDb
will pick
numSamples
non-reference samples which can later be used by the
eiPerformanceTest
function.
eiMakdDb
does its job in a job folder, named after the number of reference
compounds and the number of embedding dimensions. For example, using 300
reference compounds to generate a 100-dimensional embedding (r=300,
d=100) will result in a job folder called run-300-100.
The embedding result is the file matrix.<r>.<d>. In the above example,
the output would be run-300-100/matrix.300.100.
Creates files in dir
("run-r-d" by default).
The return value is an id number called the runId
, which needs to be
given to other functions such as eiQuery or eiAdd.
Kevin Horan
eiInit
eiPerformanceTest
eiQuery
eiCluster
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"makedb") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile=""))
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"makedb") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile=""))
Tests the performance of embedding and LSH.
eiPerformanceTest(runId,distance=getDefaultDist(descriptorType), conn=defaultConn(dir),dir=".",K=200, searchK=-1)
eiPerformanceTest(runId,distance=getDefaultDist(descriptorType), conn=defaultConn(dir),dir=".",K=200, searchK=-1)
runId |
The id number identifying a particular set of settings for a database. This is generally
the number returned by |
distance |
The distance function to be used to compute the distance between two descriptors. A default function is provided for "ap" and "fp" descriptors. |
conn |
Database connection to use. |
dir |
The directory where the "data" directory lives. Defaults to the current directory. |
K |
Number of search results to use for LSH performance test. |
searchK |
Tunable Annoy LSH parameter. A larger value will give more accurate results, but will take longer time to return. See Annoy page for details. https://github.com/spotify/annoy |
This function can be used to tune the two Annoy LSH parameters, numTrees
, and searchK
.
NumTrees is provided to the eiMakeDb function and affects the build time and the index size. A larger value will produce more accurate results, but use more disk space.
SearchK is given to the eiQuery function, or to this function. A larger value will give more accurate results, but will require more time to run.
This function will perform two different tests. The first test is how well the embedding is working. When the eiMakeDb function is run, you can specify the number of test samples to use for this test. If not specified, it will default to 10% of the data set size. During this test, we take each sample and compute its distance to every other compund in the dataset using both the given descriptor distance function (e.g., "AP" or "fingerprint"), as well as the euclidean distance computed on the embedded version. We then measure how similar the resulting ranks of these lists are using Rank Based Overlap (Webber,2010) (http://www.williamwebber.com/research/papers/wmz10_tois.pdf). The similarity for each sample is output in a file called 'embedding.performance' in the work directory. Each line corresponds to one sample.
The second test compares the rankings produced using the descriptor distance function, to the rankings produced by the final output of the LSH search, for each sample query. Again, rank based overlap (RBO) is used to compare the rankings. The results are output in the same format as for the fist test, in a file called 'indexed.performance'.
RBO is a similarity measure that produces a value in the range of [0,1]. Values closer to 0 are very dissimilar, while values closer to 1 are more similar.
Returns the results of the indexing test. Each element of the resulting
vector is the RBO similarity for the coresponding query.
Creates files in dir
/run-r-d.
Kevin Horan
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"perf") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId = eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile="")) eiPerformanceTest(runId,dir=dir,K=22)
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"perf") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId = eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile="")) eiPerformanceTest(runId,dir=dir,K=22)
Finds similar compounds for each query.
eiQuery(runId,queries,format="sdf", dir=".",distance=getDefaultDist(descriptorType),conn=defaultConn(dir), asSimilarity=FALSE, K=200, searchK=-1,lshData=NULL, mainIds = readIddb(conn,file.path(dir, Main)))
eiQuery(runId,queries,format="sdf", dir=".",distance=getDefaultDist(descriptorType),conn=defaultConn(dir), asSimilarity=FALSE, K=200, searchK=-1,lshData=NULL, mainIds = readIddb(conn,file.path(dir, Main)))
runId |
The id number identifying a particular set of settings for a database. This is generally
the number returned by |
queries |
This can be either an SDFset, or a file containg 1 or more query compounds. |
format |
The format in which the queries are given. Valid values are: "sdf" when
|
dir |
The directory where the "data" directory lives. Defaults to the current directory. |
distance |
The distance function to be used to compute the distance between two descriptors. A default function is provided for "ap" and "fp" descriptors. The Tanimoto function is used by default. |
conn |
Database connection to use. |
asSimilarity |
If true, return similarity values instead of distance values. This only works in the given distance function returns values between 0 and 1. This is true for the default atom pair and finger print distance functions. |
K |
The number of results to return. |
searchK |
Tunable Annoy LSH parameter. A larger value will give more accurate results, but will take longer time to return. The default value of -1 will allow the value to chosen automatically, which will set a value of numTrees * (approximate number of nearest neighbors). See Annoy page for details. https://github.com/spotify/annoy |
lshData |
DEPRECATED. This is no longer used. |
mainIds |
A vector of all id numbers in the current database. This is mainly provided as an option here to avoid having to re-read the id list multiple times when executing several queries. If not supplied it will read it in itself. |
This function identifies the database by the r
, d
, and
refIddb
parameters. The queries can be given in a few
different formats, see the queries
parameter for details.
The LSH algorithm is used to quickly identify compounds similar to the
queries.
This function must use a distance function rather than a similarity function.
However, if the distance function given returns values between 0 and 1, then
the asSimilarity
parameter may be used to return similarity values rather
than distance values.
Returns a data frame with columns 'query', 'target', 'target_ids', and
'distance'. 'query' and 'target' are the compound names and
distance is the distance between them, as computed by
the given distance function.'target_ids' is the compound id of the target.
Query namess are repeated for each matching target found.
If asSimilarity
is true then instead of a "distance"
column there will be a "similarity" column.
Kevin Horan
eiInit
eiMakeDb
eiPerformanceTest
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"query") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile="")) #find compounds similar two each query results = eiQuery(runId,sdfsample[1:2],K=15,dir=dir)
library(snow) r<- 50 d<- 40 #initialize data(sdfsample) dir=file.path(tempdir(),"query") dir.create(dir) eiInit(sdfsample,dir=dir,skipPriorities=TRUE) #create compound db runId=eiMakeDb(r,d,numSamples=20,dir=dir, cl=makeCluster(1,type="SOCK",outfile="")) #find compounds similar two each query results = eiQuery(runId,sdfsample[1:2],K=15,dir=dir)
122 compounds in SDF format, stored as a list. Each element of the list is one line of text. This is just used in some unit tests.
The format is: chr [1:12222] "3540" " OpenBabel06051210572D" "" ...
This function is no longer needed with the new LSH package in use now. It will be defunct in the next release.
Free the memory allocated by loadLSHData
.
freeLSHData(lshData)
freeLSHData(lshData)
lshData |
A pointer returned by |
No return value.
Kevin Horan
## Not run: lshData = loadLSHData(r,d) eiQuery(r,d,refIddb,c("650002","650003"), format="name",K=15,lshData=lshData) eiQuery(r,d,refIddb,c("650004","650005"), format="name",K=15,lshData=lshData) freeLSHData(lshData) ## End(Not run)
## Not run: lshData = loadLSHData(r,d) eiQuery(r,d,refIddb,c("650002","650003"), format="name",K=15,lshData=lshData) eiQuery(r,d,refIddb,c("650004","650005"), format="name",K=15,lshData=lshData) freeLSHData(lshData) ## End(Not run)
This function is no longer needed with the new LSH package in use now. It will be defunct in the next release.
Load the LSH index and data. If many queries are going to be performed it is advantageous
to load this object first and then hand it to eiQuery
via the
lshData
parameter for each query.
If the data needs to be freed you can call the freeLSHData
function.
loadLSHData(r, d, W = NA, M = NA, L = NA, K = NA, T = NA, dir = ".", matrixFile = NULL)
loadLSHData(r, d, W = NA, M = NA, L = NA, K = NA, T = NA, dir = ".", matrixFile = NULL)
r |
The number of references used to build the database you wish to query against. |
d |
The number of dimensions used to build the database you wish to query against. |
W |
See |
M |
See |
L |
See |
K |
See |
T |
See |
dir |
The directory where the "data" directory lives. Defaults to the current directory. |
matrixFile |
The path to the matrix file. If not specified it will look for it in the default spot. |
Returns a pointer to the allocated data. This should only be passed to
other functions with an lshData
parameter, such as eiQuery
Kevin Horan
## Not run: lshData = loadLSHData(r,d) eiQuery(r,d,refIddb,c("650002","650003"), format="name",K=15,lshData=lshData) eiQuery(r,d,refIddb,c("650004","650005"), format="name",K=15,lshData=lshData) freeLSHData(lshData) ## End(Not run)
## Not run: lshData = loadLSHData(r,d) eiQuery(r,d,refIddb,c("650002","650003"), format="name",K=15,lshData=lshData) eiQuery(r,d,refIddb,c("650004","650005"), format="name",K=15,lshData=lshData) freeLSHData(lshData) ## End(Not run)
Set the default distance function for a descriptor type. This is the distance function that will be used if none is given for a particular function call.
setDefaultDistance(descriptorType, distance)
setDefaultDistance(descriptorType, distance)
descriptorType |
The type of the descriptor to set a distance function for. Built-in values are "ap" and "fp". Additional values can be set as well. |
distance |
A distance function taking two descriptor objects (as returned by toObject in a descriptor transform,
see |
No return value.
Kevin Horan
setDefaultDistance("ap", function(d1,d2) 1-cmp.similarity(d1,d2) )
setDefaultDistance("ap", function(d1,d2) 1-cmp.similarity(d1,d2) )