Title: | a simple search engine |
---|---|
Description: | Demonstrate tokenization and a search gadget for collections of CSV files. |
Authors: | Vince Carey |
Maintainer: | VJ Carey <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.23.0 |
Built: | 2024-10-31 05:29:21 UTC |
Source: | https://github.com/bioc/ssrch |
ssrch demo with metadata documents from 68 cancer transcriptomics studies
ctxsearch()
ctxsearch()
Simply starts an app.
The metadata were derived by extracting sample.attributes fields from a search with github.com/seandavi/SRAdbV2. The sample.attributes content varies between studies and sometimes between experiments within studies. The field sets were unified with the sampleAtts function of github.com/vjcitn/HumanTranscriptomeCompendium. After unification records were stacked and CSVs were written.
if (interactive()) { oask = options()$example.ask options(example.ask=FALSE) try(ctxsearch2()) options(example.ask=oask) }
if (interactive()) { oask = options()$example.ask options(example.ask=FALSE) try(ctxsearch2()) options(example.ask=oask) }
constructor for DocSet
DocSet(kw2docs = new.env(hash = TRUE), docs2recs = new.env(hash = TRUE), docs2kw = new.env(hash = TRUE), titles = character(), urls = character(), doc_retriever = function(...) NULL)
DocSet(kw2docs = new.env(hash = TRUE), docs2recs = new.env(hash = TRUE), docs2kw = new.env(hash = TRUE), titles = character(), urls = character(), doc_retriever = function(...) NULL)
kw2docs |
an environment mapping keywords to documents |
docs2recs |
an environment mapping document identifiers to records |
docs2kw |
an environment mapping documents to keywords |
titles |
a named character vector with titles; names are document identifiers |
urls |
a named character vector with document-associated URLs; names are document identifiers |
doc_retriever |
a function that, given a document identifier, will produce the document |
instance of DocSet
Titles must be bound in post-hoc. parseDoc produces data including parsed titles but does not bind the title into the resulting object.
getClass("DocSet")
getClass("DocSet")
DocSet instance with metadata from 68 cancer studies
docset_cancer68
docset_cancer68
S4 class DocSet defined in ssrch
interactive app for ssrch DocSet instances
docset_searchapp(docset, se = NULL, sefilter = function(se, ...) se)
docset_searchapp(docset, se = NULL, sefilter = function(se, ...) se)
docset |
an instance of DocSet |
se |
(defaults to NULL) an instance of SummarizedExperiment; samples will be filtered by selection method prescribed in sefilter |
sefilter |
a function accepting (se, ...) and returning a SummarizedExperiment |
Returns list of data.frames of metadata on studies requested. Can provide a SummarizedExperiment download when 'se' is non-null, but this is not yet returned to the session.
The handling of SummarizedExperiments by this app is specialized. The 'sefilter' for the cancer example would be 'function(se, y) se[,which(se$study_accession will be called with 'y' bound to the study accession numbers selected in the app.
if (interactive()) { oask = options()$example.ask options(example.ask=FALSE) n1 = try(docset_searchapp(ssrch::docset_cancer68)) str(n1) options(example.ask=oask) }
if (interactive()) { oask = options()$example.ask options(example.ask=FALSE) n1 = try(docset_searchapp(ssrch::docset_cancer68)) str(n1) options(example.ask=oask) }
Container for simple documents with arbitrary numbers/shapes of records
utilities for ssrch
kw2docs(sdata) docs2kw(sdata) docs2recs(sdata) searchDocs(string, obj, ...) retrieve_doc(x, obj, ...)
kw2docs(sdata) docs2kw(sdata) docs2recs(sdata) searchDocs(string, obj, ...) retrieve_doc(x, obj, ...)
sdata |
instance of srchData class |
string |
character(1) query string |
obj |
instance of DocSet class |
... |
passed to base::grep |
x |
character(1) document identifier |
an environment
an environment
an environment
a data.frame with tokens queried (hits) and associated document ids (docs)
result of calling obj@doc_retriever on arguments x, ...
getClass("DocSet")
getClass("DocSet")
DocSet instance with metadata from 1009 cancer studies
ds_can1009b()
ds_can1009b()
S4 class DocSet defined in ssrch
DocSet instance
This is part of a sequence of datasets assessing how far we can go with environments of keywords. Annotation for 43000 samples is indexed here.
ds_can1009b()
ds_can1009b()
parse a document and place content in a DocSet
parseDoc(csv, DocSetInstance = new("DocSet"), doctitle = NA_character_, docabst = NA_character_, rec_id_field = "experiment.accession", exclude_fields = c("study.accession"), substrings_to_omit = c("http://purl.obolibrary.org/obo/"), patterns_to_kill = "....-..-..|.*...,...", token_fixups = list(c("t''", "t'"), c(":$", "")), max_tok_nchar = 25, min_tok_nchar = 4, cleanFields = list("..*id$", ".name$", "_name$", "checksum", "isolate", "filename", "^ID$", "barcode", "Sample.Name"))
parseDoc(csv, DocSetInstance = new("DocSet"), doctitle = NA_character_, docabst = NA_character_, rec_id_field = "experiment.accession", exclude_fields = c("study.accession"), substrings_to_omit = c("http://purl.obolibrary.org/obo/"), patterns_to_kill = "....-..-..|.*...,...", token_fixups = list(c("t''", "t'"), c(":$", "")), max_tok_nchar = 25, min_tok_nchar = 4, cleanFields = list("..*id$", ".name$", "_name$", "checksum", "isolate", "filename", "^ID$", "barcode", "Sample.Name"))
csv |
a character(1) CSV file path |
DocSetInstance |
if missing, DocSet is initialized in this function, otherwise the instance is updated with new content |
doctitle |
character(1) document title |
docabst |
character(1) a string: the document abstract |
rec_id_field |
character(1) field in CSV identifying records |
exclude_fields |
character vector of fields to ignore while parsing |
substrings_to_omit |
character vector of strings to remove from candidate keywords via gsub |
patterns_to_kill |
character(1) regexp that identifies tokens to be omitted from keyword set |
token_fixups |
a list if character(2) vectors that will be |
max_tok_nchar |
numeric(1) defaults to 25, tokens with more characters will be truncated to this length and suffixed with ellipsis |
min_tok_nchar |
numeric(1) defaults to 4, tokens shorter than this are not in index used with gsub() to repair irregularities. For example ‘c("t”", "t’")‘ will transform 'Burkitt”s' to 'Burkitt’s' |
cleanFields |
list of regular expressions identifying fields to ignore |
instance of DocSet
The expected use case has 'DocSetInstance' being updated in a loop. Sharing of environments across multiple DocSetInstances can occur and unexpected behaviors may ensue. Note also that many of the parameter defaults to parseDoc are for the use case of processing SRA metadata.
myob = ssrch::docset_cancer68 td = tempdir() alld = ls(docs2kw(myob)) r1 = retrieve_doc(alld[1], myob) expo = write.csv(r1, paste0(td, "/expo.csv")) pd = parseDoc(paste0(td, "/expo.csv"), doctitle=ssrch::titles68[alld[1]], docabst="qwerty") pd searchDocs("quer", pd) # query will fail searchDocs("qwer", pd) # should succeed
myob = ssrch::docset_cancer68 td = tempdir() alld = ls(docs2kw(myob)) r1 = retrieve_doc(alld[1], myob) expo = write.csv(r1, paste0(td, "/expo.csv")) pd = parseDoc(paste0(td, "/expo.csv"), doctitle=ssrch::titles68[alld[1]], docabst="qwerty") pd searchDocs("quer", pd) # query will fail searchDocs("qwer", pd) # should succeed
publication dates for 6000 SRA transcriptome studies
study_publ_dates
study_publ_dates
data.frame
titles for 68 cancer studies
titles68
titles68
character vector
pubmed URLs for subset of 68 cancer studies
urls68
urls68
character vector