Package 'ssrch' reference manual

Title:	a simple search engine
Description:	Demonstrate tokenization and a search gadget for collections of CSV files.
Authors:	Vince Carey
Maintainer:	VJ Carey <[email protected]>
License:	Artistic-2.0
Version:	1.23.0
Built:	2025-03-24 05:38:27 UTC
Source:	https://github.com/bioc/ssrch

ssrch demo with metadata documents from 68 cancer transcriptomics studies

Description

ssrch demo with metadata documents from 68 cancer transcriptomics studies

Usage

ctxsearch()
ctxsearch()

Value

Simply starts an app.

Note

The metadata were derived by extracting sample.attributes fields from a search with github.com/seandavi/SRAdbV2. The sample.attributes content varies between studies and sometimes between experiments within studies. The field sets were unified with the sampleAtts function of github.com/vjcitn/HumanTranscriptomeCompendium. After unification records were stacked and CSVs were written.

Examples

if (interactive()) {
  oask = options()$example.ask
  options(example.ask=FALSE)
  try(ctxsearch2())
  options(example.ask=oask)
}
if (interactive()) {
  oask = options()$example.ask
  options(example.ask=FALSE)
  try(ctxsearch2())
  options(example.ask=oask)
}

constructor for DocSet

Description

constructor for DocSet

Usage

DocSet(kw2docs = new.env(hash = TRUE), docs2recs = new.env(hash =
  TRUE), docs2kw = new.env(hash = TRUE), titles = character(),
  urls = character(), doc_retriever = function(...) NULL)
DocSet(kw2docs = new.env(hash = TRUE), docs2recs = new.env(hash =
  TRUE), docs2kw = new.env(hash = TRUE), titles = character(),
  urls = character(), doc_retriever = function(...) NULL)

Arguments

`kw2docs`	an environment mapping keywords to documents
`docs2recs`	an environment mapping document identifiers to records
`docs2kw`	an environment mapping documents to keywords
`titles`	a named character vector with titles; names are document identifiers
`urls`	a named character vector with document-associated URLs; names are document identifiers
`doc_retriever`	a function that, given a document identifier, will produce the document

Value

instance of DocSet

Note

Titles must be bound in post-hoc. parseDoc produces data including parsed titles but does not bind the title into the resulting object.

Examples

getClass("DocSet")
getClass("DocSet")

DocSet instance with metadata from 68 cancer studies

Description

DocSet instance with metadata from 68 cancer studies

Usage

docset_cancer68
docset_cancer68

Format

S4 class DocSet defined in ssrch

interactive app for ssrch DocSet instances

Description

interactive app for ssrch DocSet instances

Usage

docset_searchapp(docset, se = NULL, sefilter = function(se, ...) se)
docset_searchapp(docset, se = NULL, sefilter = function(se, ...) se)

Arguments

`docset`	an instance of DocSet
`se`	(defaults to NULL) an instance of SummarizedExperiment; samples will be filtered by selection method prescribed in sefilter
`sefilter`	a function accepting (se, ...) and returning a SummarizedExperiment

Value

Returns list of data.frames of metadata on studies requested. Can provide a SummarizedExperiment download when 'se' is non-null, but this is not yet returned to the session.

Note

The handling of SummarizedExperiments by this app is specialized. The 'sefilter' for the cancer example would be 'function(se, y) se[,which(se$study_accession will be called with 'y' bound to the study accession numbers selected in the app.

Examples

if (interactive()) {
  oask = options()$example.ask
  options(example.ask=FALSE)
  n1 = try(docset_searchapp(ssrch::docset_cancer68))
  str(n1)
  options(example.ask=oask)
}
if (interactive()) {
  oask = options()$example.ask
  options(example.ask=FALSE)
  n1 = try(docset_searchapp(ssrch::docset_cancer68))
  str(n1)
  options(example.ask=oask)
}

Container for simple documents with arbitrary numbers/shapes of records

Description

Container for simple documents with arbitrary numbers/shapes of records

utilities for ssrch

Usage

kw2docs(sdata)

docs2kw(sdata)

docs2recs(sdata)

searchDocs(string, obj, ...)

retrieve_doc(x, obj, ...)
kw2docs(sdata)

docs2kw(sdata)

docs2recs(sdata)

searchDocs(string, obj, ...)

retrieve_doc(x, obj, ...)

Arguments

`sdata`	instance of srchData class
`string`	character(1) query string
`obj`	instance of DocSet class
`...`	passed to base::grep
`x`	character(1) document identifier

Value

an environment

a data.frame with tokens queried (hits) and associated document ids (docs)

result of calling obj@doc_retriever on arguments x, ...

Examples

getClass("DocSet")
getClass("DocSet")

DocSet instance with metadata from 1009 cancer studies

Description

DocSet instance with metadata from 1009 cancer studies

Usage

ds_can1009b()
ds_can1009b()

Format

S4 class DocSet defined in ssrch

Value

DocSet instance

Note

This is part of a sequence of datasets assessing how far we can go with environments of keywords. Annotation for 43000 samples is indexed here.

Examples

ds_can1009b()
ds_can1009b()

parse a document and place content in a DocSet

Description

parse a document and place content in a DocSet

Usage

parseDoc(csv, DocSetInstance = new("DocSet"), doctitle = NA_character_,
  docabst = NA_character_, rec_id_field = "experiment.accession",
  exclude_fields = c("study.accession"),
  substrings_to_omit = c("http://purl.obolibrary.org/obo/"),
  patterns_to_kill = "....-..-..|.*...,...",
  token_fixups = list(c("t''", "t'"), c(":$", "")), max_tok_nchar = 25,
  min_tok_nchar = 4, cleanFields = list("..*id$", ".name$", "_name$",
  "checksum", "isolate", "filename", "^ID$", "barcode", "Sample.Name"))
parseDoc(csv, DocSetInstance = new("DocSet"), doctitle = NA_character_,
  docabst = NA_character_, rec_id_field = "experiment.accession",
  exclude_fields = c("study.accession"),
  substrings_to_omit = c("http://purl.obolibrary.org/obo/"),
  patterns_to_kill = "....-..-..|.*...,...",
  token_fixups = list(c("t''", "t'"), c(":$", "")), max_tok_nchar = 25,
  min_tok_nchar = 4, cleanFields = list("..*id$", ".name$", "_name$",
  "checksum", "isolate", "filename", "^ID$", "barcode", "Sample.Name"))

Arguments

`csv`	a character(1) CSV file path
`DocSetInstance`	if missing, DocSet is initialized in this function, otherwise the instance is updated with new content
`doctitle`	character(1) document title
`docabst`	character(1) a string: the document abstract
`rec_id_field`	character(1) field in CSV identifying records
`exclude_fields`	character vector of fields to ignore while parsing
`substrings_to_omit`	character vector of strings to remove from candidate keywords via gsub
`patterns_to_kill`	character(1) regexp that identifies tokens to be omitted from keyword set
`token_fixups`	a list if character(2) vectors that will be
`max_tok_nchar`	numeric(1) defaults to 25, tokens with more characters will be truncated to this length and suffixed with ellipsis
`min_tok_nchar`	numeric(1) defaults to 4, tokens shorter than this are not in index used with gsub() to repair irregularities. For example ‘c("t”", "t’")‘ will transform 'Burkitt”s' to 'Burkitt’s'
`cleanFields`	list of regular expressions identifying fields to ignore

Value

instance of DocSet

Note

The expected use case has 'DocSetInstance' being updated in a loop. Sharing of environments across multiple DocSetInstances can occur and unexpected behaviors may ensue. Note also that many of the parameter defaults to parseDoc are for the use case of processing SRA metadata.

Examples

myob = ssrch::docset_cancer68
td = tempdir()
alld = ls(docs2kw(myob))
r1 = retrieve_doc(alld[1], myob)
expo = write.csv(r1, paste0(td, "/expo.csv"))
pd = parseDoc(paste0(td, "/expo.csv"), doctitle=ssrch::titles68[alld[1]],
    docabst="qwerty")
pd
searchDocs("quer", pd) # query will fail
searchDocs("qwer", pd) # should succeed
myob = ssrch::docset_cancer68
td = tempdir()
alld = ls(docs2kw(myob))
r1 = retrieve_doc(alld[1], myob)
expo = write.csv(r1, paste0(td, "/expo.csv"))
pd = parseDoc(paste0(td, "/expo.csv"), doctitle=ssrch::titles68[alld[1]],
    docabst="qwerty")
pd
searchDocs("quer", pd) # query will fail
searchDocs("qwer", pd) # should succeed

publication dates for 6000 SRA transcriptome studies

Description

publication dates for 6000 SRA transcriptome studies

Usage

study_publ_dates
study_publ_dates

Format

data.frame

titles for 68 cancer studies

Description

titles for 68 cancer studies

Usage

titles68
titles68

Format

character vector

pubmed URLs for subset of 68 cancer studies

Description

pubmed URLs for subset of 68 cancer studies

Usage

urls68
urls68

Format

character vector

Package 'ssrch'

Help Index

ssrch demo with metadata documents from 68 cancer transcriptomics studies

Description

Usage

Value

Note

Examples

constructor for DocSet

Description

Usage

Arguments

Value

Note

Examples

DocSet instance with metadata from 68 cancer studies

Description

Usage

Format

interactive app for ssrch DocSet instances

Description

Usage

Arguments

Value

Note

Examples

Container for simple documents with arbitrary numbers/shapes of records

Description

Usage

Arguments

Value

Examples

DocSet instance with metadata from 1009 cancer studies

Description

Usage

Format

Value

Note

Examples

parse a document and place content in a DocSet

Description

Usage

Arguments

Value

Note

Examples

publication dates for 6000 SRA transcriptome studies

Description

Usage

Format

titles for 68 cancer studies

Description

Usage

Format

pubmed URLs for subset of 68 cancer studies

Description

Usage

Format