--- title: "1 Introduction to the Trio Class" date: "`r BiocStyle::doc_date()`" output: BiocStyle::html_document: toc_float: true toc_depth: 3 vignette: > %\VignetteIndexEntry{1 Introduction to the Trio Class} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE ) ``` ```{r load-package, message=FALSE, warning=FALSE} library(BenchHub) ``` # Creating Trio object The Trio class is designed to facilitate the storing and sharing of benchmarking datasets. Each Trio is structured around a single dataset but can include multiple metrics and multiple pieces of supporting evidence (such as references or gold standards). Trio objects can be created using the `Trio$new` constructor. There are 3 ways to create a Trio object: - Curated Trio Datasets - Source and ID - Load an object directly If the dataset can't be loaded using Trio's inbuilt loader, a custom loader can be provided. ## Method 1: BenchHub datasets You can directly use the name from the [BenchHub Datasets](https://docs.google.com/spreadsheets/d/1H8hOxL8D0XTquao8vGZ2cr9-XeaFC48SWAdFn0M3fkg/edit?usp=sharing) sheet to initialise a Trio object populated with some metrics and supporting evidence. This method is useful when you want to quickly start with a predefined dataset. ```{r download-trio} tempCache <- tempdir() trio <- downloadSubmissionTrio("D001", cachePath = tempdir()) trio ``` The above output shows that we have a Trio with a dataset, metrics, and supporting evidence. The dataset contains 4 observation of 4 variables, and the metrics and supporting evidence are already populated and printed. ## Method 2: Source and ID Trio objects can be created by specifying an ID from a source with a valid trio downloader. This method is useful when you have a specific dataset ID from a supported source like Figshare, GEO, or ExperimentHub. For example, if you have a dataset ID from Figshare, GEO, or ExperimentHub, you can create a Trio object as follows: - figshare: `Trio$new("figshare:figshareID[/fileID]")` - `fileID` can optionally be provided to specify a specific file in the collection. - GEO: `Trio$new("geo:GSEID[/Supplementary_filename]")` - `Supplementary_filename` can optionally be provided to specify a specific supplementary file in the series. - experimenthub: `Trio$new("experimenthub:experimenthubID")` The example below shows how to create a Trio object using a Figshare dataset with a `datasetID`. ```{r create-trio, message=FALSE, warning=FALSE} trioA <- Trio$new("figshare:26142922/47361079", evidenceColumns = c("time", "status"), task = "Risk Estimation", metrics = list("Harrell C-index" = harrelCIndexMetric, "Begg C-index" = beggCIndexMetric), cachePath = tempCache) trioA ``` ## Method 3: Load an object directly Trio can also be created by passing an object directly into the constructor. This method is useful when you already have a dataset loaded in your R environment and want to use it with Trio. If you have your own dataset, you can easily create a Trio object as well. Below is an example using a microbiome dataset. When trioB is created in an interactive R session, BenchHub will prompt Briefly describe the dataset:; the text you enter is stored in the Trio object’s description field. ```{r load-object} exampleEnv <- new.env(parent = emptyenv()) data("lubomski_microbiome_data", envir = exampleEnv, package = "BenchHub") # Add sample IDs so the evidence matches the dataset rows by name. names(exampleEnv[["lubomPD"]]) <- rownames(exampleEnv[["x"]]) trioB <- Trio$new(data = exampleEnv[["x"]], evidence = list(`Diagnosis` = list(evidence = exampleEnv[["lubomPD"]], metrics = "Balanced Accuracy")), metrics = list(`Balanced Accuracy` = balAccMetric), datasetID = "lubomski_microbiome") trioB ``` In an interactive session, a prompt will be displayed, asking you to briefly describe the dataset. This is useful for any pertinent information that is not recorded by the metadata spreadsheet. ## Bonus: Using a custom loader Trio supports custom loaders for data formats not directly supported by Trio. A loader is any function that takes in a path, provided by a downloader, and returns an object to be loaded into Trio. Below, we use an anonymous function to wrap `GEOquery::getGEO` and `Biobase::phenoData`, and provide them with the path of the downloaded file to extract both the gene expression values and the patient classes from different tables. ```{r custom-loader, eval=FALSE} trioGEO <- Trio$new( "GEO:GSE46474", dataLoader = \(path) Biobase::exprs(GEOquery::getGEO(filename = path)), task = "Rejection Prediction", evidenceLoader = \(path) Biobase::phenoData(GEOquery::getGEO(filename = path))[["procedure status:ch1"]], metrics = list(`Balanced Accuracy` = balAccMetric), cachePath = tempdir() ) trioGEO ``` # Adding Components to Trio Sequentially ## Adding metrics In benchmarking studies, a metric refers to the measurement used to evaluate a specific task. In this example, we define a task called survival model prediction. In Trio, a metric is any pairwise function of the form `f(expected, predicted)` which returns a single value. We can also add metrics that have additional arguments by passing a list of arguments to the args parameter. ```{r add-metric} eq <- \(expected, predicted, inequality = FALSE) { if (inequality) { return(!expected == predicted) } expected == predicted } trio$addMetric("equality", eq) # Trio also supports passing through arguments to a metric # Note: parameter names added for clarity trio$addMetric( name = "inequality", metric = eq, args = list(inequality = TRUE) ) ``` In the above example, we added two metrics based on the same function: "equality" and "inequality". The "equality" metric checks if the expected and predicted values are equal, while the "inequality" metric checks if they are not equal. Underneath the hood, Trio creates a wrapper function that calls the metric function with the specified arguments. ```{r see-metric} trio$metrics$inequality ``` # Other Features ## Caching Trio uses caching to avoid lengthy downloads after the first time a data set is accessed. The `cachePath` parameter specifies the path to the cache directory. If not specified, the cache directory defaults to `~/.cache/R/BenchHub/`. ## Data Splitting Trio supports data splitting for cross-validation. The `split` method splits the data into training and test sets for cross-validation. The `splitIndices` attribute stores the indices for each sample. Indices are generated using the `splitTools` package. The `split` method takes in the outcome variable and the number of folds and repeats. The `stratify` parameter can be used to stratify the outcome variable. ```{r split-function} trio$split(y = 1:137, n_fold = 2, n_repeat = 5, seed = 1234, stratify = FALSE) trio$splitIndices ``` # Conclusion In this vignette, we introduced the Trio class and demonstrated how to create a Trio object using curated datasets, source and ID, or loading an object directly. We also showed how to add metrics and supporting evidence to a Trio object and evaluate the performance of different methods using these metrics and supporting evidence. We hope this vignette helps you get started with Trio and conduct benchmarking studies more effectively. # Session Info ```{r session-info} sessionInfo() ```