--- title: "2 Evaluation using Trio" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{2 Evaluation using Trio} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, message=FALSE} library(BenchHub) library(tidyverse) library(glmnet) ``` # Import microbiome data The `Trio` object in `BenchHub` can take datasets provided by users. To demonstrate this workflow, we use the `lubomski_microbiome_data` example dataset distributed with `BenchHub`. We first load the example objects into a temporary environment to avoid placing `x` and `lubomPD` directly into the global workspace. The dataset contains two objects: `x`, a 575 by 1192 matrix of microbial abundances for 575 samples, and `lubomPD`, a binary factor indicating Parkinson's disease (`PD`) or healthy control (`HC`) status for each sample. ```{r import-data} # import the microbiome data into a temporary environment exampleEnv <- new.env(parent = emptyenv()) data("lubomski_microbiome_data", envir = exampleEnv, package = "BenchHub") x <- exampleEnv[["x"]] lubomPD <- exampleEnv[["lubomPD"]] # check the dimension of the microbiome matrix dim(x) # check the length of the patient status length(lubomPD) # Add sample IDs so the evidence matches the dataset rows by name. names(lubomPD) <- rownames(x) ``` The task we'll be evaluating uses a binary classification task where each sample is either a Parkinson's Disease (PD) patient or Healthy Control (HC). Once the data are ready to be inputted to `Trio`, we can load `Trio`. # Initialise a Trio object To initialise a Trio object, we use a `new()` method. ```{r initialise-trio} trio <- Trio$new(data = x, evidence = list( Diagnosis = list( evidence = lubomPD, metrics = "Balanced Accuracy" ) ), metrics = list( `Balanced Accuracy` = balAccMetric ), datasetID = "lubomski_microbiome" ) ``` The supporting evidence is the disease diagnosis, which is the `lubomPD` vector that we extracted from the `Lubomski` data above and the relevant metric to evaluate it is Balanced Accuracy. Next, we will use this microbiome data to illustrate two examples, 1) evaluation with cross validation, 2) evaluation without cross validation. # Cross validation - using the trio split function In this example, we show a typical classification scenario to classify the patient status, where we need to split the data into training and testing set. Step 1: Obtain relevant data for the model. To build a model, the following code will extract the data matrix and patient status outcome from `Trio` object. ```{r gold-standard} # get the gold standard from the Trio object x <- trio$data y <- trio$getEvidence("Diagnosis") ``` Step 2: For repeated cross-validation, the `split()` function will split the `y` vector (i.e., `evidence`) into the number of folds and repeats users want. In this case we used `n_fold = 2` and `n_repeat = 5` (i.e., 2-fold cross-validation with 10 repeats). Then users can get cross-validation indices (`CVindices`) for each sample, through `splitIndices`. This gives a simple list where each element represents a combination of folds and repeats for each sample. ```{r train-test} # get train and test indices trio$split(y = y, n_fold = 2, n_repeat = 5) CVindices <- trio$splitIndices ``` Step 5: Build the classification model and evaluate. We can now use a for loop to cross-validate evaluation results. Using the `CVindices`, user can subset to training and test data. As an example, we build a LASSO regression model as a classification model on the training data, then make predictions on the test data. Once predictions are obtained for the test set, we pass them to Trio using the same evidence name as stored in the Trio object (i.e., `Diagnosis`). Specifically, we call `trio$evaluate(list(lasso = list(Diagnosis = pred)))`, which instructs `evaluate()` to compare `pred` against the reference labels `Diagnosis` stored in `trio`, and then compute the specified metric (Balanced Accuracy). We first construct an explicit cross-validation plan that records fold and repeat identifiers, and then iterate over this plan to evaluate each split. ```{r cross-validation} set.seed(1234) library(tibble) cv_plan <- tibble( trainIDs = CVindices, fold = vapply( strsplit(names(CVindices), ".", fixed = TRUE), `[`, character(1), 1 ), repeat_id = vapply( strsplit(names(CVindices), ".", fixed = TRUE), `[`, character(1), 2 ) ) run_one_cv <- function(trainIDs, fold, repeat_id, trio, x, y) { x_train <- x[trainIDs, ] x_test <- x[-trainIDs, ] y_train <- y[trainIDs] y_test <- y[-trainIDs] cv_lasso <- glmnet::cv.glmnet( x = as.matrix(x_train), y = y_train, alpha = 1, family = "binomial" ) lam <- cv_lasso$lambda.1se fit <- glmnet::glmnet( x = as.matrix(x_train), y = y_train, alpha = 1, lambda = lam, family = "binomial" ) pred <- predict( fit, newx = as.matrix(x_test), s = lam, type = "class" ) pred <- setNames(as.factor(as.vector(pred)), rownames(x_test)) eval_res <- trio$evaluate(list(lasso = list(Diagnosis = pred))) # attach metadata explicitly eval_res$fold <- fold eval_res$repeat_id <- repeat_id eval_res } result_list <- vector("list", nrow(cv_plan)) for (i in seq_len(nrow(cv_plan))) { result_list[[i]] <- run_one_cv( trainIDs = cv_plan$trainIDs[[i]], fold = cv_plan$fold[[i]], repeat_id = cv_plan$repeat_id[[i]], trio = trio, x = x, y = y ) } ``` After cross-validation, we can visualise cross-validation results by averaging results across folds within each repeats. ```{r fig.cap = "Fig.1 Mean cross-validation accuracy across repeats."} result <- dplyr::bind_rows(result_list) result_summary <- result %>% dplyr::group_by(datasetID, method, evidence, metric, repeat_id) %>% dplyr::summarise(result = mean(result), .groups = "drop") # visualise the result boxplot( result_summary$result, ylab = "Accuracy", main = "Cross-validation performance" ) ``` # Without Cross-validation In this example, we show a simpler case without cross validation, where we have a prediction that we want to compare with the ground truth directly. For example, we may have simulated a single-cell count matrix and want to compare with an experimental count matrix on various attributes. Here, we use the microbiome dataset to demonstrate the above case. Step 1: Simulate a matrix of the same size as the microbiome data using negative binomial. Step 2: Define a function that calculate the sparsity of a given matrix. Step 3: Define a metric that compare difference of two values. Step 4: Evaluate the difference in sparsity of the microbiome data and the simulated data. ```{r simulate} set.seed(1) # generate a simulated matrix sim <- rnbinom(nrow(x) * ncol(x), size = 1, mu = 1) sim <- Matrix(sim, nrow = nrow(x), ncol = ncol(x)) # function that calculate the sparsity of a matrix calc_sparsity <- function(data) { sparsity <- sum(data == 0) / length(data) return(sparsity) } # metric that compare the difference of two values calc_diff <- function(evidence, predicted) { return(predicted - evidence) } ``` ```{r add-metric} # add metric that we just defined trio$addMetric(name = "Difference", metric = calc_diff) # calculate the sparsity of the data, and then input into the supporting evidence trio$addEvidence(name = "Sparsity", evidence = calc_sparsity, metrics = "Difference") # because we used negative binomial to simulate a matrix # we will name the method as "negative binomial" # then calculate the sparsity of the simulated matrix # and compare with the Sparsity supporting evidence eval_res <- trio$evaluate( list(negative_binomial = list(Sparsity = sim)) ) eval_res ``` From the evaluation result, we see there is a `r round(eval_res$result, 2)` difference in the sparsity between the microbiome data and the data simulated with negative binomial. # Importing result to BenchmarkInsights The result dataframe obtained from `Trio` evaluation can be passed to `BenchmarkInsights` for subsequent visualisation. `BenchmarkInsights` is another structure in `BenchHub` that provide a list of functions to analyse and visualise the benchmarking results from multiple perspectives. Note, to make the result dataframe compatible for BenchmarkInsights, please make sure the dataframe is in the format shown below, where the column names are "datasetID", "method", "evidence", "metric" , "result", and that there is one row for each result. ```{r get-result} # in the with cross validation result, we need to average the results from multiple repeats to give one value result <- result %>% dplyr::group_by(datasetID, method, evidence, metric) %>% dplyr::summarize(result = mean(result)) result <- rbind(result, eval_res) result ``` Please see Vignette 3 for the details on the visualisations and functions in `BenchmarkInsights`. # Session Info ```{r session-info} sessionInfo() ```