Extending MetMashR

Introduction

There is a number software solutions available to annotate an LCMS metabolomics datasets. We have only included a small number annotation_sources in MetMashR as they are the ones we use the most.

In this document we describe how to extend the provided templates to include new sources bespoke to your requirements.

Getting Started

The latest versions of struct and MetMashR that are compatible with your current R version can be installed using BiocManager.

# install BiocManager if not present
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

# install MetMashR and dependencies
BiocManager::install("MetMashR")

Once installed you can activate the packages in the usual way:

# load the packages
library(struct)
library(MetMashR)
library(metabolomicsWorkbenchR)
library(ggplot2)

Example study

In this example we are going to import some annotations from a study on Metabolomics Workbench and build a workflow to clean up the table and search for PubChem CIDs so that we can present some images of some of the annotated metabolites for the study.

First, we need to import the table of annotations from Metabolomics Workbench. For this we make use of the metabolomicsWorkbenchR package.

# get annotations
AN <- do_query(
    context = "study",
    input_item = "analysis_id",
    input_value = "AN000465",
    output_item = "metabolites"
)

The imported table has 747 rows and 8 columns. For brevity in this vignette we have cached the first 10 rows and stored them in the package.

We can convert the imported table into an annotation_table:

AT <- annotation_table(data = AN, id_column = NULL)

All annotation_table objects must have an id for each row in the table. This is so that later, when examining the outputs of workflow steps, you can easily trace an annotation through the workflow. The default id_column = NULL will create a new column ‘.MetMashR_id’ containing the row index as an identifier unless provided with an alternative column name, which should be the name of a column already in the table.

Two steps needed to clean up this table are not provided by MetMashR:

  • removal of empty columns
  • removal of suffix from duplicate molecule names

We will implement these steps here as an example of how to add new workflow steps using the struct package template system. We will also implement an new annotation_source to import the data from Metabolomics Workbench as part of the workflow.

Annotation Sources

Annotation sources are the mechanism used by MetMashR to get tables of annotation data into R and into a well-defined format. Each extension of the base annotation_source template provides methods to import the raw annotation table and convert it into a type of annotation_table suitable for the input data. For example, the cd_source defines methods to import annotation data from Compound Discoverer’s Excel format and convert it into a lcms_table object.

Adding new annotation sources

If we plan to import annotation data from Metabolomics Workbench a lot we could consider implementing a new annotation_source that would import the annotations as part of a workflow model sequence.

We will do this now as an example of implementing a new annotation_source.

All annotation_source objects use the model template from the struct package, so we could create the definition of a source on-the-fly. However, annotation_sources are likely to be used multiple times to read in data from the same source, so instead we define the new source in a more permanent way, by using code that can be included in a script to be sourced, or in a new R package.

There are three key components to an annotation source object, or indeed any struct model object:

  1. A definition of the object which defines what the input and output parameters will be.
  2. A function to create an instance of the object and populate initial values for input parameters.
  3. A model_apply method. For annotation sources this method will read in data from the source file and parse it into an annotation_table.

We will implement each of these steps now.

The class definition

In this example there are no input and output slots, but they could be defined using the slots parameter, and then differentiated by name using .params and .outputs in the prototype.

We have defined libraries in the prototype. This is a list of R package names that are needed to use this object in addition to the depends/imports of MetMashR.

.mwb_source <- setClass(
    "mwb_source",
    contains = c("annotation_database"),
    prototype = list(
        name = "Import from Metabolomics Workbench",
        libraries = "metabolomicsWorkbenchR"
    )
)

The calling function

The calling function is a small function that creates a new instance of the source object using new_struct and provides a mechanism to initialise values based on input parameters.

The use of ellipsis ... allows us to pass additional values to the base struct object like name and description without including them in the function definition, in case we want to override the values defined in the prototype of the class definition in the previous section.

mwb_source <- function(...) {
    # new object
    out <- new_struct(
        "mwb_source",
        ...
    )
    return(out)
}

The import method

Like all model objects, the workhorse method for annotation sources is method_apply. For annotation sources this method is used to import and parse the source files into an annotation_table.

Here we use setMethod to define the method for the new source.

setMethod(
    f = "read_database",
    signature = c("mwb_source"),
    definition = function(obj) {
        ## get annotations using metabolomicsWorkbenchR
        # AN = do_query(
        #    context = "study",
        #    input_item = "analysis_id",
        #    input_value = M$analysis_id,
        #    output_item = "metabolites")

        ## for vignette use locally cached subset
        AN <- readRDS(
            system.file("extdata/AN000465_subset.rds", package = "MetMashR")
        )

        return(AN)
    }
)

Note how we pass M$analysis id to the do_query function so that we can import annotations from any Metabolomics Workbench study when using our new source object.

Using the new source

The new mwb_source and method_apply method are ready to use. To import the table we can use the import_source method:

# initialise source
SRC <- mwb_source(
    source = "AN000465"
)

# import
AT <- read_source(SRC)

Variable AT is the imported and cleaned annotation_database from Metabolomics Workbench and can be used as input to model_apply for other models and sequence, e.g. to look up PubChem Ids etc.

Adding annotation workflow steps

MetMashR workflow steps use the model template from the struct package. Although originally intended for statistical methods, the model template is a flexible way to implement many different kinds of workflow step.

The struct package provides convenience functions for “on-the-fly” implementation of new model objects. Here will will use the convenience functions to implement the first new model. Examples of this are also included in the struct package vignette.

Empty column removal

To define a new model object we can use the set_struct_obj function. The new object is fairly simple in that no input parameters are required. The only output slot will contain the annotation_table after we have removed the empty columns.

For the prototype input we provide a name and a description for our new model. This is good practice as our intentions for this model are explicitly stated and stay with the model wherever we use it, helping to ensure transparency, reproducibility etc. We also defined the default output of the model by providing predicted in the prototype.

set_struct_obj(
    class_name = "drop_empty_columns",
    struct_obj = "model",
    params = character(0),
    outputs = c(updated = "annotation_source"),
    private = character(0),
    prototype = list(
        name = "Drop empty columns",
        description = paste0(
            "A workflow step that removes columns from an annotation table ",
            "where all rows are NA."
        ),
        predicted = "updated"
    )
)

We can now create an instance of our new model, and use the show method to display information about it.

M <- drop_empty_columns()
show(M)
#> A "drop_empty_columns" object
#> -----------------------------
#> name:          Drop empty columns
#> description:   A workflow step that removes columns from an annotation table where all rows are NA.
#> outputs:       updated 
#> predicted:     updated
#> seq_in:        data

Before we can use our new model we need to implement the model_apply method for it. This is the function that actually does all the work. In this case it will search for columns of missing values and then remove them from the annotation_table. To define this method we use the set_obj_method function.

set_obj_method(
    class_name = "drop_empty_columns",
    method_name = "model_apply",
    signature = c("drop_empty_columns", "annotation_source"),
    definition = function(M, D) {
        # search for columns of NA
        W <- lapply( # for each column
            D$data, # in the annotation table
            function(x) {
                all(is.na(x)) # return TRUE if all rows are NA
            }
        )

        # get index of columns with all rows NA
        idx <- which(unlist(W))

        # if any found, remove from annotation table
        if (length(idx) > 0) {
            D$data[, idx] <- NULL
        }

        # update model object
        M$updated <- D

        # return object
        return(M)
    }
)

Note that in the signature we specify that the second input is an annotation_table.

The new model is ready to use. We can test it using the model_apply method, and check that some columns have been removed.

M <- model_apply(M, AT)

The number of columns before was:

ncol(AT$data)
#> [1] 8

The number of columns afterwards is:

ncol(M$updated$data)
#> [1] 5

Suffix removal

The second model removes the suffix from the molecule names. The is necessary if we want to use the molecule names to e.g search for PubChem identifiers using the REST API; we wont get a match if the suffix is part of the molecule name.

As we did for drop_empty_columns we use the set_struct_obj and set_obj_method function to create our new workflow step.

# define new model object
set_struct_obj(
    class_name = "remove_suffix",
    struct_obj = "model",
    params = c(clean = "logical", column_name = "character"),
    outputs = c(updated = "annotation_source"),
    prototype = list(
        name = "Remove suffix",
        description = paste0(
            "A workflow step that removes suffixes from molecule names by ",
            "splitting a string at the last underscore an retaining the part",
            "of the string before the underscore."
        ),
        predicted = "updated",
        clean = FALSE,
        column_name = "V1"
    )
)

# define method for new object
set_obj_method(
    class_name = "remove_suffix",
    method_name = "model_apply",
    signature = c("remove_suffix", "annotation_source"),
    definition = function(M, D) {
        # get list of molecule names
        x <- D$data[[M$column_name]]

        # split string at last underscore
        s <- strsplit(x, "_(?!.*_)", perl = TRUE)

        # get left hand side
        s <- lapply(s, "[", 1)

        # if clean replace existing column, otherwise new column
        if (M$clean) {
            D$data[[M$column_name]] <- unlist(s)
        } else {
            D$data$name.fixed <- unlist(x)
        }

        # update model object
        M$updated <- D

        # return object
        return(M)
    }
)

For this model we included two input parameters and provided default values in the prototype:

  • column_name: the name of the column in the annotation table containing molecule names
  • clean: a flag that will replace the old column if TRUE or add a new column if FALSE

Metabolite mashing

Now that we have defined our new workflow steps we can use them in a model_seq (workflow), alongside other existing steps to mash the annotations with other tables containing identifiers and additional information.

In this case we do the following:

  1. import the annotations from Metabolomics Workbench
  2. remove empty columns
  3. remove suffixes from molecule names
  4. search the Metabolomics Workbench refmet database for identifiers
  5. use the PubChem REST API so search for compound CIDs
  6. combine the CID columns from refmet and pubchem into a single column, giving priority to refmet, to obtain a single column of identifiers we have confidence in for as many metabolites as possible
  7. use the PubChem REST API to obtain SMILES for each metabolite

For clarity in the workflow we import the refmet database before using it in the workflow instead of importing it in-line. We also import some cached REST API responses so that we dont overburden the service when generating the vignette.

# refmet
refmet <- mwb_refmet_database()

# pubchem caches
pubchem_cid_cache <- rds_database(
    source = system.file("cached/pubchem_cid_cache.rds",
        package = "MetMashR"
    )
)
pubchem_smile_cache <- rds_database(
    source = system.file("cached/pubchem_smiles_cache.rds",
        package = "MetMashR"
    )
)

In practice it is better to create your own cache using e.g. rds_database, sql_database or other struct_database objects with write access. See the vignette [TODO] for more details.

# prepare sequence
M <- import_source() +
    drop_empty_columns() +
    remove_suffix(
        clean = TRUE,
        column_name = "metabolite_name"
    ) +
    database_lookup(
        query_column = "refmet_name",
        database_column = "name",
        database = refmet,
        suffix = "_mwb",
        include = "pubchem_cid"
    ) +
    pubchem_compound_lookup(
        query_column = "metabolite_name",
        search_by = "name",
        suffix = "_pc",
        output = "cids",
        records = "best",
        delay = 0.2,
        cache = pubchem_cid_cache
    ) +
    prioritise_columns(
        column_names = c("pubchem_cid_mwb", "CID_pc"),
        output_name = "pubchem_cid",
        source_name = "pubchem_cid_source",
        source_tags = c("mwb", "pc"),
        clean = TRUE
    ) +
    pubchem_property_lookup(
        query_column = "pubchem_cid",
        search_by = "cid",
        suffix = "",
        property = "CanonicalSMILES",
        delay = 0.2,
        cache = pubchem_smile_cache
    )

# apply sequence
M <- model_apply(M, mwb_source(source = "AN000465"))

Note that because the first step in the workflow is to import data from Metabolomics Workbench, we only need to provide an empty annotation_table as input to model_apply as it will be updated when we run the workflow.

Once the table has been processed by our workflow we then use OpenBabel to generate images of the molecular structures from SMILES.

# prepare chart
C <- openbabel_structure(
    smiles_column = "CanonicalSMILES",
    row_index = 1,
    scale_to_fit = FALSE,
    view_port = 400,
    image_size = 500
)

# loop over some records and plot some of the molecules
G <- list()
x <- 1
for (k in c(3, 5)) {
    C$row_index <- k
    G[[x]] <- chart_plot(C, predicted(M))
    x <- x + 1
}
cowplot::plot_grid(plotlist = G, nrow = 1, labels = "AUTO")

Summary

Extending the templates provided by struct to implement workflow steps for metabolite mashing is straight forward using either the provided on-the-fly functions or scripting more permanent solutions. All objects that use the templates will be compatible with other workflow objects provided you follow the templates. As well as new workflow steps MetMashR can also be extended to include additional annotation sources.

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1                 metabolomicsWorkbenchR_1.17.0
#> [3] MetMashR_1.1.0                struct_1.19.0                
#> [5] BiocStyle_2.35.0             
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1            farver_2.1.2               
#>  [3] dplyr_1.1.4                 blob_1.2.4                 
#>  [5] filelock_1.0.3              fastmap_1.2.0              
#>  [7] BiocFileCache_2.15.0        digest_0.6.37              
#>  [9] lifecycle_1.0.4             rsvg_2.6.1                 
#> [11] RSQLite_2.3.9               magrittr_2.0.3             
#> [13] compiler_4.4.2              rlang_1.1.4                
#> [15] sass_0.4.9                  tools_4.4.2                
#> [17] yaml_2.3.10                 data.table_1.16.4          
#> [19] knitr_1.49                  S4Arrays_1.7.1             
#> [21] labeling_0.4.3              ontologyIndex_2.12         
#> [23] bit_4.5.0.1                 curl_6.0.1                 
#> [25] DelayedArray_0.33.3         plyr_1.8.9                 
#> [27] abind_1.4-8                 withr_3.0.2                
#> [29] purrr_1.0.2                 BiocGenerics_0.53.3        
#> [31] sys_3.4.3                   grid_4.4.2                 
#> [33] stats4_4.4.2                colorspace_2.1-1           
#> [35] scales_1.3.0                MultiAssayExperiment_1.33.4
#> [37] SummarizedExperiment_1.37.0 cli_3.6.3                  
#> [39] rmarkdown_2.29              crayon_1.5.3               
#> [41] generics_0.1.3              httr_1.4.7                 
#> [43] DBI_1.2.3                   cachem_1.1.0               
#> [45] stringr_1.5.1               zlibbioc_1.52.0            
#> [47] ggthemes_5.1.0              BiocManager_1.30.25        
#> [49] XVector_0.47.1              matrixStats_1.4.1          
#> [51] vctrs_0.6.5                 Matrix_1.7-1               
#> [53] jsonlite_1.8.9              IRanges_2.41.2             
#> [55] S4Vectors_0.45.2            bit64_4.5.2                
#> [57] maketools_1.3.1             magick_2.8.5               
#> [59] jquerylib_0.1.4             glue_1.8.0                 
#> [61] ChemmineOB_1.45.0           cowplot_1.1.3              
#> [63] stringi_1.8.4               gtable_0.3.6               
#> [65] GenomeInfoDb_1.43.2         GenomicRanges_1.59.1       
#> [67] UCSC.utils_1.3.0            munsell_0.5.1              
#> [69] tibble_3.2.1                pillar_1.10.0              
#> [71] htmltools_0.5.8.1           GenomeInfoDbData_1.2.13    
#> [73] R6_2.5.1                    dbplyr_2.5.0               
#> [75] evaluate_1.0.1              lattice_0.22-6             
#> [77] Biobase_2.67.0              memoise_2.0.1              
#> [79] bslib_0.8.0                 Rcpp_1.0.13-1              
#> [81] SparseArray_1.7.2           xfun_0.49                  
#> [83] MatrixGenerics_1.19.0       buildtools_1.0.0           
#> [85] pkgconfig_2.0.3