Introduction to the tidysbml package

The aim of this package is to supply easy extraction and manipulation of SBML information by insertion in tabular data structures. Because of descriptive nature of SBML documents, the dataframe format is particularly suitable for easily access data and be able to perform subsequent analysis. Specially, this type of conversion enables easy data interrogation by means of tidyverse verbs in order to facilitate, for instance, usage of biomaRt and igraph-like packages. The involvement in the Bioconductor project establishes a direct and consistent connection with bioinformatics community while providing cooperation of tools useful also within the frame of systems biology and, in general, for the analysis of biological data.

In order to illustrate the package functioning, we used as examples an SBML file (Hucka et al. 2019) extracted from Reactome (Milacic et al. 2023), an open-source, open-access and peer-reviewed biological pathway database. Namely, the pathway is the “Aryl hydrocarbon receptor signalling” (R-HSA-8937144 (Jassal 2016)).

After providing installation instructions, the first section describes the dataframes structure, in the subsequent two sections are described the tidysbml steps to follow for pursuing the SBML conversion, while in the last one are shown some examples to integrate tidysbml dataframes with other Bioconductor packages (i.e. biomaRt and RCy3). In the following, it is useful to distinguish the SBML tags names using italic and the R commands with teletype fonts, respectively. Also, the terms ‘tag’ and ‘component’ are used interchangeably.

Installation

To install tidysbml from Bioconductor, run

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("tidysbml")

General dataframes structure

The SBML components of main interest for this package are listOfCompartments, listOfSpecies and listOfReactions. Due to the underlying SBML format, these dataframes may generally consist of the following three sets of columns: (i) Attributes, involving tags such as id, metaid, name, etc., (ii) Notes, consisting only of notes tag and (iii) Annotation, whose qualifiers are tags like bqmodel:is, bqbiol:is, bqbiol:hasPart, etc. Each tag is exported in one separate column. Columns for tags in the Attributes set are named as the tag name, notes column is named ‘notes’, while Annotation columns are prefixed by ‘annotation_’ followed by tag name after colon symbol (‘:’), for instance column with bqbiol:is tag content is labelled ‘annotation_is’. If one entity possesses multiple tags with the same name, the repeated column name is accompanied by a number (from the second copy it starts from ’_1’). Whether more values are contained in one tag (e.g. as happens for Annotation tags such as bqbiol:hasPart, bqbiol:isDescribedBy, or also Notes column) they are separated by delimiters like ” ” (i.e. single space character) for Annotation values and “|” (i.e. pipe character) for Notes. Also, based on the selected component, the respective dataframe may contain other columns, depending on the xml structure of the underlying component’s class. See for instance the df_species_in_reactions dataframe described in the following.

Converting SBML into a R list

The first step to convert an SBML file into R dataframes is to convert the SBML document into an R list object, by means of the sbml_as_list() function. In fact, all the other functions in this package require a list as input. The sole exception is given by as_dfs() which incorporates this conversion function and therefore may also receive directly an SBML file as first argument (and not exclusively an SBML converted into a list object).

The sbml_as_list() function exploits functions for reading and converting xml files from the xml2 R package (Wickham, Hester, and Ooms 2023) and outputs an appropriate type of list. In the following a such list is referred as SBML-converted list. The first argument reads the file path where the SBML file is located, the second one sets the information about which is the SBML component the user wants to look at (i.e. among the listOfSpecies, listOfCompartments and listOfReactions), if any (the default option gets ‘all’ components). Examples for both options are given below.

After running

library(tidysbml)

example of default option is

filepath <- system.file("extdata", "R-HSA-8937144.sbml", package = "tidysbml")
sbml_list <- sbml_as_list(filepath)

that returns a full SBML model, starting from the sbml tag, converted into a list of lists nested accordingly to the xml nesting rules.

Instead, an example of SBML-list conversion for only the list of species is given by

list_species <- sbml_as_list(filepath, component = "species")

which yields an SBML-converted list of lists starting from the listOfSpecies tag, that is contained inside the sbml and model tags. This last output is required in case the user is interested in the extraction of only one dataframe, e.g., using the as_df() function, as described in the following section.

Converting SBML into dataframes

The main function of this package is as_dfs(), which is able to provide the SBML information about Compartments, Species and Reactions in a tabular format. It returns a list of at most 4 dataframes, depending on the components reported inside the SBML selected.

The dataframe for listOfCompartments (listOfSpecies) component, named df_compartments (df_species), has one row for each Compartment (Species) and one column for each Attributes, Notes and Annotation value. Similarly, the first dataframe about Reactions (i.e. df_reactions) contains one row for each Reaction with their Attributes, Notes and Annotation values as columns, while the second one (i.e. df_species_in_reactions) has one row for each Species involved in each Reaction, here with the addition of two more columns: the reaction_id column, with information about the corresponding Reaction identificator reported in the SBML document, and the container_list column, with the name of the listOf element containing that Species (i.e. listOfReactants, listOfProducts, listOfModifiers). It is possible to use as first argument the SBML file path

list_of_dfs <- as_dfs(filepath, type = "file")
#> Empty notes' column for 'compartment' elements

or directly the SBML-converted list after using sbml_as_list() as described above

list_of_dfs <- as_dfs(sbml_list, type = "list")
#> Empty notes' column for 'compartment' elements

both returning the same output, that is the list with all the dataframes available from extraction. After the list has been extracted once, this second way is preferable, in order to avoid repeated sbml-list conversions.

Another function, namely as_df(), enables the conversion of only one dataframe at a time, depending on the SBML component of interest. Here a SBML-converted list starting from listOfCompartments/listOfSpecies/listOfReactions component is a mandatory input. For instance, converting first the SBML file into a list focusing at the listOfSpecies component

list_species <- sbml_as_list(filepath, component = "species")
df_species <- as_df(list_species)

returns one dataframe containing all information about species. Just for listOfReactions component is possible to obtain two dataframes. Here dfs_about_reactions is a list of 2 dataframes obtained by

list_react <- sbml_as_list(filepath, component = "reactions")
dfs_about_reactions <- as_df(list_react)

whose first component, containing information about reactions, returns 15 columns for the 5 reactions of our example

dfs_about_reactions[[1]]
#>                                                names       compartment  fast
#> 1 notes, annotation, listOfReactants, listOfProducts  compartment_7660 false
#> 2 notes, annotation, listOfReactants, listOfProducts compartment_70101 false
#> 3 notes, annotation, listOfReactants, listOfProducts  compartment_7660 false
#> 4 notes, annotation, listOfReactants, listOfProducts compartment_70101 false
#> 5 notes, annotation, listOfReactants, listOfProducts compartment_70101 false
#>                 id    metaid
#> 1 reaction_8937191 metaid_13
#> 2 reaction_8937169 metaid_14
#> 3 reaction_8937177 metaid_15
#> 4 reaction_8936851 metaid_16
#> 5 reaction_8936849 metaid_17
#>                                                                      name
#> 1                              AHR:TCDD:2xHSP90AB1:AIP:PTGES3 dissociates
#> 2 AHR:TCDD:2xHSP90AB1:AIP:PTGES3 translocates from cytosol to nucleoplasm
#> 3                                                     AHR:TCDD binds ARNT
#> 4                                                         AHRR binds ARNT
#> 5                                       AHR:2xHSP90:AIP:PTGES3 binds TCDD
#>   reversible
#> 1      false
#> 2      false
#> 3      false
#> 4      false
#> 5      false
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        notes
#> 1                                                                                                                                                                                                                                                                                                                                                                                                                                              After ligand binding and activation, the AHR complex translocates to the nucleus where it disassociates from the chaperone subunits to then dimerise with the aryl hydrocarbon receptor nuclear translocator (ARNT) (Lees and Whitelaw 1999).
#> 2                                                                                                                                                                                                                                                                                                After ligand binding and activation, the AHR complex translocates to the nucleus by an unknown mechanism (Beischlag et al. 2008) but it has been hypothesised that ligand binding to AHR promoted p23-associated hsp90 complex-mediated interaction of AHR with the nuclear import receptor protein pendulin and subsequent nuclear translocation of the receptor (Kazlauskas et al. 2001).
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Ligand-bound aryl hydrocarbon receptor (AHR:TCDD) disassociates from the chaperone subunits to then dimerise with the aryl hydrocarbon receptor nuclear translocator (ARNT) (Lees and Whitelaw 1999).
#> 4                                                                                                                                                                                                                                                                                                                                      The aryl hydrocarbon receptor repressor (AHRR) is predominantly localised in the nucleus of cells (Kanno et al. 2007) and can bind AHR nuclear translocator (ARNT). The resultant AHRR:ARNT complex can compete with AHR:ARNT for binding to the xenobiotic response element (XRE) on target genes such CYP1A1 (Evans et al. 2008, Hahn et al. 2009).
#> 5 The aryl hydrocarbon receptor (AHR) is a ligand-activated transcription factor that can control the expression of a diverse set of genes. Two major types of environmental compounds can activate AHR signaling: halogenated aromatic hydrocarbons such as 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) and polycyclic aromatic hydrocarbons (PAH) such as benzo(a)pyrene. Unliganded AHR forms a complex in the cytosol with two copies of 90kD heat shock protein (HSP90AB1) (Forsythe et al. 2001), one X-associated protein (AIP) (Meyer et al. 1998), and one p23 molecular chaperone protein (PTGES3) (Nguyen et al. 2012, Beischlag et al. 2008). Here, the binding of TCDD is shown.
#>     annotation_created
#> 1 2016-08-30T13:13:23Z
#> 2 2016-08-30T13:13:23Z
#> 3 2016-08-30T13:13:23Z
#> 4 2016-08-24T17:20:03Z
#> 5 2016-08-24T17:20:03Z
#>                                                                                                                                       annotation_creator
#> 1 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 2 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 3 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 4 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 5 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#>                                                                                                                                                                                                                                   annotation_is
#> 1                                                                                                                                                                                                       https://identifiers.org/pubmed:10409767
#> 2                                                                                                                                                               https://identifiers.org/pubmed:18540824 https://identifiers.org/pubmed:11259606
#> 3                                                                                                                                                                                                       https://identifiers.org/pubmed:10409767
#> 4                                                                                                                       https://identifiers.org/pubmed:17980155 https://identifiers.org/pubmed:18848529 https://identifiers.org/pubmed:18000031
#> 5 https://identifiers.org/pubmed:22759865 https://identifiers.org/pubmed:11274138 https://identifiers.org/pubmed:7961644 https://identifiers.org/pubmed:12573486 https://identifiers.org/pubmed:18540824 https://identifiers.org/pubmed:9447995
#>                         annotation_isDescribedBy
#> 1 https://identifiers.org/reactome:R-HSA-8937191
#> 2 https://identifiers.org/reactome:R-HSA-8937169
#> 3 https://identifiers.org/reactome:R-HSA-8937177
#> 4 https://identifiers.org/reactome:R-HSA-8936851
#> 5 https://identifiers.org/reactome:R-HSA-8936849
#>                                  annotation_is_1  annotation_modified
#> 1 https://identifiers.org/reactome:R-HSA-8937191 2016-08-30T13:13:23Z
#> 2 https://identifiers.org/reactome:R-HSA-8937169 2016-08-30T13:13:23Z
#> 3 https://identifiers.org/reactome:R-HSA-8937177 2016-08-30T13:13:23Z
#> 4 https://identifiers.org/reactome:R-HSA-8936851 2016-08-24T17:20:03Z
#> 5 https://identifiers.org/reactome:R-HSA-8936849 2016-08-24T17:20:03Z
#>   annotation_modified_1
#> 1  2023-11-16T21:40:29Z
#> 2  2023-11-17T00:37:19Z
#> 3  2023-11-16T21:40:29Z
#> 4  2023-11-16T21:40:29Z
#> 5  2023-11-17T00:37:19Z

While df_species_in_reactions, with information about the 14 species involved in the 5 reactions described above, is obtained by taking the second component

dfs_about_reactions[[2]]
#>         reaction_id  container_list constant
#> 1  reaction_8937191 listOfReactants     true
#> 2  reaction_8937191  listOfProducts     true
#> 3  reaction_8937191  listOfProducts     true
#> 4  reaction_8937169 listOfReactants     true
#> 5  reaction_8937169  listOfProducts     true
#> 6  reaction_8937177 listOfReactants     true
#> 7  reaction_8937177 listOfReactants     true
#> 8  reaction_8937177  listOfProducts     true
#> 9  reaction_8936851 listOfReactants     true
#> 10 reaction_8936851 listOfReactants     true
#> 11 reaction_8936851  listOfProducts     true
#> 12 reaction_8936849 listOfReactants     true
#> 13 reaction_8936849 listOfReactants     true
#> 14 reaction_8936849  listOfProducts     true
#>                                         id     sboTerm         species
#> 1   speciesreference_8937191_input_8937154 SBO:0000010 species_8937154
#> 2  speciesreference_8937191_output_8937201 SBO:0000011 species_8937201
#> 3  speciesreference_8937191_output_8937150 SBO:0000011 species_8937150
#> 4   speciesreference_8937169_input_8936846 SBO:0000010 species_8936846
#> 5  speciesreference_8937169_output_8937154 SBO:0000011 species_8937154
#> 6   speciesreference_8937177_input_8936837 SBO:0000010 species_8936837
#> 7   speciesreference_8937177_input_8937150 SBO:0000010 species_8937150
#> 8  speciesreference_8937177_output_8937203 SBO:0000011 species_8937203
#> 9   speciesreference_8936851_input_8936841 SBO:0000010 species_8936841
#> 10  speciesreference_8936851_input_8936837 SBO:0000010 species_8936837
#> 11 speciesreference_8936851_output_8936835 SBO:0000011 species_8936835
#> 12  speciesreference_8936849_input_8936852 SBO:0000010 species_8936852
#> 13  speciesreference_8936849_input_8936844 SBO:0000010 species_8936844
#> 14 speciesreference_8936849_output_8936846 SBO:0000011 species_8936846
#>    stoichiometry
#> 1              1
#> 2              1
#> 3              1
#> 4              1
#> 5              1
#> 6              1
#> 7              1
#> 8              1
#> 9              1
#> 10             1
#> 11             1
#> 12             1
#> 13             1
#> 14             1

Each function described in this section performs a control on the input format correctness. In particular, it returns errors if the input object is an empty list or not a list object, and also if its format is not suitable for extraction (i.e. SBML tags are not properly named or nested). In particular, the SBML file is accepted by as_df() if it contains only one type of tag within the first level of ‘listOf’ components. For instance, if the SBML is restricted to listOfSpecies/listOfCompartments/listOfReactions tag, the only type of tag within the list should be species/compartment/reaction. One more condition, given only in the as_dfs() function, is that the first two tags in the xml hierarchy should be sbml and model, where the former contains the latter. If any one of these conditions does not hold, the respective functions are not executed.

Integration with Bioconductor packages

This section provides R code to incorporate tidysbml dataframes with other Bioconductor packages. Here are shown examples of integration for RCy3 (Gustavsen et al. 2019) and biomaRt (Durinck et al. 2005) packages.

RCy3 package permits communication between R and Cytoscape softwares. After launching Cytoscape, it is possible to import graph in form of edgelist (i.e. dataframe with source and target columns) by simple (or heavier) data manipulation through dataframes as

library(dplyr)
edgelist <- df_species_in_reactions %>% select("reaction_id", "species") %>% `colnames<-`(c("source", "target"))
RCy3::createNetworkFromDataFrames(edges = edgelist) # while running Cytoscape

BiomaRt, instead, is an annotation package providing access to external public databases. One possible usage, for instance, is to visualize information about Uniprot ids reported in SBML for Species, here considering only those composed by multiple entities (i.e. multiple ids). First, extract URIs data about species from Annotation column with bqbiol:hasPart content

vec_uri <- na.omit( unlist(
  lapply(X = list_of_dfs[[2]]$annotation_hasPart, FUN = function(x){
    unlist(strsplit(x, "||", fixed = TRUE))
  })
))

filter only Uniprot URIs

vec_uniprot <- na.omit( unlist(
  lapply( X = vec_uri, FUN = function(x){
    if( all(unlist(gregexpr("uniprot", x)) > -1) ){
      x
    } else {
      NA
    }
  })
))

and extract Uniprot ids

vec_ids <- vapply(vec_uniprot, function(x){
  chr <- "/"
  first <- max(unlist(gregexpr(chr, x)))
  substr(x, first + 1, nchar(x))
}, FUN.VALUE = character(1))

Then, using biomaRt commands, user can set attributes information to look at

library(biomaRt)
mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
#> Ensembl site unresponsive, trying asia mirror
df_mart_uniprot <- getBM( attributes = c("uniprot_gn_id", "uniprot_gn_symbol",   "description"),
                          filters = "uniprot_gn_id",
                          values = vec_ids,
                          mart = mart)
df_mart_uniprot
#>   uniprot_gn_id uniprot_gn_symbol
#> 1        O00170               AIP
#> 2        P27540              ARNT
#> 3        P35869               AHR
#>                                                                        description
#> 1  aryl hydrocarbon receptor interacting protein [Source:HGNC Symbol;Acc:HGNC:358]
#> 2 aryl hydrocarbon receptor nuclear translocator [Source:HGNC Symbol;Acc:HGNC:700]
#> 3                      aryl hydrocarbon receptor [Source:HGNC Symbol;Acc:HGNC:348]

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] biomaRt_2.63.0 tidysbml_1.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.47.0         xfun_0.48               bslib_0.8.0            
#>  [4] httr2_1.0.5             Biobase_2.67.0          vctrs_0.6.5            
#>  [7] tools_4.4.1             generics_0.1.3          curl_5.2.3             
#> [10] stats4_4.4.1            tibble_3.2.1            fansi_1.0.6            
#> [13] AnnotationDbi_1.69.0    RSQLite_2.3.7           blob_1.2.4             
#> [16] pkgconfig_2.0.3         dbplyr_2.5.0            S4Vectors_0.44.0       
#> [19] lifecycle_1.0.4         GenomeInfoDbData_1.2.13 compiler_4.4.1         
#> [22] stringr_1.5.1           Biostrings_2.75.0       progress_1.2.3         
#> [25] GenomeInfoDb_1.43.0     htmltools_0.5.8.1       sys_3.4.3              
#> [28] buildtools_1.0.0        sass_0.4.9              yaml_2.3.10            
#> [31] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
#> [34] cachem_1.1.0            tidyselect_1.2.1        digest_0.6.37          
#> [37] stringi_1.8.4           purrr_1.0.2             dplyr_1.1.4            
#> [40] maketools_1.3.1         fastmap_1.2.0           cli_3.6.3              
#> [43] magrittr_2.0.3          utf8_1.2.4              withr_3.0.2            
#> [46] filelock_1.0.3          prettyunits_1.2.0       UCSC.utils_1.2.0       
#> [49] rappdirs_0.3.3          bit64_4.5.2             rmarkdown_2.28         
#> [52] XVector_0.46.0          httr_1.4.7              bit_4.5.0              
#> [55] png_0.1-8               hms_1.1.3               memoise_2.0.1          
#> [58] evaluate_1.0.1          knitr_1.48              IRanges_2.41.0         
#> [61] BiocFileCache_2.15.0    rlang_1.1.4             glue_1.8.0             
#> [64] DBI_1.2.3               xml2_1.3.6              BiocGenerics_0.53.1    
#> [67] jsonlite_1.8.9          R6_2.5.1                zlibbioc_1.52.0

References

Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21 (16): 3439–40. https://doi.org/10.1093/bioinformatics/bti525.
Gustavsen, Julia A., Shraddha Pai, Ruth Isserlin, Barry Demchak, and Alexander R. Pico. 2019. RCy3: Network Biology Using Cytoscape from Within R.” F1000Research. https://doi.org/10.12688/f1000research.20887.3.
Hucka, Michael, Frank T. Bergmann, Claudine Chaouiya, Andreas Dräger, Stefan Hoops, Sarah M. Keating, Matthias König, et al. 2019. “The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 2 Core Release 2.” Journal of Integrative Bioinformatics 16 (2): 20190021. https://doi.org/doi:10.1515/jib-2019-0021.
Jassal, Bijay. 2016. “Aryl Hydrocarbon Receptor Signalling.” https://reactome.org/content/detail/R-HSA-8937144.
Milacic, Marija, Deidre Beavers, Patrick Conley, Chuqiao Gong, Marc Gillespie, Johannes Griss, Robin Haw, et al. 2023. “The Reactome Pathway Knowledgebase 2024.” Nucleic Acids Research 52 (D1): D672–78. https://doi.org/10.1093/nar/gkad1025.
Wickham, Hadley, Jim Hester, and Jeroen Ooms. 2023. Xml2: Parse XML. https://xml2.r-lib.org/.