The aim of this package is to
supply easy extraction and manipulation of SBML information by insertion
in tabular data structures. Because of descriptive nature of SBML
documents, the dataframe format is particularly suitable for easily
access data and be able to perform subsequent analysis. Specially, this
type of conversion enables easy data interrogation by means of tidyverse
verbs in order to facilitate, for instance, usage of
biomaRt
and igraph
-like packages. The
involvement in the Bioconductor project establishes a direct and
consistent connection with bioinformatics community while providing
cooperation of tools useful also within the frame of systems biology
and, in general, for the analysis of biological data.
In order to illustrate the package functioning, we used as examples an SBML file (Hucka et al. 2019) extracted from Reactome (Milacic et al. 2023), an open-source, open-access and peer-reviewed biological pathway database. Namely, the pathway is the “Aryl hydrocarbon receptor signalling” (R-HSA-8937144 (Jassal 2016)).
After providing installation instructions, the first section
describes the dataframes structure, in the subsequent two sections are
described the tidysbml steps to follow for pursuing the SBML conversion,
while in the last one are shown some examples to integrate tidysbml
dataframes with other Bioconductor packages (i.e. biomaRt and RCy3). In
the following, it is useful to distinguish the SBML tags names using
italic and the R commands with teletype
fonts,
respectively. Also, the terms ‘tag’ and ‘component’ are used
interchangeably.
To install tidysbml from Bioconductor, run
The SBML components of main interest for this package are
listOfCompartments, listOfSpecies and
listOfReactions. Due to the underlying SBML format, these
dataframes may generally consist of the following three sets of columns:
(i) Attributes, involving tags such as id, metaid,
name, etc., (ii) Notes, consisting only of notes tag
and (iii) Annotation, whose qualifiers are tags like
bqmodel:is, bqbiol:is, bqbiol:hasPart, etc.
Each tag is exported in one separate column. Columns for tags in the
Attributes set are named as the tag name, notes column is named ‘notes’,
while Annotation columns are prefixed by ‘annotation_’ followed by tag
name after colon symbol (‘:’), for instance column with
bqbiol:is tag content is labelled ‘annotation_is’. If one
entity possesses multiple tags with the same name, the repeated column
name is accompanied by a number (from the second copy it starts from
’_1’). Whether more values are contained in one tag (e.g. as happens for
Annotation tags such as bqbiol:hasPart,
bqbiol:isDescribedBy, or also Notes column) they are separated
by delimiters like ” ” (i.e. single space character) for Annotation
values and “|” (i.e. pipe character) for Notes. Also, based on the
selected component, the respective dataframe may contain other columns,
depending on the xml structure of the underlying component’s class. See
for instance the df_species_in_reactions
dataframe
described in the following.
The first step to convert an SBML file into R dataframes is to
convert the SBML document into an R list object, by means of the
sbml_as_list()
function. In fact, all the other functions
in this package require a list as input. The sole exception is given by
as_dfs()
which incorporates this conversion function and
therefore may also receive directly an SBML file as first argument (and
not exclusively an SBML converted into a list object).
The sbml_as_list()
function exploits functions for
reading and converting xml files from the xml2
R package
(Wickham, Hester, and Ooms 2023) and
outputs an appropriate type of list. In the following a such list is
referred as SBML-converted list. The first argument reads the file path
where the SBML file is located, the second one sets the information
about which is the SBML component the user wants to look at (i.e. among
the listOfSpecies, listOfCompartments and
listOfReactions), if any (the default option gets ‘all’
components). Examples for both options are given below.
After running
example of default option is
filepath <- system.file("extdata", "R-HSA-8937144.sbml", package = "tidysbml")
sbml_list <- sbml_as_list(filepath)
that returns a full SBML model, starting from the sbml tag, converted into a list of lists nested accordingly to the xml nesting rules.
Instead, an example of SBML-list conversion for only the list of species is given by
which yields an SBML-converted list of lists starting from the
listOfSpecies tag, that is contained inside the sbml
and model tags. This last output is required in case the user
is interested in the extraction of only one dataframe, e.g., using the
as_df()
function, as described in the following
section.
The main function of this package is as_dfs()
, which is
able to provide the SBML information about Compartments, Species and
Reactions in a tabular format. It returns a list of at most 4
dataframes, depending on the components reported inside the SBML
selected.
The dataframe for listOfCompartments
(listOfSpecies) component, named df_compartments
(df_species
), has one row for each Compartment (Species)
and one column for each Attributes, Notes and Annotation value.
Similarly, the first dataframe about Reactions
(i.e. df_reactions
) contains one row for each Reaction with
their Attributes, Notes and Annotation values as columns, while the
second one (i.e. df_species_in_reactions
) has one row for
each Species involved in each Reaction, here with the addition of two
more columns: the reaction_id
column, with information
about the corresponding Reaction identificator reported in the SBML
document, and the container_list
column, with the name of
the listOf element containing that Species
(i.e. listOfReactants, listOfProducts,
listOfModifiers). It is possible to use as first argument the
SBML file path
or directly the SBML-converted list after using
sbml_as_list()
as described above
both returning the same output, that is the list with all the dataframes available from extraction. After the list has been extracted once, this second way is preferable, in order to avoid repeated sbml-list conversions.
Another function, namely as_df()
, enables the conversion
of only one dataframe at a time, depending on the SBML component of
interest. Here a SBML-converted list starting from
listOfCompartments/listOfSpecies/listOfReactions
component is a mandatory input. For instance, converting first the SBML
file into a list focusing at the listOfSpecies component
returns one dataframe containing all information about species. Just
for listOfReactions component is possible to obtain two
dataframes. Here dfs_about_reactions
is a list of 2
dataframes obtained by
list_react <- sbml_as_list(filepath, component = "reactions")
dfs_about_reactions <- as_df(list_react)
whose first component, containing information about reactions, returns 15 columns for the 5 reactions of our example
dfs_about_reactions[[1]]
#> names compartment fast
#> 1 notes, annotation, listOfReactants, listOfProducts compartment_7660 false
#> 2 notes, annotation, listOfReactants, listOfProducts compartment_70101 false
#> 3 notes, annotation, listOfReactants, listOfProducts compartment_7660 false
#> 4 notes, annotation, listOfReactants, listOfProducts compartment_70101 false
#> 5 notes, annotation, listOfReactants, listOfProducts compartment_70101 false
#> id metaid
#> 1 reaction_8937191 metaid_13
#> 2 reaction_8937169 metaid_14
#> 3 reaction_8937177 metaid_15
#> 4 reaction_8936851 metaid_16
#> 5 reaction_8936849 metaid_17
#> name
#> 1 AHR:TCDD:2xHSP90AB1:AIP:PTGES3 dissociates
#> 2 AHR:TCDD:2xHSP90AB1:AIP:PTGES3 translocates from cytosol to nucleoplasm
#> 3 AHR:TCDD binds ARNT
#> 4 AHRR binds ARNT
#> 5 AHR:2xHSP90:AIP:PTGES3 binds TCDD
#> reversible
#> 1 false
#> 2 false
#> 3 false
#> 4 false
#> 5 false
#> notes
#> 1 After ligand binding and activation, the AHR complex translocates to the nucleus where it disassociates from the chaperone subunits to then dimerise with the aryl hydrocarbon receptor nuclear translocator (ARNT) (Lees and Whitelaw 1999).
#> 2 After ligand binding and activation, the AHR complex translocates to the nucleus by an unknown mechanism (Beischlag et al. 2008) but it has been hypothesised that ligand binding to AHR promoted p23-associated hsp90 complex-mediated interaction of AHR with the nuclear import receptor protein pendulin and subsequent nuclear translocation of the receptor (Kazlauskas et al. 2001).
#> 3 Ligand-bound aryl hydrocarbon receptor (AHR:TCDD) disassociates from the chaperone subunits to then dimerise with the aryl hydrocarbon receptor nuclear translocator (ARNT) (Lees and Whitelaw 1999).
#> 4 The aryl hydrocarbon receptor repressor (AHRR) is predominantly localised in the nucleus of cells (Kanno et al. 2007) and can bind AHR nuclear translocator (ARNT). The resultant AHRR:ARNT complex can compete with AHR:ARNT for binding to the xenobiotic response element (XRE) on target genes such CYP1A1 (Evans et al. 2008, Hahn et al. 2009).
#> 5 The aryl hydrocarbon receptor (AHR) is a ligand-activated transcription factor that can control the expression of a diverse set of genes. Two major types of environmental compounds can activate AHR signaling: halogenated aromatic hydrocarbons such as 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) and polycyclic aromatic hydrocarbons (PAH) such as benzo(a)pyrene. Unliganded AHR forms a complex in the cytosol with two copies of 90kD heat shock protein (HSP90AB1) (Forsythe et al. 2001), one X-associated protein (AIP) (Meyer et al. 1998), and one p23 molecular chaperone protein (PTGES3) (Nguyen et al. 2012, Beischlag et al. 2008). Here, the binding of TCDD is shown.
#> annotation_created
#> 1 2016-08-30T13:13:23Z
#> 2 2016-08-30T13:13:23Z
#> 3 2016-08-30T13:13:23Z
#> 4 2016-08-24T17:20:03Z
#> 5 2016-08-24T17:20:03Z
#> annotation_creator
#> 1 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 2 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 3 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 4 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> 5 N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Jassal Bijay; ORG(Orgname): OICR||N(Family Given): Wright Adam; ORG(Orgname): OICR
#> annotation_is
#> 1 https://identifiers.org/pubmed:10409767
#> 2 https://identifiers.org/pubmed:18540824 https://identifiers.org/pubmed:11259606
#> 3 https://identifiers.org/pubmed:10409767
#> 4 https://identifiers.org/pubmed:17980155 https://identifiers.org/pubmed:18848529 https://identifiers.org/pubmed:18000031
#> 5 https://identifiers.org/pubmed:22759865 https://identifiers.org/pubmed:11274138 https://identifiers.org/pubmed:7961644 https://identifiers.org/pubmed:12573486 https://identifiers.org/pubmed:18540824 https://identifiers.org/pubmed:9447995
#> annotation_isDescribedBy
#> 1 https://identifiers.org/reactome:R-HSA-8937191
#> 2 https://identifiers.org/reactome:R-HSA-8937169
#> 3 https://identifiers.org/reactome:R-HSA-8937177
#> 4 https://identifiers.org/reactome:R-HSA-8936851
#> 5 https://identifiers.org/reactome:R-HSA-8936849
#> annotation_is_1 annotation_modified
#> 1 https://identifiers.org/reactome:R-HSA-8937191 2016-08-30T13:13:23Z
#> 2 https://identifiers.org/reactome:R-HSA-8937169 2016-08-30T13:13:23Z
#> 3 https://identifiers.org/reactome:R-HSA-8937177 2016-08-30T13:13:23Z
#> 4 https://identifiers.org/reactome:R-HSA-8936851 2016-08-24T17:20:03Z
#> 5 https://identifiers.org/reactome:R-HSA-8936849 2016-08-24T17:20:03Z
#> annotation_modified_1
#> 1 2023-11-16T21:40:29Z
#> 2 2023-11-17T00:37:19Z
#> 3 2023-11-16T21:40:29Z
#> 4 2023-11-16T21:40:29Z
#> 5 2023-11-17T00:37:19Z
While df_species_in_reactions
, with information about
the 14 species involved in the 5 reactions described above, is obtained
by taking the second component
dfs_about_reactions[[2]]
#> reaction_id container_list constant
#> 1 reaction_8937191 listOfReactants true
#> 2 reaction_8937191 listOfProducts true
#> 3 reaction_8937191 listOfProducts true
#> 4 reaction_8937169 listOfReactants true
#> 5 reaction_8937169 listOfProducts true
#> 6 reaction_8937177 listOfReactants true
#> 7 reaction_8937177 listOfReactants true
#> 8 reaction_8937177 listOfProducts true
#> 9 reaction_8936851 listOfReactants true
#> 10 reaction_8936851 listOfReactants true
#> 11 reaction_8936851 listOfProducts true
#> 12 reaction_8936849 listOfReactants true
#> 13 reaction_8936849 listOfReactants true
#> 14 reaction_8936849 listOfProducts true
#> id sboTerm species
#> 1 speciesreference_8937191_input_8937154 SBO:0000010 species_8937154
#> 2 speciesreference_8937191_output_8937201 SBO:0000011 species_8937201
#> 3 speciesreference_8937191_output_8937150 SBO:0000011 species_8937150
#> 4 speciesreference_8937169_input_8936846 SBO:0000010 species_8936846
#> 5 speciesreference_8937169_output_8937154 SBO:0000011 species_8937154
#> 6 speciesreference_8937177_input_8936837 SBO:0000010 species_8936837
#> 7 speciesreference_8937177_input_8937150 SBO:0000010 species_8937150
#> 8 speciesreference_8937177_output_8937203 SBO:0000011 species_8937203
#> 9 speciesreference_8936851_input_8936841 SBO:0000010 species_8936841
#> 10 speciesreference_8936851_input_8936837 SBO:0000010 species_8936837
#> 11 speciesreference_8936851_output_8936835 SBO:0000011 species_8936835
#> 12 speciesreference_8936849_input_8936852 SBO:0000010 species_8936852
#> 13 speciesreference_8936849_input_8936844 SBO:0000010 species_8936844
#> 14 speciesreference_8936849_output_8936846 SBO:0000011 species_8936846
#> stoichiometry
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1
#> 8 1
#> 9 1
#> 10 1
#> 11 1
#> 12 1
#> 13 1
#> 14 1
Each function described in this section performs a control on the
input format correctness. In particular, it returns errors if the input
object is an empty list or not a list object, and also if its format is
not suitable for extraction (i.e. SBML tags are not properly named or
nested). In particular, the SBML file is accepted by
as_df()
if it contains only one type of tag within the
first level of ‘listOf’ components. For instance, if the SBML is
restricted to
listOfSpecies/listOfCompartments/listOfReactions
tag, the only type of tag within the list should be
species/compartment/reaction. One more
condition, given only in the as_dfs()
function, is that the
first two tags in the xml hierarchy should be sbml and
model, where the former contains the latter. If any one of
these conditions does not hold, the respective functions are not
executed.
This section provides R code to incorporate tidysbml dataframes with other Bioconductor packages. Here are shown examples of integration for RCy3 (Gustavsen et al. 2019) and biomaRt (Durinck et al. 2005) packages.
RCy3 package permits communication between R and Cytoscape softwares. After launching Cytoscape, it is possible to import graph in form of edgelist (i.e. dataframe with source and target columns) by simple (or heavier) data manipulation through dataframes as
library(dplyr)
edgelist <- df_species_in_reactions %>% select("reaction_id", "species") %>% `colnames<-`(c("source", "target"))
RCy3::createNetworkFromDataFrames(edges = edgelist) # while running Cytoscape
BiomaRt, instead, is an annotation package providing access to external public databases. One possible usage, for instance, is to visualize information about Uniprot ids reported in SBML for Species, here considering only those composed by multiple entities (i.e. multiple ids). First, extract URIs data about species from Annotation column with bqbiol:hasPart content
vec_uri <- na.omit( unlist(
lapply(X = list_of_dfs[[2]]$annotation_hasPart, FUN = function(x){
unlist(strsplit(x, "||", fixed = TRUE))
})
))
filter only Uniprot URIs
vec_uniprot <- na.omit( unlist(
lapply( X = vec_uri, FUN = function(x){
if( all(unlist(gregexpr("uniprot", x)) > -1) ){
x
} else {
NA
}
})
))
and extract Uniprot ids
vec_ids <- vapply(vec_uniprot, function(x){
chr <- "/"
first <- max(unlist(gregexpr(chr, x)))
substr(x, first + 1, nchar(x))
}, FUN.VALUE = character(1))
Then, using biomaRt commands, user can set attributes information to look at
library(biomaRt)
mart <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
#> Ensembl site unresponsive, trying asia mirror
df_mart_uniprot <- getBM( attributes = c("uniprot_gn_id", "uniprot_gn_symbol", "description"),
filters = "uniprot_gn_id",
values = vec_ids,
mart = mart)
df_mart_uniprot
#> uniprot_gn_id uniprot_gn_symbol
#> 1 O00170 AIP
#> 2 P27540 ARNT
#> 3 P35869 AHR
#> description
#> 1 aryl hydrocarbon receptor interacting protein [Source:HGNC Symbol;Acc:HGNC:358]
#> 2 aryl hydrocarbon receptor nuclear translocator [Source:HGNC Symbol;Acc:HGNC:700]
#> 3 aryl hydrocarbon receptor [Source:HGNC Symbol;Acc:HGNC:348]
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] biomaRt_2.63.0 tidysbml_1.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 xfun_0.48 bslib_0.8.0
#> [4] httr2_1.0.5 Biobase_2.67.0 vctrs_0.6.5
#> [7] tools_4.4.1 generics_0.1.3 curl_5.2.3
#> [10] stats4_4.4.1 tibble_3.2.1 fansi_1.0.6
#> [13] AnnotationDbi_1.69.0 RSQLite_2.3.7 blob_1.2.4
#> [16] pkgconfig_2.0.3 dbplyr_2.5.0 S4Vectors_0.44.0
#> [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.1
#> [22] stringr_1.5.1 Biostrings_2.75.0 progress_1.2.3
#> [25] GenomeInfoDb_1.43.0 htmltools_0.5.8.1 sys_3.4.3
#> [28] buildtools_1.0.0 sass_0.4.9 yaml_2.3.10
#> [31] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
#> [34] cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37
#> [37] stringi_1.8.4 purrr_1.0.2 dplyr_1.1.4
#> [40] maketools_1.3.1 fastmap_1.2.0 cli_3.6.3
#> [43] magrittr_2.0.3 utf8_1.2.4 withr_3.0.2
#> [46] filelock_1.0.3 prettyunits_1.2.0 UCSC.utils_1.2.0
#> [49] rappdirs_0.3.3 bit64_4.5.2 rmarkdown_2.28
#> [52] XVector_0.46.0 httr_1.4.7 bit_4.5.0
#> [55] png_0.1-8 hms_1.1.3 memoise_2.0.1
#> [58] evaluate_1.0.1 knitr_1.48 IRanges_2.41.0
#> [61] BiocFileCache_2.15.0 rlang_1.1.4 glue_1.8.0
#> [64] DBI_1.2.3 xml2_1.3.6 BiocGenerics_0.53.1
#> [67] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.52.0