Title: | Semantic Similarity on Bio-Ontologies |
---|---|
Description: | This package implements infrastructures for ontology analysis by offering efficient data structures, fast ontology traversal methods, and elegant visualizations. It provides a robust toolbox supporting over 70 methods for semantic similarity analysis. |
Authors: | Zuguang Gu [aut, cre] |
Maintainer: | Zuguang Gu <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.5.0 |
Built: | 2024-10-31 05:24:04 UTC |
Source: | https://github.com/bioc/simona |
Create sub-DAGs
## S4 method for signature 'ontology_DAG,ANY,ANY,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,ANY,ANY,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,ANY,missing,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,ANY,missing,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,ANY,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,ANY,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,missing,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,missing,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,character,missing' x[[i, j, ...]]
## S4 method for signature 'ontology_DAG,ANY,ANY,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,ANY,ANY,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,ANY,missing,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,ANY,missing,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,ANY,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,ANY,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,missing,missing' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,missing,missing,ANY' x[i, j, ..., drop = FALSE] ## S4 method for signature 'ontology_DAG,character,missing' x[[i, j, ...]]
x |
An |
i |
A single term name. The value should be a character vector. It corresponds to the roots. |
j |
A single term name. The value should be a character vector. It corresponds to the leaves. |
... |
Ignored. |
drop |
Ignored. |
It returns a sub-DAG taking node i
as the root and j
as the leaves. If i
is a vector, a super root will be added.
An ontology_DAG
object.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag["b"] dag[["b"]] dag["b", "f"] dag[, "f"]
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag["b"] dag[["b"]] dag["b", "f"] dag[, "f"]
Add annotations to the DAG object
add_annotation(dag, annotation)
add_annotation(dag, annotation)
dag |
An |
annotation |
A list of character vectors which contain items annotated to the terms. Names of the list should be the term names. In the DAG, items annotated to a term will also be annotated to its parents. Such merging is applied automatically in the package. |
An ontology_DAG
object.
Supported methods
all_term_IC_methods(require_anno = NULL) all_term_sim_methods(require_anno = NULL) all_group_sim_methods(require_anno = NULL)
all_term_IC_methods(require_anno = NULL) all_term_sim_methods(require_anno = NULL) all_group_sim_methods(require_anno = NULL)
require_anno |
If it is set to |
all_term_IC_methods()
: A vector of all supported IC methods.
all_term_sim_methods()
: A vector of all supported term similarity methods.
all_group_sim_methods()
: A vector of all supported group similarity methods.
A character vector of all supported methods.
all_term_IC_methods() all_term_sim_methods() all_group_sim_methods()
all_term_IC_methods() all_term_sim_methods() all_group_sim_methods()
Create the ontology_DAG object
create_ontology_DAG( parents, children, relations = NULL, relations_DAG = NULL, source = "Ontology", annotation = NULL, remove_cyclic_paths = FALSE, remove_rings = FALSE, alternative_terms = list(), verbose = simona_opt$verbose )
create_ontology_DAG( parents, children, relations = NULL, relations_DAG = NULL, source = "Ontology", annotation = NULL, remove_cyclic_paths = FALSE, remove_rings = FALSE, alternative_terms = list(), verbose = simona_opt$verbose )
parents |
A character vector of parent terms. You can also construct the |
children |
A character vector of child terms. |
relations |
A character vector of parent-child relations, e.g. "is_a", "part_of", or self-defined semantic relations.
If it is set, it should have the same length as |
relations_DAG |
If the relation types have hierarchical relations, it can also be constructed by |
source |
Source of the ontology. It is only used as a label of the object. |
annotation |
A list of character vectors which contain items annotated to the terms. Names of the list should be the term names. In the DAG, items annotated to a term will also be annotated to its parents. Such merging is applied automatically in the package. |
remove_cyclic_paths |
Whether to remove cyclic paths If a cyclic path is represented as |
remove_rings |
There might be rings that are isolated to the main DAG where there are no roots on the rings, thus they cannot be attached to the main DAG. If the value
of |
alternative_terms |
A named list or vector that contains mappings from alternative term IDs to terms used in the DAG. In an ontology, there might be old terms IDs marked as "replaced_by", "consider" or "alt_id" in ".obo" file. You can provide mappings from old term iDs to current term IDs with this argument. If it is a one-to-one mapping, the mapping can be a named vector where alternative term IDs are names and DAG term IDs are values. It it is a one-to-many mapping, the variable should be a named list where each member vector will first be matched to the DAG terms. If the mapping is still one-to-many, the first one is selected. |
verbose |
Whether to print messages. |
An ontology_DAG
object.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) # with annotations annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) # with relations dag = create_ontology_DAG(parents, children, relations = c("is_a", "part_of", "is_a", "part_of", "is_a", "part_of")) # with relations_DAG relations_DAG = create_ontology_DAG(c("r2", "r2"), c("r3", "r4")) dag = create_ontology_DAG(parents, children, relations = c("r1", "r2", "r1", "r3", "r1", "r4"), relations_DAG = relations_DAG) # with a list of parent-child relations dag = create_ontology_DAG(c("a-b", "a-c", "b-c", "b-d", "c-e", "e-f"))
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) # with annotations annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) # with relations dag = create_ontology_DAG(parents, children, relations = c("is_a", "part_of", "is_a", "part_of", "is_a", "part_of")) # with relations_DAG relations_DAG = create_ontology_DAG(c("r2", "r2"), c("r3", "r4")) dag = create_ontology_DAG(parents, children, relations = c("r1", "r2", "r1", "r3", "r1", "r4"), relations_DAG = relations_DAG) # with a list of parent-child relations dag = create_ontology_DAG(c("a-b", "a-c", "b-c", "b-d", "c-e", "e-f"))
Create the ontology_DAG object from the GO.db package
Mappings between alternative GO terms to official GO terms
create_ontology_DAG_from_GO_db( namespace = "BP", relations = "part of", org_db = NULL, evidence_code = NULL, retrieve_alternative = FALSE, verbose = simona_opt$verbose ) alternative_GO_terms( tag = c("replaced_by", "alt_id", "consider"), version = NULL, verbose = TRUE )
create_ontology_DAG_from_GO_db( namespace = "BP", relations = "part of", org_db = NULL, evidence_code = NULL, retrieve_alternative = FALSE, verbose = simona_opt$verbose ) alternative_GO_terms( tag = c("replaced_by", "alt_id", "consider"), version = NULL, verbose = TRUE )
namespace |
One of "BP", "CC" and "MF". |
relations |
Types of the GO term relations. In the GO.db package, the GO term relations can be "is_a", "part_of", "regulates", "negatively regulates", "positively regulates". Note since "regulates" is a parent relation of "negatively regulates", "positively regulates", if "regulates" is selected, "negatively regulates" and "positively regulates" are also selected. Note "is_a" is always included. |
org_db |
The name of the organism package or the corresponding database object, e.g. |
evidence_code |
A vector of evidence codes for gene annotation to GO terms. See https://geneontology.org/docs/guide-go-evidence-codes/. |
retrieve_alternative |
Whether to retrieve alternative/obsolete GO terms from geneontology.org? |
verbose |
Whether to print messages. |
tag |
In the |
version |
Version of the |
An ontology_DAG
object.
A list of named vectors where names are alternative GO IDs and value vectors are current GO IDs in use.
dag = create_ontology_DAG_from_GO_db() dag
dag = create_ontology_DAG_from_GO_db() dag
Create the ontology_DAG object from the igraph object
create_ontology_DAG_from_igraph( g, relations = NULL, verbose = simona_opt$verbose )
create_ontology_DAG_from_igraph( g, relations = NULL, verbose = simona_opt$verbose )
g |
An |
relations |
A vector of relation types. The length of the vector should be the same as the number of edges in |
verbose |
Whether to print messages. |
An ontology_DAG
object.
Names of all terms
dag_all_terms(dag) dag_n_terms(dag) dag_n_relations(dag) dag_n_leaves(dag)
dag_all_terms(dag) dag_n_terms(dag) dag_n_relations(dag) dag_n_leaves(dag)
dag |
An |
dag_all_terms()
returns a vector of term names. dag_n_terms()
returns
a single iteger.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_all_terms(dag) dag_n_terms(dag)
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_all_terms(dag) dag_n_terms(dag)
Convert to an igraph object
dag_as_igraph(dag)
dag_as_igraph(dag)
dag |
An |
If relations
is already set in create_ontology_DAG()
, relations are also set as an edge attribute in the igraph::igraph
object.
An igraph::igraph
object.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_as_igraph(dag)
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_as_igraph(dag)
Visualize the DAG
dag_circular_viz( dag, highlight = NULL, start = 0, end = 360, partition_by_level = 1, partition_by_size = NULL, node_col = NULL, node_transparency = 0.4, node_size = NULL, edge_col = NULL, edge_transparency = default_edge_transparency(dag), legend_labels_from = NULL, legend_labels_max_width = 50, other_legends = list(), use_raster = dag_n_terms(dag) > 10000, newpage = TRUE, verbose = simona_opt$verbose ) dag_as_DOT( dag, node_param = default_node_param, edge_param = default_edge_param ) dag_graphviz( dag, node_param = default_node_param, edge_param = default_edge_param, ... )
dag_circular_viz( dag, highlight = NULL, start = 0, end = 360, partition_by_level = 1, partition_by_size = NULL, node_col = NULL, node_transparency = 0.4, node_size = NULL, edge_col = NULL, edge_transparency = default_edge_transparency(dag), legend_labels_from = NULL, legend_labels_max_width = 50, other_legends = list(), use_raster = dag_n_terms(dag) > 10000, newpage = TRUE, verbose = simona_opt$verbose ) dag_as_DOT( dag, node_param = default_node_param, edge_param = default_edge_param ) dag_graphviz( dag, node_param = default_node_param, edge_param = default_edge_param, ... )
dag |
An |
highlight |
A vector of terms to be highlighted on the DAG. |
start |
Start of the circle, measured in degree. |
end |
End of the circle, measured in degree. |
partition_by_level |
If |
partition_by_size |
Similar as |
node_col |
Colors of nodes. If the value is a vector, the order should correspond to terms in |
node_transparency |
Transparency of nodes. The same format as |
node_size |
Size of nodes. The same format as |
edge_col |
A named vector where names correspond to relation types. |
edge_transparency |
A named vector where names correspond to relation types. |
legend_labels_from |
If partitioning is applied on the DAG, a legend is generated showing different top terms. By default, the legend labels are the term IDs. If there are additionally column stored in the meta data frame of the DAG object, the column name can be set here to replace the term IDs as legend labels. |
legend_labels_max_width |
Maximal width of legend labels measured by the number of characters per line. Labels are wrapped into multiple lines if the widths exceed it. |
other_legends |
A list of legends generated by |
use_raster |
Whether to first write the circular image into a temporary png file, then add to the plot as a raster object? |
newpage |
Whether call |
verbose |
Whether to print messages. |
node_param |
A list of parameters. Each parameter has the same format. The value can be
a single scalar, a full length vector with the same order as in |
edge_param |
A list of parameters. Each parameter has the same format. The value can be a single
scalar, or a named vector that contains a subset of terms that need to be customized.
The full set of parameters can be found at https://graphviz.org/docs/edges/.
If the parameter is set to a named vector, it can be named by relation types |
... |
Pass to |
dag_circular_viz()
uses a circular layout for visualizing large DAGs. dag_graphviz()
uses a hierarchical layout for visualizing small DAGs.
dag_as_DOT()
generates the DOT code of the DAG.
dag_graphviz()
visualizes the DAG with the DiagrammeR package.
dag_as_DOT()
returns a vector of DOT code.
http://magjac.com/graphviz-visual-editor/ is nice place to try the DOT code.
dag = create_ontology_DAG_from_GO_db() dag_circular_viz(dag) 1 if(interactive()) { dag = create_ontology_DAG_from_GO_db() dag_graphviz(dag[, "GO:0010228"]) dag_graphviz(dag[, "GO:0010228"], edge_param = list(color = c("is_a" = "purple", "part_of" = "darkgreen"), style = c("is_a" = "solid", "part_of" = "dashed")), width = 800, height = 800) # the DOT code for graphviz dag_as_DOT(dag[, "GO:0010228"]) }
dag = create_ontology_DAG_from_GO_db() dag_circular_viz(dag) 1 if(interactive()) { dag = create_ontology_DAG_from_GO_db() dag_graphviz(dag[, "GO:0010228"]) dag_graphviz(dag[, "GO:0010228"], edge_param = list(color = c("is_a" = "purple", "part_of" = "darkgreen"), style = c("is_a" = "solid", "part_of" = "dashed")), width = 800, height = 800) # the DOT code for graphviz dag_as_DOT(dag[, "GO:0010228"]) }
Depth and height in the DAG
dag_depth(dag, terms = NULL, use_cache = TRUE) dag_height(dag, terms = NULL, use_cache = TRUE) dag_shortest_dist_from_root(dag, terms = NULL, use_cache = TRUE) dag_shortest_dist_to_leaves(dag, terms = NULL, use_cache = TRUE)
dag_depth(dag, terms = NULL, use_cache = TRUE) dag_height(dag, terms = NULL, use_cache = TRUE) dag_shortest_dist_from_root(dag, terms = NULL, use_cache = TRUE) dag_shortest_dist_to_leaves(dag, terms = NULL, use_cache = TRUE)
dag |
An |
terms |
A vector of term names. If it is set, the returned vector will be subsetted to the terms that have been set here. |
use_cache |
Internally used. |
The depth of a term in the DAG is defined as the maximal distance from the root. The height of a term in the DAG is the maximal finite distance to all leaf terms.
dag_shortest_dist_from_root()
and dag_shortest_dist_to_leaves()
calculate the minimal distance from the root or to the leaves.
The word "from" and "to" emphasize the distancer is directinoal.
An integer vector with length the same as the number of total terms in the DAG.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_depth(dag) dag_height(dag) dag_shortest_dist_from_root(dag) dag_shortest_dist_to_leaves(dag)
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_depth(dag) dag_height(dag) dag_shortest_dist_from_root(dag) dag_shortest_dist_to_leaves(dag)
For a given list of terms, it returns a subset of terms which have no ancestor relations to each other.
dag_distinct_ancestors( dag, terms, in_labels = TRUE, verbose = simona_opt$verbose )
dag_distinct_ancestors( dag, terms, in_labels = TRUE, verbose = simona_opt$verbose )
dag |
An |
terms |
A vector of term names. |
in_labels |
Whether the terms are represented in their names or as integer indices? |
verbose |
Whether to print messages. Consider a subgraph that contains |
An integer vector or a character vector depending on the value of in_labels
.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_distinct_ancestors(dag, c("c", "d", "e", "f"))
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_distinct_ancestors(dag, c("c", "d", "e", "f"))
The analysis task is to evaluate which terms the given items are enriched to.
dag_enrich_on_items(dag, items, min_hits = 5, min_items = 10) dag_enrich_on_genes(dag, genes, min_hits = 5, min_genes = 10)
dag_enrich_on_items(dag, items, min_hits = 5, min_items = 10) dag_enrich_on_genes(dag, genes, min_hits = 5, min_genes = 10)
dag |
An |
items |
A vector of item names. |
min_hits |
Minimal number of items in the term set. |
min_items |
Minimal size of the term set. |
genes |
A vector of gene IDs. The gene ID type can be found by directly printing the |
min_genes |
Minimal number of genes. |
The function tests whether the list of items are enriched in terms on the DAG.
The test is based on the hypergeometric distribution. In the following 2x2 contigency table, S
is the set of items
,
for a term t
in the DAG, T
is the set of items annotated to t
(by automatically merging from its offspring terms),
the aim is to test whether S
is over-represented in T
.
The universal set all
correspond to the full set of items annotated to the DAG.
+----------+------+----------+-----+ | | in S | not in S | all | +----------+------+----------+-----+ | in T | x11 | x12 | x10 | | not in T | x21 | x22 | x20 | +----------+------+----------+-----+ | all | x01 | x02 | x | +----------+------+----------+-----+
dag_enrich_on_genes()
is the same as dag_enrich_on_items()
which only changes the argument item
to gene
.
A data frame with the following columns:
term
: Term names.
n_hits
: Number of items in items
intersecting to t
's annotated items.
n_anno
: Number of annotated items of t
. Specifically for dag_enrich_on_genes()
, this column
is renamed to n_gs
.
n_items
: Number of items in items
intersecting to all annotated items in the DAG. Specifically
for dag_enrich_on_genes()
, this column is renamed to n_genes
.
n_all
: Number of all annotated items in the DAG.
log2_fold_enrichment
: Defined as log2(observation/expected).
z_score
: Defined as (observed-expected)/sd.
p_value
: P-values from hypergeometric test.
p_adjust
: Adjusted p-values from the BH method.
The number of rows in the data frame is the same as the number of terms in the DAG.
## Not run: dag = create_ontology_DAG_from_GO_db(org_db = "org.Hs.eg.db") items = random_items(dag, 1000) df = dag_enrich_on_items(dag, items) ## End(Not run) 1
## Not run: dag = create_ontology_DAG_from_GO_db(org_db = "org.Hs.eg.db") items = random_items(dag, 1000) df = dag_enrich_on_items(dag, items) ## End(Not run) 1
The analysis task is to evaluate how significant a term includes terms
.
dag_enrich_on_offsprings(dag, terms, min_hits = 3, min_offspring = 10)
dag_enrich_on_offsprings(dag, terms, min_hits = 3, min_offspring = 10)
dag |
An |
terms |
A vector of term names. |
min_hits |
Minimal number of terms in an offspring set. |
min_offspring |
Minimal size of the offspring set. |
Given a list of terms in terms
, the function tests whether they are enriched in a term's offspring terms.
The test is based on the hypergeometric distribution. In the following 2x2 contigency table, S
is the set of terms
,
for a term t
in the DAG, T
is the set of its offspring plus the t
itself, the aim is to test whether S
is over-represented
in T
.
If there is a significant p-value, we can say the term t
preferably includes terms in term
.
+----------+------+----------+-----+ | | in S | not in S | all | +----------+------+----------+-----+ | in T | x11 | x12 | x10 | | not in T | x21 | x22 | x20 | +----------+------+----------+-----+ | all | x01 | x02 | x | +----------+------+----------+-----+
A data frame with the following columns:
term
: Term names.
n_hits
: Number of terms in terms
intersecting to t
's offspring terms.
n_offspring
: Number of offspring terms of t
(including t
itself).
n_terms
: Number of terms in term
intersecting to all terms in the DAG.
n_all
: Number of all terms in the DAG.
log2_fold_enrichment
: Defined as log2(observation/expected).
z_score
: Defined as (observed-expected)/sd.
p_value
: P-values from hypergeometric test.
p_adjust
: Adjusted p-values from the BH method.
The number of rows in the data frame is the same as the number of terms in the DAG.
## Not run: dag = create_ontology_DAG_from_GO_db() terms = random_terms(dag, 100) df = dag_enrich_on_offsprings(dag, terms) ## End(Not run) 1
## Not run: dag = create_ontology_DAG_from_GO_db() terms = random_terms(dag, 100) df = dag_enrich_on_offsprings(dag, terms) ## End(Not run) 1
Enrichment analysis on offspring terms by permutation test
dag_enrich_on_offsprings_by_permutation( dag, value, perm = 1000, min_offspring = 10, verbose = simona_opt$verbose )
dag_enrich_on_offsprings_by_permutation( dag, value, perm = 1000, min_offspring = 10, verbose = simona_opt$verbose )
dag |
An |
value |
A numeric value. The value should correspond to terms in |
perm |
Number of permutations. |
min_offspring |
Minimal size of the offspring set. |
verbose |
Whether to print messages. |
In the function dag_enrich_on_offsprings()
, the statistic for testing is the number of terms in each category. Here
this funtion makes the testing procedure more general
The function tests whether a term t
's offspring terms have an over-represented pattern on values in value
.
Denote T
as the set of t
's offspring terms plus t
itself, and v
as the numeric vector of value
, we first
calculate a score s
based on values in T
:
s = mean_{terms in T}(v)
To construct a random version of s
, we randomly sample n_T
terms from the DAG where n_T
is the size of set T
:
sr_i = mean_{n_T randomly sampled terms}(v)
where index i
represents the i^th sampling. If we sample k
times, the p-value is calculated as:
p = sum_{i in 1..k}(I(sr_i > s))/k
A data frame with the following columns:
term
: Term names.
stats
: The statistics of terms.
n_offspring
: Number of offspring terms of t
(including t
itself).
log2_fold_enrichment
: defined as log2(s/mean)
where mean
is calculated from random permutation.
z_score
: Defined as (s - mean)/sd
where mean
and sd
are calculated from random permutation.
p_value
: P-values from permutation test.
p_adjust
: Adjusted p-values from the BH method.
The number of rows in the data frame is the same as the number of terms in the DAG.
## Not run: dag = create_ontology_DAG_from_GO_db() value = runif(dag_n_terms(dag)) # a set of random values df = dag_enrich_on_offsprings_by_permutation(dag, value) ## End(Not run) 1
## Not run: dag = create_ontology_DAG_from_GO_db() value = runif(dag_n_terms(dag)) # a set of random values df = dag_enrich_on_offsprings_by_permutation(dag, value) ## End(Not run) 1
Filter the DAG
dag_filter( dag, terms = NULL, relations = NULL, root = NULL, leaves = NULL, mcols_filter = NULL, namespace = NULL )
dag_filter( dag, terms = NULL, relations = NULL, root = NULL, leaves = NULL, mcols_filter = NULL, namespace = NULL )
dag |
An |
terms |
A vector of term names. The sub-DAG will only contain these terms. |
relations |
A vector of relations. The sub-DAG will only contain these relations.
Valid values of "relations" should correspond to the values set in the
|
root |
A vector of term names which will be used as roots of the sub-DAG. Only these with their offspring terms will be kept. If there are multiple root terms, a super root will be automatically added. |
leaves |
A vector of leaf terms. Only these with their ancestor terms will be kept. |
mcols_filter |
Filtering on columns in the meta data frame. |
namespace |
The prefix before ":" of the term IDs. |
If the DAG is reduced into several disconnected parts after the filtering, a super root is automatically added.
An ontology_DAG
object.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_filter(dag, terms = c("b", "d", "f")) dag_filter(dag, root = "b") dag_filter(dag, leaves = c("c", "b")) dag_filter(dag, root = "b", leaves = "e") dag = create_ontology_DAG_from_GO_db() dag_filter(dag, relations = "is_a")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_filter(dag, terms = c("b", "d", "f")) dag_filter(dag, root = "b") dag_filter(dag, leaves = c("c", "b")) dag_filter(dag, root = "b", leaves = "e") dag = create_ontology_DAG_from_GO_db() dag_filter(dag, relations = "is_a")
Whether the terms exist in the DAG
dag_has_terms(dag, terms)
dag_has_terms(dag, terms)
dag |
An |
terms |
A vector of term IDs. |
A logical vector.
Distance from all ancestors/to all offspring in the DAG
dag_longest_dist_to_offspring(dag, from, terms = NULL, background = NULL) dag_shortest_dist_to_offspring(dag, from, terms = NULL, background = NULL) dag_longest_dist_from_ancestors(dag, to, terms = NULL, background = NULL) dag_shortest_dist_from_ancestors(dag, to, terms = NULL, background = NULL)
dag_longest_dist_to_offspring(dag, from, terms = NULL, background = NULL) dag_shortest_dist_to_offspring(dag, from, terms = NULL, background = NULL) dag_longest_dist_from_ancestors(dag, to, terms = NULL, background = NULL) dag_shortest_dist_from_ancestors(dag, to, terms = NULL, background = NULL)
dag |
An |
from |
A single term name or a vector of term names. |
terms |
A vector of term names. If it is set, the returned vector will be subsetted to the terms that have been set here. |
background |
A vector of terms. Then the lookup will only be applied in this set of terms. |
to |
Same format as the |
If from
or to
is a vector, for a specific, the longest/shortest distance among all from
/to
terms is taken.
As a special case, when from
is the root term, dag_longest_dist_to_offspring()
is the same as dag_depth()
,
and when to
are all leaf terms, dag_longest_dist_to_offspring()
is the same as dag_height()
.
An integer vector having length the same as the number of terms in the DAG. If terms are not
reachable to the from
or to
terms, the corresponding value is -1.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_longest_dist_from_ancestors(dag, "e") dag_shortest_dist_from_ancestors(dag, "e") dag_longest_dist_to_offspring(dag, "b")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_longest_dist_from_ancestors(dag, "e") dag_shortest_dist_from_ancestors(dag, "e") dag_longest_dist_to_offspring(dag, "b")
Parent/child/ancestor/offspring terms
dag_parents(dag, term, in_labels = TRUE) dag_children(dag, term, in_labels = TRUE) dag_siblings(dag, term, in_labels = TRUE) dag_ancestors(dag, term, in_labels = TRUE, include_self = FALSE) dag_offspring(dag, term, in_labels = TRUE, include_self = FALSE)
dag_parents(dag, term, in_labels = TRUE) dag_children(dag, term, in_labels = TRUE) dag_siblings(dag, term, in_labels = TRUE) dag_ancestors(dag, term, in_labels = TRUE, include_self = FALSE) dag_offspring(dag, term, in_labels = TRUE, include_self = FALSE)
dag |
An |
term |
The value can be a vector of multiple term names. If it is a vector, it returns
union of the upstream/downstream terms of the selected set of terms. For |
in_labels |
Whether the terms are represented in their names or as integer indices? |
include_self |
For |
An integer vector or a character vector depending on the value of in_labels
.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_parents(dag, "b") dag_parents(dag, "c", in_labels = FALSE) dag_children(dag, "b") dag_siblings(dag, "c") dag_ancestors(dag, "e") dag_ancestors(dag, "b")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_parents(dag, "b") dag_parents(dag, "c", in_labels = FALSE) dag_children(dag, "b") dag_siblings(dag, "c") dag_ancestors(dag, "e") dag_ancestors(dag, "b")
Generate a random DAG
dag_random_tree( n_children = 2, p_stop = 0, max = 2^10 - 1, verbose = simona_opt$verbose ) dag_add_random_children( dag, p_add = 0.1, new_children = c(1, 4), add_random_children_fun = NULL, verbose = simona_opt$verbose ) dag_random( n_children = 2, p_stop = 0, max = 2^10 - 1, p_add = 0.1, new_children = c(1, 4), verbose = simona_opt$verbose )
dag_random_tree( n_children = 2, p_stop = 0, max = 2^10 - 1, verbose = simona_opt$verbose ) dag_add_random_children( dag, p_add = 0.1, new_children = c(1, 4), add_random_children_fun = NULL, verbose = simona_opt$verbose ) dag_random( n_children = 2, p_stop = 0, max = 2^10 - 1, p_add = 0.1, new_children = c(1, 4), verbose = simona_opt$verbose )
n_children |
Number of children of a term. The value can also be a vector of length two representing the range of the number of child terms. |
p_stop |
The probability of a term to stop growing. |
max |
Maximal number of terms. |
verbose |
Whether to print messages. |
dag |
An |
p_add |
The probability to add children on each term. |
new_children |
The number or range of numbers of new children if a term is selected to add more children. |
add_random_children_fun |
A function to randomly add children from the DAG. |
dag_random_tree()
generates a random DAG tree from the root term. In a certain step of
the growing, let's denote the set of all leaf terms as L, then in the next round of growing,
floor(length(L)*p_stop)
leaf terms stop growing, and for the remaining leaf terms that
continue to grow, each term will add child terms with number in uniformly sampled within [ n_children[1], n_children[2] ]
.
The growing stops when the total number of terms in the DAG exceeds max
.
dag_add_random_children()
adds more links in a DAG. Each term is associated with a probability p_add
to add new links where the term, if it is selected, is as a parent term, linking to other terms in the DAG.
The number of new child terms is controlled by new_children
which can be a single number of a range. By default,
new child terms of a term t
are randomly selected from other terms that are lower than the term t
(check the function simona:::add_random_children
). The way how to randomly select new child terms for t
can be controlled by a self-defined function for the add_random_children_fun
argument.
dag_random()
: it simply wraps dag_random_tree()
and dag_add_random_children()
.
An ontology_DAG
object.
tree = dag_random_tree() dag = dag_random()
tree = dag_random_tree() dag = dag_random()
Reorder the DAG
dag_reorder(dag, value, verbose = simona_opt$verbose) dag_permutate_children(dag, verbose = simona_opt$verbose)
dag_reorder(dag, value, verbose = simona_opt$verbose) dag_permutate_children(dag, verbose = simona_opt$verbose)
dag |
An |
value |
A vector of numeric values. See the Details section. |
verbose |
Whether to print messages. |
In dag_reorder()
, there are two ways to set the value
argument. It can be a vector corresponding
to all terms (in the same order as in dag_all_terms()
) or a vector corresponding
to all leaf terms (in the same order as in dag_leaves()
). If value
corresponds
to all terms, the score associates to each term is the average value of all its offspring terms.
And if value
corresponds to all leaf terms, the score for each term is the average of all its
connectable leaves.
The reordering is simply applied on each term to reorder its child terms.
dag_permutate_children()
randomly permute child terms under a term.
An ontology_DAG
object.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") # by default, c and e locate on the left side, d and f locate on the right side dag = create_ontology_DAG(parents, children) dag_children(dag, "b") # move c and e to the right side of the diagram dag2 = dag_reorder(dag, value = c(1, 1, 10, 1, 10, 1)) dag_children(dag2, "b") # we can also only set values for leaf terms # there are two leaf terms c and e # we let v(c) > v(e) to move c to the right side of the diagram dag3 = dag_reorder(dag, value = c(10, 1)) dag_children(dag3, "b")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") # by default, c and e locate on the left side, d and f locate on the right side dag = create_ontology_DAG(parents, children) dag_children(dag, "b") # move c and e to the right side of the diagram dag2 = dag_reorder(dag, value = c(1, 1, 10, 1, 10, 1)) dag_children(dag2, "b") # we can also only set values for leaf terms # there are two leaf terms c and e # we let v(c) > v(e) to move c to the right side of the diagram dag3 = dag_reorder(dag, value = c(10, 1)) dag_children(dag3, "b")
Root or leaves of the DAG
dag_root(dag, in_labels = TRUE) dag_leaves(dag, in_labels = TRUE) dag_is_leaf(dag, terms)
dag_root(dag, in_labels = TRUE) dag_leaves(dag, in_labels = TRUE) dag_is_leaf(dag, terms)
dag |
An |
in_labels |
Whether the terms are represented in their names or as integer indices? |
terms |
A vector of term names. |
A character or an integer vector.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_root(dag) dag_leaves(dag)
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) dag_root(dag) dag_leaves(dag)
A shiny app for the DAG
dag_shiny(dag)
dag_shiny(dag)
dag |
An |
if(FALSE) { dag = create_ontology_DAG_from_GO_db() dag_shiny(dag) }
if(FALSE) { dag = create_ontology_DAG_from_GO_db() dag_shiny(dag) }
Reduce the DAG to a tree
dag_treelize(dag, verbose = simona_opt$verbose) dag_as_dendrogram(dag) ## S3 method for class 'ontology_tree' print(x, ...)
dag_treelize(dag, verbose = simona_opt$verbose) dag_as_dendrogram(dag) ## S3 method for class 'ontology_tree' print(x, ...)
dag |
An |
verbose |
Whether to print messages. |
x |
An |
... |
Ignored. |
A tree is a reduced DAG where a child only has one parent. The reducing is applied by a breadth-first searching
Starting from the root and on a certain depth (the depth is the maximal distance to root), for every term t
on this depth,
its child term c
and parent-child relation are kept only when depth(c) == depth(t) + 1
. If c
is selected, it is
marked as visited and will not be checked again.
In this way, depths of all terms in the orignal DAG are still identical to the depths in the tree (see the Examples section).
dag_as_dendrogram()
coverts the tree to a dendrogram
object.
A tree is also an ontology_DAG
object.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) tree = dag_treelize(dag) d1 = dag_depth(dag) d2 = dag_depth(tree) identical(d1, d2) dend = dag_as_dendrogram(tree) dend
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) tree = dag_treelize(dag) d1 = dag_depth(dag) d2 = dag_depth(tree) identical(d1, d2) dend = dag_as_dendrogram(tree) dend
Semantic similarity between two groups of terms
group_sim( dag, group1, group2, method, control = list(), verbose = simona_opt$verbose )
group_sim( dag, group1, group2, method, control = list(), verbose = simona_opt$verbose )
dag |
An |
group1 |
A vector of term names or a list of term vectors. |
group2 |
A vector of term names or a list of term vectors.. |
method |
A group similarity method. All available methods are in |
control |
A list of parameters passing to individual methods. The term similarity method is controlled by |
verbose |
Whether to print messages. |
If annotation
is set in create_ontology_DAG()
and you want to directly calculate semantic similarity between two
annotated items, you can first get the associated terms of the two items by annotated_terms()
:
group1 = annotated_terms(dag, item1)[[1]] group2 = annotated_terms(dag, item2)[[1]] group_sim(dag, group1, group2, ...)
A numeric scalar, a numeric vector or a matrix depending on the dat type of group1
and group2
.
Denote S(a, b)
as the semantic similarity between terms a
and b
where a
is from group1
and b
is from group2
,
The similarity between group1
and group2
is the average similarity of every pair of individual terms in the two groups:
group_sim = mean_{a in group1, b in group2}(S(a, b))
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_avg" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Pape link: doi:10.1093/bioinformatics/btg153.
This is the maximal S(a, b)
among all pairs of terms in group1
and group2
:
group_sim = max_{a in group1, b in group2}(S(a, b))
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_max" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1109/TCBB.2005.50.
BMA stands for "best-match average". First define similarity of a term to a group of terms as
S(x, group) = max_{y in group}(x, y)
which is the most similar terms in group
to x
.
Then the BMA similarity is calculated as:
group_sim = 0.5*(mean_{a in group1}(S(a, group2)) + mean_{b in group2}(S(b, group1)))
So it is the average of the similarity of every term in group1
to the whole group2
and every term in group2
to the whole group1
.
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMA" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1155/2012/975783.
BMM stands for "best-match max". It is defined as:
group_sim = max(mean_{a in group1}(S(a, group2)), mean_{b in group2}(S(b, group1)))
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMM" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1186/1471-2105-7-302.
ABM stands for "average best-match". It is defined as:
group_sim = (sum_{a in group1}(S(a, group2)) + sum_{b in group2}(S(b, group1)))/(n1 + n2)
where n1
and n2
are the number of terms in group1
and group2
.
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_ABM" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1186/1471-2105-14-284.
First define the distance of a term to a group of terms:
D(x, group) = 1 - S(x, group)
Then the Hausdorff distance between two groups are:
HDF = max(max_{a in group1}(D(a, group2)), max_{b in group2}(D(b, group1)))
This final similarity is:
group_sim = 1 - HDF
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_HDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Instead of using the maximal distance from a group to the other group, MHDF uses mean distance:
MHDF = max(mean_{a in group1}(D(a, group2)), mean_{b in group2}(D(b, group1)))
This final similarity is:
group_sim = 1 - MHDF
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_MHDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1109/ICPR.1994.576361.
It is defined as:
VHDF = 0.5*(sqrt(mean_{a in group1}(D(a, group2)^2)) + sqrt(mean_{b in group2}(D(b, group1)^2))) group_sim = 1 - VHDF
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_VHDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1073/pnas.0702965104.
The similarity is:
group_sim = exp(-HDF(group1, group2))
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Froehlich_2007" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1186/1471-2105-8-166.
Similar to VHDF, but it directly uses the similarity:
group_sim = 0.5*(sqrt(mean_{a in group1}(S(a, group2)^2)) + sqrt(mean_{b in group2}(S(b, group1)^2)))
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Joeng_2014" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1109/TCBB.2014.2343963.
It is based on the average distances between every pair of terms in the two groups:
exp(-mean_{a in group1, b in group2}(d(a, b)))
d(a, b)
is the distance between a
and b
, which can be the shortest distance between the two terms or
the longest distnace via LCA.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
group_sim(dag, group1, group2, method = "GroupSim_SimALN", control = list(distance = "shortest_distances_via_NCA"))
Paper link: doi:10.1109/ISCC.2008.4625763.
Denote A
and B
as the two sets of ancestors terms of terms in group1
and group2
respectively,
the SimGIC is:
group_sim = sum_{x in intersect(A, B)}(IC(x))/sum_{x in union(A, B)}(IC(x))
IC method can be set via control = list(IC_method = ...)
.
Similar as GroupSim_SimGIC, it calculates the Dice coeffcient:
group_sim = 2*sum_{x in intersect(A, B)}(IC(x))/(sum_{x in A}(IC(x)) + sum_{x in B}(IC(x)))
IC method can be set via control = list(IC_method = ...)
.
Similar as GroupSim_SimGIC, it is calculated as:
group_sim = sum_{x in intersect(A, B)}(IC(x))/max(sum_{x in A}(IC(x)), sum_{x in B}(IC(x)))
IC method can be set via control = list(IC_method = ...)
.
It is only based on the number of terms. A
is the set of all ancestors of group1 terms and B
is the set of all ancestors of group2 terms.
group_sim = length(intersect(A, B))/length(union(A, B))
It is:
group_sim = 2*length(intersect(A, B))/(length(A) + length(B))
It is:
group_sim = length(intersect(A, B))/max(length(A), length(B))
It is:
group_sim = length(intersect(A, B))/min(length(A), length(B))
It is based on the dot product of two vectors p
and q
which correspond to terms in group1
and group2
.
p
and q
have the same length as the total number of terms. Value of position i in p
or q
corresponds to term t
. The value
takes IC(t)
if t
is an ancestor of any term in p
or q
, and the value takes zero if t
is not. The
similarity betweem group1
terms and group2
terms is calculated as:
<p,q>/||p||/||q||
where <p,q>
is the dot product between the two, and ||p||
or ||q||
is the norm of the vector.
The equation can be written as:
group_sim = sum_{x in intersect(A, B)}(IC(x)^2) / sqrt(sum_{x in A}(IC(x)^2)) / sqrt(sum_{x in B}(IC(x)^2))
IC method can be set via control = list(IC_method = ...)
.
Similar as GroupSim_SimCOU, the similarity is:
<p,q>/(||p||^2 + ||q||^2 - <p,q>)
And it can be rewritten as:
group_sim = sum_{x in intersect(A, B)}(IC(x)^2) / (sum_{x in A}(IC(x)^2) + sum_{x in B}(IC(x)^2) - sum_{x in intersect(A, B)}(IC(x)^2))
IC method can be set via control = list(IC_method = ...)
.
It is the longest depth for the terms in intersect(A, B)
.
group_sim = max(depth(intersect(A, B)))
It is a normalized version of GroupSim_SimLP:
group_sim = max(depth(intersect(A, B)))/max_depth
Since the minimal depth is zero for root.
It is based on the annotated items. Denote sigma(t)
as the total annotated items of t
. The similarity is calculated as
group_sim = log(C/sigma_max)/log(sigma_min/sigma_max)
where C
is min(sigma_{x in intersect(A, B)}(x))
, i.e., the minimal sigma in the intersection of group1 and group2. Note
Now A
and B
are just two sets of terms in group1
and group2
.
sigma_max
is the total number of items annotated to the DAG, sig_min
is the minimal number of items annotated to a term, which
is mostly 1.
A
and B
are just two sets of terms in group1
and group2
. The similarity is calculated as:
group_sim = max_{t in intersect(A, B)}(1 - sigma(t)/N)
Say A
is the set of items annotated to terms in group1
and B
is the set of items annotated to group2
.
This is the Jaccard coeffcient between two sets.
The universe/background can be set via control = list(universe = ...)
.
It is the Dice coeffcient between A
and B
.
The universe/background can be set via control = list(universe = ...)
.
It is the Overlap coeffcient between A
and B
.
The universe/background can be set via control = list(universe = ...)
.
The universe/background can be set via control = list(universe = ...)
.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) group_sim(dag, c("c", "e"), c("d", "f"), method = "GroupSim_pairwise_avg", control = list(term_sim_method = "Sim_Lin_1998") )
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) group_sim(dag, c("c", "e"), c("d", "f"), method = "GroupSim_pairwise_avg", control = list(term_sim_method = "Sim_Lin_1998") )
Import ontology file to an ontology_DAG object
import_obo( file, relation_type = character(0), inherit_relations = TRUE, verbose = simona_opt$verbose, ... ) import_owl( file, relation_type = character(0), inherit_relations = TRUE, verbose = simona_opt$verbose, ... ) import_ontology( file, robot_jar = simona_opt$robot_jar, JAVA_ARGS = "", verbose = simona_opt$verbose, ... ) import_ttl(file, relation_type = "part_of", verbose = simona_opt$verbose, ...)
import_obo( file, relation_type = character(0), inherit_relations = TRUE, verbose = simona_opt$verbose, ... ) import_owl( file, relation_type = character(0), inherit_relations = TRUE, verbose = simona_opt$verbose, ... ) import_ontology( file, robot_jar = simona_opt$robot_jar, JAVA_ARGS = "", verbose = simona_opt$verbose, ... ) import_ttl(file, relation_type = "part_of", verbose = simona_opt$verbose, ...)
file |
Path of the ontology file or an URL. |
relation_type |
Semantic relation types to include. Note |
inherit_relations |
Relations may also be structured as a DAG. It controls whether to merge with a relations's offspring relations. |
verbose |
Whether to print messages. |
... |
Pass to |
robot_jar |
The path of the |
JAVA_ARGS |
Options for |
Public bio-ontologies can be obtained from Ontology Foundry or BioPortal.
The import_obo()
function parses the ontology file in .obo
format. To parse other formats, external tool robot.jar
is required.
import_owl()
only recognizes <owl:Class>
and <owl:ObjectProperty>
. If the .owl file does not contain these tags,
please use import_ontology()
directly.
robot.jar
can automatically recognize the following formats:
json
: OBO Graphs JSON
obo
: OBO Format
ofn
: OWL Functional
omn
: Manchester
owl
: RDF/XML
owx
: OWL/XML
ttl
: Turtle
The description of the ROBOT tool is at http://robot.obolibrary.org/convert.
import_ttl()
is a simple parser for the .ttl
format files. It only recognizes
terms that have the owl:Class
object. The "is_a" relation is recognized by the predicate rdfs:subClassOf
or an ontology-specific predicate that contains .*/isa
. Other relation types are defined with
the predicate owl:ObjectProperty
. The format is parsed by a Perl script system.file("scripts", "parse_ttl.pl", package = "simona")
.
An ontology_DAG
object.
# The plant ontology: http://obofoundry.org/ontology/po.html import_obo("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.obo") import_owl("http://purl.obolibrary.org/obo/po.owl") ## Not run: # The plant ontology: http://obofoundry.org/ontology/po.html dag = import_ontology("http://purl.obolibrary.org/obo/po.owl", robot_jar = ...) ## End(Not run) # file is from https://bioportal.bioontology.org/ontologies/MSTDE import_ttl("https://jokergoo.github.io/simona/MSTDE.ttl")
# The plant ontology: http://obofoundry.org/ontology/po.html import_obo("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.obo") import_owl("http://purl.obolibrary.org/obo/po.owl") ## Not run: # The plant ontology: http://obofoundry.org/ontology/po.html dag = import_ontology("http://purl.obolibrary.org/obo/po.owl", robot_jar = ...) ## End(Not run) # file is from https://bioportal.bioontology.org/ontologies/MSTDE import_ttl("https://jokergoo.github.io/simona/MSTDE.ttl")
Get or set meta columns on DAG
## S4 method for signature 'ontology_DAG' mcols(x, use.names = TRUE, ...) ## S4 replacement method for signature 'ontology_DAG' mcols(x, ...) <- value
## S4 method for signature 'ontology_DAG' mcols(x, use.names = TRUE, ...) ## S4 replacement method for signature 'ontology_DAG' mcols(x, ...) <- value
x |
An |
use.names |
Please ignore. |
... |
Other argument. For |
value |
A data frame or a matrix where rows should correspond to terms in |
A data frame.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) mcols(dag) = data.frame(id = letters[1:6], v = 1:6) mcols(dag) mcols(dag, "id") dag
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) mcols(dag) = data.frame(id = letters[1:6], v = 1:6) mcols(dag) mcols(dag, "id") dag
All Papameters of a given method
method_param(IC_method = NULL, term_sim_method = NULL, group_sim_method = NULL)
method_param(IC_method = NULL, term_sim_method = NULL, group_sim_method = NULL)
IC_method |
A single IC method name. |
term_sim_method |
A single term similarity method name. |
group_sim_method |
A single group similarity method name. |
A vector of parameter names.
method_param(IC_method = "IC_annotation") method_param(term_sim_method = "Sim_Wang_2007")
method_param(IC_method = "IC_annotation") method_param(term_sim_method = "Sim_Wang_2007")
Various types of common ancestors
MICA_term( dag, terms, IC_method, in_labels = TRUE, distance = "longest", verbose = simona_opt$verbose ) MICA_IC(dag, terms, IC_method, verbose = simona_opt$verbose) LCA_term( dag, terms, in_labels = TRUE, distance = "longest", verbose = simona_opt$verbose ) LCA_depth(dag, terms, verbose = simona_opt$verbose) NCA_term(dag, terms, in_labels = TRUE, verbose = simona_opt$verbose) max_ancestor_v(dag, terms, value, verbose = simona_opt$verbose) max_ancestor_id( dag, terms, value, in_labels = FALSE, distance = "longest", verbose = simona_opt$verbose ) max_ancestor_path_sum( dag, terms, value, add_v, distance = "longest", verbose = simona_opt$verbose ) CA_terms(dag, term1, term2, in_labels = TRUE)
MICA_term( dag, terms, IC_method, in_labels = TRUE, distance = "longest", verbose = simona_opt$verbose ) MICA_IC(dag, terms, IC_method, verbose = simona_opt$verbose) LCA_term( dag, terms, in_labels = TRUE, distance = "longest", verbose = simona_opt$verbose ) LCA_depth(dag, terms, verbose = simona_opt$verbose) NCA_term(dag, terms, in_labels = TRUE, verbose = simona_opt$verbose) max_ancestor_v(dag, terms, value, verbose = simona_opt$verbose) max_ancestor_id( dag, terms, value, in_labels = FALSE, distance = "longest", verbose = simona_opt$verbose ) max_ancestor_path_sum( dag, terms, value, add_v, distance = "longest", verbose = simona_opt$verbose ) CA_terms(dag, term1, term2, in_labels = TRUE)
dag |
An |
terms |
A vector of term names. |
IC_method |
An IC method. Valid values are in |
in_labels |
Whether the terms are represented in their names or as integer indices? |
distance |
If there are multiple LCA or MICA of two terms, whether to take the one with the longest distance of shortest distance to the two terms. Possible values are "longest" and "shortest". |
verbose |
Whether to print messages. |
value |
A numeric vector. The elements should corrrespond to terms in |
add_v |
Values to be added along the path to the MICA or LCA. The same format as |
term1 |
A single term ID. |
term2 |
A single term ID. |
There are the following three types of common ancestors:
MICA (most informative common ancestor): The common ancestor with the highest IC value.
LCA (lowest common ancestor): The common ancestor with the largest depth (The depth of a term is the maximal distance from the root term). If there are multiple ancestors having the same max depth, the ancestor with the smallest distance to the two terms is used.
NCA (nearest common ancestor): The common ancestor with the smallest distance to the two terms. If there are multiple ancestors with the same smallest distance, the ancestor with the largest depth is used.
max_ancestor_v()
and max_ancestor_id()
are more general functions which return common ancestors with
the highest value in value
.
Given a path connecting two terms and their MICA/LCA, max_ancestor_path_sum()
calculates the sum of terms along the path. The values
to be added in specified in add_v
argument.
MICA_term()
returns an integer or a character matrix of the MICA terms depending on the value of in_labels
.
MICA_IC()
returns a numeric matrix of the IC of the MICA terms.
LCA_term()
returns an integer or a character matrix of the LCA term depending on the value of in_labels
.
LCA_depth()
returns an integer matrix of the depth of the LCA terms.
NCA_term()
returns an integer or a character matrix of the NCA term depending on the value of in_labels
. The shortest distance from NCA terms can be calculated by shortest_distances_via_NCA()
.
max_ancestor_v()
returns a numeric matrix.
max_ancestor_id()
returns an integer or a character matrix.
CA_terms()
returns a vector of term IDs.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) MICA_term(dag, letters[1:6], "IC_universal") MICA_IC(dag, letters[1:6], "IC_universal") LCA_term(dag, letters[1:6]) LCA_depth(dag, letters[1:6]) NCA_term(dag, letters[1:6]) CA_terms(dag, "c", "d")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) MICA_term(dag, letters[1:6], "IC_universal") MICA_IC(dag, letters[1:6], "IC_universal") LCA_term(dag, letters[1:6]) LCA_depth(dag, letters[1:6]) NCA_term(dag, letters[1:6]) CA_terms(dag, "c", "d")
Number of annotated items
n_annotations( dag, terms = NULL, uniquify = simona_opt$anno_uniquify, use_cache = simona_opt$use_cache ) has_annotation(dag)
n_annotations( dag, terms = NULL, uniquify = simona_opt$anno_uniquify, use_cache = simona_opt$use_cache ) has_annotation(dag)
dag |
An |
terms |
A vector of term names. If it is set, the returned vector will be subsetted to the terms that have been set here. |
uniquify |
Whether to uniquify items that are annotated to the term? See Details. It is suggested to always be |
use_cache |
Internally used. |
Due to the nature of the DAG, a parent term includes all annotated items of its child terms, and an ancestor term includes all annotated items from its offspring recursively. In current tools, there are two different implementations to deal with such recursive merging.
For a term t
, denote S_1
, S_2
, ... as the sets of annotated items for its child 1, 2, ..., also denote S_t
as the set
of items that are directly annotated to t
. The first method takes the union of annotated items on t
and all its child terms:
n = length(union(S_t, S_1, S_2, ...))
And the second method takes the sum of numbers of items on t
and on all its child terms:
n = sum(length(s_t) + length(S_1) + length(S_2) + ...)
In n_annotations()
, when uniquify = TRUE
, the first method is used; and when uniquify = FALSE
, the second method is used.
For some annotation sources, it is possible that an item is annotated to multiple terms, thus, the second method which simply
adds numbers of all its child terms may not be proper because an item may be counted duplicatedly, thus over-estimating n
. The two methods
are identical only if an item is annotated to a unique term in the DAG.
We suggest to always set uniquify = TRUE
(the default), and the scenario of uniquify = FALSE
is only for the testing or benchmarking purpose.
n_annotations()
returns an integer vector.
has_annotation()
returns a logical scalar.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) n_annotations(dag)
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) n_annotations(dag)
Number of parent/child/ancestor/offspring/leaf terms
n_offspring(dag, terms = NULL, use_cache = TRUE, include_self = FALSE) n_ancestors(dag, terms = NULL, use_cache = TRUE, include_self = FALSE) n_connected_leaves(dag, terms = NULL, use_cache = TRUE) n_parents(dag, terms = NULL) n_children(dag, terms = NULL) avg_parents(dag) avg_children(dag)
n_offspring(dag, terms = NULL, use_cache = TRUE, include_self = FALSE) n_ancestors(dag, terms = NULL, use_cache = TRUE, include_self = FALSE) n_connected_leaves(dag, terms = NULL, use_cache = TRUE) n_parents(dag, terms = NULL) n_children(dag, terms = NULL) avg_parents(dag) avg_children(dag)
dag |
An |
terms |
A vector of term names. If the value is |
use_cache |
Internally used. |
include_self |
For |
For n_connected_leaves()
, leaf nodes have value of 1.
In avg_parents()
, root term is removed.
In avg_children()
, leaf term is removed.
An integer vector.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) n_parents(dag) n_children(dag) n_offspring(dag) n_ancestors(dag) n_connected_leaves(dag)
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) n_parents(dag) n_children(dag) n_offspring(dag) n_ancestors(dag) n_connected_leaves(dag)
This class defines the DAG structure of an ontology.
An ontology_DAG
object.
terms
A character vector of length n
of all term names. Other slots that store term-level information use the integer indices of terms.
n_terms
An integer scalar of the total number of terms in the DAG.
n_relations
An integer scalar of the total number of relations in the DAG.
lt_parents
A list of length n
. Each element in the list is an integer index vector of the parent terms of the i^th term.
lt_children
A list of length n
. Each element in the list is an integer index vector of the child terms of the i^th term.
lt_children_relations
A list of length n
. Each element is a vector of the semantic relations between the i^th term and its child terms, e.g. a child "is_a" parent.
The relations are represented as integers. The character name of the relations is in attr(dag@lt_children_relations, "levels")
.
relations_DAG
A simple ontology_DAG
object but constructed for relation types.
source
The source of the ontology. A character scalar only used as a mark of the returned object.
root
An integer scalar of the root term.
leaves
An integer vector of the indicies of leaf terms.
alternative_terms
A named character vector of mappings between alternative terms to DAG terms.
tpl_sorted
An integer vector of reordered term indices which has been topologically sorted in the DAG. Terms are sorted first by the depth (maximal distance from root), then the number of child terms, then the number of parent terms, and last the term names.
tpl_pos
The position of the original term in the topologically sorted path (similar as the rank), e.g. the value of the first element in the vector is the position of term 1 in the topologically sorted path.
annotation
A list of two elements: list
and names
. The dag@annotation$list
element contains a list of length n
and each element
is a vector of integer indices of annotated items. The full list of annotated items is in dag@annotation$names
.
term_env
An environment which contains various term-level statistics. It is mainly for cache purpose.
aspect_ratio
A numeric vector of length two. The aspect ratio is calculated as w/h
. For each term, there is a distance to root,
h
is the maximal distance of all terms, w
is the maximal number of items with the same distance. The two values in the aspect_ratio
slot
use maximal distance to root (the height) and the shortest distance to root as the distance measure.
elementMetadata
An additional data frame with the same number of rows as the number of terms in DAG. Order of rows should be the same as order of terms in dag@terms
.
1 # This function should not be used directly.
1 # This function should not be used directly.
Import ontologies already having gene annotations
ontology_kw( organism = "human", gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_chebi( organism = c("human", "mouse", "rat", "pig", "dog"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_hp( organism = c("human", "mouse"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_pw( organism = c("human", "mouse", "rat", "pig", "dog", "chimpanzee"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_rdo( organism = c("human", "mouse", "rat", "pig", "dog", "chimpanzee"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_vt( organism = c("human", "mouse", "rat", "pig", "dog", "chimpanzee"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_go(...)
ontology_kw( organism = "human", gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_chebi( organism = c("human", "mouse", "rat", "pig", "dog"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_hp( organism = c("human", "mouse"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_pw( organism = c("human", "mouse", "rat", "pig", "dog", "chimpanzee"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_rdo( organism = c("human", "mouse", "rat", "pig", "dog", "chimpanzee"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_vt( organism = c("human", "mouse", "rat", "pig", "dog", "chimpanzee"), gene_annotation = TRUE, verbose = simona_opt$verbose, ... ) ontology_go(...)
organism |
Organism. |
gene_annotation |
Whether to add gene annotations to the DAG. |
verbose |
Whether to print messages? |
... |
Pass to |
There are the following ontologies:
ontology_kw()
: UniProt Keywords. The list of supported organisms can be found in UniProtKeywords::load_keyword_genesets()
.
ontology_chebi()
: Chemical Entities of Biological Interest.
ontology_hp()
: The Human Phenotype Ontology.
ontology_pw()
: Pathway Ontology.
ontology_rdo()
: RGD Disease Ontology.
ontology_vt()
: Vertebrate Trait Ontology.
The source of the original files can be found with simona:::RGD_TB
.
ontology_go()
is an alias of create_ontology_DAG_from_GO_db()
. All arguments go there.
Partition the DAG
partition_by_level(dag, level = 1, from = NULL, term_pos = NULL) partition_by_size(dag, size = round(dag_n_terms(dag)/5))
partition_by_level(dag, level = 1, from = NULL, term_pos = NULL) partition_by_size(dag, size = round(dag_n_terms(dag)/5))
dag |
An |
level |
Depth in the DAG to cut. The DAG is cut below terms (or cut the links to their child terms) with |
from |
A list of terms to cut. If it is set, |
term_pos |
Internally used. |
size |
Number of terms in a cluster. The splitting stops on a term if all its child-trees are smaller than |
Let's call the terms below the from
term as "top terms" because they will be on top of the sub-DAGs after the partitioning.
It is possible that a term in the middle of the DAG can be traced back to more than one top terms.
To partition all terms exclusively, a term partitioned to the sub-DAG from the top term with the largest distance to the term.
If a term has the same largest distances to several top terms, a random top term is selected.
In partition_by_size()
, the DAG is first reduced to a tree where a child term only has one parent.
The partition is done recursively by cutting into its child-trees.
The splitting stops when all the child-trees have size less than size
.
NA
is assigned to the from
terms, their ancestor terms, and terms having infinite directed distance to from
terms.
A character vector of top terms in each partition.
dag = create_ontology_DAG_from_GO_db() pa = partition_by_level(dag) table(pa) pa = partition_by_size(dag, size = 1000) table(pa) 1
dag = create_ontology_DAG_from_GO_db() pa = partition_by_level(dag) table(pa) pa = partition_by_size(dag, size = 1000) table(pa) 1
Print the source
## S3 method for class 'print_source' print(x, ...)
## S3 method for class 'print_source' print(x, ...)
x |
An object in the |
... |
Other arguments. |
Internally used.
Randomly sample terms/items
random_terms(dag, n) random_items(dag, n)
random_terms(dag, n) random_items(dag, n)
dag |
An |
n |
Number of terms or items. |
A character vector of terms or items.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) random_terms(dag, 3) random_items(dag, 3)
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) random_terms(dag, 3) random_items(dag, 3)
Distance on the DAG
shortest_distances_via_NCA(dag, terms, verbose = simona_opt$verbose) longest_distances_via_LCA(dag, terms, verbose = simona_opt$verbose) shortest_distances_directed(dag, terms, verbose = simona_opt$verbose) longest_distances_directed(dag, terms, verbose = simona_opt$verbose)
shortest_distances_via_NCA(dag, terms, verbose = simona_opt$verbose) longest_distances_via_LCA(dag, terms, verbose = simona_opt$verbose) shortest_distances_directed(dag, terms, verbose = simona_opt$verbose) longest_distances_directed(dag, terms, verbose = simona_opt$verbose)
dag |
An |
terms |
A vector of term names. |
verbose |
Whether to print messages. |
Denote two terms as a
and b
, a common ancestor as c
, and the distance function d()
calculates the longest
distance or the shortest distance depending on the function.
shortest_distances_via_NCA()
: It calculates the smallest d(c, a) + d(c, b)
where d()
calculates the shortest distance between two terms. In this case,
c
is the NCA (nearest common ancestor) of a
and b
.
longest_distances_via_LCA()
: It calculates the largest d(c, a) + d(c, b)
where d()
calculates the longest distance between two terms via the LCA (lowest common ancestor) term. In this case,
c
is the LCA of a
and b
.
shortest_distances_directed()
: It calculates d(a, b)
where d()
calculates the shortest distance between two terms. The distance is only calculated when a
is an ancestor of b
, otherwise the distance value is -1.
longest_distances_directed()
: It calculates d(a, b)
where d()
calculates the longest distance between two terms. The distance is only calculated when a
is an ancestor of b
, otherwise the distance value is -1.
A numeric distance matrix.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) shortest_distances_via_NCA(dag, letters[1:6]) longest_distances_via_LCA(dag, letters[1:6]) shortest_distances_directed(dag, letters[1:6]) longest_distances_directed(dag, letters[1:6])
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") dag = create_ontology_DAG(parents, children) shortest_distances_via_NCA(dag, letters[1:6]) longest_distances_via_LCA(dag, letters[1:6]) shortest_distances_directed(dag, letters[1:6]) longest_distances_directed(dag, letters[1:6])
Print the ontology_DAG object
## S4 method for signature 'ontology_DAG' show(object)
## S4 method for signature 'ontology_DAG' show(object)
object |
An |
No value is returned.
Global options
simona_opt(..., RESET = FALSE, READ.ONLY = NULL, LOCAL = FALSE, ADD = FALSE)
simona_opt(..., RESET = FALSE, READ.ONLY = NULL, LOCAL = FALSE, ADD = FALSE)
... |
Name-value pairs for options. |
RESET |
Reset to default option values. |
READ.ONLY |
Only return read only options. |
LOCAL |
Only return local options. |
ADD |
Add new options. |
There are the following global options:
use_cache
: By default, information content of all terms is cached and reused. If use_cache
is set to FALSE
, IC will be re-calculated.
verbose
: Whether to print messages?
anno_uniquify
: In the annotation-based IC method, the union of items annotated to the term as well as all its offspring terms is used, which means
the set of annotated items for the term is uniquified. If anno_uniquify
is set to FALSE
, the uniquification is not applied, we simply add the number
of items annotated to the term and the numbers of items annotated to each of its offspring terms.
robot_jar
: Path of the robot.jar
file. The file can be found from https://github.com/ontodev/robot/releases.
To set an option, you can use $
:
simona_opt$verbose = FALSE
or use it as a function:
simona_opt(verbose = FALSE)
A single option value.
simona_opt
simona_opt
Term-item associations
term_annotations(dag, terms, return = "list") annotated_terms(dag, anno, return = "list")
term_annotations(dag, terms, return = "list") annotated_terms(dag, anno, return = "list")
dag |
An |
terms |
A vector of term names. |
return |
Whether the returned object is a list or a matrix? |
anno |
A vector of annotated item names. |
If an item is annotated to a term, all this term's ancestor terms are also annotated.
A list or a binary matrix showing annotation relations between terms and items.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) term_annotations(dag, letters[1:6]) term_annotations(dag, letters[1:6], return = "matrix") annotated_terms(dag, c("t1", "t2", "t3")) annotated_terms(dag, c("t1", "t2", "t3"), return = "matrix")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) term_annotations(dag, letters[1:6]) term_annotations(dag, letters[1:6], return = "matrix") annotated_terms(dag, c("t1", "t2", "t3")) annotated_terms(dag, c("t1", "t2", "t3"), return = "matrix")
Information content
term_IC( dag, method, terms = NULL, control = list(), verbose = simona_opt$verbose )
term_IC( dag, method, terms = NULL, control = list(), verbose = simona_opt$verbose )
dag |
An |
method |
An IC method. All available methods are in |
terms |
A vector of term names. If it is set, the returned vector will be subsetted to the terms that have been set here. |
control |
A list of parameters passing to individual methods. See the subsections. |
verbose |
Whether to print messages. |
A numeric vector.
Denote k
as the number of offspring terms plus the term itself and N
is such value for root (or the total number of terms in the DAG), the information
content is calculated as:
IC = -log(k/N)
For a term t
in the DAG, denote d
as the maximal distance from root (i.e. the depth) and h
as the maximal distance to leaves (i.e. the height),
the relative position p
on the longest path from root to leaves via term t
is calculated as:
p = (h + 1)/(h + d + 1)
In the formula where 1 is added gets rid of p = 0
. Then the information content is:
IC = -log(p) = -log((h+1)/(h+d+1))
Denote k
as the number of items annotated to a term t
, and N
is the number of items annotated to the root (which is
the total number of items annotated to the DAG), IC for term t
is calculated as:
IC = -log(k/N)
In current implementations in other tools, there is an inconsistency of defining k
and N
.
Please see n_annotations()
for explanation.
NA
is assigned to terms with no item annotated.
It measures the probability of a term getting full transmission from the root. Each term is associated with a p-value and the root has the p-value of 1.
For example, an intermediate term t
has two parent terms parent1
and parent2
, also assume parent1
has k1
children
and parent2
has k2
children, assume a parent transmits information equally to all its children, then respectively parent1
only transmits 1/k1
and
parent2
only transmits 1/k2
of its content to term t
, or the probability of a parent to reach t
is 1/k1
or 1/k2
.
Let's say p1
and p2
are the accumulated contents from the root to parnet1
and parent2
respectively (or the probability
of the two parent terms getting full transmission from root), then the probability of reaching t
via a full transmission graph from parent1
is the multiplication of p1
and 1/k1
, which is p1/k1
, and same for p2/k2
. Then, for term t
, if getting transmitted from parent1
and
parent2
are independent, the probability of t
(denoted as p_t
) to get transmitted from both parents is:
p_t = (p1/k1) * (p2/k2)
Since the two parents are the full set of t
's parents, p_t
is the probability of t
getting full transmission from root. And the final
information content is:
IC = -log(p_t)
Paper link: doi:10.1155/2012/975783.
It measures the number of ways from a term to reach leaf terms. E.g. in the following DAG:
a upstream /|\ b | c |/ d downstream
term a
has three ways to reach leaf, which are a->b
, a->d
and a->c->d
.
Let's denote k
as the number of ways for term t
to reach leaves and N
as the maximal value of k
which
is associated to the root term, the information content is calculated as
IC = -log(k/N) = log(N) - log(k)
Paper link: doi:10.1186/1471-2105-7-135.
It is based on the number of offspring terms of term t
.
The information content is calculated as:
IC = 1 - log(k+1)/log(N)
where k
is the number of offspring terms of t
, or you can think k+1
is the number of t
's offspring terms plus itself.
N
is the total number of terms on the DAG.
Paper link: doi:10.5555/3000001.3000272.
It is a correction of IC_Seco_2004 which considers the depth of a term in the DAG. The information content is calculated as:
IC = 0.5*IC_Seco + 0.5*log(depth)/log(max_depth)
where depth
is the depth of term t
in the DAG, defined as the maximal distance from root. max_depth
is the largest depth in the DAG.
So IC is composed with two parts: the numbers of offspring terms and positions in the DAG.
Paper link: doi:10.1109/FGCNS.2008.16.
It is also a correction to IC_Seco_2004, but considers number of relations connecting a term (i.e. number of parent terms and child terms). The information content is defined as:
(1-sigma)*IC_Seco + sigma*log((n_parents + n_children + 1)/log((total_edges + 1))
where n_parents
and n_children
are the numbers of parents and children of term t
. The tuning factor sigma
is defined as
sigma = log(total_edges+1)/(log(total_edges) + log(total_terms))
where total_edges
is the number of all relations (all parent-child relations)
and total_terms
is the number of all terms in the DAG.
Paper link: doi:10.5555/1862330.1862343.
It measures the average contribution of term t
on leaf terms. First denote zeta
as the number of leaf terms that
can be reached from term t
(or t
's offspring that are leaves.). Since all t
's ancestors can also
reach t
's leaves, the contribution of t
on leaf terms is scaled by n_ancestors
which is the number of t
's ancestor terms.
The final information content is normalized by the total number of leaves in the DAG, which is the possible maximal value of zeta
.
The complete definition of information content is:
IC = -log( (zeta/n_ancestor) / n_all_leaves)
Paper link: doi:10.1016/j.knosys.2010.10.001.
It has a complex form which takes account of the term depth and the downstream of the term. The first factor is calculated as:
f1 = log(depth)/long(max_depth)
The second factor is calculated as:
f1 = 1 - log(1 + sum_{x => t's offspring}(1/depth_x))/log(total_terms)
In the equation, the summation goes over t
's offspring terms.
The final information content is the multiplication of f1
and f2
:
IC = f1 * f2
Paper link: http://article.nadiapub.com/IJGDC/vol5_no3/6.pdf.
There is one parameter correct
. If it is set to TRUE
, the first factor f1
is calculated as:
f1 = log(depth + 1)/long(max_depth + 1)
correct
can be set as:
term_IC(dag, method = "IC_Meng_2012", control = list(correct = TRUE))
Each relation is weighted by a value less than 1 based on the semantic relation, i.e. 0.8 for "is_a" and 0.6 for "part_of".
For a term t
and one of its ancestor term a
, it first calculates an "S-value" which corresponds to a path from a
to t
where
the accumulated multiplication of weights along the path reaches maximal:
S(a->t) = max_{path}(prod_{node on the paty}(w))
Here max
goes over all possible paths from a
to t
, and prod()
multiplies edge weights in a certain path.
The formula can be transformed as (we simply rewrite S(a->t)
to S
):
1/S = min(prod(1/w)) log(1/S) = log(min(prod(1/w))) = min(sum(log(1/w)))
Since w < 1
, log(1/w)
is positive. According to the equation, the path (a->...->t
) is actually the shortest path from a
to t
by taking
log(1/w)
as the weight, and log(1/S)
is the weighted shortest distance.
If S(a->t)
can be thought as the maximal semantic contribution from a
to t
, the information content is calculated
as the sum from all t
's ancestors (including t
itself):
IC = sum_{a in t's ancestors + t}(S(a->t))
Paper link: doi:10.1093/bioinformatics/btm087.
The contribution of different semantic relations can be set with the contribution_factor
parameter. The value should be a named numeric
vector where names should cover the relations defined in relations
set in create_ontology_DAG()
. For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor
can be set as:
term_IC(dag, method = "IC_Wang", control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))
Note the IC_Wang_2007 method is normally used within the Sim_Wang_2007 semantic similarity method.
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) term_IC(dag, "IC_annotation")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = c("t1", "t2", "t3"), "b" = c("t3", "t4"), "c" = "t5", "d" = "t7", "e" = c("t4", "t5", "t6", "t7"), "f" = "t8" ) dag = create_ontology_DAG(parents, children, annotation = annotation) term_IC(dag, "IC_annotation")
Semantic similarity
term_sim(dag, terms, method, control = list(), verbose = simona_opt$verbose)
term_sim(dag, terms, method, control = list(), verbose = simona_opt$verbose)
dag |
An |
terms |
A vector of term names. |
method |
A term similarity method. All available methods are in |
control |
A list of parameters passing to individual methods. See the subsections. |
verbose |
Whether to print messages. |
A numeric symmetric matrix.
The similarity between two terms a
and b
is calculated as the IC of their MICA term c
normalized by the average of the IC of the two terms:
sim = IC(c)/((IC(a) + IC(b))/2) = 2*IC(c)/(IC(a) + IC(b))
Although any IC method can be used here, in more applications, it is normally used together with the IC_annotation method.
Paper link: doi:10.5555/645527.657297.
The IC method is fixed to IC_annotation
.
The original Resnik similarity is the IC of the MICA term. There are three ways to normalize the Resnik similarity into the scale of [0, 1]
:
Nunif
sim = IC(c)/log(N)
where N
is the total number of items annotated to the whole DAG, i.e. number of items annotated to the root. Then the IC
of a term with only one item annotated is -log(1/N)
= log(N)' which is the maximal IC value in the DAG.
Nmax
IC_max
is the maximal IC of all terms. If there is a term with only one item annotated, Nmax
is identical to the 'Nunif* method.
sim = IC(c)/IC_max
Nunivers
The IC is normalized by the maximal IC of term a
and b
.
sim = IC(c)/max(IC(a), IC(b))
Paper link: doi:10.1613/jair.514, doi:10.1186/1471-2105-9-S5-S4, doi:10.1186/1471-2105-11-562, doi:10.1155/2013/292063.
The normalization method can be set with the norm_method
parameter:
term_sim(dag, terms, control = list(norm_method = "Nmax"))
Possible values for the norm_method
parameter are "Nunif", "Nmax", "Nunivers" and "none".
It is calculated as:
sim = IC(c)/(IC(a) + IC(b) - IC(c))
The relation between FaITH_2010 similarity and Lin_1998 similarity is:
sim_FaITH = sim_Lin/(2 - sim_Lin)
Paper link: doi:10.1007/978-3-642-17746-0_39.
The IC method is fixed to IC_annotation
.
If thinking Lin_1998 is a measure of how close term a
and b
to their MICA term c
, the relevance method corrects it by multiplying
a factor which considers the specificity of how c
brings the information. The factor is calculated as 1-p(c)
where p(c)
is the annotation-based
probability p(c) = k/N
where k
is the number of items annotated to c
and N
is the total number of items annotated to the DAG. Then
the Relevance semantic similarity is calculated as:
sim = (1 - p(c)) * IC_Lin = (1 - p(c)) * 2*IC(c)/(IC(a) + IC(b))
Paper link: doi:10.1186/1471-2105-7-302.
The IC method is fixed to IC_annotation
.
The SimIC method is an improved correction method of the Relevance method because the latter works bad when p(c)
is very small. The SimIC
correction factor for MICA term c
is:
1 - 1/(1 + IC(c))
Then the similarity is:
sim = (1 - 1/(1 + IC(c))) * IC_Lin = (1 - 1/(1 + IC(c))) * 2*IC(c)/(IC(a) + IC(b))
Paper link: doi:10.48550/arXiv.1001.0958.
The IC method is fixed to IC_annotation
.
Being different from the "Relevance" and "SimIC_2010" methods that only use the IC of the MICA term, the XGraSM_2013 uses IC of all common ancestor terms of a
and b
.
First it calculates the mean IC of all common ancestor terms with positive IC values:
IC_mean = mean_t(IC(t)) where t is an ancestor of both a and b, and IC(t) > 0
then similar to the Lin_1998 method, normalize to the average IC of a
and b
:
sim = IC_mean*2/(IC(a) + IC(b))
Paper link: doi:10.1186/1471-2105-14-284.
The IC method is fixed to IC_annotation
.
It also selects a subset of common ancestors of terms a
and b
. It only selects common ancestors which can reach a
or b
via one of its child terms
that does not belong to the common ancestors. In other words, from the common ancestor, there exist a path where
the information is uniquely transmitted to a
or b
, not passing the other.
Then the mean IC of the subset common ancestors is calculated and normalized by the Lin_1998 method.
Paper link: doi:10.1016/j.gene.2014.12.062.
It uses the aggregate information content from ancestors. First define the semantic weight (Sw
) of a term t
in the DAG:
Sw = 1/(1 + exp(-1/IC(t)))
Then calculate the aggregation only in the common ancestors and the aggregationn
in the ancestors of the two terms a
and b
separatedly:
SV_{common ancestors} = sum_{t in common ancestors}(Sw(t)) SV_a = sum{a' in a's ancestors}(Sw(a')) SV_b = sum{b' in b's ancestors}(Sw(b'))
The similarity is calculated as the ratio between the aggregation on the common ancestors and the average on a
's ancestors and b
's ancestors separatedly.
sim = 2*SV_{common_ancestors}/(SV_a + SV_b)
Paper link: doi:10.1109/tcbb.2013.176.
It uses the IC_Zhang_2006 IC method and the Lin_1998 method to calculate similarities:
sim = 2*IC_zhang(c)/(IC_zhang(a) + IC_zhang(b))
It uses the IC_universal IC method and the Nunivers method to calculate similarities:
sim = IC_universal(c)/max(IC_universal(a), IC_universal(b))
First, S-value of an ancestor term c
on a term a
(S(c->a)
) is calculated (the definition of the S-value can be found in the help page of term_IC()
).
Similar to the Sim_AIC_2014, aggregation only to common ancestors, to a
's ancestors and to b
's ancestors are calculated.
SV_{common ancestors} = sum_{c in common ancestors}(S(c->a) + S(c->b)) SV_a = sum{a' in a's ancestors}(S(a'->a)) SV_b = sum{b' in b's ancestors}(S(b'->b))
Then the similarity is calculated as:
sim = SV_{common_ancestors}*2/(SV_a + SV_b)
Paper link: doi:10.1093/bioinformatics/btm087.
The contribution of different semantic relations can be set with the contribution_factor
parameter. The value should be a named numeric
vector where names should cover the relations defined in relations
set in create_ontology_DAG()
. For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor
can be set as:
term_sim(dag, terms, method = "Sim_Wang_2007", control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))
It is very similar as Sim_Wang_2007, but with a corrected contribution factor when calculating the S-value. From a parent term to a child term, Sim_Wang_2007 directly uses a weight for the relation between the parent and the child, e.g. 0.8 for "is_a" relation type and 0.6 for "part_of" relation type. In Sim_GOGO_2018, the weight is also scaled by the total number of children of that parent:
w = 1/(c + nc) + w_0
where w_0 is the original contribution factor, nc
is the number of child terms of the parent, c
is calculated to ensure that
maximal value of w
is no larger than 1, i.e. c = max(w_0)/(1 - max(w_0))
, assuming minimal value of nc
is 1. By default Sim_GOGO_2018
sets contribution factor of 0.4 for "is_a" and 0.3 for "part_of", then w = 1/(2/3 + nc) + w_0
.
Paper link: doi:10.1038/s41598-018-33219-y.
The contribution of different semantic relations can be set with the contribution_factor
parameter. The value should be a named numeric
vector where names should cover the relations defined in relations
set in create_ontology_DAG()
. For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor
can be set as:
term_sim(dag, terms, method = "Sim_GOGO_2018", control = list(contribution_factor = c("relation_a" = 0.4, "relation_b" = 0.3)))
It is based on the distance between term a
and b
. It is defined as:
sim = 1/(1 + d(a, b))
The distance can be the shortest distance between a
and b
or the longest distance via the LCA term.
Paper link: doi:10.1109/21.24528.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Rada_1989", control = list(distance = "shortest_distances_via_NCA"))
It is also based on the distance between term a
and b
:
sim = 1 - d(a, b)/2/max_depth
where max_depth
is the maximal depth (maximal distance from root) in the DAG. Similarly, d(a, b)
can be the shortest
distance or the longest distance via LCA.
Paper link: doi:10.1145/1097047.1097051.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Resnik_edge_2005", control = list(distance = "shortest_distances_via_NCA"))
It is similar as the Sim_Resnik_edge_2005 method, but it applies log-transformation on the distance and the depth:
sim = 1 - log(d(a, b) + 1)/log(2*max_depth + 1)
Paper link: doi:10.1186/1471-2105-13-261.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Leocock_1998", control = list(distance = "shortest_distances_via_NCA"))
It is based on the depth of the LCA term c
and the longest distance between term a
and b
:
sim = 2*depth(c)/(len_c(a, b) + 2*depth(c))
where len_c(a, b)
is the longest distance between a
and b
via LCA c
. The denominator in the equation can also be written as:
len_c(a, b) + 2*depth(c) = depth(c) + len(c, a) + depth(c) + len(c, b) = depth_c(a) + depth_c(b)
where depth_c(a)
is the longest distance from root to a
passing through c
.
Paper link: doi:10.3115/981732.981751.
It is a correction of the Sim_WP_1994 method. The correction factor for term a
and b
regarding to their LCA t
is:
CF(a, b) = (1-lambda)*(min(depth(a), depth(b)) - depth(c)) + lambda/(1 + abs(depth(a) - depth(b)))
lambda
takes value of 1 if a
and b
are in ancestor-offspring relation, or else it takes 0.
Paper link: https://zenodo.org/record/1075130.
It is a correction of the Sim_WP_1994 method. The correction factor for term a
and b
is:
CF(a, b) = exp(-lambda*d(a, b)/max_depth)
lambda
takes value of 1 if a
and b
are in ancestor-offspring relation, or else it takes 0. 'd(a, b)
Paper link: doi:10.48550/arXiv.1211.4709.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Leocock_1998", control = list(distance = "shortest_distances_via_NCA"))
It is very similar to the Sim_WP_1994 method:
sim = depth(c)/(len_c(a, b) + depth(c)) = d(root, c)/(d(c, a) + d(c, b) + d(root, c))
where d(a, b)
is the longest distance between a
and b
.
Paper link: https://aclanthology.org/C02-1090/.
It is purely based on the depth of term a
, b
and their LCA c
.
sim = depth(c)/(depth(a) + depth(b) - depth(c))
The similarity value might be negative because there is no restrction that the path from root to a
or b
must pass c
.
Paper link: doi:10.1145/500737.500762.
It is calculated as:
sim = depth(c)^2/depth_c(a)/depth_c(b)
where depth_c(a)
is the longest distance between root to a
passing through c
.
Paper link: doi:10.1186/1477-5956-10-s1-s18.
For a term x
, it first calculates a "mile-stone" value as
m(x) = 0.5/2^depth(x)
The the distance bewteen term a
and b
via LCA term c
is:
D(c, a) + D(c, b) = m(c) - m(a) + m(c) - m(b) = 2*m(c) - m(a) - m(b) = 1/2^depth(c) - 0.5/2^depth(a) - 0.5/2^depth(b)
We change the original depth(a)
to let it go through LCA term c
when calculating the depth:
1/2^depth(c) - 0.5/2^depth(a) - 0.5/2^depth(b) = 1/2^depth(c)- 0.5/2^(depth(c) + len(c, a)) - 0.5/2^(depth(c) + len(c, b)) = 1/2^depth(c) * (1 - 1/2^(len(c, a) + 1) - 1/2^(len(c, b) + 1)) = 2^-depth(c) * (1 - 2^-(len(c, a) + 1) - 2^-(len(c, b) + 1))
And the final similarity is 1 - distance
:
sim = 1 - 2^-depth(c) * (1 - 2^-(len(c, a) + 1) - 2^-(len(c, b) + 1))
Paper link: doi:10.1007/3-540-45483-7_8.
There is a parameter depth_via_LCA
that can be set to TRUE
or FALSE
. IF it is set to TRUE
, depth(a)
is re-defined
as should pass the LCA term c
. If it is FALSE
, it goes to the original similarity definition in the paper and note the
similarity might be negative.
term_sim(dag, terms, method = "Sim_Zhong_2002", control = list(depth_via_LCA = FALSE))
It also takes accout of the distance between term a
and b
, and the depth of the LCA term c
in the DAG.
The distance is calculated as:
D(a, b) = log(1 + d(a, b)*(max_depth - depth(c)))
Here d(a, b)
can be the shortest distance between a
and b
or the longst distance via LCA c
.
Then the distance is transformed into the similarity value scaled by the possible maximal and minimal values of D(a, b)
from the DAG:
D_max = log(1 + 2*max_depth * max_depth)
And the minimal value of D(a, b)
is zero when a
is identical to b
. Then the similarity value is scaled as:
sim = 1 - D(a, b)/D_max
Paper link: doi:10.1109/IEMBS.2006.259235.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_AlMubaid_2006", control = list(distance = "shortest_distances_via_NCA"))
It is similar to the Sim_AlMubaid_2006 method, but uses a non-linear form:
sim = exp(0.2*d(a, b)) * atan(0.6*depth(c))
where d(a, b)
can be the shortest distance or the longest distance via LCA.
Paper link: doi:10.1109/TKDE.2003.1209005.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Li_2003", control = list(distance = "shortest_distances_via_NCA"))
The similarity is adjusted by the positions of term a
, b
and the LCA term c
in the DAG. The similarity is defined as:
sim = max_depth/(max_depth + d(a, b)) * alpha/(alpha + beta)
where d(a, b)
is the distance between a
and b
which can be the shortest distance or the longest distance via LCA.
In the tuning factor, alpha
is the distance of LCA to root, which is depth(c)
. beta
is the distance to leaves, which
is the minimal distance (or the minimal height) of term a
and b
:
alpha/(alpha + beta) = depth(c)/(depth(c) + min(height(a), height(b)))
Paper link: doi:10.1371/journal.pone.0066745.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_RSS_2013", control = list(distance = "shortest_distances_via_NCA"))
It is similar as the Sim_RSS_2013 method, but it uses information content instead of the distance to adjust the similarity.
It first defines the semantic distance between term a
and b
as the sum of the distance to their MICA term c
:
D(a, b) = D(c, a) + D(c, b)
And the distance between an ancestor to a term is:
D(c, a) = IC(a) - IC(c) # if c is an ancestor of a D(a, b) = D(c, a) + D(c, b) = IC(a) + IC(b) - 2*IC(c) # if c is the MICA of a and b
Similarly, the similarity is also corrected by the position of MICA term and a
and b
in the DAG:
1/(1 + D(a, b)) * alpha/(alph + beta)
Now alpha
is the IC of the MICA term:
alpha = IC(c)
And beta
is the average of the maximal semantic distance of a
and b
to leaves.
beta = 0.5*(IC(l_a) - IC(a) + IC(l_b) - IC(b))
where l_a
is the leaf that a
can reach with the highest IC (i.e. most informative leaf), and so is l_b
.
Paper link: doi:10.1371/journal.pone.0066745.
It is based on the information content of terms on the path connecting term a
and b
via their MICA term c
.
Denote a list of terms a, ..., c, ..., b
which are composed by the shortest path from a
to c
and from b
to c
, the difference
between a
and b
is the sum of 1/IC
of the terms on the path:
sum_{x in the path}(1/IC(x))
Then the distance is scaled into [0, 1]
by an arctangent tarnsformation:
atan(sum_{x in the path}(1/IC(x)))/(pi/2)
And finally the similarity is:
sim = 1 - atan(sum_{x in the path}(1/IC(x)))/(pi/2)
Paper link: doi:10.1109/BIBM.2010.5706623.
It is similar as the Sim_Shen_2010 which also sums content along the path passing through LCA term. Instead of summing the information content, the Sim_SSDD_2013 sums up a so-called "T-value":
sim = 1 - atan(sum_{x in the path}(T(x)))/(pi/2)
Each term has a T-value and it measures the semantic content a term averagely inherited from its parents
and distributed to its offsprings. The T-value of root is 1. Assume a term t
has two parents p1
and p1
,
The T-value for term t
is averaged from its
(w1*T(p1) + w2*T(p2))/2
Since the parent may have other child terms, a factor w1
or w2
is multiplied to T(p1)
and T(p2)
. Taking
p1
as an example, it has n_p
offsprings (including itself) and t
has n_t
offsprings (including itself),
this means n_t/n_p
of information is transmitted from p1
to downstream via t
, thus w1
is defined as n_t/n_p
.
Paper link: doi:10.1016/j.ygeno.2013.04.010.
First semantic distance between term a
and b
via MICA term c
is defined as:
D(a, b) = IC(a) + IC(b) - 2*IC(c)
Then there are several normalization method to change the distance to similarity and to scale it into the range of [0, 1]
.
max: 1 - D(a, b)/2/IC_max
Couto: min(1, D(a, b)/IC_max)
Lin: 1 - D(a, b)/(IC(a) + IC(b))
which is the same as the Sim_Lin_1998 method
Garla: 1 - log(D(a, b) + 1)/log(2*IC_max + 1)
log-Lin: 1 - log(D(a, b) + 1)/log(IC(a) + IC(b) + 1)
Rada: 1/(1 + D(a, b))
Paper link: https://aclanthology.org/O97-1002/.
There is a parameter norm_method
which takes value in "max", "Couto", "Lin", "Carla", "log-Lin", "Rada":
term_sim(dag, terms, method = "Sim_Jiang_1997", control = list(norm_method = "Lin"))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the kappa coeffcient
of the two sets.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_kappa", control = list(anno_universe = ...))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the Jaccard coeffcient
of the two sets, defined as length(intersect(A, B))/length(union(A, B))
.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_Jaccard", control = list(anno_universe = ...))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the Dice coeffcient
of the two sets, defined as 2*length(intersect(A, B))/(length(A) + length(B))
.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_Dice", control = list(anno_universe = ...))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the overlap coeffcient
of the two sets, defined as length(intersect(A, B))/min(length(A), length(B))
.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_Overlap", control = list(anno_universe = ...))
Denote S_a
and S_b
are two sets of ancestor terms of term a
and b
(including a
and b
), the
semantic similarity is defined as:
length(intersect(S_a, S_b))/length(union(S_a, S_b))
term_sim(dag, terms, method = "Sim_Ancestor")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = 1:3, "b" = 3:4, "c" = 5, "d" = 7, "e" = 4:7, "f" = 8 ) dag = create_ontology_DAG(parents, children, annotation = annotation) term_sim(dag, dag_all_terms(dag), method = "Sim_Lin_1998")
parents = c("a", "a", "b", "b", "c", "d") children = c("b", "c", "c", "d", "e", "f") annotation = list( "a" = 1:3, "b" = 3:4, "c" = 5, "d" = 7, "e" = 4:7, "f" = 8 ) dag = create_ontology_DAG(parents, children, annotation = annotation) term_sim(dag, dag_all_terms(dag), method = "Sim_Lin_1998")