| Title: | GO-terms Semantic Similarity Measures |
|---|---|
| Description: | Semantic similarity analysis of Gene Ontology (GO) annotations provides a quantitative framework for comparing GO terms, gene products, and gene clusters. GOSemSim implements widely used information content- and graph-based similarity measures, including the methods of Resnik, Schlicker, Jiang, Lin, Wang and TCSS. It also provides utilities for preparing annotation data and combining term-level similarities into gene- and cluster-level scores. |
| Authors: | Guangchuang Yu [aut, cre], Alexey Stukalov [ctb], Pingfan Guo [ctb], Chuanle Xiao [ctb], Lluís Revilla Sancho [ctb] |
| Maintainer: | Guangchuang Yu <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 2.39.2 |
| Built: | 2026-06-29 14:19:01 UTC |
| Source: | https://github.com/bioc/GOSemSim |
Addding indirect GO annotation
buildGOmap(TERM2GENE)buildGOmap(TERM2GENE)
TERM2GENE |
data.frame with two or three columns of GO TERM, GENE and ONTOLOGY (optional) |
provided by a data.frame of GO TERM (column 1), GENE (column 2) and ONTOLOGY (optional) that describes GO direct annotation, this function will add indirect GO annotation of genes.
data.frame, GO annotation with direct and indirect annotation
Yu Guangchuang
Semantic similarity between two gene clusters
clusterSim( cluster1, cluster2, semData, measure = "Wang", drop = "IEA", combine = "BMA" )clusterSim( cluster1, cluster2, semData, measure = "Wang", drop = "IEA", combine = "BMA" )
cluster1 |
A set of gene IDs |
cluster2 |
Another set of gene IDs |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
similarity
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241", "994") cluster2 <- c("307", "308", "317", "321", "506", "540", "378", "388", "396") clusterSim(cluster1, cluster2, semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241", "994") cluster2 <- c("307", "308", "317", "321", "506", "540", "378", "388", "396") clusterSim(cluster1, cluster2, semData = d, measure = "Wang")
Functions for combining similarity matrix to similarity score
combineScores(SimScores, combine)combineScores(SimScores, combine)
SimScores |
similarity matrix |
combine |
combine method |
similarity value
Guangchuang Yu https://yulab-smu.top
Given two genes, calculate their semantic similarity and return the score with corresponding GO terms.
geneSim(gene1, gene2, semData, measure = "Wang", drop = "IEA", combine = "BMA")geneSim(gene1, gene2, semData, measure = "Wang", drop = "IEA", combine = "BMA")
gene1 |
Entrez gene ID |
gene2 |
Another Entrez gene ID |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
A list containing similarity value and corresponding GO terms
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) geneSim("241", "251", semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) geneSim("241", "251", semData = d, measure = "Wang")
Get organism name from OrgDb object
get_organism(object)get_organism(object)
object |
OrgDb object or OrgDb package name |
Organism name
Guangchuang Yu
These datasets are the information contents of GOterms.
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
prepare GO DATA for measuring semantic similarity
godata( OrgDb = NULL, annoDb = NULL, keytype = "ENTREZID", ont, computeIC = TRUE, processTCSS = FALSE, cutoff = NULL )godata( OrgDb = NULL, annoDb = NULL, keytype = "ENTREZID", ont, computeIC = TRUE, processTCSS = FALSE, cutoff = NULL )
OrgDb |
OrgDb object (will be removed in future, please use annoDb instead) |
annoDb |
GO annotation database, can be OrgDb or a data.frame contains three columns of 'GENE', 'GO' and 'ONTOLOGY'. |
keytype |
keytype |
ont |
one of 'BP', 'MF', 'CC' |
computeIC |
logical, whether computer IC |
processTCSS |
logical, whether to prepare TCSS data. TCSS requires
|
cutoff |
topology cutoff for TCSS subgraph construction. If |
GOSemSimDATA object
Guangchuang Yu
Class "GOSemSimDATA" This class stores IC and gene to go mapping for semantic similarity measurement
keysgene ID
ontontology
ICIC data
geneAnnogene to GO mapping
tcssdatatcssdata
metadatametadata
Given two GO IDs, calculate their semantic similarity.
goSim(GOID1, GOID2, semData, measure = "Wang")goSim(GOID1, GOID2, semData, measure = "Wang")
GOID1 |
GO ID 1 |
GOID2 |
GO ID 2 |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
similarity
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) goSim("GO:0004022", "GO:0005515", semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) goSim("GO:0004022", "GO:0005515", semData = d, measure = "Wang")
Information Content Based Methods for semantic similarity measuring
infoContentMethod(ID1, ID2, method, godata)infoContentMethod(ID1, ID2, method, godata)
ID1 |
Ontology Term |
ID2 |
Ontology Term |
method |
one of "Resnik", "Jiang", "Lin" and "Rel", "TCSS". |
godata |
GOSemSimDATA object |
implemented for methods proposed by Resnik, Jiang, Lin and Schlicker.
semantic similarity score
Guangchuang Yu https://yulab-smu.top
Calculate pairwise semantic similarities for a list of gene clusters.
mclusterSim( clusters, semData, measure = "Wang", drop = "IEA", combine = "BMA", BPPARAM = NULL )mclusterSim( clusters, semData, measure = "Wang", drop = "IEA", combine = "BMA", BPPARAM = NULL )
clusters |
A list of gene clusters |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
BPPARAM |
optional BiocParallel::BiocParallelParam object for
parallel pairwise similarity calculation. The default |
similarity matrix
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241") cluster2 <- c("578", "582") cluster3 <- c("307", "308", "317") clusters <- list(a = cluster1, b = cluster2, c = cluster3) mclusterSim(clusters, semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241") cluster2 <- c("578", "582") cluster3 <- c("307", "308", "317") clusters <- list(a = cluster1, b = cluster2, c = cluster3) mclusterSim(clusters, semData = d, measure = "Wang")
Calculate pairwise semantic similarities for a given list of genes.
mgeneSim( genes, semData, measure = "Wang", drop = "IEA", combine = "BMA", verbose = TRUE, BPPARAM = NULL )mgeneSim( genes, semData, measure = "Wang", drop = "IEA", combine = "BMA", verbose = TRUE, BPPARAM = NULL )
genes |
A list of Entrez gene IDs |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
verbose |
Whether to show a progress bar |
BPPARAM |
optional BiocParallel::BiocParallelParam object for
parallel pairwise similarity calculation. The default |
Parallel calculation is opt-in. With the default BPPARAM = NULL,
mgeneSim() keeps the original serial behavior and can show a progress bar
when verbose = TRUE. When a BPPARAM object is supplied, pairwise
similarities are calculated through BiocParallel::bplapply() and the
progress bar is not shown.
similarity matrix
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) mgeneSim(c("835", "5261", "241"), semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) mgeneSim(c("835", "5261", "241"), semData = d, measure = "Wang")
Given two sets of GO terms, calculate their semantic similarity.
mgoSim(GO1, GO2, semData, measure = "Wang", combine = "BMA")mgoSim(GO1, GO2, semData, measure = "Wang", combine = "BMA")
GO1 |
A set of GO terms |
GO2 |
Another set of GO terms |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
similarity
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") mgoSim("GO:0003824", go2, semData = d, measure = "Wang") mgoSim(go1, go2, semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") mgoSim("GO:0003824", go2, semData = d, measure = "Wang") mgoSim(go1, go2, semData = d, measure = "Wang")
given a BLAST2GO file, this function extracts the information from it and make it use for TERM2GENE.
read.blast2go(file, add_indirect_GO = FALSE)read.blast2go(file, add_indirect_GO = FALSE)
file |
BLAST2GO file |
add_indirect_GO |
whether add indirect GO annotation |
a data frame with three columns: GENE, GO and ONTOLOGY
parse GAF files
read.gaf(file, asis = FALSE, add_indirect_GO = FALSE) parse_gff(file, asis = FALSE, add_indirect_GO = FALSE)read.gaf(file, asis = FALSE, add_indirect_GO = FALSE) parse_gff(file, asis = FALSE, add_indirect_GO = FALSE)
file |
GAF file |
asis |
logical, whether output the original contains of the file and only works if 'add_indirect_GO = FALSE' |
add_indirect_GO |
whether to add indirect GO annotation |
given a GAF file, this function extracts the information from it
A data.frame. Original table if 'asis' works, otherwise contains 3 conlumns of 'GENE', 'GO' and 'ONTOLOGY'
When enabled, outdated local ontology SQLite databases will be automatically updated from remote mirrors.
set_auto_update(x = TRUE)set_auto_update(x = TRUE)
x |
logical, whether to enable auto-update (default: TRUE) |
determine the topological cutoff for TCSS method
tcss_cutoff( OrgDb = NULL, keytype = "ENTREZID", ont, combine_method = "max", ppidata )tcss_cutoff( OrgDb = NULL, keytype = "ENTREZID", ont, combine_method = "max", ppidata )
OrgDb |
OrgDb object |
keytype |
keytype |
ont |
ontology : "BP", "MF", "CC" |
combine_method |
one of "max", "BMA", "avg", "rcmax", "rcmax.avg" |
ppidata |
A data.frame contains positive set and negative set. Positive set is PPI pairs that already verified. ppidata has three columns, column 1 and 2 are character, column 3 must be logical value:TRUE/FALSE. |
numeric, topological cutoff for TCSS subgraph construction. The
returned value can be passed to godata() via the cutoff argument together
with processTCSS = TRUE.
## Not run: library(org.Hs.eg.db) library(STRINGdb) string_db <- STRINGdb$new(version = "11.0", species = 9606, score_threshold = 700) string_proteins <- string_db$get_proteins() #get relationship ppi <- string_db$get_interactions(string_proteins$protein_external_id) ppi$from <- vapply(ppi$from, function(e) strsplit(e, "9606.")[[1]][2], character(1)) ppi$to <- vapply(ppi$to, function(e) strsplit(e, "9606.")[[1]][2], character(1)) len <- nrow(ppi) #select length s_len <- 100 pos_1 <- sample(len, s_len, replace = T) #negative set pos_2 <- sample(len, s_len, replace = T) pos_3 <- sample(len, s_len, replace = T) #union as ppidata ppidata <- data.frame(pro1 = c(ppi$from[pos_1], ppi$from[pos_2]), pro2 = c(ppi$to[pos_1], ppi$to[pos_3]), label = c(rep(TRUE, s_len), rep(FALSE, s_len)), stringsAsFactors = FALSE) cutoff <- tcss_cutoff(OrgDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", combine_method = "max", ppidata) semData <- godata(annoDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", computeIC = TRUE, processTCSS = TRUE, cutoff = cutoff) ## End(Not run)## Not run: library(org.Hs.eg.db) library(STRINGdb) string_db <- STRINGdb$new(version = "11.0", species = 9606, score_threshold = 700) string_proteins <- string_db$get_proteins() #get relationship ppi <- string_db$get_interactions(string_proteins$protein_external_id) ppi$from <- vapply(ppi$from, function(e) strsplit(e, "9606.")[[1]][2], character(1)) ppi$to <- vapply(ppi$to, function(e) strsplit(e, "9606.")[[1]][2], character(1)) len <- nrow(ppi) #select length s_len <- 100 pos_1 <- sample(len, s_len, replace = T) #negative set pos_2 <- sample(len, s_len, replace = T) pos_3 <- sample(len, s_len, replace = T) #union as ppidata ppidata <- data.frame(pro1 = c(ppi$from[pos_1], ppi$from[pos_2]), pro2 = c(ppi$to[pos_1], ppi$to[pos_3]), label = c(rep(TRUE, s_len), rep(FALSE, s_len)), stringsAsFactors = FALSE) cutoff <- tcss_cutoff(OrgDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", combine_method = "max", ppidata) semData <- godata(annoDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", computeIC = TRUE, processTCSS = TRUE, cutoff = cutoff) ## End(Not run)
Measure similarities between two term vectors.
termSim( t1, t2, semData, method = c("Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS") )termSim( t1, t2, semData, method = c("Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS") )
t1 |
Term vector |
t2 |
Term vector |
semData |
GOSemSimDATA object |
method |
One of "Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS" |
Provide two term vectors, this function calculates their similarities.
The TCSS method requires semData generated by godata() with
computeIC = TRUE and processTCSS = TRUE.
score matrix
Guangchuang Yu https://yulab-smu.top
Method Wang for semantic similarity measuring
wangMethod_internal(ID1, ID2, ont = "BP")wangMethod_internal(ID1, ID2, ont = "BP")
ID1 |
Ontology Term |
ID2 |
Ontology Term |
ont |
Ontology |
semantic similarity score
Guangchuang Yu https://yulab-smu.top