| Title: | GO-terms Semantic Similarity Measures |
|---|---|
| Description: | The semantic comparisons of Gene Ontology (GO) annotations provide quantitative ways to compute similarities between genes and gene groups, and have became important basis for many bioinformatics analysis approaches. GOSemSim is an R package for semantic similarity computation among GO terms, sets of GO terms, gene products and gene clusters. GOSemSim implemented five methods proposed by Resnik, Schlicker, Jiang, Lin and Wang respectively. |
| Authors: | Guangchuang Yu [aut, cre], Alexey Stukalov [ctb], Pingfan Guo [ctb], Chuanle Xiao [ctb], Lluís Revilla Sancho [ctb] |
| Maintainer: | Guangchuang Yu <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 2.39.0 |
| Built: | 2026-05-29 10:27:30 UTC |
| Source: | https://github.com/bioc/GOSemSim |
Addding indirect GO annotation
buildGOmap(TERM2GENE)buildGOmap(TERM2GENE)
TERM2GENE |
data.frame with two or three columns of GO TERM, GENE and ONTOLOGY (optional) |
provided by a data.frame of GO TERM (column 1), GENE (column 2) and ONTOLOGY (optional) that describes GO direct annotation, this function will add indirect GO annotation of genes.
data.frame, GO annotation with direct and indirect annotation
Yu Guangchuang
Semantic similarity between two gene clusters
clusterSim( cluster1, cluster2, semData, measure = "Wang", drop = "IEA", combine = "BMA" )clusterSim( cluster1, cluster2, semData, measure = "Wang", drop = "IEA", combine = "BMA" )
cluster1 |
A set of gene IDs |
cluster2 |
Another set of gene IDs |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
similarity
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241", "994") cluster2 <- c("307", "308", "317", "321", "506", "540", "378", "388", "396") clusterSim(cluster1, cluster2, semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241", "994") cluster2 <- c("307", "308", "317", "321", "506", "540", "378", "388", "396") clusterSim(cluster1, cluster2, semData = d, measure = "Wang")
Functions for combining similarity matrix to similarity score
combineScores(SimScores, combine)combineScores(SimScores, combine)
SimScores |
similarity matrix |
combine |
combine method |
similarity value
Guangchuang Yu https://yulab-smu.top
Given two genes, calculate their semantic similarity and return the score with corresponding GO terms.
geneSim(gene1, gene2, semData, measure = "Wang", drop = "IEA", combine = "BMA")geneSim(gene1, gene2, semData, measure = "Wang", drop = "IEA", combine = "BMA")
gene1 |
Entrez gene ID |
gene2 |
Another Entrez gene ID |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
A list containing similarity value and corresponding GO terms
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) geneSim("241", "251", semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) geneSim("241", "251", semData = d, measure = "Wang")
Get organism name from OrgDb object
get_organism(object)get_organism(object)
object |
OrgDb object or OrgDb package name |
Organism name
Guangchuang Yu
These datasets are the information contents of GOterms.
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
prepare GO DATA for measuring semantic similarity
godata( OrgDb = NULL, annoDb = NULL, keytype = "ENTREZID", ont, computeIC = TRUE, processTCSS = FALSE, cutoff = NULL )godata( OrgDb = NULL, annoDb = NULL, keytype = "ENTREZID", ont, computeIC = TRUE, processTCSS = FALSE, cutoff = NULL )
OrgDb |
OrgDb object (will be removed in future, please use annoDb instead) |
annoDb |
GO annotation database, can be OrgDb or a data.frame contains three columns of 'GENE', 'GO' and 'ONTOLOGY'. |
keytype |
keytype |
ont |
one of 'BP', 'MF', 'CC' |
computeIC |
logical, whether computer IC |
processTCSS |
logical, whether to process TCSS |
cutoff |
cutoff of TCSS |
GOSemSimDATA object
Guangchuang Yu
Class "GOSemSimDATA" This class stores IC and gene to go mapping for semantic similarity measurement
keysgene ID
ontontology
ICIC data
geneAnnogene to GO mapping
tcssdatatcssdata
metadatametadata
Given two GO IDs, calculate their semantic similarity.
goSim(GOID1, GOID2, semData, measure = "Wang")goSim(GOID1, GOID2, semData, measure = "Wang")
GOID1 |
GO ID 1 |
GOID2 |
GO ID 2 |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
similarity
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) goSim("GO:0004022", "GO:0005515", semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) goSim("GO:0004022", "GO:0005515", semData = d, measure = "Wang")
Information Content Based Methods for semantic similarity measuring
infoContentMethod(ID1, ID2, method, godata)infoContentMethod(ID1, ID2, method, godata)
ID1 |
Ontology Term |
ID2 |
Ontology Term |
method |
one of "Resnik", "Jiang", "Lin" and "Rel", "TCSS". |
godata |
GOSemSimDATA object |
implemented for methods proposed by Resnik, Jiang, Lin and Schlicker.
semantic similarity score
Guangchuang Yu https://yulab-smu.top
Calculate pairwise semantic similarities for a list of gene clusters.
mclusterSim(clusters, semData, measure = "Wang", drop = "IEA", combine = "BMA")mclusterSim(clusters, semData, measure = "Wang", drop = "IEA", combine = "BMA")
clusters |
A list of gene clusters |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
similarity matrix
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241") cluster2 <- c("578", "582") cluster3 <- c("307", "308", "317") clusters <- list(a = cluster1, b = cluster2, c = cluster3) mclusterSim(clusters, semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) cluster1 <- c("835", "5261", "241") cluster2 <- c("578", "582") cluster3 <- c("307", "308", "317") clusters <- list(a = cluster1, b = cluster2, c = cluster3) mclusterSim(clusters, semData = d, measure = "Wang")
Calculate pairwise semantic similarities for a given list of genes.
mgeneSim( genes, semData, measure = "Wang", drop = "IEA", combine = "BMA", verbose = TRUE )mgeneSim( genes, semData, measure = "Wang", drop = "IEA", combine = "BMA", verbose = TRUE )
genes |
A list of Entrez gene IDs |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
Evidence codes to drop; use |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
verbose |
Whether to show a progress bar |
similarity matrix
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) mgeneSim(c("835", "5261", "241"), semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) mgeneSim(c("835", "5261", "241"), semData = d, measure = "Wang")
Given two sets of GO terms, calculate their semantic similarity.
mgoSim(GO1, GO2, semData, measure = "Wang", combine = "BMA")mgoSim(GO1, GO2, semData, measure = "Wang", combine = "BMA")
GO1 |
A set of GO terms |
GO2 |
Another set of GO terms |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
combine |
One of "max", "avg", "rcmax", "BMA" methods, used to combine multiple term scores. |
similarity
Guangchuang Yu https://yulab-smu.top
goSim() mgoSim() geneSim() mgeneSim() clusterSim() mclusterSim()
d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") mgoSim("GO:0003824", go2, semData = d, measure = "Wang") mgoSim(go1, go2, semData = d, measure = "Wang")d <- godata('org.Hs.eg.db', ont = "MF", computeIC = FALSE) go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") mgoSim("GO:0003824", go2, semData = d, measure = "Wang") mgoSim(go1, go2, semData = d, measure = "Wang")
given a BLAST2GO file, this function extracts the information from it and make it use for TERM2GENE.
read.blast2go(file, add_indirect_GO = FALSE)read.blast2go(file, add_indirect_GO = FALSE)
file |
BLAST2GO file |
add_indirect_GO |
whether add indirect GO annotation |
a data frame with three columns: GENE, GO and ONTOLOGY
parse GAF files
read.gaf(file, asis = FALSE, add_indirect_GO = FALSE) parse_gff(file, asis = FALSE, add_indirect_GO = FALSE)read.gaf(file, asis = FALSE, add_indirect_GO = FALSE) parse_gff(file, asis = FALSE, add_indirect_GO = FALSE)
file |
GAF file |
asis |
logical, whether output the original contains of the file and only works if 'add_indirect_GO = FALSE' |
add_indirect_GO |
whether to add indirect GO annotation |
given a GAF file, this function extracts the information from it
A data.frame. Original table if 'asis' works, otherwise contains 3 conlumns of 'GENE', 'GO' and 'ONTOLOGY'
determine the topological cutoff for TCSS method
tcss_cutoff( OrgDb = NULL, keytype = "ENTREZID", ont, combine_method = "max", ppidata )tcss_cutoff( OrgDb = NULL, keytype = "ENTREZID", ont, combine_method = "max", ppidata )
OrgDb |
OrgDb object |
keytype |
keytype |
ont |
ontology : "BP", "MF", "CC" |
combine_method |
"max", "BMA", "avg", "rcmax", "rcmax.avg" |
ppidata |
A data.frame contains positive set and negative set. Positive set is PPI pairs that already verified. ppidata has three columns, column 1 and 2 are character, column 3 must be logical value:TRUE/FALSE. |
numeric, topological cutoff for given parameters
## Not run: library(org.Hs.eg.db) library(STRINGdb) string_db <- STRINGdb$new(version = "11.0", species = 9606, score_threshold = 700) string_proteins <- string_db$get_proteins() #get relationship ppi <- string_db$get_interactions(string_proteins$protein_external_id) ppi$from <- vapply(ppi$from, function(e) strsplit(e, "9606.")[[1]][2], character(1)) ppi$to <- vapply(ppi$to, function(e) strsplit(e, "9606.")[[1]][2], character(1)) len <- nrow(ppi) #select length s_len <- 100 pos_1 <- sample(len, s_len, replace = T) #negative set pos_2 <- sample(len, s_len, replace = T) pos_3 <- sample(len, s_len, replace = T) #union as ppidata ppidata <- data.frame(pro1 = c(ppi$from[pos_1], ppi$from[pos_2]), pro2 = c(ppi$to[pos_1], ppi$to[pos_3]), label = c(rep(TRUE, s_len), rep(FALSE, s_len)), stringsAsFactors = FALSE) cutoff <- tcss_cutoff(OrgDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", combine_method = "max", ppidata) ## End(Not run)## Not run: library(org.Hs.eg.db) library(STRINGdb) string_db <- STRINGdb$new(version = "11.0", species = 9606, score_threshold = 700) string_proteins <- string_db$get_proteins() #get relationship ppi <- string_db$get_interactions(string_proteins$protein_external_id) ppi$from <- vapply(ppi$from, function(e) strsplit(e, "9606.")[[1]][2], character(1)) ppi$to <- vapply(ppi$to, function(e) strsplit(e, "9606.")[[1]][2], character(1)) len <- nrow(ppi) #select length s_len <- 100 pos_1 <- sample(len, s_len, replace = T) #negative set pos_2 <- sample(len, s_len, replace = T) pos_3 <- sample(len, s_len, replace = T) #union as ppidata ppidata <- data.frame(pro1 = c(ppi$from[pos_1], ppi$from[pos_2]), pro2 = c(ppi$to[pos_1], ppi$to[pos_3]), label = c(rep(TRUE, s_len), rep(FALSE, s_len)), stringsAsFactors = FALSE) cutoff <- tcss_cutoff(OrgDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", combine_method = "max", ppidata) ## End(Not run)
Measure similarities between two term vectors.
termSim( t1, t2, semData, method = c("Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS") )termSim( t1, t2, semData, method = c("Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS") )
t1 |
Term vector |
t2 |
Term vector |
semData |
GOSemSimDATA object |
method |
One of "Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS" |
Provide two term vectors, this function calculates their similarities.
score matrix
Guangchuang Yu https://yulab-smu.top
Method Wang for semantic similarity measuring
wangMethod_internal(ID1, ID2, ont = "BP")wangMethod_internal(ID1, ID2, ont = "BP")
ID1 |
Ontology Term |
ID2 |
Ontology Term |
ont |
Ontology |
semantic similarity score
Guangchuang Yu https://yulab-smu.top