Title: | GO-terms Semantic Similarity Measures |
---|---|
Description: | The semantic comparisons of Gene Ontology (GO) annotations provide quantitative ways to compute similarities between genes and gene groups, and have became important basis for many bioinformatics analysis approaches. GOSemSim is an R package for semantic similarity computation among GO terms, sets of GO terms, gene products and gene clusters. GOSemSim implemented five methods proposed by Resnik, Schlicker, Jiang, Lin and Wang respectively. |
Authors: | Guangchuang Yu [aut, cre], Alexey Stukalov [ctb], Pingfan Guo [ctb], Chuanle Xiao [ctb], Lluís Revilla Sancho [ctb] |
Maintainer: | Guangchuang Yu <[email protected]> |
License: | Artistic-2.0 |
Version: | 2.33.0 |
Built: | 2025-01-02 06:00:34 UTC |
Source: | https://github.com/bioc/GOSemSim |
Addding indirect GO annotation
buildGOmap(TERM2GENE)
buildGOmap(TERM2GENE)
TERM2GENE |
data.frame with two or three columns of GO TERM, GENE and ONTOLOGY (optional) |
provided by a data.frame of GO TERM (column 1), GENE (column 2) and ONTOLOGY (optional) that describes GO direct annotation, this function will add indirect GO annotation of genes.
data.frame, GO annotation with direct and indirect annotation
Yu Guangchuang
Given two gene clusters, this function calculates semantic similarity between them.
clusterSim( cluster1, cluster2, semData, measure = "Wang", drop = "IEA", combine = "BMA" )
clusterSim( cluster1, cluster2, semData, measure = "Wang", drop = "IEA", combine = "BMA" )
cluster1 |
A set of gene IDs. |
cluster2 |
Another set of gene IDs. |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
A set of evidence codes based on which certain annotations are dropped. Use NULL to keep all GO annotations. |
combine |
One of "max", "avg", "rcmax", "BMA" methods, for combining semantic similarity scores of multiple GO terms associated with protein or multiple proteins assiciated with protein cluster. |
similarity
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
goSim
mgoSim
geneSim
mgeneSim
mclusterSim
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) cluster1 <- c("835", "5261","241", "994") cluster2 <- c("307", "308", "317", "321", "506", "540", "378", "388", "396") clusterSim(cluster1, cluster2, semData=d, measure="Wang")
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) cluster1 <- c("835", "5261","241", "994") cluster2 <- c("307", "308", "317", "321", "506", "540", "378", "388", "396") clusterSim(cluster1, cluster2, semData=d, measure="Wang")
Functions for combining similarity matrix to similarity score
combineScores(SimScores, combine)
combineScores(SimScores, combine)
SimScores |
similarity matrix |
combine |
combine method |
similarity value
Guangchuang Yu http://guangchuangyu.github.io
Given two genes, this function will calculate the semantic similarity between them, and return their semantic similarity and the corresponding GO terms
geneSim(gene1, gene2, semData, measure = "Wang", drop = "IEA", combine = "BMA")
geneSim(gene1, gene2, semData, measure = "Wang", drop = "IEA", combine = "BMA")
gene1 |
Entrez gene id. |
gene2 |
Another entrez gene id. |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang" "TCSS" and "Wang" methods. |
drop |
A set of evidence codes based on which certain annotations are dropped. Use NULL to keep all GO annotations. |
combine |
One of "max", "avg", "rcmax", "BMA" methods, for combining semantic similarity scores of multiple GO terms associated with protein or multiple proteins assiciated with protein cluster. |
list of similarity value and corresponding GO.
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
goSim
mgoSim
mgeneSim
clusterSim
mclusterSim
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) geneSim("241", "251", semData=d, measure="Wang")
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) geneSim("241", "251", semData=d, measure="Wang")
These datasets are the information contents of GOterms.
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
prepare GO DATA for measuring semantic similarity
godata( OrgDb = NULL, annoDb = NULL, keytype = "ENTREZID", ont, computeIC = TRUE, processTCSS = FALSE, cutoff = NULL )
godata( OrgDb = NULL, annoDb = NULL, keytype = "ENTREZID", ont, computeIC = TRUE, processTCSS = FALSE, cutoff = NULL )
OrgDb |
OrgDb object (will be removed in future, please use annoDb instead) |
annoDb |
GO annotation database, can be OrgDb or a data.frame contains three columns of 'GENE', 'GO' and 'ONTOLOGY'. |
keytype |
keytype |
ont |
one of 'BP', 'MF', 'CC' |
computeIC |
logical, whether computer IC |
processTCSS |
logical, whether to process TCSS |
cutoff |
cutoff of TCSS |
GOSemSimDATA object
Guangchuang Yu
Class "GOSemSimDATA" This class stores IC and gene to go mapping for semantic similarity measurement
keys
gene ID
ont
ontology
IC
IC data
geneAnno
gene to GO mapping
tcssdata
tcssdata
metadata
metadata
Given two GO IDs, this function calculates their semantic similarity.
goSim(GOID1, GOID2, semData, measure = "Wang")
goSim(GOID1, GOID2, semData, measure = "Wang")
GOID1 |
GO ID 1. |
GOID2 |
GO ID 2. |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
similarity
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
mgoSim
geneSim
mgeneSim
clusterSim
mclusterSim
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) goSim("GO:0004022", "GO:0005515", semData=d, measure="Wang")
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) goSim("GO:0004022", "GO:0005515", semData=d, measure="Wang")
Information Content Based Methods for semantic similarity measuring
infoContentMethod(ID1, ID2, method, godata)
infoContentMethod(ID1, ID2, method, godata)
ID1 |
Ontology Term |
ID2 |
Ontology Term |
method |
one of "Resnik", "Jiang", "Lin" and "Rel", "TCSS". |
godata |
GOSemSimDATA object |
implemented for methods proposed by Resnik, Jiang, Lin and Schlicker.
semantic similarity score
Guangchuang Yu https://guangchuangyu.github.io
load OrgDb
load_OrgDb(OrgDb)
load_OrgDb(OrgDb)
OrgDb |
OrgDb object or OrgDb name |
OrgDb object
Guangchuang Yu https://yulab-smu.top
Given a list of gene clusters, this function calculates pairwise semantic similarities.
mclusterSim(clusters, semData, measure = "Wang", drop = "IEA", combine = "BMA")
mclusterSim(clusters, semData, measure = "Wang", drop = "IEA", combine = "BMA")
clusters |
A list of gene clusters. |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
A set of evidence codes based on which certain annotations are dropped. Use NULL to keep all GO annotations. |
combine |
One of "max", "avg", "rcmax", "BMA" methods, for combining semantic similarity scores of multiple GO terms associated with protein or multiple proteins assiciated with protein cluster. |
similarity matrix
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
goSim
mgoSim
geneSim
mgeneSim
clusterSim
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) cluster1 <- c("835", "5261","241") cluster2 <- c("578","582") cluster3 <- c("307", "308", "317") clusters <- list(a=cluster1, b=cluster2, c=cluster3) mclusterSim(clusters, semData=d, measure="Wang")
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) cluster1 <- c("835", "5261","241") cluster2 <- c("578","582") cluster3 <- c("307", "308", "317") clusters <- list(a=cluster1, b=cluster2, c=cluster3) mclusterSim(clusters, semData=d, measure="Wang")
Given a list of genes, this function calculates pairwise semantic similarities.
mgeneSim( genes, semData, measure = "Wang", drop = "IEA", combine = "BMA", verbose = TRUE )
mgeneSim( genes, semData, measure = "Wang", drop = "IEA", combine = "BMA", verbose = TRUE )
genes |
A list of entrez gene IDs. |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
drop |
A set of evidence codes based on which certain annotations are dropped. Use NULL to keep all GO annotations. |
combine |
One of "max", "avg", "rcmax", "BMA" methods, for combining semantic similarity scores of multiple GO terms associated with protein or multiple proteins assiciated with protein cluster. |
verbose |
show progress bar or not. |
similarity matrix
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
goSim
mgoSim
geneSim
clusterSim
mclusterSim
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) mgeneSim(c("835", "5261","241"), semData=d, measure="Wang")
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) mgeneSim(c("835", "5261","241"), semData=d, measure="Wang")
Given two GO term sets, this function will calculate the semantic similarity between them, and return their semantic similarity
mgoSim(GO1, GO2, semData, measure = "Wang", combine = "BMA")
mgoSim(GO1, GO2, semData, measure = "Wang", combine = "BMA")
GO1 |
A set of go terms. |
GO2 |
Another set of go terms. |
semData |
GOSemSimDATA object |
measure |
One of "Resnik", "Lin", "Rel", "Jiang", "TCSS" and "Wang" methods. |
combine |
One of "max", "avg", "rcmax", "BMA" methods, for combining semantic similarity scores of multiple GO terms associated with protein or multiple proteins assiciated with protein cluster. |
similarity
Yu et al. (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics (Oxford, England), 26:7 976–978, April 2010. ISSN 1367-4803 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/7/976 PMID: 20179076
goSim
geneSim
mgeneSim
clusterSim
mclusterSim
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") mgoSim("GO:0003824", go2, semData=d, measure="Wang") mgoSim(go1, go2, semData=d, measure="Wang")
d <- godata('org.Hs.eg.db', ont="MF", computeIC=FALSE) go1 <- c("GO:0004022", "GO:0004024", "GO:0004023") go2 <- c("GO:0009055", "GO:0020037") mgoSim("GO:0003824", go2, semData=d, measure="Wang") mgoSim(go1, go2, semData=d, measure="Wang")
given a BLAST2GO file, this function extracts the information from it and make it use for TERM2GENE.
read.blast2go(file, add_indirect_GO = FALSE)
read.blast2go(file, add_indirect_GO = FALSE)
file |
BLAST2GO file |
add_indirect_GO |
whether add indirect GO annotation |
a data frame with three columns: GENE, GO and ONTOLOGY
parse GAF files
read.gaf(file, asis = FALSE, add_indirect_GO = FALSE) parse_gff(file, asis = FALSE, add_indirect_GO = FALSE)
read.gaf(file, asis = FALSE, add_indirect_GO = FALSE) parse_gff(file, asis = FALSE, add_indirect_GO = FALSE)
file |
GAF file |
asis |
logical, whether output the original contains of the file and only works if 'add_indirect_GO = FALSE' |
add_indirect_GO |
whether to add indirect GO annotation |
given a GAF file, this function extracts the information from it
A data.frame. Original table if 'asis' works, otherwise contains 3 conlumns of 'GENE', 'GO' and 'ONTOLOGY'
determine the topological cutoff for TCSS method
tcss_cutoff( OrgDb = NULL, keytype = "ENTREZID", ont, combine_method = "max", ppidata )
tcss_cutoff( OrgDb = NULL, keytype = "ENTREZID", ont, combine_method = "max", ppidata )
OrgDb |
OrgDb object |
keytype |
keytype |
ont |
ontology : "BP", "MF", "CC" |
combine_method |
"max", "BMA", "avg", "rcmax", "rcmax.avg" |
ppidata |
A data.frame contains positive set and negative set. Positive set is PPI pairs that already verified. ppidata has three columns, column 1 and 2 are character, column 3 must be logical value:TRUE/FALSE. |
numeric, topological cutoff for given parameters
## Not run: library(org.Hs.eg.db) library(STRINGdb) string_db <- STRINGdb$new(version = "11.0", species = 9606, score_threshold = 700) string_proteins <- string_db$get_proteins() #get relationship ppi <- string_db$get_interactions(string_proteins$protein_external_id) ppi$from <- vapply(ppi$from, function(e) strsplit(e, "9606.")[[1]][2], character(1)) ppi$to <- vapply(ppi$to, function(e) strsplit(e, "9606.")[[1]][2], character(1)) len <- nrow(ppi) #select length s_len <- 100 pos_1 <- sample(len, s_len, replace = T) #negative set pos_2 <- sample(len, s_len, replace = T) pos_3 <- sample(len, s_len, replace = T) #union as ppidata ppidata <- data.frame(pro1 = c(ppi$from[pos_1], ppi$from[pos_2]), pro2 = c(ppi$to[pos_1], ppi$to[pos_3]), label = c(rep(TRUE, s_len), rep(FALSE, s_len)), stringsAsFactors = FALSE) cutoff <- tcss_cutoff(OrgDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", combine_method = "max", ppidata) ## End(Not run)
## Not run: library(org.Hs.eg.db) library(STRINGdb) string_db <- STRINGdb$new(version = "11.0", species = 9606, score_threshold = 700) string_proteins <- string_db$get_proteins() #get relationship ppi <- string_db$get_interactions(string_proteins$protein_external_id) ppi$from <- vapply(ppi$from, function(e) strsplit(e, "9606.")[[1]][2], character(1)) ppi$to <- vapply(ppi$to, function(e) strsplit(e, "9606.")[[1]][2], character(1)) len <- nrow(ppi) #select length s_len <- 100 pos_1 <- sample(len, s_len, replace = T) #negative set pos_2 <- sample(len, s_len, replace = T) pos_3 <- sample(len, s_len, replace = T) #union as ppidata ppidata <- data.frame(pro1 = c(ppi$from[pos_1], ppi$from[pos_2]), pro2 = c(ppi$to[pos_1], ppi$to[pos_3]), label = c(rep(TRUE, s_len), rep(FALSE, s_len)), stringsAsFactors = FALSE) cutoff <- tcss_cutoff(OrgDb = org.Hs.eg.db, keytype = "ENSEMBLPROT", ont = "BP", combine_method = "max", ppidata) ## End(Not run)
measuring similarities between two term vectors.
termSim( t1, t2, semData, method = c("Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS") )
termSim( t1, t2, semData, method = c("Wang", "Resnik", "Rel", "Jiang", "Lin", "TCSS") )
t1 |
term vector |
t2 |
term vector |
semData |
GOSemSimDATA object |
method |
one of "Wang", "Resnik", "Rel", "Jiang", and "Lin", "TCSS". |
provide two term vectors, this function will calculate their similarities.
score matrix
Guangchuang Yu http://guangchuangyu.github.io
Method Wang for semantic similarity measuring
wangMethod_internal(ID1, ID2, ont = "BP")
wangMethod_internal(ID1, ID2, ont = "BP")
ID1 |
Ontology Term |
ID2 |
Ontology Term |
ont |
Ontology |
semantic similarity score
Guangchuang Yu https://yulab-smu.top