Title: | Information Accretion-based Function Predictor Evaluation |
---|---|
Description: | This package implements methods to calculate information accretion for a given version of the gene ontology and uses this data to calculate remaining uncertainty, misinformation, and semantic similarity for given sets of predicted annotations and true annotations from a protein function predictor. |
Authors: | Ian Gonzalez and Wyatt Clark |
Maintainer: | Ian Gonzalez <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.41.0 |
Built: | 2024-12-19 04:08:29 UTC |
Source: | https://github.com/bioc/SemDist |
Calculates information accretion for each term in the specified ontology using either user -specified data or the sequence annotations for the organisms specified (note that organism-specific pacakges must be downloaded separately. See "note" section).
computeIA(ont, organism, evcodes = NULL, specify.ont = FALSE, myont = NULL, specify.annotations = FALSE, annotfile = NULL)
computeIA(ont, organism, evcodes = NULL, specify.ont = FALSE, myont = NULL, specify.annotations = FALSE, annotfile = NULL)
ont |
Character representation of ontology version to use. One of "CC", "MF", or "BP" , corresponding to Cellular Component, Molecular Function, and Biological Process. |
organism |
A character vector indicating which organism's annotation data to use. |
evcodes |
A character vector specifying which evidence codes to use in the ontology data. Default NULL value causes all codes to be used. |
specify.ont |
A boolean indicating whether the user wants to specify their own version of the ontology. |
myont |
Character object indicating what file to read in the specified ontology from. The ontology should be specified as a tab-delimited file with 2 columns (no header). Each row in the file should indicate a parent-child relationship between two GO accessions (e.g. "GO:0003674 GO:0004000") |
specify.annotations |
Boolean indicating whether the user wants to specify sequence annotations from a file. Should only be TRUE if specify.ont is TRUE. |
annotfile |
Character object indicating which file to read sequence annotations from. Should be a tab-delimited file with 2 columns. The first column is a list of sequences, the second is a list of GO accessions in the same rows as the sequences they annotate. |
Does not return a specific value. Saves the information accretion values for each term in the ontology in a .rda file that specifies the organism and the ont version. Parent count and term count objects are also saved in similarly formatted files so that IA calculations from multiple organisms can be combined.
In order to compute IA for an organism, the specific annotation data set for that organism must be installed by the user. Here is a list of supported organisms (names in the format that should be passed to computeIA) and the correspoinding packages needed:
anopheles = org.Ag.eg.db
arabidopsis = org.At.tair.db
bovine = org.Bt.eg.db
canine = org.Cf.eg.db
chicken = org.Gg.eg.db
chimp = org.Pt.eg.db
ecolik12 = org.EcK12.eg.db
fly = org.Dm.eg.db
human = org.Hs.eg.db
malaria = org.Pf.plasmo.db
mouse = org.Mm.eg.db
pig = org.Ss.eg.db
rat = org.Rn.eg.db
rhesus = org.Mmu.eg.db
worm = org.Ce.eg.db
xenopus = org.Xl.eg.db
yeast = org.Sc.sgd.db
zebrafish = org.Dr.eg.db
Ian Gonzalez and Wyatt Clark
# Calculate IA, specify ontology and annotations ontfile <- system.file("extdata", "mfo_ontology.txt", package="SemDist") annotations <- system.file("extdata", "MFO_LABELS_TEST.txt", package="SemDist") computeIA("my", "values", specify.ont=TRUE, myont=ontfile, specify.annotations=TRUE, annotfile=annotations)
# Calculate IA, specify ontology and annotations ontfile <- system.file("extdata", "mfo_ontology.txt", package="SemDist") annotations <- system.file("extdata", "MFO_LABELS_TEST.txt", package="SemDist") computeIA("my", "values", specify.ont=TRUE, myont=ontfile, specify.annotations=TRUE, annotfile=annotations)
Reads in a file containing the true terms annotating a set of sequences and a file containing the predicted terms and scores for a set of sequences and outputs a data frame containing the remaining uncertainty and misinformation values for the predictions made for each sequence.
findRUMI(ont, organism, threshold = 0.05, truefile="", predfile = "", IAccr = NULL)
findRUMI(ont, organism, threshold = 0.05, truefile="", predfile = "", IAccr = NULL)
ont |
Character representation of ontology version to use. One of "CC", "MF", or "BP" , corresponding to Cellular Component, Molecular Function, and Biological Process. |
organism |
A character vector indicating which organism(s) annotation data to use. |
threshold |
Score above which a predicted annotation should be included in the calculation. Must be a numeric value between 0 and 1, or else findRUMI throws an error. |
truefile |
Text file from which to read true annotations of sequences. Should be a tab-delineated file with 2 columns: Sequences and GO terms (accessions). |
predfile |
Text file from which to read predicted annotations of sequences. Should be a tab-delineated file with 3 columns: Sequences, GO terms (accessions), and probability score from 0 to 1 for each prediction. |
IAccr |
A variable containing a named numeric vector of IA values for all the GO terms being used that will be used for calculations instead of R packages. This argument is optional. |
A data frame containing the RU and MI values for each sequence in the file.
Ian Gonzalez and Wyatt Clark
# Using test data sets from SemDist, calculate RU and MI: truefile <- system.file("extdata", "MFO_LABELS_TEST.txt", package="SemDist") predfile <- system.file("extdata", "MFO_PREDS_TEST.txt", package="SemDist") rumiTable <- findRUMI("MF", "human", 0.75, truefile, predfile) avgRU <- mean(rumiTable$RU) avgMI <- mean(rumiTable$MI)
# Using test data sets from SemDist, calculate RU and MI: truefile <- system.file("extdata", "MFO_LABELS_TEST.txt", package="SemDist") predfile <- system.file("extdata", "MFO_PREDS_TEST.txt", package="SemDist") rumiTable <- findRUMI("MF", "human", 0.75, truefile, predfile) avgRU <- mean(rumiTable$RU) avgMI <- mean(rumiTable$MI)
This data set contains the information accretion values for each term in the requested ontology/species.
IAccr
IAccr
A named numeric vector with one value corresponding to each GO accession in the ontology.
The gene ontology data was obtained from the GO.db package and the annotation data was obtained from the following packages for each organism:
anopheles = org.Ag.eg.db
arabidopsis = org.At.tair.db
bovine = org.Bt.eg.db
canine = org.Cf.eg.db
chicken = org.Gg.eg.db
chimp = org.Pt.eg.db
ecolik12 = org.EcK12.eg.db
fly = org.Dm.eg.db
human = org.Hs.eg.db
malaria = org.Pf.plasmo.db
mouse = org.Mm.eg.db
pig = org.Ss.eg.db
rat = org.Rn.eg.db
rhesus = org.Mmu.eg.db
worm = org.Ce.eg.db
xenopus = org.Xl.eg.db
yeast = org.Sc.sgd.db
zebrafish = org.Dr.eg.db
data("Info_Accretion_mouse_CC", package = "SemDist") str(IAccr)
data("Info_Accretion_mouse_CC", package = "SemDist") str(IAccr)
This data set contains the parent count values for each term in the requested ontology/species (the number of times that the term's parents annotate a protein). This can be used along with the term count to calculate information accretion.
parentcnt
parentcnt
A named numeric vector with one value corresponding to each GO accession in the ontology.
The gene ontology data was obtained from the GO.db package and the annotation data was obtained from the following packages for each organism:
anopheles = org.Ag.eg.db
arabidopsis = org.At.tair.db
bovine = org.Bt.eg.db
canine = org.Cf.eg.db
chicken = org.Gg.eg.db
chimp = org.Pt.eg.db
ecolik12 = org.EcK12.eg.db
fly = org.Dm.eg.db
human = org.Hs.eg.db
malaria = org.Pf.plasmo.db
mouse = org.Mm.eg.db
pig = org.Ss.eg.db
rat = org.Rn.eg.db
rhesus = org.Mmu.eg.db
worm = org.Ce.eg.db
xenopus = org.Xl.eg.db
yeast = org.Sc.sgd.db
zebrafish = org.Dr.eg.db
data("Parent_Count_mouse_CC", package = "SemDist") str(parentcnt)
data("Parent_Count_mouse_CC", package = "SemDist") str(parentcnt)
Reads in a (tab-delimited) file containing the true annotations for a set of sequences, a (tab-delimited) file containing the predicted annotations and corresponding scores for the same sequences. Calculates and outputs the average remaining uncertainty, misinformation, and semantic similarity at a series of user-specified thresholds.
RUMIcurve(ont, organism, increment = 0.05, truefile, predfiles, IAccr = NULL, add.weighted = FALSE, add.prec.rec = FALSE)
RUMIcurve(ont, organism, increment = 0.05, truefile, predfiles, IAccr = NULL, add.weighted = FALSE, add.prec.rec = FALSE)
ont |
Character representation of ontology version to use. One of "CC", "MF", or "BP" , corresponding to Cellular Component, Molecular Function, and Biological Process. |
organism |
A character vector indicating which organism(s) annotation data to use. |
increment |
A numeric value between 0 and 1 indicating the distance between each threshold that should be calculated. Note that the iteration starts from a threshold of 1, so an increment value of 0.08 will result in the thresholds 0.92, 0.84, 0.76 ... being used. |
truefile |
A character vector indicating the file from which to read the true annotations for the given sequences. Should be tab-delimited, with the first column containing the sequence ids and the second containing GO accessions. |
predfiles |
A character vector containing which files to read in as the predicted annotations. Should be tab-delimited, with the first column containing sequences, the second column containing GO accessions, and the third column containing the predictors 0-1 score for that prediction. |
IAccr |
A variable containing a named numeric vector of IA values for all the GO terms being used that will be used for calculations instead of R packages. This argument is optional. |
add.weighted |
A boolean indicating whether or not to add calculation of information content weighted versions of RU, MI, and SS to the output. |
add.prec.rec |
A boolean indicating whether or not to calculate precision, recall and specificity values for the prediction at each threshold and add to the output. |
Returns a named list with the same number of elements as the input "predfiles". Each element is a data frame containing all of the user-requested values for the data at each threshold.
Ian Gonzalez and Wyatt Clark
# Using test data sets from SemDist, plot a RUMI curve: truefile <- system.file("extdata", "MFO_LABELS_TEST.txt", package="SemDist") predfile <- system.file("extdata", "MFO_PREDS_TEST.txt", package="SemDist") avgRUMIvals <- RUMIcurve("MF", "human", 0.05, truefile, predfile) firstset <- avgRUMIvals[[1]] plot(firstset$RU, firstset$MI)
# Using test data sets from SemDist, plot a RUMI curve: truefile <- system.file("extdata", "MFO_LABELS_TEST.txt", package="SemDist") predfile <- system.file("extdata", "MFO_PREDS_TEST.txt", package="SemDist") avgRUMIvals <- RUMIcurve("MF", "human", 0.05, truefile, predfile) firstset <- avgRUMIvals[[1]] plot(firstset$RU, firstset$MI)
This data set contains the term count values for each term in the requested ontology/species (the number of times that the term annotates a protein). This can be used along with the parent count to calculate information accretion.
termcnt
termcnt
A named numeric vector with one value corresponding to each GO accession in the ontology.
The gene ontology data was obtained from the GO.db package and the annotation data was obtained from the following packages for each organism:
anopheles = org.Ag.eg.db
arabidopsis = org.At.tair.db
bovine = org.Bt.eg.db
canine = org.Cf.eg.db
chicken = org.Gg.eg.db
chimp = org.Pt.eg.db
ecolik12 = org.EcK12.eg.db
fly = org.Dm.eg.db
human = org.Hs.eg.db
malaria = org.Pf.plasmo.db
mouse = org.Mm.eg.db
pig = org.Ss.eg.db
rat = org.Rn.eg.db
rhesus = org.Mmu.eg.db
worm = org.Ce.eg.db
xenopus = org.Xl.eg.db
yeast = org.Sc.sgd.db
zebrafish = org.Dr.eg.db
data("Term_Count_mouse_CC", package = "SemDist") str(termcnt)
data("Term_Count_mouse_CC", package = "SemDist") str(termcnt)