Title: | Network-based prioritization of putative metabolite IDs |
---|---|
Description: | This package uses an innovative network-based approach that will enhance our ability to determine the identities of significant ions detected by LC-MS. |
Authors: | Zhenzhi Li <[email protected]> |
Maintainer: | Zhenzhi Li <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.25.0 |
Built: | 2024-12-29 05:53:13 UTC |
Source: | https://github.com/bioc/MetID |
A dataset which can be used as input dataset and its row names do not match the default row names.
demo1
demo1
A data frame with 20 rows and 6 variables:
Mass of compounds.
Names of putative IDs.
Formulas of putative IDs.
Exact mass of putative IDs.
PubChem IDs of putative IDs.
KEGG IDs of putative IDs.
...
A dataset which can be used as input dataset and its row names do not match the default row names.
demo2
demo2
A data frame with 3592 rows and 6 variables:
Mass of compounds.
Names of putative IDs.
Formulas of putative IDs.
Exact mass of putative IDs.
PubChem IDs of putative IDs.
KEGG IDs of putative IDs.
...
Preprocess input file.
get_cleaned(filename, type = c("data.frame", "csv", "txt"), na, sep)
get_cleaned(filename, type = c("data.frame", "csv", "txt"), na, sep)
filename |
the name of the file which the data are to be read from. Its type should be chosen in 'type' parameter. Also, it should have columns named exactly as 'metid' (IDs for peaks), 'query_m.z' (query mass of peaks), 'exact_m.z' (exact mass of putitative IDs), 'kegg_id' (IDs of putitative IDs from KEGG Database), 'pubchem_cid' (CIDs of putitative IDs from PubChem Database). Otherwise, this function would not work. |
type |
string indicating the type of the file. It can be a 'data.frame' which is already loaded into R, or some other types like a csv file. |
na |
a character vector of strings which are to be interpreted as NA values. |
sep |
a character value which seperates multiple IDs in kegg_id or pubchem_cid field, if there are multiple IDs. |
get_cleaned returns a list containing the following components:
df |
a data frame which is the original input data. |
clean_data |
a data frame with unuseful observations and features removed. |
mass |
a data frame with unique query peak, along with query mass. |
ID |
a data frame with unique putitative IDs, along with PubChem ID, KEGG ID, exact mass. |
index_na |
a vector of row indexes which contains NA values. |
Build network between identifications based on kegg network database.
get_kegg_network(kegg_id)
get_kegg_network(kegg_id)
kegg_id |
a vector of strings indicating KEGG ID of putative ID. |
a binary matrix of network of KEGG IDs.
Get scores for metabolite putative IDs by LC-MS .
get_scores_for_LC_MS(filename, type = c("data.frame", "csv", "txt"), na = "NA", sep = ";", mode = c("POS", "NEG"), Size = 2000, delta = 1, gamma_mass = 10, iterations = 500)
get_scores_for_LC_MS(filename, type = c("data.frame", "csv", "txt"), na = "NA", sep = ";", mode = c("POS", "NEG"), Size = 2000, delta = 1, gamma_mass = 10, iterations = 500)
filename |
the name of the file which the data are to be read from. Its type should be chosen in 'type' parameter. Also, it should have columns named exactly 'metid' (IDs for peaks), 'query_m.z' (query mass of peaks), 'exact_m.z' (exact mass of putative IDs), 'kegg_id' (IDs of putative IDs from KEGG Database), 'pubchem_cid' (CIDs of putative IDs from PubChem Database). Otherwise, this function would not work. |
type |
string indicating the type of the file. It can be a 'data.frame' which is already loaded into R, or some other specified types like a csv file. |
na |
a character vector of strings which are to be interpreted as NA values. |
sep |
a character value which seperates multiple IDs in kegg_id or pubchem_cid field, if there are multiple IDs. |
mode |
string indicating the mode of metabolites. It can be positive mode (POS) or negative mode (NEG). |
Size |
an integer which indicates sample size in Gibbs sampling. |
delta |
a hyper-parameter representing the mean value of mass ratio. |
gamma_mass |
a hyper-parameter representing the accuracy of mass measurement. |
iterations |
ask user to input number of interations,default 500 |
A dataframe which contains input data together with a column of scores in the end. In the score column, if the row contains NA values or does not has a PubChem cid, the score would be '-', which stands for missing value. Otherwise, each score would be from 0 to 1.
## check if colnames of dataset meet requirement names(demo1) ## change colnames colnames(demo1) <- c('query_m.z','name','formula','exact_m.z','pubchem_cid','kegg_id') ## get scores out <- get_scores_for_LC_MS(demo1, type = 'data.frame', na='-', mode='POS')
## check if colnames of dataset meet requirement names(demo1) ## change colnames colnames(demo1) <- c('query_m.z','name','formula','exact_m.z','pubchem_cid','kegg_id') ## get scores out <- get_scores_for_LC_MS(demo1, type = 'data.frame', na='-', mode='POS')
Build network between identifications based on tanimoto score.
get_tani_network(pubchem_cid)
get_tani_network(pubchem_cid)
pubchem_cid |
a vector of strings indicating PubChem CID of putative ID. |
a binary matrix of network of tanimoto scores.
A dataset containing PubChem CIDs, InchiKey in the PubChem database.
InchiKey
InchiKey
A data frame with 101494 rows and 2 variables:
PubChem CIDs
Inchikeys
...
A dataset containing kegg IDs in the KEGG database with all networks.
kegg_network
kegg_network
A data frame with 57070 rows and 2 variables:
KEGG IDs
KEGG IDs, which have a connection with KEGG ID in the first column
...
The foo package provides one important functions: get_scores_for_LC_MS
get_scores_for_LC_MS: Get scores for metabolite putative IDs by LC-MS.