“ginmappeR” is an R package designed to provide functionalities to
translate gene or protein identifiers between state-of-art biological
sequence databases: CARD (https://card.mcmaster.ca/, (Alcock et al. 2023)), NCBI Protein,
Nucleotide and Gene (https://www.ncbi.nlm.nih.gov/), UniProt (https://www.uniprot.org/, (‘UniProt’ 2017)) and KEGG (https://www.kegg.jp, (Kanehisa & Goto 2000)). Also offers
complementary functionality like NCBI identical proteins or UniProt
similar genes clusters retrieval.
Nowadays, biological sequence databases offer programmatic interfaces (API) to access their data, like NCBI, UniProt or KEGG and, consequently, community developed R packages to consume these services are available, such as rentrez (Winter 2017), UniProt.ws (Carlson & Ramos 2022) and KEGGREST (Tenenbaum & Maintainer 2022), respectively. Other databases, like The Comprehensive Antibiotic Resistance Database (CARD) offer their data as a downloadable file.
The heterogeneity and low coupling of these tools motivated us to
conceive ginmappeR, an integral package that translates gene or protein
identifiers between the mentioned databases, making it easier for users
to work with multiple datasources in an unified and complete way.
The gene/protein identifier translation feature is bidirectional in
every cited database and translates into a 6x6 matrix (see figure below)
of functions of the form getSource2Target
. For example, to
translate from CARD to UniProt, getCARD2UniProt
can be
used.
Additionally, features that were not available in their respective
packages like retrieval of UniProt similar genes clusters, or were not
easily accessible (such as NCBI identical proteins retrieval), are part
of ginmappeR id translation implementation and are also offered as
individual functions for the user: getUniProtSimilarGenes
and getNCBIIdenticalProteins
.
Finally, as previously mentioned, considered databases offer API
interfaces and associated R packages, except for CARD, which is only
available as a downloadable zip file. To solve this, ginmappeR
automatically downloads CARD’s latest version and also offers the user
the possibility to update it through the updateCARDDataBase
function.
In order to illustrate the functionality of our package, we display some id conversion examples, and later on, NCBI identical protein and UniProt similar genes clusters examples.
Let us take CARD ARO identifier 3003955
and map it to the
other databases starting with the NCBI group, Protein, Nucleotide and
Gene:
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'AnnotationDbi'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'IRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'Biostrings'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'XVector'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomeInfoDb'
## Warning: multiple methods tables found for 'setequal'
## [1] "CCP45647.1"
## [1] "AL123456.3"
## [1] "888575"
Now, let’s map the id to UniProt:
## [1] "P9WJY5"
Finally, let’s map the id to KEGG database:
## [1] "mtu:Rv2846c"
Some of the mapping functions have parameters to obtain all possible
translations (exhaustiveMapping
) or to detail the
percentage of identity of the source id with the obtained id
(detailedMapping
). More information on this in the code’s
documentation. Let’s see an example employing these parameters:
# Note that when using exhaustiveMapping = TRUE, it returns a list instead
# of a character vector, to avoid mixing the result identifiers
getCARD2UniProt('3002372', exhaustiveMapping = TRUE, detailedMapping = TRUE)
## [[1]]
## [[1]]$DT
## [1] "Q6QJ79"
##
## [[1]]$`1.0`
## [1] "Q6QJ79" "A0A7G1KXU2" "D0UY02"
All the functions in ginmappeR are vectorized, that is, they can map a vector of identifiers, for example:
## [1] "CCP45647.1" NA "CAA38525.1"
R package rentrez offers access to NCBI databases, among which is
Identical Protein Groups. In order to make it more accessible to users,
ginmappeR includes getNCBIIdenticalProteins
that receives a
NCBI identifier and returns its identical proteins in form of a list of
identifiers:
## [[1]]
## [1] "WP_063864654.1" "AHA80958.1" "EKD8974449.1" "EKD8979565.1"
Through format
parameter, it is possible to obtain results
in a dataframe:
Id | Source | Nucleotide.Accession | Start | Stop | Strand | Protein | Protein.Name | Organism | Strain | Assembly |
---|---|---|---|---|---|---|---|---|---|---|
45721358 | RefSeq | NG_050043.1 | 1 | 861 | + | WP_063864654.1 | class A beta-lactamase SHV-172 | Klebsiella pneumoniae | 845332 | |
45721358 | INSDC | KF513177.1 | 1 | 861 | + | AHA80958.1 | beta-lactamase SHV-172 | Klebsiella pneumoniae | 845332 | |
45721358 | INSDC | ABJLVL010000001.1 | 124981 | 125841 | - | EKD8974449.1 | class A beta-lactamase SHV-172 | Klebsiella pneumoniae | NA | GCA_026265195.1 |
45721358 | INSDC | ABJLVL010000113.1 | 1755 | 2615 | + | EKD8979565.1 | class A beta-lactamase SHV-172 | Klebsiella pneumoniae | NA | GCA_026265195.1 |
The function getUniProtSimilarGenes
allows to retrieve
clusters of genes with 100%, 90% or 50% identity with the provided
identifier. Let us try with UniProt gene Q2A799
and 100%
identity:
## [[1]]
## [1] "B0BL11" "A0A344X7M9" "B7VEQ9"
We can use argument clusterNames
to also retrieve the
clusters names:
## [[1]]
## [1] "A0A173DQX0" "A0A1Y0BRE0" "Q8GKX3" "A0A1S5SJJ9" "D7GKY5"
## [6] "A0A0U3BEI9" "A0A2V4FMD8" "D7GKY3" "I3VI54" "A0A023SG55"
## [11] "A0A1B2F089" "A0A344X7M9" "B7VEQ9" "D6CJE1" "D7GKZ1"
## [16] "G1CSK5" "A0A1W6F5I4" "A0A844NVA2" "B0BL11" "D0EW81"
## [21] "D7GKY7" "Q1WLM9" "Q9RGC2" "A5LHV8" "Q0PRG2"
## [26] "U5NIQ3" "A0A0U1PYJ5" "C0JBE4" "C0LIL9" "H6V565"
## [31] "A0A2S5T091" "D3VX06" "D6CI36" "H2E8M2" "A4KZ69"
## [36] "A0A5Q2V4N5" "A0AAI9KXE0" "D2KHP5" "F1B1U0"