biodbUniprot is a biodb extension package that implements a connector to Uniprot database.
The UniProt Knowledge Base (Consortium 2016) can be searched using its search web service.
We present here the way to contact this web service with this package.
Install using Bioconductor:
The first step in using biodbUniprot, is to create an
instance of the biodb class BiodbMain
from the main
biodb package. This is done by calling the constructor of the
class:
During this step the configuration is set up, the cache system is initialized and extension packages are loaded.
We will see at the end of this vignette that the biodb
instance needs to be terminated with a call to the
terminate()
method.
In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbUniprot implements a connector to a remote database. Here is the code to instantiate a connector:
## Loading required package: biodbUniprot
To download entries, run the getEntry()
, which returns a
list of BiodbEntry
objects:
## INFO [04:22:47.062] Create cache folder "/github/home/.cache/R/biodb/uniprot-958d776f924f3e7a3bae586fa731b40c" for "uniprot-958d776f924f3e7a3bae586fa731b40c".
To print the information contained in the entry objects as a data
frame, run the entriesToDataframe()
method attached to the
BiodbMain
instance:
## accession gene.symbol kegg.genes.id molecular.mass
## 1 P01011 SERPINA3;AACT;GIG24;GIG25 hsa:12 47651
## 2 P09237 MMP7;MPSL1;PUMP1 hsa:4316 29677
## name
## 1 AACT_HUMAN;Alpha-1-antichymotrypsin;Cell growth-inhibiting gene 24/25 protein;Serpin A3;Alpha-1-antichymotrypsin His-Pro-less
## 2 MMP7_HUMAN;Matrilysin;Matrin;Matrix metalloproteinase-7;Pump-1 protease;Uterine metalloproteinase
## ncbi.gene.id
## 1 12
## 2 4316
## aa.seq
## 1 MERMLPLLALGLLAAGFCPAVLCHPNSPLDEENLTQENQDRGTHVDLGLASANVDFAFSLYKQLVLKAPDKNVIFSPLSISTALAFLSLGAHNTTLTEILKGLKFNLTETSEAEIHQSFQHLLRTLNQSSDELQLSMGNAMFVKEQLSLLDRFTEDAKRLYGSEAFATDFQDSAAAKKLINDYVKNGTRGKITDLIKDLDSQTMMVLVNYIFFKAKWEMPFDPQDTHQSRFYLSKKKWVMVPMMSLHHLTIPYFRDEELSCTVVELKYTGNASALFILPDQDKMEEVEAMLLPETLKRWRDSLEFREIGELYLPKFSISRDYNLNDILLQLGIEEAFTSKADLSGITGARNLAVSQVVHKAVLDVFEEGTEASAATAVKITLLSALVETRTIVRFNRPFLMIIVPTDTQNIFFMSKVTNPKQA
## 2 MRLTVLCAVCLLPGSLALPLPQEAGGMSELQWEQAQDYLKRFYLYDSETKNANSLEAKLKEMQKFFGLPITGMLNSRVIEIMQKPRCGVPDVAEYSLFPNSPKWTSKVVTYRIVSYTRDLPHITVDRLVSKALNMWGKEIPLHFRKVVWGTADIMIGFARGAHGDSYPFDGPGNTLAHAFAPGTGLGGDAHFDEDERWTDGSSLGINFLYAATHELGHSLGMGHSSDPNAVMYPTYGNGDPQNFKLSQDDIKGIQKLYGKRSNSRKK
## aa.seq.length uniprot.id ec expasy.enzyme.id
## 1 423 P01011 <NA> <NA>
## 2 267 P09237 3.4.24.23 3.4.24.23
The method wsSearch()
(wsQuery()
is now
deprecated) implements the request to the search web service,
and the parsing of its output.
To get the raw results returned by the UniProt server, run the following code:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
size=2, retfmt='plain')
## [1] "Entry\tEntry Name\nA0A0C5B5G6\tMOTSC_HUMAN\nA0A1B0GTW7\tCIROP_HUMAN\n"
The first parameter is the query, as required by the web service. To learn how to write a query for UniProt, see a description of the query web service at http://www.uniprot.org/help/api_queries.
The fields
parameter is the fields you want back for
each entry returned by the database.
The size
parameter is the maximum number of entries the
server must return.
The retfmt
parameter controls the type of output
desired. Here "plain"
states that we want the raw output
from the server.
To get the output parsed by biodb and get a data frame, run:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
size=2, retfmt='parsed')
## Entry Entry Name
## 1 A0A0C5B5G6 MOTSC_HUMAN
## 2 A0A1B0GTW7 CIROP_HUMAN
To get only the list of UniProt identifiers, run:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
size=2, retfmt='ids')
## [1] "A0A0C5B5G6" "A0A1B0GTW7"
And if you are curious to see the URL request that is sent to the server, run:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
size=2, retfmt='request')
## Biodb request object on https://rest.uniprot.org/uniprotkb/search?query=reviewed%3Atrue%20AND%20organism_id%3A9606&fields=accession%2Cid&format=tsv&size=2
The method geneSymbolToUniprotIds()
uses
wsSearch()
to search for UniProt entries that
reference particular gene symbols.
For instance, if you want to get the UniProt entries that have the gene symbol G-CSF, just run:
ids <- conn$geneSymbolToUniprotIds('G-CSF')
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession',
'gene.symbol'))
## accession gene.symbol
## 1 Q8MKE0 G-CSF
## 2 Q9GJU0 G-CSF
## 3 A0A679AQ73 g-csf
## 4 C0STS2 G-CSF 2
## 5 C0STS3 G-CSF 1
## 6 Q4H432 GCSF;G-CSF
If you want to match also GCSF (no minus sign character), then run:
ids <- conn$geneSymbolToUniprotIds('G-CSF', ignore.nonalphanum=TRUE)
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession',
'gene.symbol'))
## accession gene.symbol
## 1 P09919 CSF3;C17orf33;GCSF
## 2 P35833 CSF3;GCSF
## 3 B5L332 csf3b;csf3;Gcsf
## 4 A0A8M6Z8U5 csf3a;csf3;gcsf
## 5 B8ZHI7 csf3a;csf3;gcsf
## 6 Q8MKE0 G-CSF
## 7 Q9GJU0 G-CSF
## 8 Q4H432 GCSF;G-CSF
## 9 A0A679AQ73 g-csf
## 10 C0STS2 G-CSF 2
## 11 C0STS3 G-CSF 1
## 12 A0A2Z6I9R9 GCSF
If you want to match G-CSFa2 too, run:
ids <- conn$geneSymbolToUniprotIds('G-CSF', partial.match=TRUE)
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession',
'gene.symbol'))
## accession gene.symbol
## 1 Q8MKE0 G-CSF
## 2 Q9GJU0 G-CSF
## 3 A0A679AQ73 g-csf
## 4 C0STS2 G-CSF 2
## 5 C0STS3 G-CSF 1
## 6 Q4H432 GCSF;G-CSF
The way this method works is by running wsSearch()
to
get a first set of entry identifiers, and then download each entry and
apply a filtering on them. The downloading of the entries may quite
long, wsSearch()
returning potentially thousands of
entries, each entry being downloaded with a single separate request and
the frequency limit being 3 request per second. Entries already in cache
or memory will not be downloaded again, so running the same request a
second time will be faster, as it is usually the case with
biodb.
When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):
## INFO [04:23:07.264] Closing BiodbMain instance...
## INFO [04:23:07.265] Connector "uniprot" deleted.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biodbUniprot_1.13.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4
## [4] generics_0.1.3 bitops_1.0-9 stringi_1.8.4
## [7] RSQLite_2.3.7 hms_1.1.3 digest_0.6.37
## [10] magrittr_2.0.3 evaluate_1.0.1 fastmap_1.2.0
## [13] blob_1.2.4 plyr_1.8.9 jsonlite_1.8.9
## [16] progress_1.2.3 DBI_1.2.3 BiocManager_1.30.25
## [19] httr_1.4.7 fansi_1.0.6 XML_3.99-0.17
## [22] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
## [25] chk_0.9.2 crayon_1.5.3 dbplyr_2.5.0
## [28] bit64_4.5.2 withr_3.0.2 cachem_1.1.0
## [31] yaml_2.3.10 tools_4.4.1 memoise_2.0.1
## [34] biodb_1.13.0 dplyr_1.1.4 filelock_1.0.3
## [37] curl_5.2.3 buildtools_1.0.0 vctrs_0.6.5
## [40] R6_2.5.1 BiocFileCache_2.13.2 lifecycle_1.0.4
## [43] stringr_1.5.1 bit_4.5.0 pkgconfig_2.0.3
## [46] pillar_1.9.0 bslib_0.8.0 glue_1.8.0
## [49] Rcpp_1.0.13 lgr_0.4.4 xfun_0.48
## [52] tibble_3.2.1 tidyselect_1.2.1 sys_3.4.3
## [55] knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28
## [58] maketools_1.3.1 compiler_4.4.1 prettyunits_1.2.0
## [61] askpass_1.2.1 RCurl_1.98-1.16 openssl_2.2.2