The UCSC.utils package

Introduction

UCSC.utils is an infrastructure package that provides a small set of low-level utilities to retrieve data from the UCSC Genome Browser. Most functions in the package access the data via the UCSC REST API but some of them query the UCSC MySQL server directly.

Note that the primary purpose of the package is to support higher-level functionalities implemented in downstream packages like GenomeInfoDb or txdbmaker.

Installation

Like any other Bioconductor package, UCSC.utils should always be installed with BiocManager::install():

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("UCSC.utils")

However, note that UCSC.utils will typically get automatically installed as a dependency of other Bioconductor packages, so explicit installation of the package is usually not needed.

Functions defined in the package

list_UCSC_genomes()

suppressPackageStartupMessages(library(UCSC.utils))

list_UCSC_genomes("cat")
##             organism  genome common_name tax_id
## 1        Felis catus felCat3         Cat   9685
## 2        Felis catus felCat4         Cat   9685
## 3        Felis catus felCat5         Cat   9685
## 4        Felis catus felCat8         Cat   9685
## 5        Felis catus felCat9         Cat   9685
## 6 Tursiops truncatus turTru2     Dolphin   9739
##                                 description
## 1                 Mar. 2006 (Broad/felCat3)
## 2        Dec. 2008 (NHGRI/GTB V17e/felCat4)
## 3 Sep. 2011 (ICGSC Felis_catus 6.2/felCat5)
## 4 Nov. 2014 (ICGSC Felis_catus_8.0/felCat8)
## 5       Nov. 2017 (Felis_catus_9.0/felCat9)
## 6       Oct. 2011 (Baylor Ttru_1.4/turTru2)

See ?list_UCSC_genomes for more information and additional examples.

get_UCSC_chrom_sizes()

felCat9_chrom_sizes <- get_UCSC_chrom_sizes("felCat9")
head(felCat9_chrom_sizes)
##                chrA1 chrUn_NW_019369707v1 chrUn_NW_019369340v1 
##            242100913                 1969                 2790 
## chrUn_NW_019367154v1 chrUn_NW_019366602v1 chrUn_NW_019367170v1 
##                 2807                 2835                 2909

See ?get_UCSC_chrom_sizes for more information and additional examples.

list_UCSC_tracks()

list_UCSC_tracks("felCat9", group="varRep")
##               track     primary_table       type  group composite_track
## 1 EVA SNP Release 6           evaSnp6 bigBed 9 + varRep         EVA SNP
## 2 EVA SNP Release 5           evaSnp5 bigBed 9 + varRep         EVA SNP
## 3 EVA SNP Release 4           evaSnp4 bigBed 9 + varRep         EVA SNP
## 4 EVA SNP Release 3            evaSnp bigBed 9 + varRep         EVA SNP
## 5    Microsatellite          microsat      bed 4 varRep            <NA>
## 6  Interrupted Rpts     nestedRepeats   bed 12 + varRep            <NA>
## 7      RepeatMasker              rmsk       rmsk varRep            <NA>
## 8    Simple Repeats      simpleRepeat    bed 4 + varRep            <NA>
## 9        WM + SDust windowmaskerSdust      bed 3 varRep            <NA>

See ?list_UCSC_tracks for more information and additional examples.

fetch_UCSC_track_data()

mm9_cytoBandIdeo <- fetch_UCSC_track_data("mm9", "cytoBandIdeo")
head(mm9_cytoBandIdeo)
##   chrom chromStart chromEnd name gieStain
## 1  chr1          0  8918386  qA1  gpos100
## 2  chr1    8918386 12386647  qA2     gneg
## 3  chr1   12386647 20314102  qA3   gpos33
## 4  chr1   20314102 22295965  qA4     gneg
## 5  chr1   22295965 31214352  qA5  gpos100
## 6  chr1   31214352 43601000   qB     gneg

See ?fetch_UCSC_track_data for more information and additional examples.

UCSC_dbselect()

Retrieve a full SQL table:

felCat9_refGene <- UCSC_dbselect("felCat9", "refGene")
head(felCat9_refGene)
##    bin         name chrom strand   txStart     txEnd  cdsStart    cdsEnd
## 1  863 NM_001009849 chrE2      +  36523718  36526085  36523718  36526085
## 2 1227 NM_001009828 chrD1      -  84257818  84258562  84257818  84258562
## 3  762 NM_001009827 chrE1      -  23276059  23282999  23276460  23282943
## 4 1577 NM_001009826 chrC1      - 130088795 130092596 130089335 130092527
## 5  759 NM_001309049 chrA1      +  22918805  22920344  22919057  22920092
## 6  875 NM_001042567 chrB1      -  38074338  38098460  38074369  38098430
##   exonCount   exonStarts     exonEnds score name2 cdsStartStat cdsEndStat
## 1         3 36523718.... 36523788....     0 CCL17         cmpl       cmpl
## 2         1     84257818     84258562     0  BDNF         cmpl       cmpl
## 3         4 23276059.... 23276485....     0  CCL5         cmpl       cmpl
## 4         2 13008879.... 13009038....     0 CXCR4         cmpl       cmpl
## 5         1     22918805     22920344     0 LPAR6         cmpl       cmpl
## 6        10 38074338.... 38074370....     0   LPL         cmpl       cmpl
##     exonFrames
## 1      0, 1, 2
## 2            0
## 3   2, 2, 1, 0
## 4         0, 0
## 5            0
## 6 2, 2, 2,....

Or retrieve a subset of it:

columns <- c("chrom", "strand", "txStart", "txEnd", "exonCount", "name2")
UCSC_dbselect("felCat9", "refGene", columns=columns, where="chrom='chrA1'")
##    chrom strand   txStart     txEnd exonCount      name2
## 1  chrA1      + 141539866 141571963        14       HEXB
## 2  chrA1      - 138510635 138551376        10        SMN
## 3  chrA1      - 145138360 145306338         8       ARSB
## 4  chrA1      + 175374295 175382370        14        F12
## 5  chrA1      + 191099915 191109377         6      IL12B
## 6  chrA1      + 199096975 199126348        22      CSF1R
## 7  chrA1      +  11563271  11622511        32      BRCA2
## 8  chrA1      +  22918805  22920344         1      LPAR6
## 9  chrA1      -  24436704  24481859         9      LRCH1
## 10 chrA1      -  77088366  77130144         5      EFNB2
## 11 chrA1      +  82349957  82367091         9      LAMP1
## 12 chrA1      +  91725271  91726174         1 FELCATV1R2
## 13 chrA1      + 111190386 111192290         4       CSF2
## 14 chrA1      + 111190386 111192290         4       CSF2
## 15 chrA1      - 111639220 111641240         4        IL5
## 16 chrA1      + 111778711 111787544         4        IL4
## 17 chrA1      - 111946494 111949509         2       GDF9
## 18 chrA1      - 121247112 121356809         8      NR3C1
## 19 chrA1      - 161161087 161213837         6       LIX1
## 20 chrA1      + 193037229 193057081         8     HAVCR1
## 21 chrA1      - 200182785 200184042         1      ADRB2
## 22 chrA1      + 214528332 214560411         7    SLC45A2

Note that UCSC_dbselect is an alternative to fetch_UCSC_track_data that is more efficient and gives the user more control on what data to retrieve exactly from the server. However, the downside of it is that it does not work with all tracks!

See ?UCSC_dbselect for more information and additional examples.

Session information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] UCSC.utils_1.3.0 BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5         httr_1.4.7          cli_3.6.3          
##  [4] knitr_1.48          rlang_1.1.4         xfun_0.48          
##  [7] DBI_1.2.3           generics_0.1.3      jsonlite_1.8.9     
## [10] bit_4.5.0           S4Vectors_0.44.0    buildtools_1.0.0   
## [13] htmltools_0.5.8.1   maketools_1.3.1     sys_3.4.3          
## [16] sass_0.4.9          stats4_4.4.1        hms_1.1.3          
## [19] rmarkdown_2.28      evaluate_1.0.1      jquerylib_0.1.4    
## [22] fastmap_1.2.0       yaml_2.3.10         lifecycle_1.0.4    
## [25] RMariaDB_1.3.2      BiocManager_1.30.25 compiler_4.4.1     
## [28] blob_1.2.4          timechange_0.3.0    pkgconfig_2.0.3    
## [31] digest_0.6.37       R6_2.5.1            curl_5.2.3         
## [34] bslib_0.8.0         bit64_4.5.2         tools_4.4.1        
## [37] lubridate_1.9.3     BiocGenerics_0.53.1 cachem_1.1.0