Title: | Low-level utilities to retrieve data from the UCSC Genome Browser |
---|---|
Description: | A set of low-level utilities to retrieve data from the UCSC Genome Browser. Most functions in the package access the data via the UCSC REST API but some of them query the UCSC MySQL server directly. Note that the primary purpose of the package is to support higher-level functionalities implemented in downstream packages like GenomeInfoDb or txdbmaker. |
Authors: | Hervé Pagès [aut, cre] |
Maintainer: | Hervé Pagès <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.3.0 |
Built: | 2024-10-31 06:23:44 UTC |
Source: | https://github.com/bioc/UCSC.utils |
Fetch the track data for a given UCSC genome/track.
fetch_UCSC_track_data(genome, primary_table, api.url=UCSC.api.url())
fetch_UCSC_track_data(genome, primary_table, api.url=UCSC.api.url())
genome |
A single string specifying the name of a UCSC genome e.g. |
primary_table |
A single string specifying the name of the primary table associated
with the track from which to fetch the data.
See |
api.url |
The URL of the UCSC API. By default, the URL returned by
|
A data frame.
list_UCSC_genomes
to get the list of UCSC genomes.
list_UCSC_tracks
to get the list of tracks and
associated primary tables for a given UCSC genome.
UCSC_dbselect
for a more efficient and more flexible
way to retrieve data directly from the UCSC MariaDB server.
gorGor6_gap_data <- fetch_UCSC_track_data("gorGor6", "gap") head(gorGor6_gap_data) ## --- Comparison with UCSC_dbselect() --- gorGor6_gap_data2 <- UCSC_dbselect("gorGor6", "gap") ## Easy sanity checks. stopifnot( identical(dim(gorGor6_gap_data), dim(gorGor6_gap_data2)), identical(colnames(gorGor6_gap_data), colnames(gorGor6_gap_data2)) ) ## But the two data frames are not identical: identical(gorGor6_gap_data, gorGor6_gap_data2) # FALSE! ## However, they have the same content. Ony reason they're not ## identical is because their rows are not in the same order. ## Let's reorder the rows in the two data frames by genomic location, ## that is, first by chromosome, then by chromStart, and finally ## by chromEnd: sort_rows <- function(gap_data) { oo <- order(gap_data$chrom, gap_data$chromStart, gap_data$chromEnd) gap_data <- gap_data[oo, ] rownames(gap_data) <- NULL gap_data } gorGor6_gap_data <- sort_rows(gorGor6_gap_data) gorGor6_gap_data2 <- sort_rows(gorGor6_gap_data2) ## Now the two data frames are identical: stopifnot(identical(gorGor6_gap_data, gorGor6_gap_data2))
gorGor6_gap_data <- fetch_UCSC_track_data("gorGor6", "gap") head(gorGor6_gap_data) ## --- Comparison with UCSC_dbselect() --- gorGor6_gap_data2 <- UCSC_dbselect("gorGor6", "gap") ## Easy sanity checks. stopifnot( identical(dim(gorGor6_gap_data), dim(gorGor6_gap_data2)), identical(colnames(gorGor6_gap_data), colnames(gorGor6_gap_data2)) ) ## But the two data frames are not identical: identical(gorGor6_gap_data, gorGor6_gap_data2) # FALSE! ## However, they have the same content. Ony reason they're not ## identical is because their rows are not in the same order. ## Let's reorder the rows in the two data frames by genomic location, ## that is, first by chromosome, then by chromStart, and finally ## by chromEnd: sort_rows <- function(gap_data) { oo <- order(gap_data$chrom, gap_data$chromStart, gap_data$chromEnd) gap_data <- gap_data[oo, ] rownames(gap_data) <- NULL gap_data } gorGor6_gap_data <- sort_rows(gorGor6_gap_data) gorGor6_gap_data2 <- sort_rows(gorGor6_gap_data2) ## Now the two data frames are identical: stopifnot(identical(gorGor6_gap_data, gorGor6_gap_data2))
Get the chromosome sizes of a given UCSC genome.
get_UCSC_chrom_sizes(genome, api.url=UCSC.api.url(), recache=FALSE)
get_UCSC_chrom_sizes(genome, api.url=UCSC.api.url(), recache=FALSE)
genome |
A single string specifying the name of a UCSC genome e.g. |
api.url |
The URL of the UCSC API. By default, the URL returned by
|
recache |
|
A named numeric vector. The names on the vector are the UCSC chromosomes/sequences. The vector values are the corresponding lengths.
Note that the vector is not sorted in any particular order. In particular there's not guarantee that the chromosomes will precede the scaffolds.
list_UCSC_genomes
to get the list of UCSC genomes.
The Seqinfo
constructor function
in the GenomeInfoDb package for an alternate (higher level)
way of retrieving the chromosome information of a given NCBI
assembly or UCSC genome.
list_UCSC_tracks
to get the list of tracks and
associated primary tables for a given UCSC genome.
UCSC.api.url
for how to use an alternative
UCSC API URL by default.
get_UCSC_chrom_sizes("ce2") get_UCSC_chrom_sizes("hg38")
get_UCSC_chrom_sizes("ce2") get_UCSC_chrom_sizes("hg38")
Get the list of UCSC genomes.
list_UCSC_genomes(organism=NA, api.url=UCSC.api.url(), recache=FALSE) get_organism_for_UCSC_genome(genome, api.url=UCSC.api.url(), recache=FALSE)
list_UCSC_genomes(organism=NA, api.url=UCSC.api.url(), recache=FALSE) get_organism_for_UCSC_genome(genome, api.url=UCSC.api.url(), recache=FALSE)
organism |
By default all UCSC genomes are returned (in a data frame).
When |
api.url |
The URL of the UCSC API. By default, the URL returned by
|
recache |
In the case of |
genome |
A character vector of valid UCSC genomes e.g. |
For list_UCSC_genomes
: A data frame with 1 row per genome
and 5 columns: organism
, genome
, common_name
,
tax_id
, description
.
For get_organism_for_UCSC_genome
: A named character of the same
length as the input containing the scientific names of the organisms
associated with the supplied UCSC genomes.
https://genome.ucsc.edu/FAQ/FAQreleases.html for the online HTML page that lists all UCSC genome releases, including archived ones.
registered_UCSC_genomes
in the
GenomeInfoDb package for a similar function that returns
only UCSC genomes registered in the GenomeInfoDb package.
get_UCSC_chrom_sizes
to get the chromosome sizes
of a given UCSC genome.
list_UCSC_tracks
to get the list of tracks and
associated primary tables for a given UCSC genome.
UCSC.api.url
for how to use an alternative
UCSC API URL by default.
list_UCSC_genomes("human") list_UCSC_genomes("pacos") get_organism_for_UCSC_genome(c("ce11", "xenTro10", "mpxvRivers"))
list_UCSC_genomes("human") list_UCSC_genomes("pacos") get_organism_for_UCSC_genome(c("ce11", "xenTro10", "mpxvRivers"))
Get the list of tracks and associated primary tables for a given UCSC genome.
list_UCSC_tracks(genome, group=NULL, api.url=UCSC.api.url(), recache=FALSE)
list_UCSC_tracks(genome, group=NULL, api.url=UCSC.api.url(), recache=FALSE)
genome |
A single string specifying the name of a UCSC genome e.g. |
group |
|
api.url |
The URL of the UCSC API. By default, the URL returned by
|
recache |
|
A data frame with 1 row per track and 5 columns:
track
, primary_table
, type
, group
,
composite_track
.
Note that columns group
and composite_track
can
contain NA
s.
list_UCSC_genomes
to get the list of UCSC genomes.
fetch_UCSC_track_data
to fetch the track data for
a given UCSC genome/track.
get_UCSC_chrom_sizes
to get the chromosome sizes
of a given UCSC genome.
UCSC.api.url
for how to use an alternative
UCSC API URL by default.
## List all tracks for ce2 genome: list_UCSC_tracks("ce2") ## List tracks in the "rna" group only: list_UCSC_tracks("ce2", group="rna") ## Note that some tracks don't belong to any group: list_UCSC_tracks("hg38", group=NA)
## List all tracks for ce2 genome: list_UCSC_tracks("ce2") ## List tracks in the "rna" group only: list_UCSC_tracks("ce2", group="rna") ## Note that some tracks don't belong to any group: list_UCSC_tracks("hg38", group=NA)
A convenience utility to retrieve data from the UCSC MariaDB server
via simple SQL SELECT
statements.
Requires the RMariaDB package!
UCSC_dbselect(dbname, from, columns=NULL, where=NULL, MoreSQL=NULL, host="genome-mysql.soe.ucsc.edu", port=3306)
UCSC_dbselect(dbname, from, columns=NULL, where=NULL, MoreSQL=NULL, host="genome-mysql.soe.ucsc.edu", port=3306)
dbname |
A single string specifying the name of the database to connect to.
This is usually the name of a valid UCSC genome e.g. |
from |
A single string specifying the input data, that is, the
|
columns |
|
where |
|
MoreSQL |
|
host |
A single string specifying the name of the server's host |
port |
The TCP/IP port to use to connect to the server. |
UCSC_dbselect
is an alternative to fetch_UCSC_track_data
that is more efficient and gives the user more control on what data to
retrieve exactly from the server.
However, the downside is that UCSC_dbselect
does not work with all
tracks! This is because, not all track data are stored in a database:
some tracks are actually file-based (e.g. bigBed tracks). See below for
an example.
A data frame.
list_UCSC_genomes
to get the list of UCSC genomes.
list_UCSC_tracks
to get the list of tracks and
associated primary tables for a given UCSC genome.
fetch_UCSC_track_data
to fetch the track data for
a given UCSC genome/track.
https://genome.ucsc.edu/goldenpath/help/mysql.html for more information about downloading data from UCSC MariaDB databases.
### Retrieve full "ncbiGene" table ("NCBI Genes" track) for eboVir3: UCSC_dbselect("eboVir3", "ncbiGene") ### Retrieve subset of "sgpGene" table ("SGP Genes" track) for hg38: UCSC_dbselect("hg38", "sgpGene", where="chrom='chrM'") ### Retrieve subset of "gap" table for gorGor6: columns <- c("chrom", "chromStart", "chromEnd", "type") where <- "chrom='chrX' AND type='contig'" UCSC_dbselect("gorGor6", "gap", columns=columns, where=where) ### With a LEFT JOIN and a LIMIT clause: from <- paste("ncbiRefSeq LEFT JOIN ncbiRefSeqLink", "ON ncbiRefSeq.name=ncbiRefSeqLink.id") res <- UCSC_dbselect("sacCer3", from, MoreSQL="LIMIT 5") dim(res) ## WARNING: UCSC_dbselect() does not work with all tracks! ## For example it does not work with bigBed tracks: subset(list_UCSC_tracks("hg38", group="genes"), grepl("bigBed 9", type)) ## Not run: UCSC_dbselect("hg38", "crisprAllTargets") # table doesn't exist! ## End(Not run)
### Retrieve full "ncbiGene" table ("NCBI Genes" track) for eboVir3: UCSC_dbselect("eboVir3", "ncbiGene") ### Retrieve subset of "sgpGene" table ("SGP Genes" track) for hg38: UCSC_dbselect("hg38", "sgpGene", where="chrom='chrM'") ### Retrieve subset of "gap" table for gorGor6: columns <- c("chrom", "chromStart", "chromEnd", "type") where <- "chrom='chrX' AND type='contig'" UCSC_dbselect("gorGor6", "gap", columns=columns, where=where) ### With a LEFT JOIN and a LIMIT clause: from <- paste("ncbiRefSeq LEFT JOIN ncbiRefSeqLink", "ON ncbiRefSeq.name=ncbiRefSeqLink.id") res <- UCSC_dbselect("sacCer3", from, MoreSQL="LIMIT 5") dim(res) ## WARNING: UCSC_dbselect() does not work with all tracks! ## For example it does not work with bigBed tracks: subset(list_UCSC_tracks("hg38", group="genes"), grepl("bigBed 9", type)) ## Not run: UCSC_dbselect("hg38", "crisprAllTargets") # table doesn't exist! ## End(Not run)
Convenience helper for getting or setting global option UCSC.api.url
.
UCSC.api.url(api.url=NULL)
UCSC.api.url(api.url=NULL)
api.url |
A single string containing the URL to the alternative UCSC API to use by default. For convenience,
|
Various functions in the UCSC.utils package query the UCSC REST API.
This is the case for example for list_UCSC_genomes
,
get_UCSC_chrom_sizes
, list_UCSC_tracks
and more.
Global option UCSC.api.url
controls the UCSC API URL
that these functions use y default. The option is set to
"https://api.genome.ucsc.edu"
(primary URL, US West Coast)
at package startup.
UCSC.api.url()
and UCSC.api.url(some_url)
are provided
as convenient ways of doing getOption("UCSC.api.url")
and options(UCSC.api.url=some_url)
, respectively.
When called with no argument, UCSC.api.url()
returns
getOption("UCSC.api.url")
.
When passed an URL, UCSC.api.url(some_url)
returns the
previous URL, that is, the UCSC API URL that was previously
used by default. Note that the previous URL is returned invisibly.
https://genome.ucsc.edu/goldenPath/help/api.html#Mirrors a list of alternative UCSC API URLs.
list_UCSC_genomes
to get the list of UCSC genomes.
UCSC.api.url() # current default value of the UCSC API URL get_UCSC_chrom_sizes("ce11", recache=TRUE) ## Temporarily use the mirror in Europe: previous_url <- UCSC.api.url("europe") UCSC.api.url() # new default value of the UCSC API URL get_UCSC_chrom_sizes("ce11", recache=TRUE) ## Restore previous default value: UCSC.api.url(previous_url) UCSC.api.url()
UCSC.api.url() # current default value of the UCSC API URL get_UCSC_chrom_sizes("ce11", recache=TRUE) ## Temporarily use the mirror in Europe: previous_url <- UCSC.api.url("europe") UCSC.api.url() # new default value of the UCSC API URL get_UCSC_chrom_sizes("ce11", recache=TRUE) ## Restore previous default value: UCSC.api.url(previous_url) UCSC.api.url()