Package: AnnotationHub
Authors: Bioconductor Package Maintainer [cre], Martin Morgan [aut], Marc Carlson [ctb], Dan Tenenbaum [ctb], Sonali Arora [ctb], Valerie Oberchain [ctb], Kayla Morrell [ctb], Lori Shepherd [aut]
Modified: Mon March 18 2024
Compiled: Tue Jun 30 22:38:10 2026

Accessing Genome-Scale Data

Non-model organism gene annotations

Bioconductor offers pre-built org.* annotation packages for model organisms, with their use described in the OrgDb section of the Annotation work flow. Here we discover available OrgDb objects for less-model organisms

library(AnnotationHub)
ah <- AnnotationHub()

query(ah, "OrgDb")

## AnnotationHub with 1976 records
## # snapshotDate(): 2026-06-30
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, NCBI
## # $species: Escherichia coli, Coffea arabica, greater Indian_fruit_bat, Zophobas morio, Zophobas...
## # $rdataclass: OrgDb, TxDb
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## #   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH111588"]]' 
## 
##              title                                          
##   AH111588 | OrgDb Sqlite file for Coffea arabica           
##   AH119540 | org.Pseudophryne_corroboree.eg.sqlite          
##   AH119541 | org.Triticum_aestivum.eg.sqlite                
##   AH119542 | org.Triticum_aestivum_subsp._aestivum.eg.sqlite
##   AH119543 | org.Triticum_sativum.eg.sqlite                 
##   ...        ...                                            
##   AH121957 | org.Mmu.eg.db.sqlite                           
##   AH121958 | org.Ce.eg.db.sqlite                            
##   AH121959 | org.Xl.eg.db.sqlite                            
##   AH121960 | org.Sc.sgd.db.sqlite                           
##   AH121961 | org.Dr.eg.db.sqlite

orgdb <- query(ah, c("OrgDb", "[email protected]"))[[1]]

## downloading 1 resources

## retrieving 1 resource

## loading from cache

The object returned by AnnotationHub is directly usable with the select() interface, e.g., to discover the available keytypes for querying the object, the columns that these keytypes can map to, and finally selecting the SYMBOL and GENENAME corresponding to the first 6 ENTREZIDs

library(AnnotationDbi)
AnnotationDbi::keytypes(orgdb)

##  [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "EVIDENCE"    "EVIDENCEALL" "GENENAME"   
##  [7] "GID"         "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PMID"       
## [13] "REFSEQ"      "SYMBOL"

AnnotationDbi::columns(orgdb)

##  [1] "ACCNUM"      "ALIAS"       "CHR"         "ENTREZID"    "EVIDENCE"    "EVIDENCEALL"
##  [7] "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL"
## [13] "PMID"        "REFSEQ"      "SYMBOL"

egid <- head(keys(orgdb, "ENTREZID"))
AnnotationDbi::select(orgdb, egid, c("SYMBOL", "GENENAME"), "ENTREZID")

## 'select()' returned 1:1 mapping between keys and columns

##    ENTREZID       SYMBOL                       GENENAME
## 1 134884233         NPR2 natriuretic peptide receptor 2
## 2 134884234 LOC134884234               5S ribosomal RNA
## 3 134884235 LOC134884235               5S ribosomal RNA
## 4 134884236 LOC134884236               5S ribosomal RNA
## 5 134884237 LOC134884237               5S ribosomal RNA
## 6 134884238 LOC134884238               5S ribosomal RNA

Roadmap Epigenomics Project

All Roadmap Epigenomics files are hosted here. If one had to download these files on their own, one would navigate through the web interface to find useful files, then use something like the following R code.

url <- "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3K4me1.broadPeak.gz"
filename <-  basename(url)
download.file(url, destfile=filename)
if (file.exists(filename))
   data <- import(filename, format="bed")

This would have to be repeated for all files, and the onus would lie on the user to identify, download, import, and manage the local disk location of these files.

AnnotationHub reduces this task to just a few lines of R code

library(AnnotationHub)
ah = AnnotationHub()
epiFiles <- query(ah, "EpigenomeRoadMap")

A look at the value returned by epiFiles shows us that 18250 roadmap resources are available via AnnotationHub. Additional information about the files is also available, e.g., where the files came from (dataprovider), genome, species, sourceurl, sourcetypes.

epiFiles

## AnnotationHub with 18250 records
## # snapshotDate(): 2026-06-30
## # $dataprovider: BroadInstitute, NA
## # $species: Homo sapiens
## # $rdataclass: BigWigFile, GRanges, data.frame
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## #   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH28856"]]' 
## 
##              title                                 
##   AH28856  | E001-H3K4me1.broadPeak.gz             
##   AH28857  | E001-H3K4me3.broadPeak.gz             
##   AH28858  | E001-H3K9ac.broadPeak.gz              
##   AH28859  | E001-H3K9me3.broadPeak.gz             
##   AH28860  | E001-H3K27me3.broadPeak.gz            
##   ...        ...                                   
##   AH49542  | E061_mCRF_FractionalMethylation.bigwig
##   AH49543  | E081_mCRF_FractionalMethylation.bigwig
##   AH49544  | E082_mCRF_FractionalMethylation.bigwig
##   AH116724 | TENET_consensus_enhancer_regions      
##   AH116726 | TENET_consensus_promoter_regions

A good sanity check to ensure that we have files only from the Roadmap Epigenomics project is to check that all the files in the returned smaller hub object come from Homo sapiens and the hg19, hg38 genome

unique(epiFiles$species)

## [1] "Homo sapiens"

unique(epiFiles$genome)

## [1] "hg19" "hg38"

Broadly, one can get an idea of the different files from this project looking at the sourcetype

table(epiFiles$sourcetype)

## 
##      BED   BigWig      GTF Multiple      tab      Zip 
##     8298     9932        3        2        1       14

To get a more descriptive idea of these different files one can use:

sort(table(epiFiles$description), decreasing=TRUE)

## 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Bigwig File containing -log10(p-value) signal tracks from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          6881 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Bigwig File containing fold enrichment signal tracks from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2947 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Narrow ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2894 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Broad ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2534 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Gapped ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2534 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Narrow DNasePeaks for consolidated epigenomes from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           131 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                15 state chromatin segmentations from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           127 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Broad domains on enrichment for DNase-seq for consolidated epigenomes from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            78 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RRBS fractional methylation calls from EpigenomeRoadMap Project  
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            51 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Whole genome bisulphite fractional methylation calls from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            37 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MeDIP/MRE(mCRF) fractional methylation calls from EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            16 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      GencodeV10 gene/transcript coordinates and annotations corresponding to hg19 version of the human genome 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             3 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RNA-seq read count matrix for intronic protein-coding RNA elements 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             2 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           RNA-seq read counts matrix for ribosomal gene exons 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             2 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               RPKM expression matrix for ribosomal gene exons 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             2 
## A composite GRanges object containing regions of putative enhancer elements from a variety of sources, primarily for use in the TENET Bioconductor package. This dataset is composed of regions of strong enhancers as annotated by the Roadmap Epigenomics ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome, as well as regions of human permissive enhancers identified by the FANTOM5 project. For additional information on component datasets, see the manifest file hosted at https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_consensus_datasets_manifest.tsv 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                      A composite GRanges object containing regions of putative promoter elements from a variety of sources, primarily for use in the TENET Bioconductor package. This dataset is composed of regions flanking transcription start sites as annotated by the Roadmap Epigenomics ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome. For additional information on component datasets, see the manifest file hosted at https://github.com/rhielab/TENET.AnnotationHub/blob/devel/data-raw/TENET_consensus_datasets_manifest.tsv 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Metadata for EpigenomeRoadMap Project 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RNA-seq read counts matrix for non-coding RNAs 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           RNA-seq read counts matrix for protein coding exons 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           RNA-seq read counts matrix for protein coding genes 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RNA-seq read counts matrix for ribosomal genes 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    RPKM expression matrix for non-coding RNAs 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               RPKM expression matrix for protein coding exons 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               RPKM expression matrix for protein coding genes 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     RPKM expression matrix for ribosomal RNAs 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1

The ‘metadata’ provided by the Roadmap Epigenomics Project is also available. Note that the information displayed about a hub with a single resource is quite different from the information displayed when the hub references more than one resource.

metadata.tab <- query(ah , c("EpigenomeRoadMap", "Metadata"))
metadata.tab

## AnnotationHub with 1 record
## # snapshotDate(): 2026-06-30
## # names(): AH41830
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: data.frame
## # $rdatadateadded: 2015-05-11
## # $title: EID_metadata.tab
## # $description: Metadata for EpigenomeRoadMap Project
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: tab
## # $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/metadata/EID_metadata.tab
## # $sourcesize: 18035
## # $tags: c("EpigenomeRoadMap", "Metadata") 
## # retrieve record with 'object[["AH41830"]]'

So far we have been exploring information about resources, without downloading the resource to a local cache and importing it into R. One can retrieve the resource using [[ as indicated at the end of the show method

## downloading 1 resources

## retrieving 1 resource

## loading from cache

metadata.tab <- ah[["AH41830"]]

## loading from cache

The metadata.tab file is returned as a data.frame. The first 6 rows of the first 5 columns are shown here:

metadata.tab[1:6, 1:5]

##    EID    GROUP   COLOR          MNEMONIC                                   STD_NAME
## 1 E001      ESC #924965            ESC.I3                                ES-I3 Cells
## 2 E002      ESC #924965           ESC.WA7                               ES-WA7 Cells
## 3 E003      ESC #924965            ESC.H1                                   H1 Cells
## 4 E004 ES-deriv #4178AE ESDR.H1.BMP4.MESO H1 BMP4 Derived Mesendoderm Cultured Cells
## 5 E005 ES-deriv #4178AE ESDR.H1.BMP4.TROP H1 BMP4 Derived Trophoblast Cultured Cells
## 6 E006 ES-deriv #4178AE       ESDR.H1.MSC          H1 Derived Mesenchymal Stem Cells

One can keep constructing different queries using multiple arguments to trim down these 18250 to get the files one wants. For example, to get the ChIP-Seq files for consolidated epigenomes, one could use

bpChipEpi <- query(ah , c("EpigenomeRoadMap", "broadPeak", "chip", "consolidated"))

To get all the bigWig signal files, one can query the hub using

allBigWigFiles <- query(ah, c("EpigenomeRoadMap", "BigWig"))

To access the 15 state chromatin segmentations, one can use

seg <- query(ah, c("EpigenomeRoadMap", "segmentations"))

If one is interested in getting all the files related to one sample

E126 <- query(ah , c("EpigenomeRoadMap", "E126", "H3K4ME2"))
E126

## AnnotationHub with 6 records
## # snapshotDate(): 2026-06-30
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges, BigWigFile
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## #   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH29817"]]' 
## 
##             title                                  
##   AH29817 | E126-H3K4me2.broadPeak.gz              
##   AH30868 | E126-H3K4me2.narrowPeak.gz             
##   AH31801 | E126-H3K4me2.gappedPeak.gz             
##   AH32990 | E126-H3K4me2.fc.signal.bigwig          
##   AH34022 | E126-H3K4me2.pval.signal.bigwig        
##   AH40177 | E126-H3K4me2.imputed.pval.signal.bigwig

Hub resources can also be selected using $, subset(), and BiocHubsShiny(); see the main AnnotationHub vignette for additional detail.

Hub resources are imported as the appropriate Bioconductor object for use in further analysis. For example, peak files are returned as GRanges objects.

## downloading 1 resources

## retrieving 1 resource

## loading from cache

## require("rtracklayer")

peaks <- E126[['AH29817']]

## loading from cache

seqinfo(peaks)

## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
##   seqnames       seqlengths isCircular genome
##   chr1            249250621      FALSE   hg19
##   chr2            243199373      FALSE   hg19
##   chr3            198022430      FALSE   hg19
##   chr4            191154276      FALSE   hg19
##   chr5            180915260      FALSE   hg19
##   ...                   ...        ...    ...
##   chrUn_gl000245      36651      FALSE   hg19
##   chrUn_gl000246      38154      FALSE   hg19
##   chrUn_gl000247      36422      FALSE   hg19
##   chrUn_gl000248      39786      FALSE   hg19
##   chrUn_gl000249      38502      FALSE   hg19

BigWig files are returned as BigWigFile objects. A BigWigFile is a reference to a file on disk; the data in the file can be read in using rtracklayer::import(), perhaps querying these large files for particular genomic regions of interest as described on the help page ?import.bw.

Each record inside AnnotationHub is associated with a unique identifier. Most GRanges objects returned by AnnotationHub contain the unique AnnotationHub identifier of the resource from which the GRanges is derived. This can come handy when working with the GRanges object for a while, and additional information about the object (e.g., the name of the file in the cache, or the original sourceurl for the data underlying the resource) that is being worked with.

metadata(peaks)

## $AnnotationHubName
## [1] "AH29817"
## 
## $`File Name`
## [1] "E126-H3K4me2.broadPeak.gz"
## 
## $`Data Source`
## [1] "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz"
## 
## $Provider
## [1] "BroadInstitute"
## 
## $Organism
## [1] "Homo sapiens"
## 
## $`Taxonomy ID`
## [1] 9606

ah[metadata(peaks)$AnnotationHubName]$sourceurl

## [1] "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz"

Ensembl GTF and FASTA files for TxDb gene models and sequence queries

Bioconductor represents gene models using ‘transcript’ databases. These are available via packages such as TxDb.Hsapiens.UCSC.hg38.knownGene or can be constructed using functions such as txdbmaker::makeTxDbFromBiomart().

AnnotationHub provides an easy way to work with gene models published by Ensembl. Let’s see what Ensembl’s Release-94 has in terms of data for pufferfish, Takifugu rubripes.

query(ah, c("Takifugu", "release-94"))

## AnnotationHub with 7 records
## # snapshotDate(): 2026-06-30
## # $dataprovider: Ensembl
## # $species: Takifugu rubripes
## # $rdataclass: TwoBitFile, GRanges
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## #   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH64856"]]' 
## 
##             title                                       
##   AH64856 | Takifugu_rubripes.FUGU5.94.abinitio.gtf     
##   AH64857 | Takifugu_rubripes.FUGU5.94.chr.gtf          
##   AH64858 | Takifugu_rubripes.FUGU5.94.gtf              
##   AH66114 | Takifugu_rubripes.FUGU5.cdna.all.2bit       
##   AH66115 | Takifugu_rubripes.FUGU5.dna_rm.toplevel.2bit
##   AH66116 | Takifugu_rubripes.FUGU5.dna_sm.toplevel.2bit
##   AH66117 | Takifugu_rubripes.FUGU5.ncrna.2bit

We see that there is a GTF file descrbing gene models, as well as various DNA sequences. Let’s retrieve the GTF and top-level DNA sequence files. The GTF file is imported as a GRanges instance, the DNA sequence as a twobit file.

gtf <- ah[["AH64858"]]

## downloading 1 resources

## retrieving 1 resource

## loading from cache

## Importing File into R ..

dna <- ah[["AH66116"]]

## downloading 1 resources

## retrieving 1 resource

## loading from cache

head(gtf, 3)

## GRanges object with 3 ranges and 22 metadata columns:
##       seqnames        ranges strand |   source       type     score     phase            gene_id
##          <Rle>     <IRanges>  <Rle> | <factor>   <factor> <numeric> <integer>        <character>
##   [1]        1 217531-252954      + |  ensembl gene              NA      <NA> ENSTRUG00000009922
##   [2]        1 217531-252954      + |  ensembl transcript        NA      <NA> ENSTRUG00000009922
##   [3]        1 217531-217702      + |  ensembl exon              NA      <NA> ENSTRUG00000009922
##       gene_version   gene_name gene_source   gene_biotype      transcript_id transcript_version
##        <character> <character> <character>    <character>        <character>        <character>
##   [1]            2       sdk2b     ensembl protein_coding               <NA>               <NA>
##   [2]            2       sdk2b     ensembl protein_coding ENSTRUT00000025027                  2
##   [3]            2       sdk2b     ensembl protein_coding ENSTRUT00000025027                  2
##       transcript_name transcript_source transcript_biotype exon_number            exon_id
##           <character>       <character>        <character> <character>        <character>
##   [1]            <NA>              <NA>               <NA>        <NA>               <NA>
##   [2]       sdk2b-201           ensembl     protein_coding        <NA>               <NA>
##   [3]       sdk2b-201           ensembl     protein_coding           1 ENSTRUE00000325931
##       exon_version  protein_id protein_version projection_parent_gene projection_parent_transcript
##        <character> <character>     <character>            <character>                  <character>
##   [1]         <NA>        <NA>            <NA>                   <NA>                         <NA>
##   [2]         <NA>        <NA>            <NA>                   <NA>                         <NA>
##   [3]            1        <NA>            <NA>                   <NA>                         <NA>
##               tag
##       <character>
##   [1]        <NA>
##   [2]        <NA>
##   [3]        <NA>
##   -------
##   seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths

dna

## TwoBitFile object
## resource: /github/home/.cache/R/AnnotationHub/21394bda5b76_72862

head(seqlevels(dna))

## [1] "1" "2" "3" "4" "5" "6"

Let’s identify the 25 longest DNA sequences, and keep just the annotations on these scaffolds.

keep <- names(tail(sort(seqlengths(dna)), 25))
gtf_subset <- gtf[seqnames(gtf) %in% keep]

It is trivial to make a TxDb instance of this subset (or of the entire gtf)

library(txdbmaker)         # for makeTxDbFromGRanges
txdb <- makeTxDbFromGRanges(gtf_subset)

## Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type stop_codon. This
##   information was ignored.

## Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information is not available for
## this TxDb object

and to use that in conjunction with the DNA sequences, e.g., to find exon sequences of all annotated genes.

library(Rsamtools)               # for getSeq,FaFile-method
exons <- exons(txdb)
length(exons)

## [1] 178769

getSeq(dna, exons)

## DNAStringSet object of length 178769:
##          width seq
##      [1]   172 CGATACGGCGCGCTCCGTTTGCCTCCGCCCCCCCCGTGGCG...GCGTTTCTGGGCCCCGCCCCCCTCGCCTCCCTCCGTGGCAG
##      [2]    28 TTGGGATTATTCTCACACGCTGATCGGT
##      [3]   160 ACGACGTGCCCCCCTACTTCAAGACGGAGCCGGCCCGGAGC...CACAACAACACGGAGCTGACGCGCTTCTCGCTGGAGTACAG
##      [4]   107 GTACGTGATCCCGTCTTTGGACCGCTCCCACGCCGGATTCT...GGGCGCCCTGCTGCAGAGACGCACCGAAGTCCAGGTGGTCT
##      [5]   148 TTATGGGAAGCTTCGAGGAGGGCGAGCGAGCCCAGTCCGTC...TGGTACCGGGATGGACGCAAGATTCCCCCGAGCAGCCGCAT
##      ...   ... ...
## [178765]    54 ATGCCCTCAATTACACTACCGCAGAAGGAGAACGCTCTCTTCAAAAGAATATTG
## [178766]   863 CTCTTGGTGAGGGGAAGGATGAATTTATCCGATGTCCAGTG...GTGATATAAGTTTTAGGGAAGAGCCCCATAGGCTGATGTAG
## [178767]   270 TTTGTGCAATGGGTGGCACCAGCAGCACCAGCAGGTTGTTT...CCCGTCTATCCGGATCATGCAGTGGAACATACTGGCACAAG
## [178768]   982 CAGTTGTACAGAAATCGTTGGAGCAGACCTGGAGGCTGTTG...CCCGTCTATCCGGATCATGCAGTGGAACATACTGGCACAAG
## [178769]   627 GGGGGAGATTCCGATGGTGGTATATTTAAAAAGTTGAAACT...GCCAAAGTGTTCCAGTTCCACCCATCGTGGCGGCCCGCCAG

There is a one-to-one mapping between the genomic ranges contained in exons and the DNA sequences returned by getSeq().

Some difficulties arise when working with this partly assembled genome that require more advanced GenomicRanges skills, see the GenomicRanges vignettes, especially “GenomicRanges HOWTOs” and “An Introduction to GenomicRanges”.

liftOver to map between genome builds

Suppose we wanted to lift features from one genome build to another, e.g., because annotations were generated for hg19 but our experimental analysis used hg18. We know that UCSC provides ‘liftover’ files for mapping between genome builds.

In this example, we will take our broad Peak GRanges from E126 which comes from the ‘hg19’ genome, and lift over these features to their ‘hg38’ coordinates.

chainfiles <- query(ah , c("hg38", "hg19", "chainfile"))
chainfiles

## AnnotationHub with 4 records
## # snapshotDate(): 2026-06-30
## # $dataprovider: UCSC, NCBI
## # $species: Homo sapiens
## # $rdataclass: ChainFile
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## #   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH14108"]]' 
## 
##             title                                        
##   AH14108 | hg38ToHg19.over.chain.gz                     
##   AH14150 | hg19ToHg38.over.chain.gz                     
##   AH78915 | Chain file for Homo sapiens rRNA hg19 to hg38
##   AH78916 | Chain file for Homo sapiens rRNA hg38 to hg19

We are interested in the file that lifts over features from hg19 to hg38 so lets download that using

## downloading 1 resources

## retrieving 1 resource

## loading from cache

chain <- chainfiles[['AH14150']]

## loading from cache

chain

## Chain of length 25
## names(25): chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 ... chr18 chr19 chr20 chr21 chr22 chrX chrY chrM

Perform the liftOver operation using rtracklayer::liftOver():

library(rtracklayer)
gr38 <- liftOver(peaks, chain)

This returns a GRangeslist; update the genome of the result to get the final result

genome(gr38) <- "hg38"
gr38

## GRangesList object of length 153266:
## [[1]]
## GRanges object with 1 range and 5 metadata columns:
##       seqnames            ranges strand |        name     score signalValue    pValue    qValue
##          <Rle>         <IRanges>  <Rle> | <character> <numeric>   <numeric> <numeric> <numeric>
##   [1]     chr1 28667912-28670147      * |      Rank_1       189     10.5585   22.0132   18.9991
##   -------
##   seqinfo: 23 sequences from hg38 genome; no seqlengths
## 
## [[2]]
## GRanges object with 1 range and 5 metadata columns:
##       seqnames            ranges strand |        name     score signalValue    pValue    qValue
##          <Rle>         <IRanges>  <Rle> | <character> <numeric>   <numeric> <numeric> <numeric>
##   [1]     chr4 54090990-54092984      * |      Rank_2       188     8.11483   21.8044   18.8066
##   -------
##   seqinfo: 23 sequences from hg38 genome; no seqlengths
## 
## [[3]]
## GRanges object with 1 range and 5 metadata columns:
##       seqnames            ranges strand |        name     score signalValue    pValue    qValue
##          <Rle>         <IRanges>  <Rle> | <character> <numeric>   <numeric> <numeric> <numeric>
##   [1]    chr14 75293392-75296621      * |      Rank_3       180     8.89834   20.9771   18.0282
##   -------
##   seqinfo: 23 sequences from hg38 genome; no seqlengths
## 
## ...
## <153263 more elements>

sessionInfo

sessionInfo()

## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rtracklayer_1.73.0          VariantAnnotation_1.59.0    SummarizedExperiment_1.43.0
##  [4] MatrixGenerics_1.25.0       matrixStats_1.5.0           Rsamtools_2.29.0           
##  [7] Biostrings_2.81.3           XVector_0.53.0              txdbmaker_1.9.0            
## [10] GenomicFeatures_1.65.0      AnnotationDbi_1.75.0        Biobase_2.73.1             
## [13] GenomicRanges_1.65.0        IRanges_2.47.2              Seqinfo_1.3.0              
## [16] S4Vectors_0.51.5            AnnotationHub_4.3.2         BiocFileCache_3.3.0        
## [19] dbplyr_2.6.0                BiocGenerics_0.59.8         generics_0.1.4             
## [22] BiocStyle_2.41.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1         dplyr_1.2.1              blob_1.3.0              
##  [4] filelock_1.0.3           bitops_1.0-9             fastmap_1.2.0           
##  [7] RCurl_1.98-1.19          GenomicAlignments_1.49.0 XML_3.99-0.23           
## [10] digest_0.6.39            lifecycle_1.0.5          KEGGREST_1.53.4         
## [13] RSQLite_3.53.2           magrittr_2.0.5           compiler_4.6.1          
## [16] progress_1.2.3           rlang_1.2.0              sass_0.4.10             
## [19] tools_4.6.1              yaml_2.3.12              knitr_1.51              
## [22] prettyunits_1.2.0        S4Arrays_1.13.0          bit_4.6.0               
## [25] curl_7.1.0               DelayedArray_0.39.3      abind_1.4-8             
## [28] BiocParallel_1.47.0      withr_3.0.3              purrr_1.2.2             
## [31] sys_3.4.3                grid_4.6.1               biomaRt_2.69.0          
## [34] cli_3.6.6                rmarkdown_2.31           crayon_1.5.3            
## [37] otel_0.2.0               httr_1.4.8               rjson_0.2.23            
## [40] BiocBaseUtils_1.15.1     DBI_1.3.0                cachem_1.1.0            
## [43] stringr_1.6.0            parallel_4.6.1           BiocManager_1.30.27     
## [46] restfulr_0.0.17          vctrs_0.7.3              Matrix_1.7-5            
## [49] jsonlite_2.0.0           hms_1.1.4                bit64_4.8.2             
## [52] maketools_1.3.2          jquerylib_0.1.4          glue_1.8.1              
## [55] codetools_0.2-20         stringi_1.8.7            BiocVersion_3.24.0      
## [58] GenomeInfoDb_1.49.1      BiocIO_1.23.3            UCSC.utils_1.9.0        
## [61] tibble_3.3.1             pillar_1.11.1            rappdirs_0.3.4          
## [64] htmltools_0.5.9          BSgenome_1.81.0          R6_2.6.1                
## [67] httr2_1.2.3              evaluate_1.0.5           lattice_0.22-9          
## [70] png_0.1-9                cigarillo_1.3.0          memoise_2.0.1           
## [73] bslib_0.11.0             SparseArray_1.13.2       xfun_0.59               
## [76] buildtools_1.0.0         pkgconfig_2.0.3

- Accessing Genome-Scale Data
- sessionInfo