Title: | Query the gene models of a given organism/assembly |
---|---|
Description: | Extract the genomic locations of genes, transcripts, exons, introns, and CDS, for the gene models stored in a TxDb object. A TxDb object is a small database that contains the gene models of a given organism/assembly. Bioconductor provides a small collection of TxDb objects in the form of ready-to-install TxDb packages for the most commonly studied organisms. Additionally, the user can easily make a TxDb object (or package) for the organism/assembly of their choice by using the tools from the txdbmaker package. |
Authors: | M. Carlson [aut], H. Pagès [aut, cre], P. Aboyoun [aut], S. Falcon [aut], M. Morgan [aut], D. Sarkar [aut], M. Lawrence [aut], V. Obenchain [aut], S. Arora [ctb], J. MacDonald [ctb], M. Ramos [ctb], S. Saini [ctb], P. Shannon [ctb], L. Shepherd [ctb], D. Tenenbaum [ctb], D. Van Twisk [ctb] |
Maintainer: | H. Pagès <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.59.1 |
Built: | 2024-11-08 03:30:20 UTC |
Source: | https://github.com/bioc/GenomicFeatures |
These functions coerce a TxDb
object to a
GRanges
object with
metadata columns encoding transcript structures according to the
model of a standard file format. Currently, BED and GFF models are
supported. If a TxDb
is passed to
export
, when targeting a BED or GFF file,
this coercion occurs automatically.
## S4 method for signature 'TxDb' asBED(x) ## S4 method for signature 'TxDb' asGFF(x)
## S4 method for signature 'TxDb' asBED(x) ## S4 method for signature 'TxDb' asGFF(x)
x |
A |
For asBED
, a GRanges
, with the columns name
,
thickStart
, thickEnd
, blockStarts
,
blockSizes
added. The thick regions correspond to the CDS
regions, and the blocks represent the exons. The transcript IDs are
stored in the name
column. The ranges are the transcript bounds.
For asGFF
, a GRanges
, with columns type
,
Name
, ID
,, and Parent
. The gene structures are
expressed according to the conventions defined by the GFF3 spec. There
are elements of each type
of feature: “gene”,
“mRNA” “exon” and “cds”. The Name
column
contains the gene_id
for genes, tx_name
for transcripts,
and exons and cds regions are NA
. The ID
column uses
gene_id
and tx_id
, with the prefixes “GeneID” and
“TxID” to ensure uniqueness across types. The exons and cds
regions have NA
for ID
. The Parent
column
contains the ID
s of the parent features. A feature may have
multiple parents (the column is a CharacterList
). Each exon
belongs to one or more mRNAs, and mRNAs belong to a gene.
Michael Lawrence
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) asBED(txdb) asGFF(txdb)
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) asBED(txdb) asGFF(txdb)
coverageByTranscript
computes the transcript (or CDS) coverage
of a set of ranges.
pcoverageByTranscript
is a version of coverageByTranscript
that operates element-wise.
coverageByTranscript(x, transcripts, ignore.strand=FALSE) pcoverageByTranscript(x, transcripts, ignore.strand=FALSE, ...)
coverageByTranscript(x, transcripts, ignore.strand=FALSE) pcoverageByTranscript(x, transcripts, ignore.strand=FALSE, ...)
x |
An object representing a set of ranges (typically aligned reads). GRanges, GRangesList, GAlignments, GAlignmentPairs, and GAlignmentsList objects are supported. More generally, for More generally, for |
transcripts |
A GRangesList object representing the exons of
each transcript for which to compute coverage. For each transcript, the
exons must be ordered by ascending rank, that is, by their position
in the transcript. This means that, for a transcript located on the minus
strand, the exons should typically be ordered by descending position on
the reference genome. If Alternatively, For |
ignore.strand |
TRUE or FALSE. If FALSE (the default) then the strand of a range in
|
... |
Additional arguments passed to the internal call to
|
An RleList object parallel to transcripts
,
that is, the i-th element in it is an integer-Rle
representing the coverage of the i-th transcript in transcripts
.
Its lengths()
is guaranteed to be identical to
sum(width(transcripts))
. The names and metadata columns on
transcripts
are propagated to it.
Hervé Pagès
transcripts
, transcriptsBy
,
and transcriptsByOverlaps
, for extracting
genomic feature locations from a TxDb-like object.
transcriptLengths
for extracting the transcript
lengths (and other metrics) from a TxDb object.
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from chromosome sequences.
The RleList class defined and documented in the IRanges package.
The GRangesList class defined and documented in the GenomicRanges package.
The coverage
methods defined in the
GenomicRanges package.
The exonsBy
function for extracting exon ranges
grouped by transcript.
findCompatibleOverlaps
in the
GenomicAlignments package for finding which reads are
compatible with the splicing of which transcript.
## --------------------------------------------------------------------- ## 1. A SIMPLE ARTIFICIAL EXAMPLE WITH ONLY ONE TRANSCRIPT ## --------------------------------------------------------------------- ## Get some transcripts: library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE) dm3_transcripts ## Let's pick up the 1st transcript: FBtr0300689. It as 2 exons and 1 ## intron: my_transcript <- dm3_transcripts["FBtr0300689"] ## Let's create 3 artificial aligned reads. We represent them as a ## GRanges object of length 3 that contains the genomic positions of ## the 3 reads. Note that these reads are simple alignments i.e. each ## of them can be represented with a single range. This would not be ## the case if they were junction reads. my_reads <- GRanges(c("chr2L:7531-7630", "chr2L:8101-8200", "chr2L:8141-8240")) ## The coverage of the 3 reads on the reference genome is: coverage(my_reads) ## As you can see, all the genomic positions in the 3 ranges participate ## to the coverage. This can be confirmed by comparing: sum(coverage(my_reads)) ## with: sum(width(my_reads)) ## They should always be the same. ## When computing the coverage on a transcript, only the part of the ## read that overlaps with the transcript participates to the coverage. ## Let's look at the individual coverage of each read on transcript ## FBtr0300689: ## The 1st read is fully contained within the 1st exon: coverageByTranscript(my_reads[1], my_transcript) ## Note that the length of the Rle (1880) is the length of the transcript. ## The 2nd and 3rd reads overlap the 2 exons and the intron. Only the ## parts that overlap the exons participate to coverage: coverageByTranscript(my_reads[2], my_transcript) coverageByTranscript(my_reads[3], my_transcript) ## The coverage of the 3 reads together is: coverageByTranscript(my_reads, my_transcript) ## Note that this is the sum of the individual coverages. This can be ## checked with: stopifnot(all( coverageByTranscript(my_reads, my_transcript) == Reduce("+", lapply(seq_along(my_reads), function(i) coverageByTranscript(my_reads[i], my_transcript)), 0L) )) ## --------------------------------------------------------------------- ## 2. COMPUTE THE FULL TRANSCRIPTOME COVERAGE OF A SET OF ALIGNED READS ## --------------------------------------------------------------------- ## Load the aligned reads: library(pasillaBamSubset) library(GenomicAlignments) reads <- readGAlignments(untreated1_chr4()) ## Compute the full transcriptome coverage by calling ## coverageByTranscript() on 'dm3_transcripts': tx_cvg <- coverageByTranscript(reads, dm3_transcripts, ignore.strand=TRUE) tx_cvg ## A sanity check: stopifnot(identical(lengths(tx_cvg), sum(width(dm3_transcripts)))) ## We can also use pcoverageByTranscript() to compute 'tx_cvg'. ## For this we first create a GAlignmentsList object "parallel" to ## 'dm3_transcripts' where the i-th list element contains the aligned ## reads that overlap with the i-th transcript: hits <- findOverlaps(reads, dm3_transcripts, ignore.strand=TRUE) tx2reads <- setNames(as(t(hits), "List"), names(dm3_transcripts)) reads_by_tx <- extractList(reads, tx2reads) # GAlignmentsList object reads_by_tx ## Call pcoverageByTranscript(): tx_cvg2 <- pcoverageByTranscript(reads_by_tx, dm3_transcripts, ignore.strand=TRUE) stopifnot(identical(tx_cvg, tx_cvg2)) ## A more meaningful coverage is obtained by counting for each ## transcript only the reads that are *compatible* with its splicing: compat_hits <- findCompatibleOverlaps(reads, dm3_transcripts) tx2reads <- setNames(as(t(compat_hits), "List"), names(dm3_transcripts)) compat_reads_by_tx <- extractList(reads, tx2reads) tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx, dm3_transcripts, ignore.strand=TRUE) ## A sanity check: stopifnot(all(all(tx_compat_cvg <= tx_cvg))) ## --------------------------------------------------------------------- ## 3. COMPUTE CDS COVERAGE OF A SET OF ALIGNED READS ## --------------------------------------------------------------------- ## coverageByTranscript() can also be used to compute CDS coverage: cds <- cdsBy(txdb, by="tx", use.names=TRUE) cds_cvg <- coverageByTranscript(reads, cds, ignore.strand=TRUE) cds_cvg ## A sanity check: stopifnot(identical(lengths(cds_cvg), sum(width(cds)))) ## --------------------------------------------------------------------- ## 4. ALTERNATIVELY, THE CDS COVERAGE CAN BE OBTAINED FROM THE ## TRANSCRIPT COVERAGE BY TRIMMING THE 5' AND 3' UTRS ## --------------------------------------------------------------------- tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE) stopifnot(identical(tx_lens$tx_name, names(tx_cvg))) # sanity ## Keep the rows in 'tx_lens' that correspond to a list element in ## 'cds_cvg' and put them in the same order as in 'cds_cvg': m <- match(names(cds_cvg), names(tx_cvg)) tx_lens <- tx_lens[m, ] utr5_width <- tx_lens$utr5_len utr3_width <- tx_lens$utr3_len cds_cvg2 <- windows(tx_cvg[m], start=1L+utr5_width, end=-1L-utr3_width) ## A sanity check: stopifnot(identical(cds_cvg2, cds_cvg))
## --------------------------------------------------------------------- ## 1. A SIMPLE ARTIFICIAL EXAMPLE WITH ONLY ONE TRANSCRIPT ## --------------------------------------------------------------------- ## Get some transcripts: library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE) dm3_transcripts ## Let's pick up the 1st transcript: FBtr0300689. It as 2 exons and 1 ## intron: my_transcript <- dm3_transcripts["FBtr0300689"] ## Let's create 3 artificial aligned reads. We represent them as a ## GRanges object of length 3 that contains the genomic positions of ## the 3 reads. Note that these reads are simple alignments i.e. each ## of them can be represented with a single range. This would not be ## the case if they were junction reads. my_reads <- GRanges(c("chr2L:7531-7630", "chr2L:8101-8200", "chr2L:8141-8240")) ## The coverage of the 3 reads on the reference genome is: coverage(my_reads) ## As you can see, all the genomic positions in the 3 ranges participate ## to the coverage. This can be confirmed by comparing: sum(coverage(my_reads)) ## with: sum(width(my_reads)) ## They should always be the same. ## When computing the coverage on a transcript, only the part of the ## read that overlaps with the transcript participates to the coverage. ## Let's look at the individual coverage of each read on transcript ## FBtr0300689: ## The 1st read is fully contained within the 1st exon: coverageByTranscript(my_reads[1], my_transcript) ## Note that the length of the Rle (1880) is the length of the transcript. ## The 2nd and 3rd reads overlap the 2 exons and the intron. Only the ## parts that overlap the exons participate to coverage: coverageByTranscript(my_reads[2], my_transcript) coverageByTranscript(my_reads[3], my_transcript) ## The coverage of the 3 reads together is: coverageByTranscript(my_reads, my_transcript) ## Note that this is the sum of the individual coverages. This can be ## checked with: stopifnot(all( coverageByTranscript(my_reads, my_transcript) == Reduce("+", lapply(seq_along(my_reads), function(i) coverageByTranscript(my_reads[i], my_transcript)), 0L) )) ## --------------------------------------------------------------------- ## 2. COMPUTE THE FULL TRANSCRIPTOME COVERAGE OF A SET OF ALIGNED READS ## --------------------------------------------------------------------- ## Load the aligned reads: library(pasillaBamSubset) library(GenomicAlignments) reads <- readGAlignments(untreated1_chr4()) ## Compute the full transcriptome coverage by calling ## coverageByTranscript() on 'dm3_transcripts': tx_cvg <- coverageByTranscript(reads, dm3_transcripts, ignore.strand=TRUE) tx_cvg ## A sanity check: stopifnot(identical(lengths(tx_cvg), sum(width(dm3_transcripts)))) ## We can also use pcoverageByTranscript() to compute 'tx_cvg'. ## For this we first create a GAlignmentsList object "parallel" to ## 'dm3_transcripts' where the i-th list element contains the aligned ## reads that overlap with the i-th transcript: hits <- findOverlaps(reads, dm3_transcripts, ignore.strand=TRUE) tx2reads <- setNames(as(t(hits), "List"), names(dm3_transcripts)) reads_by_tx <- extractList(reads, tx2reads) # GAlignmentsList object reads_by_tx ## Call pcoverageByTranscript(): tx_cvg2 <- pcoverageByTranscript(reads_by_tx, dm3_transcripts, ignore.strand=TRUE) stopifnot(identical(tx_cvg, tx_cvg2)) ## A more meaningful coverage is obtained by counting for each ## transcript only the reads that are *compatible* with its splicing: compat_hits <- findCompatibleOverlaps(reads, dm3_transcripts) tx2reads <- setNames(as(t(compat_hits), "List"), names(dm3_transcripts)) compat_reads_by_tx <- extractList(reads, tx2reads) tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx, dm3_transcripts, ignore.strand=TRUE) ## A sanity check: stopifnot(all(all(tx_compat_cvg <= tx_cvg))) ## --------------------------------------------------------------------- ## 3. COMPUTE CDS COVERAGE OF A SET OF ALIGNED READS ## --------------------------------------------------------------------- ## coverageByTranscript() can also be used to compute CDS coverage: cds <- cdsBy(txdb, by="tx", use.names=TRUE) cds_cvg <- coverageByTranscript(reads, cds, ignore.strand=TRUE) cds_cvg ## A sanity check: stopifnot(identical(lengths(cds_cvg), sum(width(cds)))) ## --------------------------------------------------------------------- ## 4. ALTERNATIVELY, THE CDS COVERAGE CAN BE OBTAINED FROM THE ## TRANSCRIPT COVERAGE BY TRIMMING THE 5' AND 3' UTRS ## --------------------------------------------------------------------- tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE) stopifnot(identical(tx_lens$tx_name, names(tx_cvg))) # sanity ## Keep the rows in 'tx_lens' that correspond to a list element in ## 'cds_cvg' and put them in the same order as in 'cds_cvg': m <- match(names(cds_cvg), names(tx_cvg)) tx_lens <- tx_lens[m, ] utr5_width <- tx_lens$utr5_len utr3_width <- tx_lens$utr3_len cds_cvg2 <- windows(tx_cvg[m], start=1L+utr5_width, end=-1L-utr3_width) ## A sanity check: stopifnot(identical(cds_cvg2, cds_cvg))
exonicParts
and intronicParts
extract the non-overlapping
(a.k.a. disjoint) exonic or intronic parts from a TxDb-like object.
exonicParts(txdb, linked.to.single.gene.only=FALSE) intronicParts(txdb, linked.to.single.gene.only=FALSE) ## 3 helper functions used internally by exonicParts() and intronicParts(): tidyTranscripts(txdb, drop.geneless=FALSE) tidyExons(txdb, drop.geneless=FALSE) tidyIntrons(txdb, drop.geneless=FALSE)
exonicParts(txdb, linked.to.single.gene.only=FALSE) intronicParts(txdb, linked.to.single.gene.only=FALSE) ## 3 helper functions used internally by exonicParts() and intronicParts(): tidyTranscripts(txdb, drop.geneless=FALSE) tidyExons(txdb, drop.geneless=FALSE) tidyIntrons(txdb, drop.geneless=FALSE)
txdb |
A TxDb object, or any TxDb-like object that supports the
|
linked.to.single.gene.only |
If If
|
drop.geneless |
If If Note that
|
exonicParts
returns a disjoint and strictly sorted
GRanges object with 1 range per exonic part
and with metadata columns tx_id
, tx_name
, gene_id
,
exon_id
, exon_name
, and exon_rank
.
If linked.to.single.gene.only
was set to TRUE
,
an additional exonic_part
metadata column is added that
indicates the rank of each exonic part within all the exonic parts
linked to the same gene.
intronicParts
returns a disjoint and strictly sorted
GRanges object with 1 range per intronic part
and with metadata columns tx_id
, tx_name
, and gene_id
.
If linked.to.single.gene.only
was set to TRUE
,
an additional intronic_part
metadata column is added that
indicates the rank of each intronic part within all the intronic parts
linked to the same gene.
tidyTranscripts
returns a GRanges object
with 1 range per transcript and with metadata columns tx_id
,
tx_name
, and gene_id
.
tidyExons
returns a GRanges object
with 1 range per exon and with metadata columns tx_id
,
tx_name
, gene_id
, exon_id
, exon_name
,
and exon_rank
.
tidyIntrons
returns a GRanges object
with 1 range per intron and with metadata columns tx_id
,
tx_name
, and gene_id
.
Hervé Pagès
disjoin
in the IRanges package.
transcripts
, transcriptsBy
,
and transcriptsByOverlaps
, for extracting
genomic feature locations from a TxDb-like object.
transcriptLengths
for extracting the transcript
lengths (and other metrics) from a TxDb object.
extendExonsIntoIntrons
for extending exons into
their adjacent introns.
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from chromosome sequences.
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
The TxDb class.
library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ## --------------------------------------------------------------------- ## exonicParts() ## --------------------------------------------------------------------- exonic_parts1 <- exonicParts(txdb) exonic_parts1 ## Mapping from exonic parts to genes is many-to-many: gene_id1 <- mcols(exonic_parts1)$gene_id gene_id1 # CharacterList object table(lengths(gene_id1)) ## The number of known genes a Human exonic part can be linked to ## varies from 0 to 22! exonic_parts2 <- exonicParts(txdb, linked.to.single.gene.only=TRUE) exonic_parts2 ## Mapping from exonic parts to genes now is many-to-one: gene_id2 <- mcols(exonic_parts2)$gene_id gene_id2[1:20] # character vector ## Select exonic parts for a given gene: exonic_parts2[gene_id2 %in% "643837"] ## Sanity checks: stopifnot(isDisjoint(exonic_parts1), isStrictlySorted(exonic_parts1)) stopifnot(isDisjoint(exonic_parts2), isStrictlySorted(exonic_parts2)) stopifnot(all(exonic_parts2 %within% reduce(exonic_parts1))) stopifnot(identical( lengths(gene_id1) == 1L, exonic_parts1 %within% exonic_parts2 )) ## --------------------------------------------------------------------- ## intronicParts() ## --------------------------------------------------------------------- intronic_parts1 <- intronicParts(txdb) intronic_parts1 ## Mapping from intronic parts to genes is many-to-many: mcols(intronic_parts1)$gene_id table(lengths(mcols(intronic_parts1)$gene_id)) ## A Human intronic part can be linked to 0 to 22 known genes! intronic_parts2 <- intronicParts(txdb, linked.to.single.gene.only=TRUE) intronic_parts2 ## Mapping from intronic parts to genes now is many-to-one: class(mcols(intronic_parts2)$gene_id) # character vector ## Sanity checks: stopifnot(isDisjoint(intronic_parts1), isStrictlySorted(intronic_parts1)) stopifnot(isDisjoint(intronic_parts2), isStrictlySorted(intronic_parts2)) stopifnot(all(intronic_parts2 %within% reduce(intronic_parts1))) stopifnot(identical( lengths(mcols(intronic_parts1)$gene_id) == 1L, intronic_parts1 %within% intronic_parts2 )) ## --------------------------------------------------------------------- ## Helper functions ## --------------------------------------------------------------------- tidyTranscripts(txdb) # Ordered by 'tx_id'. tidyTranscripts(txdb, drop.geneless=TRUE) # Ordered first by 'gene_id', # then by 'tx_id'. tidyExons(txdb) # Ordered first by 'tx_id', # then by 'exon_rank'. tidyExons(txdb, drop.geneless=TRUE) # Ordered first by 'gene_id', # then by 'tx_id', # then by 'exon_rank'. tidyIntrons(txdb) # Ordered by 'tx_id'. tidyIntrons(txdb, drop.geneless=TRUE) # Ordered first by 'gene_id', # then by 'tx_id'.
library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ## --------------------------------------------------------------------- ## exonicParts() ## --------------------------------------------------------------------- exonic_parts1 <- exonicParts(txdb) exonic_parts1 ## Mapping from exonic parts to genes is many-to-many: gene_id1 <- mcols(exonic_parts1)$gene_id gene_id1 # CharacterList object table(lengths(gene_id1)) ## The number of known genes a Human exonic part can be linked to ## varies from 0 to 22! exonic_parts2 <- exonicParts(txdb, linked.to.single.gene.only=TRUE) exonic_parts2 ## Mapping from exonic parts to genes now is many-to-one: gene_id2 <- mcols(exonic_parts2)$gene_id gene_id2[1:20] # character vector ## Select exonic parts for a given gene: exonic_parts2[gene_id2 %in% "643837"] ## Sanity checks: stopifnot(isDisjoint(exonic_parts1), isStrictlySorted(exonic_parts1)) stopifnot(isDisjoint(exonic_parts2), isStrictlySorted(exonic_parts2)) stopifnot(all(exonic_parts2 %within% reduce(exonic_parts1))) stopifnot(identical( lengths(gene_id1) == 1L, exonic_parts1 %within% exonic_parts2 )) ## --------------------------------------------------------------------- ## intronicParts() ## --------------------------------------------------------------------- intronic_parts1 <- intronicParts(txdb) intronic_parts1 ## Mapping from intronic parts to genes is many-to-many: mcols(intronic_parts1)$gene_id table(lengths(mcols(intronic_parts1)$gene_id)) ## A Human intronic part can be linked to 0 to 22 known genes! intronic_parts2 <- intronicParts(txdb, linked.to.single.gene.only=TRUE) intronic_parts2 ## Mapping from intronic parts to genes now is many-to-one: class(mcols(intronic_parts2)$gene_id) # character vector ## Sanity checks: stopifnot(isDisjoint(intronic_parts1), isStrictlySorted(intronic_parts1)) stopifnot(isDisjoint(intronic_parts2), isStrictlySorted(intronic_parts2)) stopifnot(all(intronic_parts2 %within% reduce(intronic_parts1))) stopifnot(identical( lengths(mcols(intronic_parts1)$gene_id) == 1L, intronic_parts1 %within% intronic_parts2 )) ## --------------------------------------------------------------------- ## Helper functions ## --------------------------------------------------------------------- tidyTranscripts(txdb) # Ordered by 'tx_id'. tidyTranscripts(txdb, drop.geneless=TRUE) # Ordered first by 'gene_id', # then by 'tx_id'. tidyExons(txdb) # Ordered first by 'tx_id', # then by 'exon_rank'. tidyExons(txdb, drop.geneless=TRUE) # Ordered first by 'gene_id', # then by 'tx_id', # then by 'exon_rank'. tidyIntrons(txdb) # Ordered by 'tx_id'. tidyIntrons(txdb, drop.geneless=TRUE) # Ordered first by 'gene_id', # then by 'tx_id'.
extendExonsIntoIntrons
extends the supplied exons by a given
number of bases into their adjacent introns.
extendExonsIntoIntrons(ex_by_tx, extent=2)
extendExonsIntoIntrons(ex_by_tx, extent=2)
ex_by_tx |
A GRangesList object containing exons grouped
by transcript. This must be an object as returned by
|
extent |
Size of the extent in number of bases. 2 by default. The first exon in a transcript will be extended by that amount on its 3' side only. The last exon in a transcript will be extended by that amount on its 5' side only. All other exons (i.e. intermediate exons) will be extended by that amount on each side. Note that exons that belong to a single-exon transcript don't get extended. The default value of 2 corresponds to inclusion of the donor/acceptor intronic regions (typically GT/AG). |
A copy of GRangesList object ex_by_tx
where the original exon ranges have been extended.
Names and metadata columns on ex_by_tx
are propagated to the
result.
Hervé Pagès
transcripts
, transcriptsBy
,
and transcriptsByOverlaps
, for extracting
genomic feature locations from a TxDb-like object.
exonicParts
and intronicParts
for
extracting non-overlapping exonic or intronic parts from a
TxDb-like object.
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from chromosome sequences.
The TxDb class.
## With toy transcripts: ex_by_tx <- GRangesList( TX1="chr1:10-20:+", TX2=c("chr1:10-20:+", "chr1:50-75:+"), TX3=c("chr1:10-20:+", "chr1:50-75:+", "chr1:100-120:+"), TX4="chr1:10-20:-", TX5=c("chr1:10-20:-", "chr1:50-75:-"), TX6=c("chr1:10-20:-", "chr1:50-75:-", "chr1:100-120:-") ) extended <- extendExonsIntoIntrons(ex_by_tx, extent=2) extended[1:3] extended[4:6] ## With real-world transcripts: library(TxDb.Celegans.UCSC.ce11.ensGene) txdb <- TxDb.Celegans.UCSC.ce11.ensGene ex_by_tx <- exonsBy(txdb, by="tx") ex_by_tx extendExonsIntoIntrons(ex_by_tx, extent=2) ## Sanity check: stopifnot(identical(extendExonsIntoIntrons(ex_by_tx, extent=0), ex_by_tx))
## With toy transcripts: ex_by_tx <- GRangesList( TX1="chr1:10-20:+", TX2=c("chr1:10-20:+", "chr1:50-75:+"), TX3=c("chr1:10-20:+", "chr1:50-75:+", "chr1:100-120:+"), TX4="chr1:10-20:-", TX5=c("chr1:10-20:-", "chr1:50-75:-"), TX6=c("chr1:10-20:-", "chr1:50-75:-", "chr1:100-120:-") ) extended <- extendExonsIntoIntrons(ex_by_tx, extent=2) extended[1:3] extended[4:6] ## With real-world transcripts: library(TxDb.Celegans.UCSC.ce11.ensGene) txdb <- TxDb.Celegans.UCSC.ce11.ensGene ex_by_tx <- exonsBy(txdb, by="tx") ex_by_tx extendExonsIntoIntrons(ex_by_tx, extent=2) ## Sanity check: stopifnot(identical(extendExonsIntoIntrons(ex_by_tx, extent=0), ex_by_tx))
extractTranscriptSeqs
extracts transcript (or CDS) sequences from
an object representing a single chromosome or a collection of chromosomes.
extractTranscriptSeqs(x, transcripts, ...) ## S4 method for signature 'DNAString' extractTranscriptSeqs(x, transcripts, strand="+") ## S4 method for signature 'ANY' extractTranscriptSeqs(x, transcripts, ...)
extractTranscriptSeqs(x, transcripts, ...) ## S4 method for signature 'DNAString' extractTranscriptSeqs(x, transcripts, strand="+") ## S4 method for signature 'ANY' extractTranscriptSeqs(x, transcripts, ...)
x |
An object representing a single chromosome or a collection of chromosomes.
More precisely, Other objects representing a collection of chromosomes are supported
(e.g. FaFile objects in the Rsamtools package)
as long as |
transcripts |
An object representing the exon ranges of each transcript to extract. More precisely:
Note that, for each transcript, the exons must be ordered by ascending rank, that is, by ascending position in the transcript (when going in the 5' to 3' direction). This generally means (but not always) that they are also ordered from 5' to 3' on the reference genome. More precisely:
If |
... |
Additional arguments, for use in specific methods. For the default method, additional arguments are allowed only when
|
strand |
Only supported when Can be an atomic vector, a factor, or an Rle object,
in which case it indicates the strand of each transcript (i.e. all the
exons in a transcript are considered to be on the same strand).
More precisely: it's turned into a factor (or factor-Rle)
that has the "standard strand levels" (this is done by calling the
|
A DNAStringSet object parallel to
transcripts
, that is, the i-th element in it is the sequence
of the i-th transcript in transcripts
.
Hervé Pagès
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
transcriptLengths
for extracting the transcript
lengths (and other metrics) from a TxDb object.
extendExonsIntoIntrons
for extending exons
into their adjacent introns.
The transcriptLocs2refLocs
function for converting
transcript-based locations into reference-based locations.
The available.genomes
function in the
BSgenome package for checking avaibility of BSgenome
data packages (and installing the desired one).
The DNAString and DNAStringSet classes defined and documented in the Biostrings package.
The translate
function in the
Biostrings package for translating DNA or RNA sequences
into amino acid sequences.
The GRangesList class defined and documented in the GenomicRanges package.
The IntegerRangesList class defined and documented in the IRanges package.
The exonsBy
function for extracting exon ranges
grouped by transcript.
The TxDb class.
## --------------------------------------------------------------------- ## 1. A TOY EXAMPLE ## --------------------------------------------------------------------- library(Biostrings) ## A chromosome of length 30: x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC") ## 2 transcripts on 'x': tx1 <- IRanges(1, 8) # 1 exon tx2 <- c(tx1, IRanges(12, 30)) # 2 exons transcripts <- IRangesList(tx1=tx1, tx2=tx2) extractTranscriptSeqs(x, transcripts) ## By default, all the exons are considered to be on the plus strand. ## We can use the 'strand' argument to tell extractTranscriptSeqs() ## to extract them from the minus strand. ## Extract all the exons from the minus strand: extractTranscriptSeqs(x, transcripts, strand="-") ## Note that, for a transcript located on the minus strand, the exons ## should typically be ordered by descending position on the reference ## genome in order to reflect their rank in the transcript: extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-") ## Extract the exon of the 1st transcript from the minus strand: extractTranscriptSeqs(x, transcripts, strand=c("-", "+")) ## Extract the 2nd exon of the 2nd transcript from the minus strand: extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-"))) ## --------------------------------------------------------------------- ## 2. A REAL EXAMPLE ## --------------------------------------------------------------------- ## Load a genome: library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 ## Load a TxDb object: txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ## Check that 'txdb' is based on the hg19 assembly: txdb ## Extract the exon ranges grouped by transcript from 'txdb': transcripts <- exonsBy(txdb, by="tx", use.names=TRUE) ## Extract the transcript sequences from the genome: tx_seqs <- extractTranscriptSeqs(genome, transcripts) tx_seqs ## A sanity check: stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts))))) ## Note that 'tx_seqs' can also be obtained with: extractTranscriptSeqs(genome, txdb, use.names=TRUE) ## --------------------------------------------------------------------- ## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES ## --------------------------------------------------------------------- cds <- cdsBy(txdb, by="tx", use.names=TRUE) cds_seqs <- extractTranscriptSeqs(genome, cds) cds_seqs ## A sanity check: stopifnot(identical(width(cds_seqs), unname(sum(width(cds))))) ## Note that, alternatively, the CDS sequences can be obtained from the ## transcript sequences by removing the 5' and 3' UTRs: tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE) stopifnot(identical(tx_lens$tx_name, names(tx_seqs))) # sanity ## Keep the rows in 'tx_lens' that correspond to a sequence in 'cds_seqs' ## and put them in the same order as in 'cds_seqs': m <- match(names(cds_seqs), names(tx_seqs)) tx_lens <- tx_lens[m, ] utr5_width <- tx_lens$utr5_len utr3_width <- tx_lens$utr3_len cds_seqs2 <- narrow(tx_seqs[m], start=utr5_width+1L, end=-(utr3_width+1L)) stopifnot(identical(as.character(cds_seqs2), as.character(cds_seqs))) ## --------------------------------------------------------------------- ## 4. TRANSLATE THE CDS SEQUENCES ## --------------------------------------------------------------------- prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve") ## Note that, by default, translate() uses The Standard Genetic Code to ## translate codons into amino acids. However, depending on the organism, ## a different genetic code might be needed to translate CDS sequences ## located on the mitochodrial chromosome. For example, for vertebrates, ## the following code could be used to correct 'prot_seqs': SGC1 <- getGeneticCode("SGC1") chrM_idx <- which(all(seqnames(cds) == "chrM")) prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1, if.fuzzy.codon="solve")
## --------------------------------------------------------------------- ## 1. A TOY EXAMPLE ## --------------------------------------------------------------------- library(Biostrings) ## A chromosome of length 30: x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC") ## 2 transcripts on 'x': tx1 <- IRanges(1, 8) # 1 exon tx2 <- c(tx1, IRanges(12, 30)) # 2 exons transcripts <- IRangesList(tx1=tx1, tx2=tx2) extractTranscriptSeqs(x, transcripts) ## By default, all the exons are considered to be on the plus strand. ## We can use the 'strand' argument to tell extractTranscriptSeqs() ## to extract them from the minus strand. ## Extract all the exons from the minus strand: extractTranscriptSeqs(x, transcripts, strand="-") ## Note that, for a transcript located on the minus strand, the exons ## should typically be ordered by descending position on the reference ## genome in order to reflect their rank in the transcript: extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-") ## Extract the exon of the 1st transcript from the minus strand: extractTranscriptSeqs(x, transcripts, strand=c("-", "+")) ## Extract the 2nd exon of the 2nd transcript from the minus strand: extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-"))) ## --------------------------------------------------------------------- ## 2. A REAL EXAMPLE ## --------------------------------------------------------------------- ## Load a genome: library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 ## Load a TxDb object: txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ## Check that 'txdb' is based on the hg19 assembly: txdb ## Extract the exon ranges grouped by transcript from 'txdb': transcripts <- exonsBy(txdb, by="tx", use.names=TRUE) ## Extract the transcript sequences from the genome: tx_seqs <- extractTranscriptSeqs(genome, transcripts) tx_seqs ## A sanity check: stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts))))) ## Note that 'tx_seqs' can also be obtained with: extractTranscriptSeqs(genome, txdb, use.names=TRUE) ## --------------------------------------------------------------------- ## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES ## --------------------------------------------------------------------- cds <- cdsBy(txdb, by="tx", use.names=TRUE) cds_seqs <- extractTranscriptSeqs(genome, cds) cds_seqs ## A sanity check: stopifnot(identical(width(cds_seqs), unname(sum(width(cds))))) ## Note that, alternatively, the CDS sequences can be obtained from the ## transcript sequences by removing the 5' and 3' UTRs: tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE) stopifnot(identical(tx_lens$tx_name, names(tx_seqs))) # sanity ## Keep the rows in 'tx_lens' that correspond to a sequence in 'cds_seqs' ## and put them in the same order as in 'cds_seqs': m <- match(names(cds_seqs), names(tx_seqs)) tx_lens <- tx_lens[m, ] utr5_width <- tx_lens$utr5_len utr3_width <- tx_lens$utr3_len cds_seqs2 <- narrow(tx_seqs[m], start=utr5_width+1L, end=-(utr3_width+1L)) stopifnot(identical(as.character(cds_seqs2), as.character(cds_seqs))) ## --------------------------------------------------------------------- ## 4. TRANSLATE THE CDS SEQUENCES ## --------------------------------------------------------------------- prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve") ## Note that, by default, translate() uses The Standard Genetic Code to ## translate codons into amino acids. However, depending on the organism, ## a different genetic code might be needed to translate CDS sequences ## located on the mitochodrial chromosome. For example, for vertebrates, ## the following code could be used to correct 'prot_seqs': SGC1 <- getGeneticCode("SGC1") chrM_idx <- which(all(seqnames(cds) == "chrM")) prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1, if.fuzzy.codon="solve")
extractUpstreamSeqs
is a generic function for extracting
sequences upstream of a supplied set of genes or transcripts.
extractUpstreamSeqs(x, genes, width=1000, ...) ## Dispatch is on the 2nd argument! ## S4 method for signature 'GenomicRanges' extractUpstreamSeqs(x, genes, width=1000) ## S4 method for signature 'TxDb' extractUpstreamSeqs(x, genes, width=1000, exclude.seqlevels=NULL)
extractUpstreamSeqs(x, genes, width=1000, ...) ## Dispatch is on the 2nd argument! ## S4 method for signature 'GenomicRanges' extractUpstreamSeqs(x, genes, width=1000) ## S4 method for signature 'TxDb' extractUpstreamSeqs(x, genes, width=1000, exclude.seqlevels=NULL)
x |
An object containing the chromosome sequences from which to extract the
upstream sequences. It can be a BSgenome,
TwoBitFile, or FaFile object,
or any genome sequence container.
More formally, |
genes |
An object containing the locations (i.e. chromosome name, start, end, and
strand) of the genes or transcripts with respect to the reference genome.
Only GenomicRanges and TxDb objects
are supported at the moment. If the latter, the gene locations are obtained
by calling the |
width |
How many bases to extract upstream of each TSS (transcription start site). |
... |
Additional arguments, for use in specific methods. |
exclude.seqlevels |
A character vector containing the chromosome names (a.k.a. sequence levels) to exclude when the genes are obtained from a TxDb object. |
A DNAStringSet object containing one upstream sequence
per gene (or per transcript if genes
is a
GenomicRanges object containing transcript ranges).
More precisely, if genes
is a GenomicRanges
object, the returned object is parallel to it, that is, the i-th
element in the returned object is the upstream sequence corresponding to
the i-th gene (or transcript) in genes
. Also the names on the
GenomicRanges object are propagated to the returned
object.
If genes
is a TxDb object, the names on the returned
object are the gene IDs found in the TxDb object. To see the
type of gene IDs (i.e. Entrez gene ID or Ensembl gene ID or ...), you can
display genes
with show(genes)
.
In addition, the returned object has the following metadata columns
(accessible with mcols
) that provide some information about
the gene (or transcript) corresponding to each upstream sequence:
gene_seqnames
: the chromosome name of the gene (or
transcript);
gene_strand
: the strand of the gene (or transcript);
gene_TSS
: the transcription start site of the gene (or
transcript).
IMPORTANT: Always make sure to use a TxDb package (or TxDb
object) that contains a gene model compatible with the genome sequence
container x
, that is, a gene model based on the exact same reference
genome as x
.
See
http://bioconductor.org/packages/release/BiocViews.html#___TxDb
for the list of TxDb packages available in the current release of
Bioconductor.
Note that you can make your own custom TxDb object from
various annotation resources by using one of the makeTxDbFrom*()
functions listed in the "See also" section below.
Hervé Pagès
makeTxDbFromUCSC
, makeTxDbFromBiomart
,
and makeTxDbFromEnsembl
, for making a TxDb
object from online resources.
makeTxDbFromGRanges
and makeTxDbFromGFF
for making a TxDb object from a GRanges
object, or from a GFF or GTF file.
The available.genomes
function in the
BSgenome package for checking avaibility of BSgenome
data packages (and installing the desired one).
The BSgenome, TwoBitFile, and FaFile classes, defined and documented in the BSgenome, rtracklayer, and Rsamtools packages, respectively.
The TxDb class.
The genes
function for extracting gene ranges from
a TxDb object.
The GenomicRanges class defined and documented in the GenomicRanges package.
The DNAStringSet class defined and documented in the Biostrings package.
The seqinfo
getter defined and documented
in the GenomeInfoDb package.
The getSeq
function for extracting
subsequences from a sequence container.
## Load a genome: library(BSgenome.Dmelanogaster.UCSC.dm3) genome <- BSgenome.Dmelanogaster.UCSC.dm3 genome ## Use a TxDb object: library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene txdb # contains Ensembl gene IDs ## Because the chrU and chrUextra sequences are made of concatenated ## scaffolds (see https://genome.ucsc.edu/cgi-bin/hgGateway?db=dm3), ## extracting the upstream sequences for genes located on these ## scaffolds is not reliable. So we exclude them: exclude <- c("chrU", "chrUextra") up1000seqs <- extractUpstreamSeqs(genome, txdb, width=1000, exclude.seqlevels=exclude) up1000seqs # the names are Ensembl gene IDs mcols(up1000seqs) ## Upstream sequences for genes close to the chromosome bounds can be ## shorter than 1000 (note that this does not happen for circular ## chromosomes like chrM): table(width(up1000seqs)) mcols(up1000seqs)[width(up1000seqs) != 1000, ]
## Load a genome: library(BSgenome.Dmelanogaster.UCSC.dm3) genome <- BSgenome.Dmelanogaster.UCSC.dm3 genome ## Use a TxDb object: library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene txdb # contains Ensembl gene IDs ## Because the chrU and chrUextra sequences are made of concatenated ## scaffolds (see https://genome.ucsc.edu/cgi-bin/hgGateway?db=dm3), ## extracting the upstream sequences for genes located on these ## scaffolds is not reliable. So we exclude them: exclude <- c("chrU", "chrUextra") up1000seqs <- extractUpstreamSeqs(genome, txdb, width=1000, exclude.seqlevels=exclude) up1000seqs # the names are Ensembl gene IDs mcols(up1000seqs) ## Upstream sequences for genes close to the chromosome bounds can be ## shorter than 1000 (note that this does not happen for circular ## chromosomes like chrM): table(width(up1000seqs)) mcols(up1000seqs)[width(up1000seqs) != 1000, ]
WARNING: The FeatureDb/makeFeatureDbFromUCSC/features code base is
no longer actively maintained and FeatureDb-related functionalities
might get deprecated in the near future. Please use
makeFeatureDbFromUCSC
for a convenient way to
import transcript annotations from UCSC online resources into
Bioconductor.
The FeatureDb class is a generic container for storing genomic locations of an arbitrary type of genomic features.
See ?TxDb
for a container for storing transcript
annotations.
See ?makeFeatureDbFromUCSC
for a convenient way to
make FeatureDb objects from BioMart online resources.
In the code snippets below, x
is a FeatureDb object.
metadata(x)
:Return x
's metadata in a data frame.
Marc Carlson
The TxDb class for storing transcript annotations.
makeFeatureDbFromUCSC
for a convenient way to
make a FeatureDb object from UCSC online resources.
saveDb
and loadDb
for
saving and loading the database content of a FeatureDb object.
features
for how to extract genomic features
from a FeatureDb object.
fdb_file <- system.file("extdata", "FeatureDb.sqlite", package="GenomicFeatures") fdb <- loadDb(fdb_file) fdb
fdb_file <- system.file("extdata", "FeatureDb.sqlite", package="GenomicFeatures") fdb <- loadDb(fdb_file) fdb
WARNING: The FeatureDb/makeFeatureDbFromUCSC/features code base is
no longer actively maintained and FeatureDb-related functionalities
might get deprecated in the near future. Please use
makeFeatureDbFromUCSC
for a convenient way to
import transcript annotations from UCSC online resources into
Bioconductor.
Generic function to extract genomic features from a FeatureDb object.
features(x) ## S4 method for signature 'FeatureDb' features(x)
features(x) ## S4 method for signature 'FeatureDb' features(x)
x |
A FeatureDb object. |
a GRanges object
M. Carlson
fdb <- loadDb(system.file("extdata", "FeatureDb.sqlite", package="GenomicFeatures")) features(fdb)
fdb <- loadDb(system.file("extdata", "FeatureDb.sqlite", package="GenomicFeatures")) features(fdb)
Extract promoter or terminator sequences for the genes or transcripts specified in the query (aGRanges or GRangesList object) from a BSgenome or FaFile object.
## S4 method for signature 'GRanges' getPromoterSeq(query, subject, upstream=2000, downstream=200) ## S4 method for signature 'GRanges' getTerminatorSeq(query, subject, upstream=2000, downstream=200) ## S4 method for signature 'GRangesList' getPromoterSeq(query, subject, upstream=2000, downstream=200) ## S4 method for signature 'GRangesList' getTerminatorSeq(query, subject, upstream=2000, downstream=200)
## S4 method for signature 'GRanges' getPromoterSeq(query, subject, upstream=2000, downstream=200) ## S4 method for signature 'GRanges' getTerminatorSeq(query, subject, upstream=2000, downstream=200) ## S4 method for signature 'GRangesList' getPromoterSeq(query, subject, upstream=2000, downstream=200) ## S4 method for signature 'GRangesList' getTerminatorSeq(query, subject, upstream=2000, downstream=200)
query |
A GRanges or GRangesList object containing genes grouped by transcript. |
subject |
A BSgenome or FaFile object from which the sequences will be taken. |
upstream |
The number of DNA bases to include upstream of the TSS (transcription start site) |
downstream |
The number of DNA bases to include downstream of the TSS (transcription start site) |
getPromoterSeq
and getTerminatorSeq
are generic functions
dispatching on query, which is either a GRanges or a GRangesList.
They are convenience wrappers for the promoters
, terminators
,
and getSeq
functions.
The purpose is to allow sequence extraction from either a
BSgenome or FaFile object.
Default values for upstream
and downstream
were chosen based
on our current understanding of gene regulation. On average, promoter
regions in the mammalian genome are 5000 bp upstream and downstream of the
transcription start site.
A DNAStringSet or DNAStringSetList instance corresponding to the GRanges or GRangesList supplied in the query.
Paul Shannon
The promoters
man page in the
GenomicRanges package for the promoters()
and
terminators()
methods for GenomicRanges
objects.
getSeq
in the Biostrings
package for extracting a set of sequences from a sequence
container like a BSgenome or
FaFile object.
library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(BSgenome.Hsapiens.UCSC.hg19) ## A GRangesList object describing all the known Human transcripts grouped ## by gene: txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene tx_by_gene <- transcriptsBy(txdb, by="gene") e2f3 <- "1871" # entrez geneID for a cell cycle control transcription # factor, chr6 on the plus strand ## A GRanges object describing the three transcripts for gene 1871: e2f3_tx <- tx_by_gene[[e2f3]] ## Promoter sequences for gene 1871: e2f3_promoter_seqs <- getPromoterSeq(e2f3_tx, Hsapiens, upstream=40, downstream=15) e2f3_promoter_seqs mcols(e2f3_promoter_seqs) ## Terminator sequences for gene 1871: e2f3_terminator_seqs <- getTerminatorSeq(e2f3_tx, Hsapiens, upstream=25, downstream=10) e2f3_terminator_seqs mcols(e2f3_terminator_seqs) # same as 'mcols(e2f3_promoter_seqs)' ## All Human promoter sequences grouped by gene: getPromoterSeq(tx_by_gene, Hsapiens, upstream=6, downstream=4)
library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(BSgenome.Hsapiens.UCSC.hg19) ## A GRangesList object describing all the known Human transcripts grouped ## by gene: txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene tx_by_gene <- transcriptsBy(txdb, by="gene") e2f3 <- "1871" # entrez geneID for a cell cycle control transcription # factor, chr6 on the plus strand ## A GRanges object describing the three transcripts for gene 1871: e2f3_tx <- tx_by_gene[[e2f3]] ## Promoter sequences for gene 1871: e2f3_promoter_seqs <- getPromoterSeq(e2f3_tx, Hsapiens, upstream=40, downstream=15) e2f3_promoter_seqs mcols(e2f3_promoter_seqs) ## Terminator sequences for gene 1871: e2f3_terminator_seqs <- getTerminatorSeq(e2f3_tx, Hsapiens, upstream=25, downstream=10) e2f3_terminator_seqs mcols(e2f3_terminator_seqs) # same as 'mcols(e2f3_promoter_seqs)' ## All Human promoter sequences grouped by gene: getPromoterSeq(tx_by_gene, Hsapiens, upstream=6, downstream=4)
Utility function for retrieving the mapping from the internal ids to the external names of a given feature type.
id2name(txdb, feature.type=c("tx", "exon", "cds"))
id2name(txdb, feature.type=c("tx", "exon", "cds"))
txdb |
A TxDb object. |
feature.type |
The feature type for which the mapping must be retrieved. |
Transcripts, exons and CDS parts in a TxDb object are
stored in seperate tables where the primary key is an integer
called feature internal id. This id is stored in the
"tx_id"
column for transcripts, in the "exon_id"
column for exons, and in the "cds_id"
column for CDS parts.
Unlike other commonly used ids like Entrez Gene IDs or Ensembl IDs,
this internal id was generated at the time the TxDb
object was created and has no meaning outside the scope of this object.
The id2name
function can be used to translate this internal
id into a more informative id or name called feature external
name. This name is stored in the "tx_name"
column for
transcripts, in the "exon_name"
column for exons, and in
the "cds_name"
column for CDS parts.
Note that, unlike the feature internal id, the feature external
name is not guaranteed to be unique or even defined (the column
can contain NA
s).
A named character vector where the names are the internal ids and the values the external names.
Hervé Pagès
transcripts
, transcriptsBy
,
and transcriptsByOverlaps
, for how to extract
genomic features from a TxDb object.
The TxDb class.
txdb1_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb1 <- loadDb(txdb1_file) id2name(txdb1, feature.type="tx")[1:4] id2name(txdb1, feature.type="exon")[1:4] id2name(txdb1, feature.type="cds")[1:4] txdb2_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite", package="GenomicFeatures") txdb2 <- loadDb(txdb2_file) id2name(txdb2, feature.type="tx")[1:4] id2name(txdb2, feature.type="exon")[1:4] id2name(txdb2, feature.type="cds")[1:4]
txdb1_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb1 <- loadDb(txdb1_file) id2name(txdb1, feature.type="tx")[1:4] id2name(txdb1, feature.type="exon")[1:4] id2name(txdb1, feature.type="cds")[1:4] txdb2_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite", package="GenomicFeatures") txdb2 <- loadDb(txdb2_file) id2name(txdb2, feature.type="tx")[1:4] id2name(txdb2, feature.type="exon")[1:4] id2name(txdb2, feature.type="cds")[1:4]
IMPORTANT NOTE: Starting with BioC 3.19, functions
supportedUCSCFeatureDbTracks()
, supportedUCSCFeatureDbTables()
,
UCSCFeatureDbTableSchema()
, and makeFeatureDbFromUCSC()
are
defined in the txdbmaker package.
txdbmaker::supportedUCSCFeatureDbTracks
,
txdbmaker::supportedUCSCFeatureDbTables
,
txdbmaker::UCSCFeatureDbTableSchema
,
and txdbmaker::makeFeatureDbFromUCSC
in the txdbmaker package.
IMPORTANT NOTE: Starting with BioC 3.19, the makeTxDb
function
is defined in the txdbmaker package.
txdbmaker::makeTxDb
in the txdbmaker
package.
IMPORTANT NOTE: Starting with BioC 3.19, functions
makeTxDbFromBiomart()
and getChromInfoFromBiomart()
are defined in the txdbmaker package.
txdbmaker::makeTxDbFromBiomart
and
txdbmaker::getChromInfoFromBiomart
in
the txdbmaker package.
IMPORTANT NOTE: Starting with BioC 3.19, the makeTxDbFromEnsembl
function is defined in the txdbmaker package.
txdbmaker::makeTxDbFromEnsembl
in the
txdbmaker package.
IMPORTANT NOTE: Starting with BioC 3.19, the makeTxDbFromGFF
function is defined in the txdbmaker package.
txdbmaker::makeTxDbFromGFF
in the txdbmaker
package.
IMPORTANT NOTE: Starting with BioC 3.19, the makeTxDbFromGRanges
function is defined in the txdbmaker package.
txdbmaker::makeTxDbFromGRanges
in the
txdbmaker package.
IMPORTANT NOTE: Starting with BioC 3.19, functions
makeTxDbFromUCSC()
, supportedUCSCtables()
,
and browseUCSCtrack()
are defined in the txdbmaker
package.
txdbmaker::makeTxDbFromUCSC
,
txdbmaker::supportedUCSCtables
,
and txdbmaker::browseUCSCtrack
in the txdbmaker package.
IMPORTANT NOTE: Starting with BioC 3.19, functions
makeTxDbPackageFromUCSC()
, makeFDbPackageFromUCSC()
,
makeTxDbPackageFromBiomart()
, makeTxDbPackage()
supportedMiRBaseBuildValues()
and makePackageName()
are defined in the txdbmaker package.
txdbmaker::makeTxDbPackageFromUCSC
,
txdbmaker::makeFDbPackageFromUCSC
,
txdbmaker::makeTxDbPackageFromBiomart
,
txdbmaker::makeTxDbPackage
,
txdbmaker::supportedMiRBaseBuildValues
,
and txdbmaker::makePackageName
in the txdbmaker package.
Map IDs to Genomic Ranges
mapIdsToRanges(x, ...) ## S4 method for signature 'TxDb' mapIdsToRanges(x, keys, type = c("cds", "exon", "tx", "gene"), columns = NULL)
mapIdsToRanges(x, ...) ## S4 method for signature 'TxDb' mapIdsToRanges(x, keys, type = c("cds", "exon", "tx", "gene"), columns = NULL)
x |
Database to use for mapping |
keys |
Values to lookup, passed to |
type |
Types of feature to return |
columns |
Additional metadata columns to include in the output |
... |
Additional arguments passed to methods |
GRangesList
corresponding to the keys
TxDb
: TxDb method
fl <- system.file(package = "GenomicFeatures", "extdata", "sample_ranges.rds") txdb <- makeTxDbFromGRanges(readRDS(fl)) keys <- list(tx_name = c("ENST00000371582", "ENST00000371588", "ENST00000494752", "ENST00000614008", "ENST00000496771")) mapIdsToRanges(txdb, keys = keys, type = "tx")
fl <- system.file(package = "GenomicFeatures", "extdata", "sample_ranges.rds") txdb <- makeTxDbFromGRanges(readRDS(fl)) keys <- list(tx_name = c("ENST00000371582", "ENST00000371588", "ENST00000494752", "ENST00000614008", "ENST00000496771")) mapIdsToRanges(txdb, keys = keys, type = "tx")
Map Genomic Ranges to IDs
mapRangesToIds(x, ...) ## S4 method for signature 'TxDb' mapRangesToIds(x, ranges, type = c("cds", "exon", "tx", "gene"), columns = NULL, ...)
mapRangesToIds(x, ...) ## S4 method for signature 'TxDb' mapRangesToIds(x, ranges, type = c("cds", "exon", "tx", "gene"), columns = NULL, ...)
x |
Database to use for mapping |
ranges |
range object used to subset |
type |
of feature to return |
columns |
additional metadata columns to include in the output. |
... |
Additional arguments passed to
|
DataFrame
of mcols from the database.
TxDb
: TxDb method
fl <- system.file(package = "GenomicFeatures", "extdata", "sample_ranges.rds") txdb <- makeTxDbFromGRanges(readRDS(fl)) keys <- list(tx_name = c("ENST00000371582", "ENST00000371588", "ENST00000494752", "ENST00000614008", "ENST00000496771")) res <- mapIdsToRanges(txdb, keys = keys, type = "tx") mapRangesToIds(txdb, res, "tx")
fl <- system.file(package = "GenomicFeatures", "extdata", "sample_ranges.rds") txdb <- makeTxDbFromGRanges(readRDS(fl)) keys <- list(tx_name = c("ENST00000371582", "ENST00000371588", "ENST00000494752", "ENST00000614008", "ENST00000496771")) res <- mapIdsToRanges(txdb, keys = keys, type = "tx") mapRangesToIds(txdb, res, "tx")
Map range coordinates between features in the transcriptome and genome (reference) space.
See ?mapToAlignments
in the
GenomicAlignments package for mapping coordinates between
reads (local) and genome (reference) space using a CIGAR alignment.
## mapping to transcripts ## S4 method for signature 'GenomicRanges,GenomicRanges' mapToTranscripts(x, transcripts, ignore.strand = FALSE) ## S4 method for signature 'GenomicRanges,GRangesList' mapToTranscripts(x, transcripts, ignore.strand = FALSE, intronJunctions=FALSE) ## S4 method for signature 'ANY,TxDb' mapToTranscripts(x, transcripts, ignore.strand = FALSE, extractor.fun = GenomicFeatures::transcripts, ...) ## S4 method for signature 'GenomicRanges,GRangesList' pmapToTranscripts(x, transcripts, ignore.strand = FALSE) ## mapping from transcripts ## S4 method for signature 'GenomicRanges,GRangesList' mapFromTranscripts(x, transcripts, ignore.strand = FALSE) ## S4 method for signature 'GenomicRanges,GRangesList' pmapFromTranscripts(x, transcripts, ignore.strand = FALSE) ## S4 method for signature 'IntegerRanges,GRangesList' pmapFromTranscripts(x, transcripts)
## mapping to transcripts ## S4 method for signature 'GenomicRanges,GenomicRanges' mapToTranscripts(x, transcripts, ignore.strand = FALSE) ## S4 method for signature 'GenomicRanges,GRangesList' mapToTranscripts(x, transcripts, ignore.strand = FALSE, intronJunctions=FALSE) ## S4 method for signature 'ANY,TxDb' mapToTranscripts(x, transcripts, ignore.strand = FALSE, extractor.fun = GenomicFeatures::transcripts, ...) ## S4 method for signature 'GenomicRanges,GRangesList' pmapToTranscripts(x, transcripts, ignore.strand = FALSE) ## mapping from transcripts ## S4 method for signature 'GenomicRanges,GRangesList' mapFromTranscripts(x, transcripts, ignore.strand = FALSE) ## S4 method for signature 'GenomicRanges,GRangesList' pmapFromTranscripts(x, transcripts, ignore.strand = FALSE) ## S4 method for signature 'IntegerRanges,GRangesList' pmapFromTranscripts(x, transcripts)
x |
GenomicRanges object of positions to be mapped.
The seqnames of |
transcripts |
A named GenomicRanges or
GRangesList object used to map between The |
ignore.strand |
When When Mapped position is computed by counting from the transcription start site
(TSS) and is not affected by the value of |
intronJunctions |
Logical to indicate if intronic ranges in This argument is only supported in When Ranges that have either the start or end in an intron are considered "non
hits" and are never mapped. Ranges that span introns are always mapped.
Neither of these range types are controlled by the |
extractor.fun |
Function to extract genomic features from a This argument is only applicable to Valid
|
... |
Additional arguments passed to |
In GenomicFeatures >= 1.21.10, the default for ignore.strand
was
changed to FALSE
for consistency with other methods in the
GenomicRanges and GenomicAlignments packages. Additionally,
the mapped position is computed from the TSS and does not depend on the
ignore.strand
argument.
See the section on ignore.strand
for details.
mapToTranscripts
, pmapToTranscripts
:The genomic range in x
is mapped to the local position in the
transcripts
ranges. A successful mapping occurs when x
is completely within the transcripts
range, equivalent to:
findOverlaps(..., type="within")
Transcriptome-based coordinates start counting at 1 at the beginning
of the transcripts
range and return positions where x
was aligned. The seqlevels of the return object are taken from the
transcripts
object and should be transcript names. In this
direction, mapping is attempted between all elements of x
and
all elements of transcripts
.
mapToTranscripts
uses findOverlaps
to map ranges in
x
to ranges in transcripts
. This method does not return
unmapped ranges.
pmapToTranscripts
maps the i-th range in x
to the
i-th range in transcripts
. Recycling is supported for both
x
and transcripts
when either is length == 1L; otherwise
the lengths must match. Ranges in x
that do not map (out of bounds
or strand mismatch) are returned as zero-width ranges starting at 0.
These ranges are given the seqname of "UNMAPPED".
mapFromTranscripts
, pmapFromTranscripts
:The transcript-based position in x
is mapped to genomic coordinates
using the ranges in transcripts
. A successful mapping occurs when
the following is TRUE:
width(transcripts) >= start(x) + width(x)
x
is aligned to transcripts
by moving in start(x)
positions in from the beginning of the transcripts
range. The
seqlevels of the return object are chromosome names.
mapFromTranscripts
uses the seqname of x
and the names
of transcripts
to determine mapping pairs (vs attempting to match
all possible pairs). Name matching is motivated by use cases such as
differentially expressed regions where the expressed regions in x
would only be related to a subset of regions in transcripts
.
This method does not return unmapped ranges.
pmapFromTranscripts
maps the i-th range in x
to the i-th
range in transcripts
and therefore does not use name matching.
Recycling is supported in pmapFromTranscripts
when either
x
or transcripts
is length == 1L; otherwise the lengths
must match. Ranges in x
that do not map (out of bounds or strand
mismatch) are returned as zero-width ranges starting at 0. These ranges
are given the seqname of "UNMAPPED".
pmapToTranscripts
returns a GRanges
the same length as
x
.
pmapFromTranscripts
returns a GRanges
when transcripts
is a GRanges
and a GRangesList
when transcripts
is a GRangesList
. In both cases the return object is the same
length as x
. The rational for returning the GRangesList
is
to preserve exon structure; ranges in a list element that are not overlapped
by x
are returned as a zero-width range. The GRangesList
return object will have no seqlevels called "UNMAPPED"; those will only
occur when a GRanges
is returned.
mapToTranscripts
and mapFromTranscripts
return GRanges
objects that vary in length similar to a Hits
object. The result
contains mapped records only; strand mismatch and out of bound ranges are
not returned. xHits
and transcriptsHits
metadata columns
(similar to the queryHits
and subjectHits
of a Hits
object) indicate elements of x
and transcripts
used in
the mapping.
When intronJunctions
is TRUE, mapToTranscripts
returns an
extra metdata column named intronic
to identify the intron ranges.
When mapping to transcript coordinates, seqlevels of the output are the names
on the transcripts
object and most often these will be transcript
names. When mapping to the genome, seqlevels of the output are the seqlevels
of transcripts
which are usually chromosome names.
V. Obenchain, M. Lawrence and H. Pagès
?mapToAlignments
in the
GenomicAlignments package for methods mapping between
reads and genome space using a CIGAR alignment.
## --------------------------------------------------------------------- ## A. Basic Use: Conversion between CDS and Exon coordinates and the ## genome ## --------------------------------------------------------------------- ## Gene "Dgkb" has ENTREZID "217480": library(org.Mm.eg.db) Dgkb_geneid <- get("Dgkb", org.Mm.egSYMBOL2EG) ## The gene is on the positive strand, chromosome 12: library(TxDb.Mmusculus.UCSC.mm10.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene tx_by_gene <- transcriptsBy(txdb, by="gene") Dgkb_transcripts <- tx_by_gene[[Dgkb_geneid]] Dgkb_transcripts # all 7 Dgkb transcripts are on chr12, positive strand ## To map coordinates from local CDS or exon space to genome ## space use mapFromTranscripts(). ## When mapping CDS coordinates to genome space the 'transcripts' ## argument is the collection of CDS parts by transcript. coord <- GRanges("chr12", IRanges(4, width=1)) ## Get the names of the transcripts in the gene: Dgkb_tx_names <- mcols(Dgkb_transcripts)$tx_name Dgkb_tx_names ## Use these names to isolate the region of interest: cds_by_tx <- cdsBy(txdb, "tx", use.names=TRUE) Dgkb_cds_by_tx <- cds_by_tx[intersect(Dgkb_tx_names, names(cds_by_tx))] ## Dgkb CDS parts grouped by transcript (no-CDS transcripts omitted): Dgkb_cds_by_tx lengths(Dgkb_cds_by_tx) # nb of CDS parts per transcript ## A requirement for mapping from transcript space to genome space ## is that seqnames in 'x' match the names in 'transcripts'. names(Dgkb_cds_by_tx) <- rep(seqnames(coord), length(Dgkb_cds_by_tx)) ## There are 6 results, one for each transcript. mapFromTranscripts(coord, Dgkb_cds_by_tx) ## To map exon coordinates to genome space the 'transcripts' ## argument is the collection of exon regions by transcript. coord <- GRanges("chr12", IRanges(100, width=1)) ex_by_tx <- exonsBy(txdb, "tx", use.names=TRUE) Dgkb_ex_by_tx <- ex_by_tx[Dgkb_tx_names] names(Dgkb_ex_by_tx) <- rep(seqnames(coord), length(Dgkb_ex_by_tx)) ## Again the output has 6 results, one for each transcript. mapFromTranscripts(coord, Dgkb_ex_by_tx) ## To go the reverse direction and map from genome space to ## local CDS or exon space, use mapToTranscripts(). ## Genomic position 37981944 maps to CDS position 4: coord <- GRanges("chr12", IRanges(37981944, width=1)) mapToTranscripts(coord, Dgkb_cds_by_tx) ## Genomic position 37880273 maps to exon position 100: coord <- GRanges("chr12", IRanges(37880273, width=1)) mapToTranscripts(coord, Dgkb_ex_by_tx) ## The following examples use more than 2GB of memory, which is more ## than what 32-bit Windows can handle: is_32bit_windows <- .Platform$OS.type == "windows" && .Platform$r_arch == "i386" if (!is_32bit_windows) { ## --------------------------------------------------------------------- ## B. Map sequence locations in exons to the genome ## --------------------------------------------------------------------- ## NAGNAG alternative splicing plays an essential role in biological ## processes and represents a highly adaptable system for ## posttranslational regulation of gene function. The majority of ## NAGNAG studies largely focus on messenger RNA. A study by Sun, ## Lin, and Yan (http://www.hindawi.com/journals/bmri/2014/736798/) ## demonstrated that NAGNAG splicing is also operative in large ## intergenic noncoding RNA (lincRNA). One finding of interest was ## that linc-POLR3G-10 exhibited two NAGNAG acceptors located in two ## distinct transcripts: TCONS_00010012 and TCONS_00010010. ## Extract the exon coordinates of TCONS_00010012 and TCONS_00010010: lincrna <- c("TCONS_00010012", "TCONS_00010010") library(TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts) txdb <- TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts exons <- exonsBy(txdb, by="tx", use.names=TRUE)[lincrna] exons ## The two NAGNAG acceptors were identified in the upstream region of ## the fourth and fifth exons located in TCONS_00010012. ## Extract the sequences for transcript TCONS_00010012: library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 exons_seq <- getSeq(genome, exons[[1]]) ## TCONS_00010012 has 4 exons: exons_seq ## The most common triplet among the lincRNA sequences was CAG. Identify ## the location of this pattern in all exons. cag_loc <- vmatchPattern("CAG", exons_seq) ## Convert the first occurance of CAG in each exon back to genome ## coordinates. first_loc <- do.call(c, sapply(cag_loc, "[", 1, simplify=TRUE)) pmapFromTranscripts(first_loc, exons[[1]]) ## --------------------------------------------------------------------- ## C. Map dbSNP variants to CDS or cDNA coordinates ## --------------------------------------------------------------------- ## The GIPR gene encodes a G-protein coupled receptor for gastric ## inhibitory polypeptide (GIP). Originally GIP was identified to ## inhibited gastric acid secretion and gastrin release but was later ## demonstrated to stimulate insulin release in the presence of elevated ## glucose. ## In this example 5 SNPs located in the GIPR gene are mapped to cDNA ## coordinates. A list of SNPs in GIPR can be downloaded from dbSNP or ## NCBI. rsids <- c("rs4803846", "rs139322374", "rs7250736", "rs7250754", "rs9749185") ## Extract genomic coordinates with a SNPlocs package. library(SNPlocs.Hsapiens.dbSNP144.GRCh38) snps <- snpsById(SNPlocs.Hsapiens.dbSNP144.GRCh38, rsids) ## Gene regions of GIPR can be extracted from a TxDb package of ## compatible build. The TxDb package uses Entrez gene identifiers ## and GIPR is a gene symbol. Let's first lookup its Entrez gene ID. library(org.Hs.eg.db) GIPR_geneid <- get("GIPR", org.Hs.egSYMBOL2EG) ## The transcriptsBy() extractor returns a range for each transcript that ## includes the UTR and exon regions (i.e., cDNA). library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene tx_by_gene <- transcriptsBy(txdb, "gene") GIPR_transcripts <- tx_by_gene[GIPR_geneid] GIPR_transcripts # all 8 GIPR transcripts are on chr19, positive strand ## Before mapping, the chromosome names (seqlevels) in the two ## objects must be harmonized. The style is NCBI for 'snps' and ## UCSC for 'GIPR_transcripts'. seqlevelsStyle(snps) seqlevelsStyle(GIPR_transcripts) ## Modify the style (and genome) in 'snps' to match 'GIPR_transcripts'. seqlevelsStyle(snps) <- seqlevelsStyle(GIPR_transcripts) ## The 'GIPR_transcripts' object is a GRangesList of length 1. This single ## list element contains the cDNA range for 8 different transcripts. To ## map to each transcript individually 'GIPR_transcripts' must be unlisted ## before mapping. ## Map all 5 SNPS to all 8 transcripts: mapToTranscripts(snps, unlist(GIPR_transcripts)) ## Map the first SNP to transcript "ENST00000590918.5" and the second to ## "ENST00000263281.7". pmapToTranscripts(snps[1:2], unlist(GIPR_transcripts)[1:2]) ## The cdsBy() extractor returns CDS parts by gene or by transcript. ## Extract the CDS parts for transcript "ENST00000263281.7". cds <- cdsBy(txdb, "tx", use.names=TRUE)["ENST00000263281.7"] cds ## The 'cds' object is a GRangesList of length 1 containing the ranges of ## all CDS parts for single transcript "ENST00000263281.7". ## To map to the concatenated group of ranges leave 'cds' as a GRangesList. mapToTranscripts(snps, cds) ## Only the second SNP could be mapped. Unlisting the 'cds' object maps ## the SNPs to the individual cds ranges (vs the concatenated range). mapToTranscripts(snps[2], unlist(cds)) ## The location is the same because the SNP hit the first CDS part. If ## the transcript were on the "-" strand the difference in concatenated ## vs non-concatenated position would be more obvious. ## Change strand: strand(cds) <- strand(snps) <- "-" mapToTranscripts(snps[2], unlist(cds)) }
## --------------------------------------------------------------------- ## A. Basic Use: Conversion between CDS and Exon coordinates and the ## genome ## --------------------------------------------------------------------- ## Gene "Dgkb" has ENTREZID "217480": library(org.Mm.eg.db) Dgkb_geneid <- get("Dgkb", org.Mm.egSYMBOL2EG) ## The gene is on the positive strand, chromosome 12: library(TxDb.Mmusculus.UCSC.mm10.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene tx_by_gene <- transcriptsBy(txdb, by="gene") Dgkb_transcripts <- tx_by_gene[[Dgkb_geneid]] Dgkb_transcripts # all 7 Dgkb transcripts are on chr12, positive strand ## To map coordinates from local CDS or exon space to genome ## space use mapFromTranscripts(). ## When mapping CDS coordinates to genome space the 'transcripts' ## argument is the collection of CDS parts by transcript. coord <- GRanges("chr12", IRanges(4, width=1)) ## Get the names of the transcripts in the gene: Dgkb_tx_names <- mcols(Dgkb_transcripts)$tx_name Dgkb_tx_names ## Use these names to isolate the region of interest: cds_by_tx <- cdsBy(txdb, "tx", use.names=TRUE) Dgkb_cds_by_tx <- cds_by_tx[intersect(Dgkb_tx_names, names(cds_by_tx))] ## Dgkb CDS parts grouped by transcript (no-CDS transcripts omitted): Dgkb_cds_by_tx lengths(Dgkb_cds_by_tx) # nb of CDS parts per transcript ## A requirement for mapping from transcript space to genome space ## is that seqnames in 'x' match the names in 'transcripts'. names(Dgkb_cds_by_tx) <- rep(seqnames(coord), length(Dgkb_cds_by_tx)) ## There are 6 results, one for each transcript. mapFromTranscripts(coord, Dgkb_cds_by_tx) ## To map exon coordinates to genome space the 'transcripts' ## argument is the collection of exon regions by transcript. coord <- GRanges("chr12", IRanges(100, width=1)) ex_by_tx <- exonsBy(txdb, "tx", use.names=TRUE) Dgkb_ex_by_tx <- ex_by_tx[Dgkb_tx_names] names(Dgkb_ex_by_tx) <- rep(seqnames(coord), length(Dgkb_ex_by_tx)) ## Again the output has 6 results, one for each transcript. mapFromTranscripts(coord, Dgkb_ex_by_tx) ## To go the reverse direction and map from genome space to ## local CDS or exon space, use mapToTranscripts(). ## Genomic position 37981944 maps to CDS position 4: coord <- GRanges("chr12", IRanges(37981944, width=1)) mapToTranscripts(coord, Dgkb_cds_by_tx) ## Genomic position 37880273 maps to exon position 100: coord <- GRanges("chr12", IRanges(37880273, width=1)) mapToTranscripts(coord, Dgkb_ex_by_tx) ## The following examples use more than 2GB of memory, which is more ## than what 32-bit Windows can handle: is_32bit_windows <- .Platform$OS.type == "windows" && .Platform$r_arch == "i386" if (!is_32bit_windows) { ## --------------------------------------------------------------------- ## B. Map sequence locations in exons to the genome ## --------------------------------------------------------------------- ## NAGNAG alternative splicing plays an essential role in biological ## processes and represents a highly adaptable system for ## posttranslational regulation of gene function. The majority of ## NAGNAG studies largely focus on messenger RNA. A study by Sun, ## Lin, and Yan (http://www.hindawi.com/journals/bmri/2014/736798/) ## demonstrated that NAGNAG splicing is also operative in large ## intergenic noncoding RNA (lincRNA). One finding of interest was ## that linc-POLR3G-10 exhibited two NAGNAG acceptors located in two ## distinct transcripts: TCONS_00010012 and TCONS_00010010. ## Extract the exon coordinates of TCONS_00010012 and TCONS_00010010: lincrna <- c("TCONS_00010012", "TCONS_00010010") library(TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts) txdb <- TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts exons <- exonsBy(txdb, by="tx", use.names=TRUE)[lincrna] exons ## The two NAGNAG acceptors were identified in the upstream region of ## the fourth and fifth exons located in TCONS_00010012. ## Extract the sequences for transcript TCONS_00010012: library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 exons_seq <- getSeq(genome, exons[[1]]) ## TCONS_00010012 has 4 exons: exons_seq ## The most common triplet among the lincRNA sequences was CAG. Identify ## the location of this pattern in all exons. cag_loc <- vmatchPattern("CAG", exons_seq) ## Convert the first occurance of CAG in each exon back to genome ## coordinates. first_loc <- do.call(c, sapply(cag_loc, "[", 1, simplify=TRUE)) pmapFromTranscripts(first_loc, exons[[1]]) ## --------------------------------------------------------------------- ## C. Map dbSNP variants to CDS or cDNA coordinates ## --------------------------------------------------------------------- ## The GIPR gene encodes a G-protein coupled receptor for gastric ## inhibitory polypeptide (GIP). Originally GIP was identified to ## inhibited gastric acid secretion and gastrin release but was later ## demonstrated to stimulate insulin release in the presence of elevated ## glucose. ## In this example 5 SNPs located in the GIPR gene are mapped to cDNA ## coordinates. A list of SNPs in GIPR can be downloaded from dbSNP or ## NCBI. rsids <- c("rs4803846", "rs139322374", "rs7250736", "rs7250754", "rs9749185") ## Extract genomic coordinates with a SNPlocs package. library(SNPlocs.Hsapiens.dbSNP144.GRCh38) snps <- snpsById(SNPlocs.Hsapiens.dbSNP144.GRCh38, rsids) ## Gene regions of GIPR can be extracted from a TxDb package of ## compatible build. The TxDb package uses Entrez gene identifiers ## and GIPR is a gene symbol. Let's first lookup its Entrez gene ID. library(org.Hs.eg.db) GIPR_geneid <- get("GIPR", org.Hs.egSYMBOL2EG) ## The transcriptsBy() extractor returns a range for each transcript that ## includes the UTR and exon regions (i.e., cDNA). library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene tx_by_gene <- transcriptsBy(txdb, "gene") GIPR_transcripts <- tx_by_gene[GIPR_geneid] GIPR_transcripts # all 8 GIPR transcripts are on chr19, positive strand ## Before mapping, the chromosome names (seqlevels) in the two ## objects must be harmonized. The style is NCBI for 'snps' and ## UCSC for 'GIPR_transcripts'. seqlevelsStyle(snps) seqlevelsStyle(GIPR_transcripts) ## Modify the style (and genome) in 'snps' to match 'GIPR_transcripts'. seqlevelsStyle(snps) <- seqlevelsStyle(GIPR_transcripts) ## The 'GIPR_transcripts' object is a GRangesList of length 1. This single ## list element contains the cDNA range for 8 different transcripts. To ## map to each transcript individually 'GIPR_transcripts' must be unlisted ## before mapping. ## Map all 5 SNPS to all 8 transcripts: mapToTranscripts(snps, unlist(GIPR_transcripts)) ## Map the first SNP to transcript "ENST00000590918.5" and the second to ## "ENST00000263281.7". pmapToTranscripts(snps[1:2], unlist(GIPR_transcripts)[1:2]) ## The cdsBy() extractor returns CDS parts by gene or by transcript. ## Extract the CDS parts for transcript "ENST00000263281.7". cds <- cdsBy(txdb, "tx", use.names=TRUE)["ENST00000263281.7"] cds ## The 'cds' object is a GRangesList of length 1 containing the ranges of ## all CDS parts for single transcript "ENST00000263281.7". ## To map to the concatenated group of ranges leave 'cds' as a GRangesList. mapToTranscripts(snps, cds) ## Only the second SNP could be mapped. Unlisting the 'cds' object maps ## the SNPs to the individual cds ranges (vs the concatenated range). mapToTranscripts(snps[2], unlist(cds)) ## The location is the same because the SNP hit the first CDS part. If ## the transcript were on the "-" strand the difference in concatenated ## vs non-concatenated position would be more obvious. ## Change strand: strand(cds) <- strand(snps) <- "-" mapToTranscripts(snps[2], unlist(cds)) }
Generic functions to extract microRNA or tRNA genomic ranges from an object. This page documents the methods for TxDb objects only.
microRNAs(x) ## S4 method for signature 'TxDb' microRNAs(x) tRNAs(x) ## S4 method for signature 'TxDb' tRNAs(x)
microRNAs(x) ## S4 method for signature 'TxDb' microRNAs(x) tRNAs(x) ## S4 method for signature 'TxDb' tRNAs(x)
x |
A TxDb object. |
A GRanges object.
M. Carlson
transcripts
, transcriptsBy
, and
transcriptsByOverlaps
for the core genomic features
extractors.
The TxDb class.
## Not run: library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(mirbase.db) microRNAs(TxDb.Hsapiens.UCSC.hg19.knownGene) ## End(Not run)
## Not run: library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(mirbase.db) microRNAs(TxDb.Hsapiens.UCSC.hg19.knownGene) ## End(Not run)
A distance()
method for TxDb objects.
## S4 method for signature 'GenomicRanges,TxDb' distance(x, y, ignore.strand=FALSE, ..., id, type=c("gene", "tx", "exon", "cds"))
## S4 method for signature 'GenomicRanges,TxDb' distance(x, y, ignore.strand=FALSE, ..., id, type=c("gene", "tx", "exon", "cds"))
x |
The query GenomicRanges instance. |
y |
A TxDb object. The |
id |
A |
type |
A |
ignore.strand |
A |
... |
Additional arguments for methods. |
This distance()
method returns the distance for each range in x
to the range extracted from the TxDb object y
. Values in
id
are matched to one of ‘gene_id’, ‘tx_id’,
‘exon_id’ or ‘cds_id’ identifiers in the TxDb
and the corresponding ranges are extracted. The type
argument
specifies which identifier is represented in id
. The extracted
ranges are used in the distance calculation with the ranges in x
.
The method returns NA
values when the genomic region defined
by id
cannot be collapsed into a single range (e.g.,
when a gene spans multiple chromosomes) or if the id
is not found in y
.
The behavior of distance()
with respect to zero-width ranges
has changed in Bioconductor 2.12. See the man page ?distance
in the IRanges for details.
An integer vector of distances between the ranges in x
and y
.
Valerie Obenchain <[email protected]>
nearest-methods in the IRanges package.
nearest-methods in the GenomicRanges package.
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene gr <- GRanges(c("chr2L", "chr2R"), IRanges(c(100000, 200000), width=100)) distance(gr, txdb, id=c("FBgn0259717", "FBgn0261501"), type="gene") distance(gr, txdb, id=c("10000", "23000"), type="cds") ## The id's must be in the appropriate order with respect to 'x'. distance(gr, txdb, id=c("4", "4097"), type="tx") ## 'id' "4" is on chr2L and "4097" is on chr2R. transcripts(txdb, filter=list(tx_id=c("4", "4097"))) ## If we reverse the 'id' the chromosomes are incompatable with gr. distance(gr, txdb, id=c("4097", "4"), type="tx") ## distance() compares each 'x' to the corresponding 'y'. ## If an 'id' is not found in the TxDb 'y' will not ## be the same lenth as 'x' and an error is thrown. ## Not run: distance(gr, txdb, id=c("FBgn0000008", "INVALID"), type="gene") ## will fail ## End(Not run)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene gr <- GRanges(c("chr2L", "chr2R"), IRanges(c(100000, 200000), width=100)) distance(gr, txdb, id=c("FBgn0259717", "FBgn0261501"), type="gene") distance(gr, txdb, id=c("10000", "23000"), type="cds") ## The id's must be in the appropriate order with respect to 'x'. distance(gr, txdb, id=c("4", "4097"), type="tx") ## 'id' "4" is on chr2L and "4097" is on chr2R. transcripts(txdb, filter=list(tx_id=c("4", "4097"))) ## If we reverse the 'id' the chromosomes are incompatable with gr. distance(gr, txdb, id=c("4097", "4"), type="tx") ## distance() compares each 'x' to the corresponding 'y'. ## If an 'id' is not found in the TxDb 'y' will not ## be the same lenth as 'x' and an error is thrown. ## Not run: distance(gr, txdb, id=c("FBgn0000008", "INVALID"), type="gene") ## will fail ## End(Not run)
proteinToGenome
is a generic function for mapping
ranges of protein-relative positions to the genome.
NOTE: This man page is for the proteinToGenome
S4 generic
function and methods defined in the GenomicFeatures package,
which are (loosely) modeled on the proteinToGenome
function from the ensembldb package.
See ?ensembldb::proteinToGenome
for the latter.
## S4 generic function: proteinToGenome(x, db, ...) # dispatch is on 2nd argument 'db' ## S4 method for signature 'ANY' proteinToGenome(x, db) ## S4 method for signature 'GRangesList' proteinToGenome(x, db)
## S4 generic function: proteinToGenome(x, db, ...) # dispatch is on 2nd argument 'db' ## S4 method for signature 'ANY' proteinToGenome(x, db) ## S4 method for signature 'GRangesList' proteinToGenome(x, db)
x |
A named IRanges object (or derivative) containing ranges of protein-relative positions (protein-relative positions are positions relative to a protein sequence). The names on mcols(transcripts(db, columns="tx_name"))$tx_name And for the method for GRangesList objects,
names(db) |
db |
For the default For the method for GRangesList objects: A named GRangesList object (or derivative) where each list element is a GRanges object representing a CDS (the ranges in the GRanges object must represent the CDS parts ordered by ascending exon rank). |
... |
Further arguments to be passed to specific methods. |
The proteinToGenome()
method for GRangesList
objects is the workhorse behind the default method. Note that the latter
is a thin wrapper around the former, which simply does the following:
Use cdsBy()
to extract the CDS parts from db
.
The CDS parts are returned in a GRangesList
object that has the names of the transcript on it (one transcript
name per list element).
Call proteinToGenome()
on x
and the
GRangesList object returned by
cdsBy()
.
A named GRangesList object parallel to
x
(the transcript names on x
are propagated).
The i-th list element in the returned object is the result of mapping
the range of protein-relative positions x[i]
to the genome.
Note that a given range in x
can only be mapped to the genome
if the name on it is the name of a coding transcript. If it's
not (i.e. if it's the name of a non-coding transcript), then
an empty GRanges object is placed in the returned
object to indicate the impossible mapping, and a warning is issued.
Otherwise, if a given range in x
can be mapped to the
genome, then the result of the mapping is represented by a
non-empty GRanges object.
Note that this object represents the original CDS associated to
x
, trimmed on its 5' end or 3' end, or on both.
Furthermore, this object will have the same metadata columns as the
GRanges object representing the original CDS,
plus the 2 following ones:
protein_start
: The protein-relative start of the mapping.
protein_end
: The protein-relative end of the mapping.
Unlike ensembldb::proteinToGenome()
which
can work either with Ensembl protein IDs or Ensembl transcript IDs
on x
, the default proteinToGenome()
method described
above only accepts transcript names on x
.
This means that, if the user is in possession of protein IDs, they must first replace them with the corresponding transcript IDs (referred to as transcript names in the context of TxDb objects). How to do this exactly depends on the origin of those IDs (UCSC, Ensembl, GTF/GFF3 file, FlyBase, etc...)
H. Pagès, using ensembldb::proteinToGenome()
for
inspiration and design.
The proteinToGenome
function in the
ensembldb package, which the proteinToGenome()
generic and methods documented in this man page are (loosely)
modeled on.
TxDb objects.
transcripts
for extracting transcripts from a
TxDb-like object.
IRanges objects in the IRanges package.
GRanges and GRangesList objects in the GenomicRanges package.
## --------------------------------------------------------------------- ## USING TOY CDS ## --------------------------------------------------------------------- ## CDS1 has 2 CDS parts: CDS1 <- GRanges(c("chrX:11-60:+", "chrX:101-125:+")) ## CDS2 has 3 CDS parts: CDS2 <- GRanges(c("chrY:201-230:-", "chrY:101-125:-", "chrY:11-60:-")) ## Put them in a GRangesList object: cds_by_tx <- GRangesList(TX1=CDS1, TX2=CDS2) cds_by_tx x1 <- IRanges(start=8, end=20, names="TX1") proteinToGenome(x1, cds_by_tx) x2 <- IRanges(start=c(1, 18), end=c(25, 20), names=c("TX1", "TX1")) x2 proteinToGenome(x2, cds_by_tx) x3 <- IRanges(start=8, end=15, names="TX2") proteinToGenome(x3, cds_by_tx) x4 <- c(x3, x2) x4 proteinToGenome(x4, cds_by_tx) ## --------------------------------------------------------------------- ## USING A TxDb OBJECT ## --------------------------------------------------------------------- library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ## The first transcript (FBtr0309810) is non-coding: x <- IRanges(c(FBtr0309810="11-55", FBtr0306539="90-300")) res <- proteinToGenome(x, txdb) res
## --------------------------------------------------------------------- ## USING TOY CDS ## --------------------------------------------------------------------- ## CDS1 has 2 CDS parts: CDS1 <- GRanges(c("chrX:11-60:+", "chrX:101-125:+")) ## CDS2 has 3 CDS parts: CDS2 <- GRanges(c("chrY:201-230:-", "chrY:101-125:-", "chrY:11-60:-")) ## Put them in a GRangesList object: cds_by_tx <- GRangesList(TX1=CDS1, TX2=CDS2) cds_by_tx x1 <- IRanges(start=8, end=20, names="TX1") proteinToGenome(x1, cds_by_tx) x2 <- IRanges(start=c(1, 18), end=c(25, 20), names=c("TX1", "TX1")) x2 proteinToGenome(x2, cds_by_tx) x3 <- IRanges(start=8, end=15, names="TX2") proteinToGenome(x3, cds_by_tx) x4 <- c(x3, x2) x4 proteinToGenome(x4, cds_by_tx) ## --------------------------------------------------------------------- ## USING A TxDb OBJECT ## --------------------------------------------------------------------- library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ## The first transcript (FBtr0309810) is non-coding: x <- IRanges(c(FBtr0309810="11-55", FBtr0306539="90-300")) res <- proteinToGenome(x, txdb) res
select
, columns
and keys
can be used together to
extract data from a TxDb object.
In the code snippets below, x
is a TxDb object.
keytypes(x)
:allows the user to discover which keytypes can be passed in to
select
or keys
and the keytype
argument.
keys(x, keytype, pattern, column, fuzzy)
:Return keys for the database contained in the TxDb object .
The keytype
argument specifies the kind of keys that will
be returned. By default keys
will return the "GENEID" keys
for the database.
If keys
is used with pattern
, it will pattern match
on the keytype
.
But if the column
argument is also provided along with the
pattern
argument, then pattern
will be matched
against the values in column
instead.
And if keys
is called with column
and no
pattern
argument, then it will return all keys that have
corresponding values in the column
argument.
Thus, the behavior of keys
all depends on how many arguments are
specified.
Use of the fuzzy
argument will toggle fuzzy matching to
TRUE or FALSE. If pattern
is not used, fuzzy is ignored.
columns(x)
:Show which kinds of data can be returned for the TxDb object.
select(x, keys, columns, keytype)
:When all the appropriate arguments are specified select
will retrieve the matching data as a data.frame based on
parameters for selected keys
and columns
and
keytype
arguments.
Marc Carlson
AnnotationDb-class for more descriptsion
of methods select
,keytypes
,keys
and columns
.
transcripts
, transcriptsBy
,
and transcriptsByOverlaps
, for other ways to
extract genomic features from a TxDb object.
The TxDb class.
txdb_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) txdb ## find key types keytypes(txdb) ## list IDs that can be used to filter head(keys(txdb, "GENEID")) head(keys(txdb, "TXID")) head(keys(txdb, "TXNAME")) ## list columns that can be returned by select columns(txdb) ## call select res <- select(txdb, head(keys(txdb, "GENEID")), columns=c("GENEID","TXNAME"), keytype="GENEID") head(res)
txdb_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) txdb ## find key types keytypes(txdb) ## list IDs that can be used to filter head(keys(txdb, "GENEID")) head(keys(txdb, "TXID")) head(keys(txdb, "TXNAME")) ## list columns that can be returned by select columns(txdb) ## call select res <- select(txdb, head(keys(txdb, "GENEID")), columns=c("GENEID","TXNAME"), keytype="GENEID") head(res)
The transcriptLengths
function extracts the transcript lengths from
a TxDb object. It also returns the CDS and UTR lengths for each
transcript if the user requested them.
transcriptLengths(txdb, with.cds_len=FALSE, with.utr5_len=FALSE, with.utr3_len=FALSE, ...)
transcriptLengths(txdb, with.cds_len=FALSE, with.utr5_len=FALSE, with.utr3_len=FALSE, ...)
txdb |
A TxDb object. |
with.cds_len , with.utr5_len , with.utr3_len
|
|
... |
Additional arguments used by |
All the lengths are counted in number of nucleotides.
The length of a processed transcript is just the sum of the lengths of its
exons. This should not be confounded with the length of the stretch of DNA
transcribed into RNA (a.k.a. transcription unit), which can be obtained
with width(transcripts(txdb))
.
A data frame with 1 row per transcript. The rows are guaranteed to be in
the same order as the elements of the GRanges object
returned by transcripts(txdb)
.
The data frame has between 5 and 8 columns, depending on what the user
requested via the with.cds_len
, with.utr5_len
, and
with.utr3_len
arguments.
The first 3 columns are the same as the metadata columns of the object returned by
transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
that is:
tx_id
: The internal transcript ID. This ID is unique within
the scope of the TxDb object. It is not an official or public
ID (like an Ensembl or FlyBase ID) or an Accession number, so it
cannot be used to lookup the transcript in public data bases or in
other TxDb objects. Furthermore, this ID could change when
re-running the code that was used to make the TxDb object.
tx_name
: An official/public transcript name or ID that can
be used to lookup the transcript in public data bases or in other
TxDb objects. This column is not guaranteed to contain unique
values and it can contain NAs.
gene_id
: The official/public ID of the gene that the
transcript belongs to. Can be NA if the gene is unknown or if the
transcript is not considered to belong to a gene.
The other columns are quantitative:
nexon
: The number of exons in the transcript.
tx_len
: The length of the processed transcript.
cds_len
: [optional] The length of the CDS region of the
processed transcript.
utr5_len
: [optional] The length of the 5' UTR region of
the processed transcript.
utr3_len
: [optional] The length of the 3' UTR region of
the processed transcript.
Hervé Pagès
transcripts
, transcriptsBy
,
and transcriptsByOverlaps
, for extracting
genomic feature locations from a TxDb-like object.
exonicParts
and intronicParts
for
extracting non-overlapping exonic or intronic parts from a
TxDb-like object.
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from chromosome sequences.
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
makeTxDbFromUCSC
, makeTxDbFromBiomart
,
and makeTxDbFromEnsembl
, for making a TxDb
object from online resources.
makeTxDbFromGRanges
and makeTxDbFromGFF
for making a TxDb object from a GRanges
object, or from a GFF or GTF file.
The TxDb class.
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene dm3_txlens <- transcriptLengths(txdb) head(dm3_txlens) dm3_txlens <- transcriptLengths(txdb, with.cds_len=TRUE, with.utr5_len=TRUE, with.utr3_len=TRUE) head(dm3_txlens) ## When cds_len is 0 (non-coding transcript), utr5_len and utr3_len ## must also be 0: non_coding <- dm3_txlens[dm3_txlens$cds_len == 0, ] stopifnot(all(non_coding[6:8] == 0)) ## When cds_len is not 0 (coding transcript), cds_len + utr5_len + ## utr3_len must be equal to tx_len: coding <- dm3_txlens[dm3_txlens$cds_len != 0, ] stopifnot(all(rowSums(coding[6:8]) == coding[[5]])) ## A sanity check: stopifnot(identical(dm3_txlens$tx_id, mcols(transcripts(txdb))$tx_id))
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene dm3_txlens <- transcriptLengths(txdb) head(dm3_txlens) dm3_txlens <- transcriptLengths(txdb, with.cds_len=TRUE, with.utr5_len=TRUE, with.utr3_len=TRUE) head(dm3_txlens) ## When cds_len is 0 (non-coding transcript), utr5_len and utr3_len ## must also be 0: non_coding <- dm3_txlens[dm3_txlens$cds_len == 0, ] stopifnot(all(non_coding[6:8] == 0)) ## When cds_len is not 0 (coding transcript), cds_len + utr5_len + ## utr3_len must be equal to tx_len: coding <- dm3_txlens[dm3_txlens$cds_len != 0, ] stopifnot(all(rowSums(coding[6:8]) == coding[[5]])) ## A sanity check: stopifnot(identical(dm3_txlens$tx_id, mcols(transcripts(txdb))$tx_id))
transcriptLocs2refLocs
converts transcript-based
locations into reference-based (aka chromosome-based or genomic)
locations.
transcriptWidths
computes the lengths of the transcripts
(called the "widths" in this context) based on the boundaries
of their exons.
transcriptLocs2refLocs(tlocs, exonStarts=list(), exonEnds=list(), strand=character(0), decreasing.rank.on.minus.strand=FALSE, error.if.out.of.bounds=TRUE) transcriptWidths(exonStarts=list(), exonEnds=list())
transcriptLocs2refLocs(tlocs, exonStarts=list(), exonEnds=list(), strand=character(0), decreasing.rank.on.minus.strand=FALSE, error.if.out.of.bounds=TRUE) transcriptWidths(exonStarts=list(), exonEnds=list())
tlocs |
A list of integer vectors of the same length as |
exonStarts , exonEnds
|
The starts and ends of the exons, respectively. Each argument can be a list of integer vectors,
an IntegerList object,
or a character vector where each element is a
comma-separated list of integers.
In addition, the lists represented by |
strand |
A character vector of the same length as |
decreasing.rank.on.minus.strand |
|
error.if.out.of.bounds |
|
For transcriptLocs2refLocs
: A list of integer vectors of the same
shape as tlocs
.
For transcriptWidths
: An integer vector with one element per
transcript.
Hervé Pagès
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from chromosomes.
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
## --------------------------------------------------------------------- ## WITH A SMALL SET OF HUMAN TRANSCRIPTS ## --------------------------------------------------------------------- txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE) genome <- BSgenome::getBSgenome("hg19") # load the hg19 genome tx_seqs <- extractTranscriptSeqs(genome, ex_by_tx) ## Get the reference-based locations of the first 4 (5' end) ## and last 4 (3' end) nucleotides in each transcript: tlocs <- lapply(width(tx_seqs), function(w) c(1:4, (w-3):w)) tx_strand <- sapply(strand(ex_by_tx), runValue) ## Note that, because of how we made them, 'tlocs', 'start(ex_by_tx)', ## 'end(ex_by_tx)' and 'tx_strand' are "parallel" objects i.e. they ## have the same length, and, for any valid positional index, elements ## at this position are corresponding to each other. This is how ## transcriptLocs2refLocs() expects them to be! rlocs <- transcriptLocs2refLocs(tlocs, start(ex_by_tx), end(ex_by_tx), tx_strand, decreasing.rank.on.minus.strand=TRUE) ## --------------------------------------------------------------------- ## WITH TWO WORM TRANSCRIPTS: ZC101.3.1 AND F37B1.1.1 ## --------------------------------------------------------------------- library(TxDb.Celegans.UCSC.ce11.ensGene) txdb <- TxDb.Celegans.UCSC.ce11.ensGene my_tx_names <- c("ZC101.3.1", "F37B1.1.1") ## Both transcripts are on chromosome II, the first one on its positive ## strand and the second one on its negative strand: my_tx <- transcripts(txdb, filter=list(tx_name=my_tx_names)) my_tx ## Using transcripts stored in a GRangesList object: ex_by_tx <- exonsBy(txdb, use.names=TRUE)[my_tx_names] genome <- getBSgenome("ce11") # load the ce11 genome tx_seqs <- extractTranscriptSeqs(genome, ex_by_tx) tx_seqs ## Since the 2 transcripts are on the same chromosome, an alternative ## is to store them in an IRangesList object and use that object with ## extractTranscriptSeqs(): ex_by_tx2 <- ranges(ex_by_tx) tx_seqs2 <- extractTranscriptSeqs(genome$chrII, ex_by_tx2, strand=strand(my_tx)) stopifnot(identical(as.character(tx_seqs), as.character(tx_seqs2))) ## Store exon starts and ends in two IntegerList objects for use with ## transcriptWidths() and transcriptLocs2refLocs(): exon_starts <- start(ex_by_tx) exon_ends <- end(ex_by_tx) ## Same as 'width(tx_seqs)': transcriptWidths(exonStarts=exon_starts, exonEnds=exon_ends) transcriptLocs2refLocs(list(c(1:2, 202:205, 1687:1688), c(1:2, 193:196, 721:722)), exonStarts=exon_starts, exonEnds=exon_ends, strand=c("+","-")) ## A sanity check: ref_locs <- transcriptLocs2refLocs(list(1:1688, 1:722), exonStarts=exon_starts, exonEnds=exon_ends, strand=c("+","-")) stopifnot(genome$chrII[ref_locs[[1]]] == tx_seqs[[1]]) stopifnot(complement(genome$chrII)[ref_locs[[2]]] == tx_seqs[[2]])
## --------------------------------------------------------------------- ## WITH A SMALL SET OF HUMAN TRANSCRIPTS ## --------------------------------------------------------------------- txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE) genome <- BSgenome::getBSgenome("hg19") # load the hg19 genome tx_seqs <- extractTranscriptSeqs(genome, ex_by_tx) ## Get the reference-based locations of the first 4 (5' end) ## and last 4 (3' end) nucleotides in each transcript: tlocs <- lapply(width(tx_seqs), function(w) c(1:4, (w-3):w)) tx_strand <- sapply(strand(ex_by_tx), runValue) ## Note that, because of how we made them, 'tlocs', 'start(ex_by_tx)', ## 'end(ex_by_tx)' and 'tx_strand' are "parallel" objects i.e. they ## have the same length, and, for any valid positional index, elements ## at this position are corresponding to each other. This is how ## transcriptLocs2refLocs() expects them to be! rlocs <- transcriptLocs2refLocs(tlocs, start(ex_by_tx), end(ex_by_tx), tx_strand, decreasing.rank.on.minus.strand=TRUE) ## --------------------------------------------------------------------- ## WITH TWO WORM TRANSCRIPTS: ZC101.3.1 AND F37B1.1.1 ## --------------------------------------------------------------------- library(TxDb.Celegans.UCSC.ce11.ensGene) txdb <- TxDb.Celegans.UCSC.ce11.ensGene my_tx_names <- c("ZC101.3.1", "F37B1.1.1") ## Both transcripts are on chromosome II, the first one on its positive ## strand and the second one on its negative strand: my_tx <- transcripts(txdb, filter=list(tx_name=my_tx_names)) my_tx ## Using transcripts stored in a GRangesList object: ex_by_tx <- exonsBy(txdb, use.names=TRUE)[my_tx_names] genome <- getBSgenome("ce11") # load the ce11 genome tx_seqs <- extractTranscriptSeqs(genome, ex_by_tx) tx_seqs ## Since the 2 transcripts are on the same chromosome, an alternative ## is to store them in an IRangesList object and use that object with ## extractTranscriptSeqs(): ex_by_tx2 <- ranges(ex_by_tx) tx_seqs2 <- extractTranscriptSeqs(genome$chrII, ex_by_tx2, strand=strand(my_tx)) stopifnot(identical(as.character(tx_seqs), as.character(tx_seqs2))) ## Store exon starts and ends in two IntegerList objects for use with ## transcriptWidths() and transcriptLocs2refLocs(): exon_starts <- start(ex_by_tx) exon_ends <- end(ex_by_tx) ## Same as 'width(tx_seqs)': transcriptWidths(exonStarts=exon_starts, exonEnds=exon_ends) transcriptLocs2refLocs(list(c(1:2, 202:205, 1687:1688), c(1:2, 193:196, 721:722)), exonStarts=exon_starts, exonEnds=exon_ends, strand=c("+","-")) ## A sanity check: ref_locs <- transcriptLocs2refLocs(list(1:1688, 1:722), exonStarts=exon_starts, exonEnds=exon_ends, strand=c("+","-")) stopifnot(genome$chrII[ref_locs[[1]]] == tx_seqs[[1]]) stopifnot(complement(genome$chrII)[ref_locs[[2]]] == tx_seqs[[2]])
Generic functions to extract genomic features from a TxDb-like object. This page documents the methods for TxDb objects only.
transcripts(x, ...) ## S4 method for signature 'TxDb' transcripts(x, columns=c("tx_id", "tx_name"), filter=NULL, use.names=FALSE) exons(x, ...) ## S4 method for signature 'TxDb' exons(x, columns="exon_id", filter=NULL, use.names=FALSE) cds(x, ...) ## S4 method for signature 'TxDb' cds(x, columns="cds_id", filter=NULL, use.names=FALSE) genes(x, ...) ## S4 method for signature 'TxDb' genes(x, columns="gene_id", filter=NULL, single.strand.genes.only=TRUE) ## S4 method for signature 'TxDb' promoters(x, upstream=2000, downstream=200, use.names=TRUE, ...) ## S4 method for signature 'TxDb' terminators(x, upstream=2000, downstream=200, use.names=TRUE, ...)
transcripts(x, ...) ## S4 method for signature 'TxDb' transcripts(x, columns=c("tx_id", "tx_name"), filter=NULL, use.names=FALSE) exons(x, ...) ## S4 method for signature 'TxDb' exons(x, columns="exon_id", filter=NULL, use.names=FALSE) cds(x, ...) ## S4 method for signature 'TxDb' cds(x, columns="cds_id", filter=NULL, use.names=FALSE) genes(x, ...) ## S4 method for signature 'TxDb' genes(x, columns="gene_id", filter=NULL, single.strand.genes.only=TRUE) ## S4 method for signature 'TxDb' promoters(x, upstream=2000, downstream=200, use.names=TRUE, ...) ## S4 method for signature 'TxDb' terminators(x, upstream=2000, downstream=200, use.names=TRUE, ...)
x |
A TxDb object. |
... |
For the For the |
columns |
Columns to include in the output.
Must be
If the vector is named, those names are used for the corresponding column in the element metadata of the returned object. |
filter |
Either |
use.names |
|
single.strand.genes.only |
If |
upstream , downstream
|
For For For additional details see
|
These are the main functions for extracting features from a
TxDb-like object. Note that cds()
extracts the
bulk CDS parts stored in x
, that is, the CDS regions
associated with exons. It is often more useful to extract them grouped
by transcript with cdsBy()
.
To restrict the output based on interval information, use the
transcriptsByOverlaps()
, exonsByOverlaps()
,
or cdsByOverlaps()
function.
The promoters()
and terminators()
functions compute
user-defined promoter or terminator regions for the transcripts
in a TxDb-like object. The returned object is a
GRanges with one range per transcript
in the TxDb-like object.
Each range represents the promoter or terminator region associated
with a transcript, that is, the region around the TSS
(transcription start site) or TES (transcription end site) the
span of which is defined by upstream
and downstream
.
For additional details on how the promoter and terminator ranges
are computed and the handling of +
and -
strands see
?GenomicRanges::promoters
in the
GenomicRanges package.
A GRanges object. The only exception being
when genes()
is used with single.strand.genes.only=FALSE
,
in which case a GRangesList object is returned.
M. Carlson, P. Aboyoun and H. Pagès
transcriptsBy
and transcriptsByOverlaps
for more ways to extract genomic features
from a TxDb-like object.
transcriptLengths
for extracting the transcript
lengths (and other metrics) from a TxDb object.
exonicParts
and intronicParts
for
extracting non-overlapping exonic or intronic parts from a
TxDb-like object.
extendExonsIntoIntrons
for extending exons
into their adjacent introns.
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from reference sequences.
getPromoterSeq
for extracting gene promoter or
terminator sequences.
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
select-methods for how to use the simple "select" interface to extract information from a TxDb object.
microRNAs
and tRNAs
for extracting
microRNA or tRNA genomic ranges from a TxDb object.
id2name
for mapping TxDb internal ids
to external names for a given feature type.
The TxDb class.
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ## --------------------------------------------------------------------- ## transcripts() ## --------------------------------------------------------------------- tx1 <- transcripts(txdb) tx1 transcripts(txdb, use.names=TRUE) transcripts(txdb, columns=NULL, use.names=TRUE) filter <- list(tx_chrom = c("chr3", "chr5"), tx_strand = "+") tx2 <- transcripts(txdb, filter=filter) tx2 ## Sanity checks: stopifnot( identical(mcols(tx1)$tx_id, seq_along(tx1)), identical(tx2, tx1[seqnames(tx1) == "chr3" & strand(tx1) == "+"]) ) ## --------------------------------------------------------------------- ## exons() ## --------------------------------------------------------------------- exons(txdb, columns=c("EXONID", "TXNAME"), filter=list(exon_id=1)) exons(txdb, columns=c("EXONID", "TXNAME"), filter=list(tx_name="uc009vip.1")) ## --------------------------------------------------------------------- ## genes() ## --------------------------------------------------------------------- genes(txdb) # a GRanges object cols <- c("tx_id", "tx_chrom", "tx_strand", "exon_id", "exon_chrom", "exon_strand") ## By default, genes are returned in a GRanges object and those that ## cannot be represented by a single genomic range (because they have ## exons located on both strands of the same reference sequence or on ## more than one reference sequence) are dropped with a message: single_strand_genes <- genes(txdb, columns=cols) ## Because we've returned single strand genes only, the "tx_chrom" ## and "exon_chrom" metadata columns are guaranteed to match ## 'seqnames(single_strand_genes)': stopifnot(identical(as.character(seqnames(single_strand_genes)), as.character(mcols(single_strand_genes)$tx_chrom))) stopifnot(identical(as.character(seqnames(single_strand_genes)), as.character(mcols(single_strand_genes)$exon_chrom))) ## and also the "tx_strand" and "exon_strand" metadata columns are ## guaranteed to match 'strand(single_strand_genes)': stopifnot(identical(as.character(strand(single_strand_genes)), as.character(mcols(single_strand_genes)$tx_strand))) stopifnot(identical(as.character(strand(single_strand_genes)), as.character(mcols(single_strand_genes)$exon_strand))) all_genes <- genes(txdb, columns=cols, single.strand.genes.only=FALSE) all_genes # a GRangesList object multiple_strand_genes <- all_genes[elementNROWS(all_genes) >= 2] multiple_strand_genes mcols(multiple_strand_genes) ## --------------------------------------------------------------------- ## promoters() and terminators() ## --------------------------------------------------------------------- ## This: promoters(txdb, upstream=100, downstream=50) ## is equivalent to: tx <- transcripts(txdb, use.names=TRUE) promoters(tx, upstream=100, downstream=50) ## And this: terminators(txdb, upstream=100, downstream=50) ## is equivalent to: terminators(tx, upstream=100, downstream=50) ## Extra arguments are passed to transcripts(). So this: columns <- c("tx_name", "gene_id") promoters(txdb, upstream=100, downstream=50, columns=columns) ## is equivalent to: promoters(transcripts(txdb, columns=columns, use.names=TRUE), upstream=100, downstream=50)
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ## --------------------------------------------------------------------- ## transcripts() ## --------------------------------------------------------------------- tx1 <- transcripts(txdb) tx1 transcripts(txdb, use.names=TRUE) transcripts(txdb, columns=NULL, use.names=TRUE) filter <- list(tx_chrom = c("chr3", "chr5"), tx_strand = "+") tx2 <- transcripts(txdb, filter=filter) tx2 ## Sanity checks: stopifnot( identical(mcols(tx1)$tx_id, seq_along(tx1)), identical(tx2, tx1[seqnames(tx1) == "chr3" & strand(tx1) == "+"]) ) ## --------------------------------------------------------------------- ## exons() ## --------------------------------------------------------------------- exons(txdb, columns=c("EXONID", "TXNAME"), filter=list(exon_id=1)) exons(txdb, columns=c("EXONID", "TXNAME"), filter=list(tx_name="uc009vip.1")) ## --------------------------------------------------------------------- ## genes() ## --------------------------------------------------------------------- genes(txdb) # a GRanges object cols <- c("tx_id", "tx_chrom", "tx_strand", "exon_id", "exon_chrom", "exon_strand") ## By default, genes are returned in a GRanges object and those that ## cannot be represented by a single genomic range (because they have ## exons located on both strands of the same reference sequence or on ## more than one reference sequence) are dropped with a message: single_strand_genes <- genes(txdb, columns=cols) ## Because we've returned single strand genes only, the "tx_chrom" ## and "exon_chrom" metadata columns are guaranteed to match ## 'seqnames(single_strand_genes)': stopifnot(identical(as.character(seqnames(single_strand_genes)), as.character(mcols(single_strand_genes)$tx_chrom))) stopifnot(identical(as.character(seqnames(single_strand_genes)), as.character(mcols(single_strand_genes)$exon_chrom))) ## and also the "tx_strand" and "exon_strand" metadata columns are ## guaranteed to match 'strand(single_strand_genes)': stopifnot(identical(as.character(strand(single_strand_genes)), as.character(mcols(single_strand_genes)$tx_strand))) stopifnot(identical(as.character(strand(single_strand_genes)), as.character(mcols(single_strand_genes)$exon_strand))) all_genes <- genes(txdb, columns=cols, single.strand.genes.only=FALSE) all_genes # a GRangesList object multiple_strand_genes <- all_genes[elementNROWS(all_genes) >= 2] multiple_strand_genes mcols(multiple_strand_genes) ## --------------------------------------------------------------------- ## promoters() and terminators() ## --------------------------------------------------------------------- ## This: promoters(txdb, upstream=100, downstream=50) ## is equivalent to: tx <- transcripts(txdb, use.names=TRUE) promoters(tx, upstream=100, downstream=50) ## And this: terminators(txdb, upstream=100, downstream=50) ## is equivalent to: terminators(tx, upstream=100, downstream=50) ## Extra arguments are passed to transcripts(). So this: columns <- c("tx_name", "gene_id") promoters(txdb, upstream=100, downstream=50, columns=columns) ## is equivalent to: promoters(transcripts(txdb, columns=columns, use.names=TRUE), upstream=100, downstream=50)
Generic functions to extract genomic features of a given type grouped based on another type of genomic feature. This page documents the methods for TxDb objects only.
transcriptsBy(x, by=c("gene", "exon", "cds"), ...) ## S4 method for signature 'TxDb' transcriptsBy(x, by=c("gene", "exon", "cds"), use.names=FALSE) exonsBy(x, by=c("tx", "gene"), ...) ## S4 method for signature 'TxDb' exonsBy(x, by=c("tx", "gene"), use.names=FALSE) cdsBy(x, by=c("tx", "gene"), ...) ## S4 method for signature 'TxDb' cdsBy(x, by=c("tx", "gene"), use.names=FALSE) intronsByTranscript(x, ...) ## S4 method for signature 'TxDb' intronsByTranscript(x, use.names=FALSE) fiveUTRsByTranscript(x, ...) ## S4 method for signature 'TxDb' fiveUTRsByTranscript(x, use.names=FALSE) threeUTRsByTranscript(x, ...) ## S4 method for signature 'TxDb' threeUTRsByTranscript(x, use.names=FALSE)
transcriptsBy(x, by=c("gene", "exon", "cds"), ...) ## S4 method for signature 'TxDb' transcriptsBy(x, by=c("gene", "exon", "cds"), use.names=FALSE) exonsBy(x, by=c("tx", "gene"), ...) ## S4 method for signature 'TxDb' exonsBy(x, by=c("tx", "gene"), use.names=FALSE) cdsBy(x, by=c("tx", "gene"), ...) ## S4 method for signature 'TxDb' cdsBy(x, by=c("tx", "gene"), use.names=FALSE) intronsByTranscript(x, ...) ## S4 method for signature 'TxDb' intronsByTranscript(x, use.names=FALSE) fiveUTRsByTranscript(x, ...) ## S4 method for signature 'TxDb' fiveUTRsByTranscript(x, use.names=FALSE) threeUTRsByTranscript(x, ...) ## S4 method for signature 'TxDb' threeUTRsByTranscript(x, use.names=FALSE)
x |
A TxDb object. |
... |
Arguments to be passed to or from methods. |
by |
One of |
use.names |
Controls how to set the names of the returned
GRangesList object.
These functions return all the features of a given type (e.g.
all the exons) grouped by another feature type (e.g. grouped by
transcript) in a GRangesList object.
By default (i.e. if Finally, |
These functions return a GRangesList object where the ranges within each of the elements are ordered according to the following rule:
When using exonsBy
or cdsBy
with by="tx"
,
the returned exons or CDS parts are ordered by ascending rank for
each transcript, that is, by their position in the transcript.
In all other cases, the ranges will be ordered by chromosome, strand,
start, and end values.
A GRangesList object.
M. Carlson, P. Aboyoun and H. Pagès
transcripts
and transcriptsByOverlaps
for more ways to extract genomic features
from a TxDb-like object.
transcriptLengths
for extracting the transcript
lengths (and other metrics) from a TxDb object.
exonicParts
and intronicParts
for
extracting non-overlapping exonic or intronic parts from a
TxDb-like object.
extendExonsIntoIntrons
for extending exons
into their adjacent introns.
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from chromosome sequences.
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
select-methods for how to use the simple "select" interface to extract information from a TxDb object.
id2name
for mapping TxDb internal ids
to external names for a given feature type.
The TxDb class.
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ## Extract the transcripts grouped by gene: transcriptsBy(txdb, "gene") ## Extract the exons grouped by gene: exonsBy(txdb, "gene") ## Extract the CDS parts grouped by transcript: cds_by_tx0 <- cdsBy(txdb, "tx") ## With more informative group names: cds_by_tx1 <- cdsBy(txdb, "tx", use.names=TRUE) ## Note that 'cds_by_tx1' can also be obtained with: names(cds_by_tx0) <- id2name(txdb, feature.type="tx")[names(cds_by_tx0)] stopifnot(identical(cds_by_tx0, cds_by_tx1)) ## Extract the introns grouped by transcript: intronsByTranscript(txdb) ## Extract the 5' UTRs grouped by transcript: fiveUTRsByTranscript(txdb) fiveUTRsByTranscript(txdb, use.names=TRUE) # more informative group names
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ## Extract the transcripts grouped by gene: transcriptsBy(txdb, "gene") ## Extract the exons grouped by gene: exonsBy(txdb, "gene") ## Extract the CDS parts grouped by transcript: cds_by_tx0 <- cdsBy(txdb, "tx") ## With more informative group names: cds_by_tx1 <- cdsBy(txdb, "tx", use.names=TRUE) ## Note that 'cds_by_tx1' can also be obtained with: names(cds_by_tx0) <- id2name(txdb, feature.type="tx")[names(cds_by_tx0)] stopifnot(identical(cds_by_tx0, cds_by_tx1)) ## Extract the introns grouped by transcript: intronsByTranscript(txdb) ## Extract the 5' UTRs grouped by transcript: fiveUTRsByTranscript(txdb) fiveUTRsByTranscript(txdb, use.names=TRUE) # more informative group names
Generic functions to extract genomic features for specified genomic locations. This page documents the methods for TxDb objects only.
transcriptsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), ...) ## S4 method for signature 'TxDb' transcriptsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), columns = c("tx_id", "tx_name")) exonsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), ...) ## S4 method for signature 'TxDb' exonsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), columns = "exon_id") cdsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), ...) ## S4 method for signature 'TxDb' cdsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), columns = "cds_id")
transcriptsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), ...) ## S4 method for signature 'TxDb' transcriptsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), columns = c("tx_id", "tx_name")) exonsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), ...) ## S4 method for signature 'TxDb' exonsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), columns = "exon_id") cdsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), ...) ## S4 method for signature 'TxDb' cdsByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end"), columns = "cds_id")
x |
A TxDb object. |
ranges |
A GRanges object to restrict the output. |
maxgap , minoverlap , type
|
Used in the internal call to |
... |
Arguments to be passed to or from methods. |
columns |
Columns to include in the output.
See |
These functions subset the results of transcripts
,
exons
, and cds
function calls with
using the results of findOverlaps
calls based on the specified ranges
.
a GRanges object
P. Aboyoun
transcripts
and transcriptsBy
for more ways to extract genomic features
from a TxDb-like object.
transcriptLengths
for extracting the transcript
lengths (and other metrics) from a TxDb object.
exonicParts
and intronicParts
for
extracting non-overlapping exonic or intronic parts from a
TxDb-like object.
extractTranscriptSeqs
for extracting transcript
(or CDS) sequences from chromosome sequences.
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
select-methods for how to use the simple "select" interface to extract information from a TxDb object.
id2name
for mapping TxDb internal ids
to external names for a given feature type.
The TxDb class.
txdb <- loadDb(system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures")) gr <- GRanges(Rle("chr1", 2), IRanges(c(500,10500), c(10000,30000)), strand = Rle("-", 2)) transcriptsByOverlaps(txdb, gr)
txdb <- loadDb(system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures")) gr <- GRanges(Rle("chr1", 2), IRanges(c(500,10500), c(10000,30000)), strand = Rle("-", 2)) transcriptsByOverlaps(txdb, gr)
The TxDb class is a container for storing transcript annotations.
In the code snippets below, x
is a TxDb object.
metadata(x)
:Return x
's metadata in a data frame.
seqlevels0(x)
:Get the sequence levels originally in x
. This ignores any
change the user might have made to the sequence levels with the
seqlevels
setter.
seqlevels(x)
, seqlevels(x) <- value
:Get or set the sequence levels in x
.
seqinfo(x)
, seqinfo(x) <- value
:Get or set the information about the underlying sequences.
Note that, for now, the setter only supports replacement of the
sequence names, i.e., except for their sequence names (accessed with
seqnames(value)
and seqnames(seqinfo(x))
, respectively),
Seqinfo objects value
(supplied) and
seqinfo(x)
(current) must be identical.
isActiveSeq(x)
:Return the currently active sequences for this txdb object as a named logical vector. Only active sequences will be tapped when using the supplied accessor methods. Inactive sequences will be ignored. By default, all available sequences will be active.
isActiveSeq(x) <- value
:Allows the user to change which sequences will be actively accessed by the accessor methods by altering the contents of this named logical vector.
seqlevelsStyle(x)
, seqlevelsStyle(x) <- value
:Get or set the seqname style for x
.
See the seqlevelsStyle generic getter and setter
in the GenomeInfoDb package for more information.
as.list(x)
:Dump the entire db into a list of data frames, say txdb_dump
,
that can then be used to recreate the original db with
do.call(makeTxDb, txdb_dump)
with no loss of information
(except possibly for some of the metadata).
Note that the transcripts are dumped in the same order in all the
data frames.
Hervé Pagès, Marc Carlson
makeTxDbFromUCSC
, makeTxDbFromBiomart
,
and makeTxDbFromEnsembl
, for making a TxDb
object from online resources.
makeTxDbFromGRanges
and makeTxDbFromGFF
for making a TxDb object from a GRanges
object, or from a GFF or GTF file.
saveDb
and
loadDb
in the AnnotationDbi
package for saving and loading a TxDb object as an SQLite file.
transcripts
, transcriptsBy
,
and transcriptsByOverlaps
, for extracting
genomic feature locations from a TxDb-like object.
transcriptLengths
for extracting the transcript
lengths (and other metrics) from a TxDb object.
select-methods for how to use the simple "select" interface to extract information from a TxDb object.
The Seqinfo class in the GenomeInfoDb package.
txdb_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) txdb ## Use of seqinfo(): seqlevelsStyle(txdb) seqinfo(txdb) seqlevels(txdb) seqlengths(txdb) # shortcut for 'seqlengths(seqinfo(txdb))' isCircular(txdb) # shortcut for 'isCircular(seqinfo(txdb))' names(which(isCircular(txdb))) ## You can set user-supplied seqlevels on 'txdb' to restrict any further ## operations to a subset of chromosomes: seqlevels(txdb) <- c("Y", "6") ## Then you can restore the seqlevels stored in the db: seqlevels(txdb) <- seqlevels0(txdb) ## Use of as.list(): txdb_dump <- as.list(txdb) txdb_dump txdb1 <- do.call(makeTxDb, txdb_dump) stopifnot(identical(as.list(txdb1), txdb_dump))
txdb_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) txdb ## Use of seqinfo(): seqlevelsStyle(txdb) seqinfo(txdb) seqlevels(txdb) seqlengths(txdb) # shortcut for 'seqlengths(seqinfo(txdb))' isCircular(txdb) # shortcut for 'isCircular(seqinfo(txdb))' names(which(isCircular(txdb))) ## You can set user-supplied seqlevels on 'txdb' to restrict any further ## operations to a subset of chromosomes: seqlevels(txdb) <- c("Y", "6") ## Then you can restore the seqlevels stored in the db: seqlevels(txdb) <- seqlevels0(txdb) ## Use of as.list(): txdb_dump <- as.list(txdb) txdb_dump txdb1 <- do.call(makeTxDb, txdb_dump) stopifnot(identical(as.list(txdb1), txdb_dump))