Title: | Annotation of Genomic Regions to Genomic Annotations |
---|---|
Description: | Given a set of genomic sites/regions (e.g. ChIP-seq peaks, CpGs, differentially methylated CpGs or regions, SNPs, etc.) it is often of interest to investigate the intersecting genomic annotations. Such annotations include those relating to gene models (promoters, 5'UTRs, exons, introns, and 3'UTRs), CpGs (CpG islands, CpG shores, CpG shelves), or regulatory sequences such as enhancers. The annotatr package provides an easy way to summarize and visualize the intersection of genomic sites/regions with genomic annotations. |
Authors: | Raymond G. Cavalcante [aut, cre], Maureen A. Sartor [ths] |
Maintainer: | Raymond G. Cavalcante <[email protected]> |
License: | GPL-3 |
Version: | 1.33.0 |
Built: | 2024-12-21 06:20:22 UTC |
Source: | https://github.com/bioc/annotatr |
Annotate genomic regions to selected genomic annotations while preserving the data associated with the genomic regions.
annotate_regions( regions, annotations, minoverlap = 1L, ignore.strand = TRUE, quiet = FALSE )
annotate_regions( regions, annotations, minoverlap = 1L, ignore.strand = TRUE, quiet = FALSE )
regions |
The GRanges object returned by |
annotations |
A character vector of annotations to build. Valid annotation codes are listed with |
minoverlap |
A scalar, positive integer, indicating the minimum required overlap of regions with annotations. |
ignore.strand |
Logical indicating whether strandedness should be respected in findOverlaps(). Default FALSE. |
quiet |
Print progress messages (FALSE) or not (TRUE). |
A GRanges
where the granges
are from the regions, and the mcols
include the mcols
from the regions and a column with the annotation GRanges
.
r_file = system.file('extdata', 'test_read_multiple_data_nohead.bed', package='annotatr') extraCols = c(pval = 'numeric', mu1 = 'integer', mu0 = 'integer', diff_exp = 'character') r = read_regions(con = r_file, extraCols = extraCols, rename_score = 'coverage') # Get premade CpG annotations data('annotations', package = 'annotatr') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE)
r_file = system.file('extdata', 'test_read_multiple_data_nohead.bed', package='annotatr') extraCols = c(pval = 'numeric', mu1 = 'integer', mu0 = 'integer', diff_exp = 'character') r = read_regions(con = r_file, extraCols = extraCols, rename_score = 'coverage') # Get premade CpG annotations data('annotations', package = 'annotatr') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE)
A GRanges
of precomputed annotations for CpG features. Created by doing
build_annotations(genome='hg19', annotations = 'hg19_cpgs')
.
annotations
annotations
A GRanges
object with the CpG feature annotations for hg19
and containing mcols
:
The internal ID for the annotation
All NA, since these are not associated with tx_ids
All NA, since there are not associated Entrez IDs
All NA, since there are not associated gene symbols
A character indicating the type of annotation. Including: 'hg19_cpg_islands', 'hg19_cpg_shores', 'hg19_cpg_shelves', and 'hg19_cpg_inter'.
The AnnotationHub resource for hg19 CpG features.
Given a set of genomic sites/regions (e.g. ChIP-seq peaks, CpGs, differentially methylated CpGs or regions, SNPs, etc.) it is often of interest to investigate the intersecting functional annotations. Such annotations include those relating to gene models (promoters, 5'UTRs, exons, introns, and 3'UTRs), CpGs (CpG islands, CpG shores, CpG shelves), the non-coding genome, and enhancers. The annotatr package provides an easy way to summarize and visualize the intersection of genomic sites/regions with the above functional annotations.
Code thanks to Martin Morgan. This is a global variable that will store custom annotations that a user reads in during a session in which annotatr is loaded.
annotatr_cache
annotatr_cache
An object of class list
of length 3.
An environment to contain custom annotations from read_annotations
.
# Example usage annotatr_cache$set("foo", 1:10) annotatr_cache$get("foo") # Read in a BED3 file as a custom annotation file = system.file('extdata', 'test_annotations_3.bed', package='annotatr') # The custom annotation is added to the annotatr_cache environment in this function read_annotations(con = file, name = 'test', genome = 'hg19') # The result of read_annotations() is not visible in .GlobalEnv, instead # need to use the get method print(annotatr_cache$get('hg19_custom_test')) # See what is in the annotatr_cache annotatr_cache$list_env()
# Example usage annotatr_cache$set("foo", 1:10) annotatr_cache$get("foo") # Read in a BED3 file as a custom annotation file = system.file('extdata', 'test_annotations_3.bed', package='annotatr') # The custom annotation is added to the annotatr_cache environment in this function read_annotations(con = file, name = 'test', genome = 'hg19') # The result of read_annotations() is not visible in .GlobalEnv, instead # need to use the get method print(annotatr_cache$get('hg19_custom_test')) # See what is in the annotatr_cache annotatr_cache$list_env()
A helper function to build arbitrary annotatinos from AnnotationHub
build_ah_annots(genome, ah_codes, annotation_class)
build_ah_annots(genome, ah_codes, annotation_class)
genome |
The genome assembly. |
ah_codes |
A named character vector giving the AnnotationHub accession number (e.g. AH23256), and whose name describes what the annotation is (e.g. Gm12878_H3K4me3). |
annotation_class |
A string to name the group of annotations in |
A GRanges
object stored in annotatr_cache
. To view an annotation built with this function, do annotatr_cache$get(name)
. To add these annotations to a set of annotations, include '[genome]_[annotation_class]_[name]'
in the call to build_annotations()
. See example below.
# Create a named vector for the AnnotationHub accession codes with desired names h3k4me3_code = c('Gm12878' = 'AH23256') # Fetch ah_codes from AnnotationHub and create annotations annotatr understands build_ah_annots(genome = 'hg19', ah_codes = h3k4me3_code, annotation_class = 'H3K4me3') # The annotations as they appear in annotatr_cache annot_name = c('hg19_H3K4me3_Gm12878') # Build the annotations right before annotating any regions annotations = build_annotations(genome = 'hg19', annotations = annot_name)
# Create a named vector for the AnnotationHub accession codes with desired names h3k4me3_code = c('Gm12878' = 'AH23256') # Fetch ah_codes from AnnotationHub and create annotations annotatr understands build_ah_annots(genome = 'hg19', ah_codes = h3k4me3_code, annotation_class = 'H3K4me3') # The annotations as they appear in annotatr_cache annot_name = c('hg19_H3K4me3_Gm12878') # Build the annotations right before annotating any regions annotations = build_annotations(genome = 'hg19', annotations = annot_name)
Create a GRanges
object consisting of all the desired annotations
. Supported annotation codes are listed by builtin_annotations()
. The basis for enhancer annotations are FANTOM5 data, the basis for CpG related annotations are CpG island tracks from AnnotationHub
, and the basis for genic annotations are from the TxDb.*
and org.db
group of packages.
build_annotations(genome, annotations)
build_annotations(genome, annotations)
genome |
The genome assembly. |
annotations |
A character vector of annotations to build. Valid annotation codes are listed with |
A GRanges
object of all the annotations
combined. The mcols
are id, tx_id, gene_id, symbol, type
. The id
column is a unique name, the tx_id
column is either a UCSC knownGene transcript ID (genic annotations) or a Ensembl transcript ID (lncRNA annotations), the gene_id
is the Entrez ID, the symbol
is the gene symbol from the org.*.eg.db
mapping from the Entrez ID, and the type
is of the form [genome]_[type]_[name]
.
# Example with hg19 gene promoters annots = c('hg19_genes_promoters') annots_gr = build_annotations(genome = 'hg19', annotations = annots) # See vignette for an example with custom annotation
# Example with hg19 gene promoters annots = c('hg19_genes_promoters') annots_gr = build_annotations(genome = 'hg19', annotations = annots) # See vignette for an example with custom annotation
Using the AnnotationHub
package, extract CpG island track for the appropriate genome
and construct the shores, shelves, and interCGI annotations as desired.
build_cpg_annots( genome = annotatr::builtin_genomes(), annotations = annotatr::builtin_annotations() )
build_cpg_annots( genome = annotatr::builtin_genomes(), annotations = annotatr::builtin_annotations() )
genome |
The genome assembly. |
annotations |
A character vector with entries of the form |
A list of GRanges
objects.
A helper function to build enhancer annotations for hg19 and mm10 from FANTOM5.
build_enhancer_annots(genome = c("hg19", "hg38", "mm9", "mm10"))
build_enhancer_annots(genome = c("hg19", "hg38", "mm9", "mm10"))
genome |
The genome assembly. |
A GRanges
object.
Using the TxDb.*
group of packages, construct genic annotations consisting of any combination of 1-5kb upstream of a TSS, promoters (< 1kb from TSS), 5UTRs, CDS, exons, first exons, introns, intron/exon and exon/intron boundaries, 3UTRs, and intergenic.
build_gene_annots( genome = annotatr::builtin_genomes(), annotations = annotatr::builtin_annotations() )
build_gene_annots( genome = annotatr::builtin_genomes(), annotations = annotatr::builtin_annotations() )
genome |
The genome assembly. |
annotations |
A character vector with entries of the form |
A list of GRanges
objects with unique id
of the form [type]:i
, tx_id
being the UCSC knownGene transcript name, gene_id
being the Entrez Gene ID, symbol
being the gene symbol from the Entrez ID to symbol mapping in org.db
for that species, and type
being the annotation type.
A helper function to build chromHMM annotations for hg19 from UCSC Genome Browser.
build_hmm_annots( genome = c("hg19"), annotations = annotatr::builtin_annotations() )
build_hmm_annots( genome = c("hg19"), annotations = annotatr::builtin_annotations() )
genome |
The genome assembly. |
annotations |
A character vector of valid chromatin state annotatin codes. |
A GRanges
object.
Using the AnnotationHub
package, retrieve transcript level lncRNA annotations for either human (GRCh38) or mouse (GRCm38). If the genome is 'hg19', use the permalink from GENCODE and rtracklayer::import()
to download and process.
build_lncrna_annots(genome = c("hg19", "hg38", "mm10"))
build_lncrna_annots(genome = c("hg19", "hg38", "mm10"))
genome |
The genome assembly. |
A GRanges
object with id
giving the transcript_type
from the GENCODE file, tx_id
being the Ensembl transcript ID, gene_id
being the Entrez ID coming from a mapping of gene symbol to Entrez ID, symbol
being the gene_name from the GENCODE file, and the type
being [genome]_lncrna_gencode
.
This includes the shortcuts. The expand_annotations()
function helps
handle the shortcuts.
builtin_annotations()
builtin_annotations()
A character vector of available annotations.
builtin_annotations()
builtin_annotations()
Function returning supported TxDb.* genomes
builtin_genomes()
builtin_genomes()
A character vector of genomes for supported TxDb.* packages
builtin_genomes()
builtin_genomes()
Gives errors if any annotations are not in builtin_annotations() (and they are not in the required custom format), basicgenes are used, or the genome prefixes are not the same for all annotations.
check_annotations(annotations)
check_annotations(annotations)
annotations |
A character vector of annotations possibly using the shortcuts |
If all the checks on the annotations pass, returns NULL to allow code to move forward.
Function to expand annotation shortcuts
expand_annotations(annotations)
expand_annotations(annotations)
annotations |
A character vector of annotations, possibly using the shortcut accessors |
A vector of data accession-ized names that are ordered from upstream to downstream in the case of knownGenes and islands to interCGI in the case of cpgs.
Function to return cell line from chromatin annotation code
get_cellline_from_code(code)
get_cellline_from_code(code)
code |
The annotation code, used in |
A string of the cell line used in a chromatin annotation code
Function to return cell line from chromatin annotation shortcut
get_cellline_from_shortcut(shortcut)
get_cellline_from_shortcut(shortcut)
shortcut |
The annotation shortcut, used in |
A string of the cell line used in a chromatin annotation shortcut
Function to get correct org.* package name based on genome
get_orgdb_name(genome = annotatr::builtin_genomes())
get_orgdb_name(genome = annotatr::builtin_genomes())
genome |
A string giving the genome assembly. |
A string giving the correct org for org.db packages. e.g. hg19 -> Hs.
Function to get correct TxDb.* package name based on genome
get_txdb_name(genome = annotatr::builtin_genomes())
get_txdb_name(genome = annotatr::builtin_genomes())
genome |
A string giving the genome assembly. |
A string giving the name of the correct TxDb.* package name based on genome
.
Given a GRanges
of annotated regions, plot the number of regions with the corresponding genomic annotations used in annotation_order
. If a region is annotated to multiple annotations of the same annot.type
, the region will only be counted once in the corresponding bar plot. For example, if a region were annotated to multiple exons, it would only count once toward the exon bar in the plot, but if it were annotated to an exon and an intron, it would count towards both.
plot_annotation( annotated_regions, annotated_random, annotation_order = NULL, plot_title, x_label, y_label, quiet = FALSE )
plot_annotation( annotated_regions, annotated_random, annotation_order = NULL, plot_title, x_label, y_label, quiet = FALSE )
annotated_regions |
The |
annotated_random |
The |
annotation_order |
A character vector which doubles as the subset of annotations desired for the plot as well as the ordering. If |
plot_title |
A string used for the title of the plot. If missing, no title is displayed. |
x_label |
A string used for the x-axis label. If missing, no x-axis label is displayed. |
y_label |
A string used for the y-axis label. If missing, no y-axis label is displayed. |
quiet |
Print progress messages (FALSE) or not (TRUE). |
A ggplot
object which can be viewed by calling it, saved with ggplot2::ggsave
, or edited.
######################################################################## # An example of ChIP-seq peaks with signalValue used for score # Get premade CpG annotations data('annotations', package = 'annotatr') chip_bed = system.file('extdata', 'Gm12878_Stat3_chr2.bed.gz', package = 'annotatr') chip_regions = read_regions(con = chip_bed, genome = 'hg19') chip_rnd = randomize_regions(regions = chip_regions) chip_annots = annotate_regions( regions = chip_regions, annotations = annotations, ignore.strand = TRUE) chip_rnd_annots = annotate_regions( regions = chip_rnd, annotations = annotations, ignore.strand = TRUE) annots_order = c( 'hg19_cpg_islands', 'hg19_cpg_shores') p_annots = plot_annotation(annotated_regions = chip_annots, annotation_order = annots_order) p_annots_rnd = plot_annotation(annotated_regions = chip_annots, annotated_random = chip_rnd_annots, annotation_order = annots_order)
######################################################################## # An example of ChIP-seq peaks with signalValue used for score # Get premade CpG annotations data('annotations', package = 'annotatr') chip_bed = system.file('extdata', 'Gm12878_Stat3_chr2.bed.gz', package = 'annotatr') chip_regions = read_regions(con = chip_bed, genome = 'hg19') chip_rnd = randomize_regions(regions = chip_regions) chip_annots = annotate_regions( regions = chip_regions, annotations = annotations, ignore.strand = TRUE) chip_rnd_annots = annotate_regions( regions = chip_rnd, annotations = annotations, ignore.strand = TRUE) annots_order = c( 'hg19_cpg_islands', 'hg19_cpg_shores') p_annots = plot_annotation(annotated_regions = chip_annots, annotation_order = annots_order) p_annots_rnd = plot_annotation(annotated_regions = chip_annots, annotated_random = chip_rnd_annots, annotation_order = annots_order)
Given a GRanges
of annotated regions from annotate_regions()
, visualize the the distribution of categorical data fill
in categorical data x
. A bar representing the distribution of all fill
in x
will be added according to the contents of fill
. This is the distribution over all values of x
. Additionally, when annotated_random
is not missing, a "Random Regions" bar shows the distribution of random regions over fill
.
plot_categorical( annotated_regions, annotated_random, x, fill = NULL, x_order = NULL, fill_order = NULL, position = "stack", plot_title, legend_title, x_label, y_label, quiet = FALSE )
plot_categorical( annotated_regions, annotated_random, x, fill = NULL, x_order = NULL, fill_order = NULL, position = "stack", plot_title, legend_title, x_label, y_label, quiet = FALSE )
annotated_regions |
The |
annotated_random |
The |
x |
One of 'annot.type' or a categorical data column, indicating whether annotation classes or data classes will appear on the x-axis. |
fill |
One of 'annot.type', a categorical data column, or |
x_order |
A character vector that subsets and orders the x classes. Default |
fill_order |
A character vector that subsets and orders the fill classes. Default |
position |
A string which has the same possible values as in |
plot_title |
A string used for the title of the plot. If missing, no title is displayed. |
legend_title |
A string used for the legend title to describe fills (if fill is not |
x_label |
A string used for the x-axis label. If missing, corresponding variable name used. |
y_label |
A string used for the y-axis label. If missing, corresponding variable name used. |
quiet |
Print progress messages (FALSE) or not (TRUE). |
For example, if a differentially methylated region has the categorical label hyper, and is annotated to a promoter, a 5UTR, two exons, and an intron. Each annotation will appear in the All bar once. Likewise for the hyper bar if the differential methylation status is chosen as x
with annot.type
chosen as fill
.
A ggplot
object which can be viewed by calling it, or saved with ggplot2::ggsave
.
# Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, genome = 'hg19', rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) dm_order = c( 'hyper', 'hypo') cpg_order = c( 'hg19_cpg_islands', 'hg19_cpg_shores', 'hg19_cpg_shelves', 'hg19_cpg_inter') dm_vn = plot_categorical( annotated_regions = dm_annots, x = 'DM_status', fill = 'annot.type', x_order = dm_order, fill_order = cpg_order, position = 'fill', legend_title = 'knownGene Annotations', x_label = 'DM status', y_label = 'Proportion') # Create randomized regions dm_rnd_regions = randomize_regions(regions = dm_regions) dm_rnd_annots = annotate_regions( regions = dm_rnd_regions, annotations = annotations, ignore.strand = TRUE) dm_vn_rnd = plot_categorical( annotated_regions = dm_annots, annotated_random = dm_rnd_annots, x = 'DM_status', fill = 'annot.type', x_order = dm_order, fill_order = cpg_order, position = 'fill', legend_title = 'knownGene Annotations', x_label = 'DM status', y_label = 'Proportion')
# Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, genome = 'hg19', rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) dm_order = c( 'hyper', 'hypo') cpg_order = c( 'hg19_cpg_islands', 'hg19_cpg_shores', 'hg19_cpg_shelves', 'hg19_cpg_inter') dm_vn = plot_categorical( annotated_regions = dm_annots, x = 'DM_status', fill = 'annot.type', x_order = dm_order, fill_order = cpg_order, position = 'fill', legend_title = 'knownGene Annotations', x_label = 'DM status', y_label = 'Proportion') # Create randomized regions dm_rnd_regions = randomize_regions(regions = dm_regions) dm_rnd_annots = annotate_regions( regions = dm_rnd_regions, annotations = annotations, ignore.strand = TRUE) dm_vn_rnd = plot_categorical( annotated_regions = dm_annots, annotated_random = dm_rnd_annots, x = 'DM_status', fill = 'annot.type', x_order = dm_order, fill_order = cpg_order, position = 'fill', legend_title = 'knownGene Annotations', x_label = 'DM status', y_label = 'Proportion')
All co-occurring annotations associated with a region are computed and displayed as a heatmap.
plot_coannotations( annotated_regions, annotation_order = NULL, plot_title, axes_label, quiet = FALSE )
plot_coannotations( annotated_regions, annotation_order = NULL, plot_title, axes_label, quiet = FALSE )
annotated_regions |
The |
annotation_order |
A character vector which doubles as the subset of annotations desired for plot as well as the ordering. If |
plot_title |
A string used for the title of the plot. If missing, no plot title label is displayed. |
axes_label |
A string used for the axis labels. If missing, corresponding variable name used. |
quiet |
Print progress messages (FALSE) or not (TRUE). |
As with plot_annotation()
, the number in each cell is the number of unique regions annotated to the pair of annotations.
For example, if a region is annotated to both a CpG shore and to two different exons simultaneously, the region will only be counted once in the CpG shore / exon cell. NOTE, this same region will count once in both the CpG shore and exon cells on the diagonal.
A ggplot
object which can be viewed by calling it, saved with ggplot2::ggsave
, or edited.
# Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) all_order = c( 'hg19_cpg_islands', 'hg19_cpg_shores', 'hg19_cpg_shelves', 'hg19_cpg_inter') dm_vs_ca = plot_coannotations( annotated_regions = dm_annots, annotation_order = all_order, axes_label = 'Annotations', plot_title = 'Co-occurrence of Annotations')
# Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) all_order = c( 'hg19_cpg_islands', 'hg19_cpg_shores', 'hg19_cpg_shelves', 'hg19_cpg_inter') dm_vs_ca = plot_coannotations( annotated_regions = dm_annots, annotation_order = all_order, axes_label = 'Annotations', plot_title = 'Co-occurrence of Annotations')
This function produces either histograms over facet
, or x-y scatterplots over facet
. In the case of histograms over facets, the All distribution (hollow histogram with red outline) is the distribution of x
over all the regions in the data. The facet specific distributions (solid gray) are the distribution of x
over the regions in each facet. For example, a CpG with associated percent methylation annotated to a CpG island and a promoter will count once in the All distribution, but will count once each in the CpG island and promoter facet distributions.
plot_numerical( annotated_regions, x, y, facet, facet_order, bin_width = 10, plot_title, x_label, y_label, legend_facet_label, legend_cum_label, quiet = FALSE )
plot_numerical( annotated_regions, x, y, facet, facet_order, bin_width = 10, plot_title, x_label, y_label, legend_facet_label, legend_cum_label, quiet = FALSE )
annotated_regions |
A |
x |
A string indicating the column of the |
y |
A string indicating the column of the |
facet |
A string, or character vector of two strings, indicating indicating which categorical variable(s) in the |
facet_order |
A character vector, or list of character vectors if |
bin_width |
An integer indicating the bin width of the histogram used for score. Default 10. Select something appropriate for the data. NOTE: This is only used if |
plot_title |
A string used for the title of the plot. If missing, no title is displayed. |
x_label |
A string used for the x-axis label. If missing, no x-axis label is displayed. |
y_label |
A string used for the y-axis label. If missing, no y-axis label is displayed. |
legend_facet_label |
A string used to label the gray bar portion of the legend. Defaults to "x in facet". |
legend_cum_label |
A string used to label the red outline portion of the legend. Defaults to "All in x". |
quiet |
Print progress messages (FALSE) or not (TRUE). |
A ggplot
object which can be viewed by calling it, or saved with ggplot2::ggsave
.
# An example with multi-columned data # Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] # Annotate the regions dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) # Plot histogram of group 1 methylation rates across the CpG annotations. # NOTE: Overall distribution (everything in \code{facet_order}) # is plotted in each facet for comparison. dm_vs_regions_mu1 = plot_numerical( annotated_regions = dm_annots, x = 'mu1', facet = 'annot.type', facet_order = c('hg19_cpg_islands','hg19_cpg_shores', 'hg19_cpg_shelves','hg19_cpg_inter'), bin_width = 5, plot_title = 'Group 1 Methylation over CpG Annotations', x_label = 'Group 1 Methylation') # Plot histogram of group 1 methylation rates across the CpG annotations # crossed with DM_status dm_vs_regions_diffmeth = plot_numerical( annotated_regions = dm_annots, x = 'diff_meth', facet = c('annot.type','DM_status'), facet_order = list( c('hg19_genes_promoters','hg19_genes_5UTRs','hg19_cpg_islands'), c('hyper','hypo','none')), bin_width = 5, plot_title = 'Group 0 Region Methylation In Genes', x_label = 'Methylation Difference') # Can also use the result of annotate_regions() to plot two numerical # data columns against each other for each region, and facet by annotations. dm_vs_regions_annot = plot_numerical( annotated_regions = dm_annots, x = 'mu0', y = 'mu1', facet = 'annot.type', facet_order = c('hg19_cpg_islands','hg19_cpg_shores', 'hg19_cpg_shelves','hg19_cpg_inter'), plot_title = 'Region Methylation: Group 0 vs Group 1', x_label = 'Group 0', y_label = 'Group 1') # Another example, but using differential methylation status as the facets. dm_vs_regions_name = plot_numerical( annotated_regions = dm_annots, x = 'mu0', y = 'mu1', facet = 'DM_status', facet_order = c('hyper','hypo','none'), plot_title = 'Region Methylation: Group 0 vs Group 1', x_label = 'Group 0', y_label = 'Group 1')
# An example with multi-columned data # Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] # Annotate the regions dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) # Plot histogram of group 1 methylation rates across the CpG annotations. # NOTE: Overall distribution (everything in \code{facet_order}) # is plotted in each facet for comparison. dm_vs_regions_mu1 = plot_numerical( annotated_regions = dm_annots, x = 'mu1', facet = 'annot.type', facet_order = c('hg19_cpg_islands','hg19_cpg_shores', 'hg19_cpg_shelves','hg19_cpg_inter'), bin_width = 5, plot_title = 'Group 1 Methylation over CpG Annotations', x_label = 'Group 1 Methylation') # Plot histogram of group 1 methylation rates across the CpG annotations # crossed with DM_status dm_vs_regions_diffmeth = plot_numerical( annotated_regions = dm_annots, x = 'diff_meth', facet = c('annot.type','DM_status'), facet_order = list( c('hg19_genes_promoters','hg19_genes_5UTRs','hg19_cpg_islands'), c('hyper','hypo','none')), bin_width = 5, plot_title = 'Group 0 Region Methylation In Genes', x_label = 'Methylation Difference') # Can also use the result of annotate_regions() to plot two numerical # data columns against each other for each region, and facet by annotations. dm_vs_regions_annot = plot_numerical( annotated_regions = dm_annots, x = 'mu0', y = 'mu1', facet = 'annot.type', facet_order = c('hg19_cpg_islands','hg19_cpg_shores', 'hg19_cpg_shelves','hg19_cpg_inter'), plot_title = 'Region Methylation: Group 0 vs Group 1', x_label = 'Group 0', y_label = 'Group 1') # Another example, but using differential methylation status as the facets. dm_vs_regions_name = plot_numerical( annotated_regions = dm_annots, x = 'mu0', y = 'mu1', facet = 'DM_status', facet_order = c('hyper','hypo','none'), plot_title = 'Region Methylation: Group 0 vs Group 1', x_label = 'Group 0', y_label = 'Group 1')
Plot numerical data associated with regions occurring in annot1
, annot2
and in both. As with plot_numerical()
, the result is a plot of histograms or x-y scatterplots.
plot_numerical_coannotations( annotated_regions, x, y, annot1, annot2, bin_width = 10, plot_title, x_label, y_label, legend_facet_label, legend_cum_label, quiet = FALSE )
plot_numerical_coannotations( annotated_regions, x, y, annot1, annot2, bin_width = 10, plot_title, x_label, y_label, legend_facet_label, legend_cum_label, quiet = FALSE )
annotated_regions |
A |
x |
A string indicating the column of the |
y |
A string indicating the column of the |
annot1 |
A string indicating the first annotation type. |
annot2 |
A string indicating the second annotation type. |
bin_width |
An integer indicating the bin width of the histogram used for score. Default 10. Select something appropriate for the data. NOTE: This is only used if |
plot_title |
A string used for the title of the plot. If missing, no title is displayed. |
x_label |
A string used for the x-axis label. If missing, no x-axis label is displayed. |
y_label |
A string used for the y-axis label. If missing, no y-axis label is displayed. |
legend_facet_label |
A string used to label the gray bar portion of the legend. Defaults to "x in annot pair". |
legend_cum_label |
A string used to label the red outline portion of the legend. Defaults to "All x". |
quiet |
Print progress messages (FALSE) or not (TRUE). |
For example, a CpG with associated percent methylation annotated to a CpG island and a promoter will count once in the All distribution and once in the CpG island / promoter facet distribution. However, a CpG associated only with a promoter will count once in the All distribution and once in the promoter / promoter distribution.
A ggplot
object which can be viewed by calling it, or saved with ggplot2::ggsave
.
# Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) dm_vs_num_co = plot_numerical_coannotations( annotated_regions = dm_annots, x = 'mu0', annot1 = 'hg19_cpg_islands', annot2 = 'hg19_cpg_shelves', bin_width = 5, plot_title = 'Group 0 Perc. Meth. in CpG Islands and Promoters', x_label = 'Percent Methylation')
# Get premade CpG annotations data('annotations', package = 'annotatr') dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu1 = 'numeric', mu0 = 'numeric') dm_regions = read_regions(con = dm_file, extraCols = extraCols, rename_score = 'pval', rename_name = 'DM_status', format = 'bed') dm_regions = dm_regions[1:1000] dm_annots = annotate_regions( regions = dm_regions, annotations = annotations, ignore.strand = TRUE) dm_vs_num_co = plot_numerical_coannotations( annotated_regions = dm_annots, x = 'mu0', annot1 = 'hg19_cpg_islands', annot2 = 'hg19_cpg_shelves', bin_width = 5, plot_title = 'Group 0 Perc. Meth. in CpG Islands and Promoters', x_label = 'Percent Methylation')
randomize_regions
is a wrapper function for regioneR::randomizeRegions()
that simplifies the creation of randomized regions for an input set of regions read with read_regions()
. It relies on the seqlengths
of regions
in order to build the appropriate genome
object for regioneR::randomizeRegions()
.
randomize_regions( regions, allow.overlaps = TRUE, per.chromosome = TRUE, quiet = FALSE )
randomize_regions( regions, allow.overlaps = TRUE, per.chromosome = TRUE, quiet = FALSE )
regions |
A |
allow.overlaps |
A logical stating whether random regions can overlap input regions (TRUE) or not (FALSE). Default TRUE. |
per.chromosome |
A logical stating whether the random regions should remain on the same chromosome (TRUE) or not (FALSE). Default TRUE. |
quiet |
Print progress messages (FALSE) or not (TRUE). |
NOTE: The data associated with the input regions
are not passed on to the random regions.
A GRanges
object of randomized regions based on regions
from read_regions()
. NOTE: Data associated with the original regions is not attached to the randomized regions.
# Create random region set based on ENCODE ChIP-seq data file = system.file('extdata', 'Gm12878_Stat3_chr2.bed.gz', package = 'annotatr') r = read_regions(con = file, genome = 'hg19') random_r = randomize_regions(regions = r)
# Create random region set based on ENCODE ChIP-seq data file = system.file('extdata', 'Gm12878_Stat3_chr2.bed.gz', package = 'annotatr') r = read_regions(con = file, genome = 'hg19') random_r = randomize_regions(regions = r)
read_annotations()
is a wrapper for the rtracklayer::import()
function that creates a GRanges
object matching the structure of annotations built with build_annotations()
. The structure is defined by GRanges
, with the mcols()
with names c('id','gene_id','symbol','type')
.
read_annotations(con, name, genome = NA, format, extraCols = character(), ...)
read_annotations(con, name, genome = NA, format, extraCols = character(), ...)
con |
A path, URL, connection or BEDFile object. See |
name |
A string for the name of the annotations to be used in the name of the object, [genome]_custom_[name] |
genome |
From |
format |
From |
extraCols |
From |
... |
Parameters to pass onto the format-specific method of |
A GRanges
object stored in annotatr_cache
. To view a custom annotation, do annotatr_cache$get(name)
. To add a custom annotation to the set of annotations, include '[genome]_custom_[name]'
in the call to build_annotations()
. See example below.
# Read in a BED3 file as a custom annotation file = system.file('extdata', 'test_annotations_3.bed', package='annotatr') read_annotations(con = file, name = 'test', genome = 'hg19') build_annotations(genome = 'hg19', annotations = 'hg19_custom_test') print(annotatr_cache$get('hg19_custom_test'))
# Read in a BED3 file as a custom annotation file = system.file('extdata', 'test_annotations_3.bed', package='annotatr') read_annotations(con = file, name = 'test', genome = 'hg19') build_annotations(genome = 'hg19', annotations = 'hg19_custom_test') print(annotatr_cache$get('hg19_custom_test'))
read_regions()
reads genomic regions by calling the rtracklayer::import()
function. This function can automatically deal with BEDX files from BED3 to BED6. For BED6+Y, the extraCols
argument should be used to correctly interpret the extra columns.
read_regions( con, genome = NA, format, extraCols = character(), rename_name, rename_score, ... )
read_regions( con, genome = NA, format, extraCols = character(), rename_name, rename_score, ... )
con |
A path, URL, connection or BEDFile object. See |
genome |
From |
format |
From |
extraCols |
From |
rename_name |
A string to rename the name column of the BED file. For example, if the name column actually contains a categorical variable. |
rename_score |
A string to rename the score column of the BED file. For example, if the score column represents a quantity about the data besides the score in the BED specification. |
... |
Parameters to pass onto the format-specific method of |
NOTE: The name
(4th) and score
(5th) columns are so named. If these columns have a particular meaning for your data, they should be renamed with the rename_name
and/or rename_score
parameters.
A GRanges
object.
# Example of reading a BED6+3 file where the last 3 columns are non-standard file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu0 = 'numeric', mu1 = 'numeric') gr = read_regions(con = file, genome = 'hg19', extraCols = extraCols, format = 'bed', rename_name = 'DM_status', rename_score = 'pval')
# Example of reading a BED6+3 file where the last 3 columns are non-standard file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr') extraCols = c(diff_meth = 'numeric', mu0 = 'numeric', mu1 = 'numeric') gr = read_regions(con = file, genome = 'hg19', extraCols = extraCols, format = 'bed', rename_name = 'DM_status', rename_score = 'pval')
Function to recode classes from chromHMM type column
reformat_hmm_codes(hmm_codes)
reformat_hmm_codes(hmm_codes)
hmm_codes |
in the original form from UCSC Genome Browser track. |
A character vector of chromHMM classes with numbers and underscores removed.
Function to subset a tbl_df or grouped_df by a column
subset_order_tbl(tbl, col, col_order)
subset_order_tbl(tbl, col, col_order)
tbl |
A |
col |
A string indicating which column of of |
col_order |
A character vector indicating the order of |
A modified version of summary
with col
subsetted by col_order
.
Given a GRanges
of annotated regions, count the number of regions in each annotation type. If annotated_random
is not NULL
, then the same is computed for the random regions.
summarize_annotations(annotated_regions, annotated_random, quiet = FALSE)
summarize_annotations(annotated_regions, annotated_random, quiet = FALSE)
annotated_regions |
The |
annotated_random |
The |
quiet |
Print progress messages (FALSE) or not (TRUE). |
If a region is annotated to multiple annotations of the same annot.type
, the region will only be counted once. For example, if a region were annotated to multiple exons, it would only count once toward the exons, but if it were annotated to an exon and an intron, it would count towards both.
A tbl_df
of the number of regions per annotation type.
### An example of ChIP-seq peaks with signalValue # Get premade CpG annotations data('annotations', package = 'annotatr') file = system.file('extdata', 'Gm12878_Stat3_chr2.bed.gz', package = 'annotatr') r = read_regions(con = file, genome = 'hg19') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE, quiet = FALSE) rnd = randomize_regions(regions = r) rnd_annots = annotate_regions( regions = rnd, annotations = annotations, ignore.strand = TRUE, quiet = FALSE) # Summarize the annotated regions without randomized regions s = summarize_annotations(annotated_regions = a) # Summarize the annotated regions with randomized regions s_rnd = summarize_annotations( annotated_regions = a, annotated_random = rnd_annots)
### An example of ChIP-seq peaks with signalValue # Get premade CpG annotations data('annotations', package = 'annotatr') file = system.file('extdata', 'Gm12878_Stat3_chr2.bed.gz', package = 'annotatr') r = read_regions(con = file, genome = 'hg19') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE, quiet = FALSE) rnd = randomize_regions(regions = r) rnd_annots = annotate_regions( regions = rnd, annotations = annotations, ignore.strand = TRUE, quiet = FALSE) # Summarize the annotated regions without randomized regions s = summarize_annotations(annotated_regions = a) # Summarize the annotated regions with randomized regions s_rnd = summarize_annotations( annotated_regions = a, annotated_random = rnd_annots)
Given a GRanges
of annotated regions, count the number of regions when the annotations are grouped by
categorical columns.
summarize_categorical( annotated_regions, by = c("annot.type", "annot.id"), quiet = FALSE )
summarize_categorical( annotated_regions, by = c("annot.type", "annot.id"), quiet = FALSE )
annotated_regions |
The |
by |
A character vector to group the data in |
quiet |
Print progress messages (FALSE) or not (TRUE). |
If a region is annotated to multiple annotations of the same annot.type
, the region will only be counted once. For example, if a region were annotated to multiple exons, it would only count once toward the exons, but if it were annotated to an exon and an intron, it would count towards both.
A grouped dplyr::tbl_df
of the counts of groupings according to the by
vector.
# Get premade CpG annotations data('annotations', package = 'annotatr') r_file = system.file('extdata', 'test_read_multiple_data_nohead.bed', package='annotatr') extraCols = c(pval = 'numeric', mu1 = 'integer', mu0 = 'integer', diff_exp = 'character') r = read_regions(con = r_file, genome = 'hg19', extraCols = extraCols, rename_score = 'coverage') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE) sc = summarize_categorical( annotated_regions = a, by = c('annot.type', 'name'), quiet = FALSE)
# Get premade CpG annotations data('annotations', package = 'annotatr') r_file = system.file('extdata', 'test_read_multiple_data_nohead.bed', package='annotatr') extraCols = c(pval = 'numeric', mu1 = 'integer', mu0 = 'integer', diff_exp = 'character') r = read_regions(con = r_file, genome = 'hg19', extraCols = extraCols, rename_score = 'coverage') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE) sc = summarize_categorical( annotated_regions = a, by = c('annot.type', 'name'), quiet = FALSE)
Given a GRanges
of annotated regions, summarize numerical data columns based on a grouping.
summarize_numerical( annotated_regions, by = c("annot.type", "annot.id"), over, quiet = FALSE )
summarize_numerical( annotated_regions, by = c("annot.type", "annot.id"), over, quiet = FALSE )
annotated_regions |
The |
by |
A character vector of the columns of |
over |
A character vector of the numerical columns in |
quiet |
Print progress messages (FALSE) or not (TRUE). |
NOTE: We do not take the distinct values of seqnames
, start
, end
, annot.type
as in the other summarize_*()
functions because in the case of a region that intersected two distinct exons, using distinct()
would destroy the information of the mean of the numerical column over one of the exons, which is not desirable.
A grouped dplyr::tbl_df
, and the count
, mean
, and sd
of the cols
by
the groupings.
### Test on a very simple bed file to demonstrate different options # Get premade CpG annotations data('annotations', package = 'annotatr') r_file = system.file('extdata', 'test_read_multiple_data_nohead.bed', package='annotatr') extraCols = c(pval = 'numeric', mu1 = 'integer', mu0 = 'integer', diff_exp = 'character') r = read_regions(con = r_file, genome = 'hg19', extraCols = extraCols, rename_score = 'coverage') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE) # Testing over normal by sn1 = summarize_numerical( annotated_regions = a, by = c('annot.type', 'annot.id'), over = c('coverage', 'mu1', 'mu0'), quiet = FALSE) # Testing over a different by sn2 = summarize_numerical( annotated_regions = a, by = c('diff_exp'), over = c('coverage', 'mu1', 'mu0'))
### Test on a very simple bed file to demonstrate different options # Get premade CpG annotations data('annotations', package = 'annotatr') r_file = system.file('extdata', 'test_read_multiple_data_nohead.bed', package='annotatr') extraCols = c(pval = 'numeric', mu1 = 'integer', mu0 = 'integer', diff_exp = 'character') r = read_regions(con = r_file, genome = 'hg19', extraCols = extraCols, rename_score = 'coverage') a = annotate_regions( regions = r, annotations = annotations, ignore.strand = TRUE) # Testing over normal by sn1 = summarize_numerical( annotated_regions = a, by = c('annot.type', 'annot.id'), over = c('coverage', 'mu1', 'mu0'), quiet = FALSE) # Testing over a different by sn2 = summarize_numerical( annotated_regions = a, by = c('diff_exp'), over = c('coverage', 'mu1', 'mu0'))
Function to tidy up annotation accessors for visualization
tidy_annotations(annotations)
tidy_annotations(annotations)
annotations |
A character vector of annotations, in the order they are to appear in the visualization. |
A list of mappings from original annotation names to names ready for visualization.