Title: | GenomicDistributions: fast analysis of genomic intervals with Bioconductor |
---|---|
Description: | If you have a set of genomic ranges, this package can help you with visualization and comparison. It produces several kinds of plots, for example: Chromosome distribution plots, which visualize how your regions are distributed over chromosomes; feature distance distribution plots, which visualizes how your regions are distributed relative to a feature of interest, like Transcription Start Sites (TSSs); genomic partition plots, which visualize how your regions overlap given genomic features such as promoters, introns, exons, or intergenic regions. It also makes it easy to compare one set of ranges to another. |
Authors: | Kristyna Kupkova [aut, cre], Jose Verdezoto [aut], Tessa Danehy [aut], John Lawson [aut], Jose Verdezoto [aut], Michal Stolarczyk [aut], Jason Smith [aut], Bingjie Xue [aut], Sophia Rogers [aut], John Stubbs [aut], Nathan C. Sheffield [aut] |
Maintainer: | Kristyna Kupkova <[email protected]> |
License: | BSD_2_clause + file LICENSE |
Version: | 1.15.0 |
Built: | 2024-11-05 06:01:21 UTC |
Source: | https://github.com/bioc/GenomicDistributions |
Checks to make sure a package object is installed, and if so, returns it. If the library is not installed, it issues a warning and returns NULL.
.requireAndReturn(BSgenomeString)
.requireAndReturn(BSgenomeString)
BSgenomeString |
A BSgenome compatible genome string. |
A BSgenome object if installed.
Checks class of the list of variables. To be used in functions
.validateInputs(checkList)
.validateInputs(checkList)
checkList |
list of object to check, e.g. list(varname=c("data.frame", "numeric")). Multiuple strings in the vector are treated as OR. |
A warning if the wrong input class is provided.
x = function(var1) { cl = list(var1=c("numeric","character")) .validateInputs(cl) return(var1^2) }
x = function(var1) { cl = list(var1=c("numeric","character")) .validateInputs(cl) return(var1^2) }
Given a BSgenome object (to be loaded via loadBSgenome
), and a number
of bins, this will bin that genome. It is a simple wrapper of the
binChroms
function
binBSGenome(genome, binCount)
binBSGenome(genome, binCount)
genome |
A UCSC-style string denoting reference assembly (e.g. 'hg38') |
binCount |
number of bins per chromosome |
A data.table object showing the region and bin IDs of the reference genome.
## Not run: binCount = 1000 refGenomeBins = binBSGenome("hg19", binCount) ## End(Not run)
## Not run: binCount = 1000 refGenomeBins = binBSGenome("hg19", binCount) ## End(Not run)
Given a list of chromosomes with corresponding sizes, this script will produce (roughly) evenly-sized bins across the chromosomes. It does not account for assembly gaps or the like.
binChroms(binCount, chromSizes)
binChroms(binCount, chromSizes)
binCount |
number of bins (total; *not* per chromosome) |
chromSizes |
a named list of size (length) for each chromosome. |
A data.table object assigning a bin ID to each chromosome region.
chromSizes = c(chr1=249250621, chr2=243199373, chr3=198022430) cBins = binChroms(1000, chromSizes)
chromSizes = c(chr1=249250621, chr2=243199373, chr3=198022430) cBins = binChroms(1000, chromSizes)
Given a start coordinate, end coordinate, and number of bins to divide, this function will split the regions into that many bins. Bins will be only approximately the same size, due to rounding. (they should not be more than 1 different).
binRegion(start, end, binSize = NULL, binCount = NULL, indicator = NULL)
binRegion(start, end, binSize = NULL, binCount = NULL, indicator = NULL)
start |
The starting coordinate |
end |
The ending coordinate |
binSize |
The size of bin to divide the genome into. You must supply either binSize (priority) or binCount. |
binCount |
The number of bins to divide. If you do not supply binSize, you must supply binCount, which will be used to calculate the binSize. |
indicator |
A vector with identifiers to keep with your bins, in case you are doing this on a long table with multiple segments concatenated |
Use case: take a set of regions, like CG islands, and bin them; now you can aggregate signal scores across the bins, giving you an aggregate signal in bins across many regions of the same type.
In theory, this just runs on 3 values, but you can run it inside a data.table j expression to divide a bunch of regions in the same way.
A data.table, expanded to nrow = number of bins, with these id columns: id: region ID binID: repeating ID (this is the value to aggregate across) ubinID: unique bin IDs
Rbins = binRegion(1, 3000, 100, 1000)
Rbins = binRegion(1, 3000, 100, 1000)
Converts a list of data.tables (From BSreadbeds) into GRanges.
BSdtToGRanges(dtList)
BSdtToGRanges(dtList)
dtList |
A list of data.tables |
A GRangesList object.
Returns a data.table showing counts of regions from the query that overlap with each bin. In other words, where on which chromosomes are the ranges distributed? You must provide binned regions. Only the midpoint of each query region is used to test for overlap with the bin regions.
calcChromBins(query, bins)
calcChromBins(query, bins)
query |
A GenomicRanges or GenomicRangesList object with query regions |
bins |
Pre-computed bins (as a GRangesList object) to aggregate over; for example, these could be genome bins |
A data.table showing where on which chromosomes ranges are distributed.
chromSizes = getChromSizes("hg19") genomeBins = getGenomeBins(chromSizes) chromDistribution = calcChromBins(vistaEnhancers, genomeBins) vistaSftd = GenomicRanges::shift(vistaEnhancers, 100000) vistaSftd2 = GenomicRanges::shift(vistaEnhancers, 200000) calcChromBins(vistaEnhancers, GRangesList(vistaSftd, vistaSftd2))
chromSizes = getChromSizes("hg19") genomeBins = getGenomeBins(chromSizes) chromDistribution = calcChromBins(vistaEnhancers, genomeBins) vistaSftd = GenomicRanges::shift(vistaEnhancers, 100000) vistaSftd2 = GenomicRanges::shift(vistaEnhancers, 200000) calcChromBins(vistaEnhancers, GRangesList(vistaSftd, vistaSftd2))
calcChromBins
, which is more general.Returns the distribution of query over a reference assembly
Given a query set of elements (a GRanges object) and a reference assembly
(*e.g. 'hg38'), this will aggregate and count the distribution of the query
elements across bins of the reference genome. This is a helper function to
create features for common genomes. It is a wrapper of
calcChromBins
, which is more general.
calcChromBinsRef(query, refAssembly, binCount = 3000)
calcChromBinsRef(query, refAssembly, binCount = 3000)
query |
A GenomicRanges or GenomicRangesList object with query regions |
refAssembly |
A character vector that will be used to grab chromosome
sizes with |
binCount |
Number of bins to divide the chromosomes into |
A data.table showing the distribution of regions across bins of the reference genome.
ChromBins = calcChromBinsRef(vistaEnhancers, "hg19")
ChromBins = calcChromBinsRef(vistaEnhancers, "hg19")
calcChromBins
, which is more general.Returns the distribution of query over a reference assembly
Given a query set of elements (a GRanges object) and a reference assembly
(*e.g. 'hg38'), this will aggregate and count the distribution of the query
elements across bins of the reference genome. This is a helper function to
create features for common genomes. It is a wrapper of
calcChromBins
, which is more general.
calcChromBinsRefSlow(query, refAssembly, binCount = 3000)
calcChromBinsRefSlow(query, refAssembly, binCount = 3000)
query |
A GenomicRanges or GenomicRangesList object with query regions |
refAssembly |
A character vector that will be used to grab chromosome
sizes with |
binCount |
Number of bins to divide the chromosomes into |
A data.table showing the distribution of regions across bins of the reference genome.
ChromBins = calcChromBinsRef(vistaEnhancers, "hg19")
ChromBins = calcChromBinsRef(vistaEnhancers, "hg19")
Takes a GRanges object, then assigns each element to a partition from the provided partitionList, and then tallies the number of regions assigned to each partition. A typical example of partitions is promoter, exon, intron, etc; this function will yield the number of each for a query GRanges object There will be a priority order to these, to account for regions that may overlap multiple genomic partitions.
calcCumulativePartitions(query, partitionList, remainder = "intergenic")
calcCumulativePartitions(query, partitionList, remainder = "intergenic")
query |
GRanges or GRangesList with regions to classify. |
partitionList |
An ORDERED and NAMED list of genomic partitions GRanges. This list must be in priority order; the input will be assigned to the first partition it overlaps. |
remainder |
Which partition do you want to account for 'everything else'? |
A data.frame assigning each element of a GRanges object to a partition from a previously provided partitionList.
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR) calcCumulativePartitions(vistaEnhancers, partitionList)
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR) calcCumulativePartitions(vistaEnhancers, partitionList)
This function is a wrapper for calcCumulativePartitions
that uses
built-in partitions for a given reference genome assembly.
calcCumulativePartitionsRef(query, refAssembly)
calcCumulativePartitionsRef(query, refAssembly)
query |
A GenomicRanges or GenomicRangesList object with query regions |
refAssembly |
A character vector specifying the reference genome
assembly (*e.g.* 'hg19'). This will be used to grab chromosome sizes
with |
A data.frame indicating the number of query region overlaps in several genomic partitions.
calcCumulativePartitionsRef(vistaEnhancers, "hg19")
calcCumulativePartitionsRef(vistaEnhancers, "hg19")
Given a reference genome (BSgenome object) and ranges on the reference, this function returns a data.table with counts of dinucleotides within the GRanges object.
calcDinuclFreq(query, ref, rawCounts = FALSE)
calcDinuclFreq(query, ref, rawCounts = FALSE)
query |
A GRanges object with query sets |
ref |
Reference genome BSgenome object |
rawCounts |
a logical indicating whether the raw numbers should be displayed, rather than percentages (optional). |
A data.table with counts of dinucleotides across the GRanges object
## Not run: bsg = loadBSgenome('hg19') DNF = calcDinuclFreq(vistaEnhancers, bsg) ## End(Not run)
## Not run: bsg = loadBSgenome('hg19') DNF = calcDinuclFreq(vistaEnhancers, bsg) ## End(Not run)
Given a reference genome (BSgenome object) and ranges on the reference, this function returns a data.table with counts of dinucleotides within the GRanges object.
calcDinuclFreqRef(query, refAssembly, rawCounts = FALSE)
calcDinuclFreqRef(query, refAssembly, rawCounts = FALSE)
query |
A GRanges object with query sets |
refAssembly |
A character vector specifying the reference genome
assembly (*e.g.* 'hg19'). This will be used to grab chromosome sizes with
|
rawCounts |
a logical indicating whether the raw numbers should be displayed, rather than percentages (optional). |
A numeric vector or list of vectors with the GC percentage of the query regions.
## Not run: query = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions") GRquery = rtracklayer::import(query) refAssembly = 'hg19' DNF = calcDinuclFreqRef(GRquery, refAssembly) ## End(Not run)
## Not run: query = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions") GRquery = rtracklayer::import(query) refAssembly = 'hg19' DNF = calcDinuclFreqRef(GRquery, refAssembly) ## End(Not run)
Calculates expected partiton overlap based on contribution of each feature (partition) to genome size. Expected and observed overlaps are then compared.
calcExpectedPartitions( query, partitionList, genomeSize = NULL, remainder = "intergenic", bpProportion = FALSE )
calcExpectedPartitions( query, partitionList, genomeSize = NULL, remainder = "intergenic", bpProportion = FALSE )
query |
GRanges or GRangesList with regions to classify. |
partitionList |
An ORDERED (if bpProportion=FALSE) and NAMED list of genomic partitions GRanges. This list must be in priority order; the input will be assigned to the first partition it overlaps. However, if bpProportion=TRUE, the list does not need ordering. |
genomeSize |
The number of bases in the query genome. In other words, the sum of all chromosome sizes. |
remainder |
Which partition do you want to account for 'everything else'? |
bpProportion |
logical indicating if overlaps should be calculated based on number of base pairs overlapping with each partition. bpProportion=FALSE does overlaps in priority order, bpProportion=TRUE counts number of overlapping base pairs between query and each partition. |
A data.frame assigning each element of a GRanges object to a partition from a previously provided partitionList.The data.frame also contains Chi-square p-values calculated for observed/expected overlaps on each individual partition.
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR) chromSizes = getChromSizes('hg19') genomeSize = sum(chromSizes) calcExpectedPartitions(vistaEnhancers, partitionList, genomeSize)
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR) chromSizes = getChromSizes('hg19') genomeSize = sum(chromSizes) calcExpectedPartitions(vistaEnhancers, partitionList, genomeSize)
This function is a wrapper for calcExpectedPartitions
that uses
built-in partitions for a given reference genome assembly.
calcExpectedPartitionsRef(query, refAssembly, bpProportion = FALSE)
calcExpectedPartitionsRef(query, refAssembly, bpProportion = FALSE)
query |
A GenomicRanges or GenomicRangesList object with query regions |
refAssembly |
A character vector specifying the reference genome
assembly (*e.g.* 'hg19'). This will be used to grab annotation
models with |
bpProportion |
logical indicating if overlaps should be calculated based on number of base pairs overlapping with each partition. bpProportion=FALSE does overlaps in priority order, bpProportion=TRUE counts number of overlapping base pairs between query and each partition. |
A data.frame indicating the number of query region overlaps in several genomic partitions.
calcExpectedPartitionsRef(vistaEnhancers, "hg19")
calcExpectedPartitionsRef(vistaEnhancers, "hg19")
For a given query set of genomic regions, and a given feature set of regions, this function will return the distance for each query region to its closest feature. It ignores strand and returns the distance as positive or negative, depending on whether the feature is upstream or downstream
calcFeatureDist(query, features)
calcFeatureDist(query, features)
query |
A GRanges or GRangesList object with query sets |
features |
A GRanges object with features to test distance to |
This function is similar to the bioconductor distanceToNearest function, but returns negative values for downstream distances instead of absolute values. This allows you to assess the relative location.
A vector of genomic distances for each query region relative to its closest feature.
vistaSftd = GenomicRanges::shift(vistaEnhancers, 100000) calcFeatureDist(vistaEnhancers, vistaSftd)
vistaSftd = GenomicRanges::shift(vistaEnhancers, 100000) calcFeatureDist(vistaEnhancers, vistaSftd)
Given a query GRanges object and an assembly string, this function will grab
the TSS list for the given reference assembly and then calculate the distance
from each query feature to the closest TSS. It is a wrapper of
calcFeatureDist
that uses built-in TSS features for a reference
assembly
calcFeatureDistRefTSS(query, refAssembly)
calcFeatureDistRefTSS(query, refAssembly)
query |
A GenomicRanges or GenomicRangesList object with query regions |
refAssembly |
A character vector specifying the reference genome
assembly (*e.g.* 'hg19'). This will be used to grab chromosome sizes with
|
A vector of distances for each query region relative to TSSs.
calcFeatureDistRefTSS(vistaEnhancers, "hg19")
calcFeatureDistRefTSS(vistaEnhancers, "hg19")
Given a reference genome as a BSgenome object and some ranges on that reference, this function will return a vector of the same length as the granges object, with percent of Cs and Gs.
calcGCContent(query, ref)
calcGCContent(query, ref)
query |
A GenomicRanges or GenomicRangesList object with query regions. |
ref |
Reference genome BSgenome object. |
A numeric vector of list of vectors with the GC percentage of the query regions.
## Not run: bsg = loadBSgenome('hg19') gcvec = calcGCContent(vistaEnhancers, bsg) ## End(Not run)
## Not run: bsg = loadBSgenome('hg19') gcvec = calcGCContent(vistaEnhancers, bsg) ## End(Not run)
Given a reference genome as a BSgenome object and some ranges on that reference, this function will return a vector of the same length as the granges object, with percent of Cs and Gs.
calcGCContentRef(query, refAssembly)
calcGCContentRef(query, refAssembly)
query |
A GenomicRanges or GenomicRangesList object with query regions |
refAssembly |
A character vector specifying the reference genome
assembly (*e.g.* 'hg19'). This will be used to grab chromosome sizes with
|
A numeric vector or list of vectors with the GC percentage of the query regions.
## Not run: refAssembly = 'hg19' GCcontent = calcGCContentRef(vistaEnhancers, refAssembly) ## End(Not run)
## Not run: refAssembly = 'hg19' GCcontent = calcGCContentRef(vistaEnhancers, refAssembly) ## End(Not run)
Group regions from the same chromosome together and compute the distance of a region to its nearest neighbor. Distances are then lumped into a numeric vector.
calcNearestNeighbors(query, correctRef = "None")
calcNearestNeighbors(query, correctRef = "None")
query |
A GRanges or GRangesList object. |
correctRef |
A string indicating the reference genome to use if Nearest neighbor distances are corrected for the number of regions in a regionSet. |
A numeric vector or list of vectors containing the distance of regions to their nearest neighbors.
Nneighbors = calcNearestNeighbors(vistaEnhancers)
Nneighbors = calcNearestNeighbors(vistaEnhancers)
Group regions from the same chromosome together and calculate the distances of a region to its upstream and downstream neighboring regions. Distances are then lumped into a numeric vector.
calcNeighborDist(query, correctRef = "None")
calcNeighborDist(query, correctRef = "None")
query |
A GRanges or GRangesList object. |
correctRef |
A string indicating the reference genome to use if distances are corrected for the number of regions in a regionSet. |
A numeric vector or list with different vectors containing the distances of regions to their upstream/downstream neighbors.
dist = calcNeighborDist(vistaEnhancers)
dist = calcNeighborDist(vistaEnhancers)
Takes a GRanges object, then assigns each element to a partition from the provided partitionList, and then tallies the number of regions assigned to each partition. A typical example of partitions is promoter, exon, intron, etc; this function will yield the number of each for a query GRanges object There will be a priority order to these, to account for regions that may overlap multiple genomic partitions.
calcPartitions( query, partitionList, remainder = "intergenic", bpProportion = FALSE )
calcPartitions( query, partitionList, remainder = "intergenic", bpProportion = FALSE )
query |
GRanges or GRangesList with regions to classify |
partitionList |
an ORDERED (if bpProportion=FALSE) and NAMED list of genomic partitions GRanges. This list must be in priority order; the input will be assigned to the first partition it overlaps. bpProportion=TRUE, the list does not need ordering. |
remainder |
A character vector to assign any query regions that do not overlap with anything in the partitionList. Defaults to "intergenic" |
bpProportion |
logical indicating if overlaps should be calculated based on number of base pairs overlapping with each partition. bpProportion=FALSE does overlaps in priority order, bpProportion=TRUE counts number of overlapping base pairs between query and each partition. |
A data.frame assigning each element of a GRanges object to a partition from a previously provided partitionList.
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR) calcPartitions(vistaEnhancers, partitionList)
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR) calcPartitions(vistaEnhancers, partitionList)
This function is a wrapper for calcPartitions
and calcPartitionPercents
that uses built-in
partitions for a given reference genome assembly.
calcPartitionsRef(query, refAssembly, bpProportion = FALSE)
calcPartitionsRef(query, refAssembly, bpProportion = FALSE)
query |
A GenomicRanges or GenomicRangesList object with query regions |
refAssembly |
A character vector specifying the reference genome
assembly (*e.g.* 'hg19'). This will be used to grab annotation
models with |
bpProportion |
logical indicating if overlaps should be calculated based on number of base pairs overlapping with each partition. bpProportion=FALSE does overlaps in priority order, bpProportion=TRUE counts number of overlapping base pairs between query and each partition. |
A data.frame indicating the number of query region overlaps in several genomic partitions.
calcPartitionsRef(vistaEnhancers, "hg19")
calcPartitionsRef(vistaEnhancers, "hg19")
The function calcSummarySignal takes the input BED file(s) in form of GRanges or GRangesList object, overlaps it with all defined open chromatin regions across conditions (e.g. cell types) and returns a matrix, where each row is the input genomic region (if overlap was found), each column is a condition, and the value is a meam signal from regions where overlap was found.
calcSummarySignal(query, signalMatrix)
calcSummarySignal(query, signalMatrix)
query |
Genomic regions to be analyzed. Can be GRanges or GRangesList object. |
signalMatrix |
Matrix with signal values in predfined regions, where rows are predefined genomic regions, columns are conditions (e.g. cell types in which the signal was measured). First column contains information about the genomic region in following form: chr_start_end. Can be either data.frame or data.table object. |
A list with named components: signalSummaryMatrix - data.table with cell specific open chromatin signal values for query regions matrixStats - data.frame containing boxplot stats for individual cell type
signalSummaryList = calcSummarySignal(vistaEnhancers, exampleOpenSignalMatrix_hg19)
signalSummaryList = calcSummarySignal(vistaEnhancers, exampleOpenSignalMatrix_hg19)
The length of a genomic region (the distance between the start and end) is called the width When given a query set of genomic regions, this function returns the width
calcWidth(query)
calcWidth(query)
query |
A GRanges or GRangesList object with query sets |
A vector of the widths (end-start coordinates) of GRanges objects.
regWidths = calcWidth(vistaEnhancers)
regWidths = calcWidth(vistaEnhancers)
Table the maps cell types to tissues and groups
data(cellTypeMetadata)
data(cellTypeMetadata)
data.table with 3 columns (cellType, tissue and group) and 74 rows (one per cellType)
self-curated dataset
A dataset containing chromosome sizes for Homo Sapiens hg38 genome assembly
data(chromSizes_hg19)
data(chromSizes_hg19)
A named vectors of lengths with one item per chromosome
BSgenome.Hsapiens.UCSC.hg19 package
Converts a data.table (DT) object to a GenomicRanges (GR) object. Tries to be intelligent, guessing chr and start, but you have to supply end or other columns if you want them to be carried into the GR.
dtToGr( DT, chr = "chr", start = "start", end = NA, strand = NA, name = NA, splitFactor = NA, metaCols = NA )
dtToGr( DT, chr = "chr", start = "start", end = NA, strand = NA, name = NA, splitFactor = NA, metaCols = NA )
DT |
A data.table representing genomic regions. |
chr |
A string representing the chromosome column. |
start |
A string representing the name of the start column. |
end |
A string representing the name of the end column. |
strand |
A string representing the name of the strand column. |
name |
A string representing the name of the name column. |
splitFactor |
A string representing the name of the column to use to split the data.table into multiple data.tables. |
metaCols |
A string representing the name of the metadata column(s) to include in the returned GRanges object. |
A GRanges object.
start1 = c(seq(from=1, to = 2001, by = 1000), 800) chrString1 = c(rep("chr1", 3), "chr2") dt = data.table::data.table(chr=chrString1, start=start1, end=start1 + 250) newGR = dtToGr(dt)
start1 = c(seq(from=1, to = 2001, by = 1000), 800) chrString1 = c(rep("chr1", 3), "chr2") dt = data.table::data.table(chr=chrString1, start=start1, end=start1 + 250) newGR = dtToGr(dt)
Two utility functions for converting data.tables into GRanges objects
dtToGrInternal(DT, chr, start, end = NA, strand = NA, name = NA, metaCols = NA)
dtToGrInternal(DT, chr, start, end = NA, strand = NA, name = NA, metaCols = NA)
DT |
A data.table representing genomic regions. |
chr |
A string representing the chromosome column. |
start |
A string representing the name of the start column. |
end |
A string representing the name of the end column. |
strand |
A string representing the name of the strand column. |
name |
A string representing the name of the name column. |
metaCols |
A string representing the name of the metadata column(s) to include in the returned GRanges object. |
A GRanges object.
Preparation steps:
made a universe of regions by merging regions across cell types defined as opened in ENCODE
took bigwig files from ENCODE for individual cell types, merged replicates, filtered out blacklisted sites
evaluated the signal above regions defined by previous step
performed quantile normalization
subsetted it
data(exampleOpenSignalMatrix_hg19)
data(exampleOpenSignalMatrix_hg19)
data.frame, rows represent whole selection of open chromatin regions across all cell types defined by ENCODE, columns are individual cell types and values are normalized open chromatin signal values.
http://big.databio.org/open_chromatin_matrix/openSignalMatrix_hg19_quantileNormalized_round4.txt.gz
A dataset containing gene models for Homo Sapiens hg38 genome assembly.
data(geneModels_hg19)
data(geneModels_hg19)
A list of two GRanges objects, with genes and exons locations
EnsDb.Hsapiens.v75 package
Given GRanges for genes, and a GRanges for exons, returns a list of GRanges corresponding to various breakdown of the genome, based on the given annotations; it gives you proximal and core promoters, exons, and introns.
genomePartitionList( genesGR, exonsGR, threeUTRGR = NULL, fiveUTRGR = NULL, getCorePromoter = TRUE, getProxPromoter = TRUE, corePromSize = 100, proxPromSize = 2000 )
genomePartitionList( genesGR, exonsGR, threeUTRGR = NULL, fiveUTRGR = NULL, getCorePromoter = TRUE, getProxPromoter = TRUE, corePromSize = 100, proxPromSize = 2000 )
genesGR |
a GRanges object of gene coordinates |
exonsGR |
a GRanges object of exons coordinates |
threeUTRGR |
a GRanges object of 3' UTRs |
fiveUTRGR |
a GRanges object of 5' UTRs |
getCorePromoter |
option specifying if core promoters should be extracted defeaults to TRUE |
getProxPromoter |
option specifying if proximal promoters should be extracted defeaults to TRUE |
corePromSize |
size of core promoter (in bp) upstrem from TSS default value = 100 |
proxPromSize |
size of proximal promoter (in bp) upstrem from TSS default value = 2000 |
To be used as a partitionList for calcPartitions
.
A list of GRanges objects, each corresponding to a partition of the genome. Partitions include proximal and core promoters, exons and introns.
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR)
partitionList = genomePartitionList(geneModels_hg19$genesGR, geneModels_hg19$exonsGR, geneModels_hg19$threeUTRGR, geneModels_hg19$fiveUTRGR)
If you have a set of genomic ranges, the GenomicDistributions R package can help you with some simple visualizations. Currently, it can produce two kinds of plots: First, the chromosome distribution plot, which visualizes how your regions are distributed over chromosomes; and second, the feature distribution plot, which visualizes how your regions are distributed relative to a feature of interest, like Transcription Start Sites (TSSs).
Nathan C. Sheffield
http://github.com/databio/GenomicDistributions
Returns built-in chrom sizes for a given reference assembly
getChromSizes(refAssembly)
getChromSizes(refAssembly)
refAssembly |
A string identifier for the reference assembly |
A vector with the chromosome sizes corresponding to a specific genome assembly.
getChromSizes("hg19")
getChromSizes("hg19")
Get gene models from a remote or local FASTA file
getChromSizesFromFasta(source, destDir = NULL, convertEnsemblUCSC = FALSE)
getChromSizesFromFasta(source, destDir = NULL, convertEnsemblUCSC = FALSE)
source |
a string that is either a path to a local or remote FASTA |
destDir |
a string that indicates the path to the directory where the downloaded FASTA file should be stored |
convertEnsemblUCSC |
a logical indicating whether Ensembl style chromosome annotation should be changed to UCSC style (add chr) |
a named vector of sequence lengths
CElegansFasteCropped = system.file("extdata", "C_elegans_cropped_example.fa.gz", package="GenomicDistributions") CElegansChromSizes = getChromSizesFromFasta(CElegansFasteCropped)
CElegansFasteCropped = system.file("extdata", "C_elegans_cropped_example.fa.gz", package="GenomicDistributions") CElegansChromSizes = getChromSizesFromFasta(CElegansFasteCropped)
Some functions require gene models, which can obtained from any source. This function allows you to retrieve a few common built-in ones.
getGeneModels(refAssembly)
getGeneModels(refAssembly)
refAssembly |
A string identifier for the reference assembly |
A list containing the gene models corresponding to a specific reference assembly.
getGeneModels("hg19")
getGeneModels("hg19")
Get gene models from a remote or local GTF file
getGeneModelsFromGTF( source, features, convertEnsemblUCSC = FALSE, destDir = NULL, filterProteinCoding = TRUE )
getGeneModelsFromGTF( source, features, convertEnsemblUCSC = FALSE, destDir = NULL, filterProteinCoding = TRUE )
source |
a string that is either a path to a local or remote GTF |
features |
a vector of strings with feature identifiers that to include in the result list |
convertEnsemblUCSC |
a logical indicating whether Ensembl style chromosome annotation should be changed to UCSC style |
destDir |
a string that indicates the path to the directory where the downloaded GTF file should be stored |
filterProteinCoding |
a logical indicating if TSSs should be only protein-coding genes (default = TRUE) |
a list of GRanges objects
CElegansGtfCropped = system.file("extdata", "C_elegans_cropped_example.gtf.gz", package="GenomicDistributions") features = c("gene", "exon", "three_prime_utr", "five_prime_utr") CElegansGeneModels = getGeneModelsFromGTF(CElegansGtfCropped, features, TRUE)
CElegansGtfCropped = system.file("extdata", "C_elegans_cropped_example.gtf.gz", package="GenomicDistributions") features = c("gene", "exon", "three_prime_utr", "five_prime_utr") CElegansGeneModels = getGeneModelsFromGTF(CElegansGtfCropped, features, TRUE)
Returns bins used in 'calcChromBins' function Given a named vector of chromosome sizes, the function returns GRangesList object with bins for each chromosome.
getGenomeBins(chromSizes, binCount = 10000)
getGenomeBins(chromSizes, binCount = 10000)
chromSizes |
a named list of size (length) for each chromosome. |
binCount |
number of bins (total; *not* per chromosome), defaults to 10,000 |
A GRangesList object with bins that separate chromosomes into equal parts.
chromSizes = getChromSizes("hg19") chromBins = getGenomeBins(chromSizes)
chromSizes = getChromSizes("hg19") chromBins = getGenomeBins(chromSizes)
This is a generic getter function that will return a data object requested, if it is included in the built-in data with the GenomicDistributions package or GenomicDistributionsData package (if installed). Data objects can be requested for different reference assemblies and data types (specified by a tagline, which is a unique string identifying the data type).
getReferenceData(refAssembly, tagline)
getReferenceData(refAssembly, tagline)
refAssembly |
Reference assembly string (e.g. 'hg38') |
tagline |
The string that was used to identify data of a given type in the data building step. It's used for the filename so we know what to load, and is what makes this function generic (so it can load different data types). |
A requested and included package data object.
Get transcription start sites (TSSs) from a remote or local GTF file
getTssFromGTF( source, convertEnsemblUCSC = FALSE, destDir = NULL, filterProteinCoding = TRUE )
getTssFromGTF( source, convertEnsemblUCSC = FALSE, destDir = NULL, filterProteinCoding = TRUE )
source |
a string that is either a path to a local or remote GTF |
convertEnsemblUCSC |
a logical indicating whether Ensembl style chromosome annotation should be changed to UCSC style |
destDir |
a string that indicates the path to the directory where the downloaded GTF file should be stored |
filterProteinCoding |
a logical indicating if TSSs should be only protein-coding genes (default = TRUE) |
a list of GRanges objects
CElegansGtfCropped = system.file("extdata", "C_elegans_cropped_example.gtf.gz", package="GenomicDistributions") CElegansTss = getTssFromGTF(CElegansGtfCropped, TRUE)
CElegansGtfCropped = system.file("extdata", "C_elegans_cropped_example.gtf.gz", package="GenomicDistributions") CElegansTss = getTssFromGTF(CElegansGtfCropped, TRUE)
Convert a GenomicRanges into a data.table.
grToDt(GR)
grToDt(GR)
GR |
A Granges object |
A data.table object.
If you are building a histogram of binned values, you want to have labels for your bins that correspond to the ranges you used to bin. This function takes the breakpoints that define your bins and produces nice-looking labels for your histogram plot.
labelCuts( breakPoints, round_digits = 1, signif_digits = 3, collapse = "-", infBins = FALSE )
labelCuts( breakPoints, round_digits = 1, signif_digits = 3, collapse = "-", infBins = FALSE )
breakPoints |
The exact values you want as boundaries for your bins |
round_digits |
Number of digits to cut round labels to. |
signif_digits |
Number of significant digits to specify. |
collapse |
Character to separate the labels |
infBins |
use >/< as labels on the edge bins |
labelCuts
will take a cut group, (e.g., a quantile division of
some signal), and give you clean labels (similar to the cut method).
A vector of histogram axis labels.
This function will let you use a simple character vector (e.g. 'hg19') to load and then return BSgenome objects. This lets you avoid having to use the more complex annotation for a complete BSgenome object (e.g. BSgenome.Hsapiens.UCSC.hg38.masked)
loadBSgenome(genomeBuild, masked = TRUE)
loadBSgenome(genomeBuild, masked = TRUE)
genomeBuild |
One of 'hg19', 'hg38', 'mm10', 'mm9', or 'grch38' |
masked |
Should we used the masked version? Default:TRUE |
A BSgenome object corresponding to the provided genome build.
## Not run: bsg = loadBSgenome('hg19') ## End(Not run)
## Not run: bsg = loadBSgenome('hg19') ## End(Not run)
Load selected EnsDb library
loadEnsDb(genomeBuild)
loadEnsDb(genomeBuild)
genomeBuild |
string, genome identifier |
loaded library
## Not run: loadEnsDb("hg19") ## End(Not run)
## Not run: loadEnsDb("hg19") ## End(Not run)
Internal helper function to calculate distance between neighboring regions.
neighbordt(querydt)
neighbordt(querydt)
querydt |
A data table with regions grouped according to chromosome. |
A numeric vector with the distances in bp
Nathan's magical named list function. This function is a drop-in replacement for the base list() function, which automatically names your list according to the names of the variables used to construct it. It seamlessly handles lists with some names and others absent, not overwriting specified names while naming any unnamed parameters. Took me awhile to figure this out.
nlist(...)
nlist(...)
... |
arguments passed to list() |
A named list object.
x=5 y=10 nlist(x,y) # returns list(x=5, y=10) list(x,y) # returns unnamed list(5, 10)
x=5 y=10 nlist(x,y) # returns list(x=5, y=10) list(x,y) # returns unnamed list(5, 10)
Plots result from genomicDistribution
calculation
plotChromBins( genomeAggregate, plotTitle = "Distribution over chromosomes", ylim = "max" )
plotChromBins( genomeAggregate, plotTitle = "Distribution over chromosomes", ylim = "max" )
genomeAggregate |
The output from the genomicDistribution function |
plotTitle |
Title for plot. |
ylim |
Limit of y-axes. Default "max" sets limit to N of biggest bin. |
A ggplot object showing the distribution of the query regions over bins of the reference genome.
agg = data.frame("regionID"=1:5, "chr"=rep(c("chr1"), 5), "withinGroupID"=1:5, "N"=c(1,3,5,7,9)) ChromBins = plotChromBins(agg)
agg = data.frame("regionID"=1:5, "chr"=rep(c("chr1"), 5), "withinGroupID"=1:5, "N"=c(1,3,5,7,9)) ChromBins = plotChromBins(agg)
This function plots the cumulative distribution of regions across a feature set.
plotCumulativePartitions(assignedPartitions, feature_names = NULL)
plotCumulativePartitions(assignedPartitions, feature_names = NULL)
assignedPartitions |
Results from |
feature_names |
An optional character vector of feature names, in the same order as the GenomicRanges or GenomicRangesList object. |
A ggplot object of the cumulative distribution of regions in features.
p = calcCumulativePartitionsRef(vistaEnhancers, "hg19") cumuPlot = plotCumulativePartitions(p)
p = calcCumulativePartitionsRef(vistaEnhancers, "hg19") cumuPlot = plotCumulativePartitions(p)
Given calcDinuclFreq
or calcDinuclFreqRef
results, this function
generates a violin plot of dinucleotide frequency
plotDinuclFreq(DNFDataTable)
plotDinuclFreq(DNFDataTable)
DNFDataTable |
A data.table, data.frame, or a list of dinucleotide counts -
results from |
A ggplot object plotting distribution of dinucleotide content in query regions
DNFDataTable = data.table::data.table(GC = rnorm(400, mean=0.5, sd=0.1), CG = rnorm(400, mean=0.5, sd=0.5), AT = rnorm(400, mean=0.5, sd=1), TA = rnorm(400, mean=0.5, sd=1.5)) DNFPlot = plotDinuclFreq(DNFDataTable) ## Not run: query = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions") GRquery = rtracklayer::import(query) refAssembly = 'hg19' DNF = calcDinuclFreqRef(GRquery, refAssembly) DNFPlot2 = plotDinuclFreq(DNF) ## End(Not run)
DNFDataTable = data.table::data.table(GC = rnorm(400, mean=0.5, sd=0.1), CG = rnorm(400, mean=0.5, sd=0.5), AT = rnorm(400, mean=0.5, sd=1), TA = rnorm(400, mean=0.5, sd=1.5)) DNFPlot = plotDinuclFreq(DNFDataTable) ## Not run: query = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions") GRquery = rtracklayer::import(query) refAssembly = 'hg19' DNF = calcDinuclFreqRef(GRquery, refAssembly) DNFPlot2 = plotDinuclFreq(DNF) ## End(Not run)
Produces a barplot showing how query regions of interest are distributed relative to the expected distribution across a given partition list
plotExpectedPartitions(expectedPartitions, feature_names = NULL, pval = FALSE)
plotExpectedPartitions(expectedPartitions, feature_names = NULL, pval = FALSE)
expectedPartitions |
A data.frame holding the frequency of assignment
to each of the partitions, the expected number of each partition, and
the log10 of the observed over expected. Produced by
|
feature_names |
Character vector with labels for the partitions (optional). By default it will use the names from the first argument. |
pval |
Logical indicating whether Chi-square p-values should be added for each partition. |
A ggplot object using a barplot to show the distribution of the query regions across a given partition list.
p = calcExpectedPartitionsRef(vistaEnhancers, "hg19") expectedPlot = plotExpectedPartitions(p)
p = calcExpectedPartitionsRef(vistaEnhancers, "hg19") expectedPlot = plotExpectedPartitions(p)
Given the results from featureDistribution
, plots a histogram of
distances surrounding the features of interest
plotFeatureDist( dists, bgdists = NULL, featureName = "features", numbers = FALSE, nbins = 50, size = 1e+05, infBins = FALSE, tile = FALSE, labelOrder = "default" )
plotFeatureDist( dists, bgdists = NULL, featureName = "features", numbers = FALSE, nbins = 50, size = 1e+05, infBins = FALSE, tile = FALSE, labelOrder = "default" )
dists |
Results from |
bgdists |
Background distances. If provided, will plot a background distribution of expected distances |
featureName |
Character vector for plot labels (optional). |
numbers |
a logical indicating whether the raw numbers should be displayed, rather than percentages (optional). |
nbins |
Number of bins on each side of the center point. |
size |
Number of bases to include in plot on each side of the center point. |
infBins |
Include catch-all bins on the sides? |
tile |
Turn on a tile mode, which plots a tiled figure instead of a histogram. |
labelOrder |
– Enter "default" to order by order of user input (default); Enter "center" to order by value in tile in the closest proximity to the center of features (in case TSS is used - center is TSS) (center). |
A ggplot2 plot object
TSSdist = calcFeatureDistRefTSS(vistaEnhancers, "hg19") f = plotFeatureDist(TSSdist, featureName="TSS")
TSSdist = calcFeatureDistRefTSS(vistaEnhancers, "hg19") f = plotFeatureDist(TSSdist, featureName="TSS")
calcGCContent
function, this will produce a
density plotPlots a density distribution of GC vectors
Give results from the calcGCContent
function, this will produce a
density plot
plotGCContent(gcvectors)
plotGCContent(gcvectors)
gcvectors |
A numeric vector or list of numeric vectors of GC contents. |
A ggplot object plotting distribution of GC content in query regions.
numVector = rnorm(400, mean=0.5, sd=0.1) GCplot = plotGCContent(numVector) vecs = list(example1 = rnorm(400, mean=0.5, sd=0.1), example2 = rnorm(600, mean=0.5, sd=0.1)) GCplot = plotGCContent(vecs)
numVector = rnorm(400, mean=0.5, sd=0.1) GCplot = plotGCContent(numVector) vecs = list(example1 = rnorm(400, mean=0.5, sd=0.1), example2 = rnorm(600, mean=0.5, sd=0.1)) GCplot = plotGCContent(vecs)
Plot the distances from regions to their upstream/downstream neighbors or nearest neighbors. Distances can be passed as either raw bp or corrected for the number of regions (log10(obs/exp)), but this has to be specified in the function parameters.
plotNeighborDist(dcvec, correctedDist = FALSE, Nneighbors = FALSE)
plotNeighborDist(dcvec, correctedDist = FALSE, Nneighbors = FALSE)
dcvec |
A numeric vector or list of vectors containing distances
to upstream/downstream neighboring regions or to nearest neighbors.
Produced by |
correctedDist |
A logical indicating if the plot axis should be adjusted to show distances corrected for the number of regions in a regionset. |
Nneighbors |
A logical indicating whether legend should be adjusted if Nearest neighbors are being plotted. Default legend shows distances to upstream/downstream neighbors. |
A ggplot density object showing the distribution of raw or corrected distances.
numVector = rnorm(400, mean=5, sd=0.1) d = plotNeighborDist(numVector)
numVector = rnorm(400, mean=5, sd=0.1) d = plotNeighborDist(numVector)
This function can be used to test a GRanges object against any arbitrary list of genome partitions. The partition list is a priority-ordered list of GRanges objects. Each region in the query will be assigned to a given partition that it overlaps with the highest priority.
plotPartitions(assignedPartitions, numbers = FALSE, stacked = FALSE)
plotPartitions(assignedPartitions, numbers = FALSE, stacked = FALSE)
assignedPartitions |
A table holding the frequency of assignment to
each of the partitions. Produced by |
numbers |
logical indicating whether raw overlaps should be plotted instead of the default percentages |
stacked |
logical indicating that data should be plotted as stacked bar plot |
A ggplot object using a barplot to show the distribution of the query regions across a given partition list.
p = calcPartitionsRef(vistaEnhancers, "hg19") partPlot = plotPartitions(p) partCounts = plotPartitions(p, numbers=TRUE) partPlot = plotPartitions(p, stacked=TRUE)
p = calcPartitionsRef(vistaEnhancers, "hg19") partPlot = plotPartitions(p) partCounts = plotPartitions(p, numbers=TRUE) partPlot = plotPartitions(p, stacked=TRUE)
Given the results from calcWidth
, plots a histogram with
outliers trimmed.
plotQTHist( x, EndBarColor = "gray57", MiddleBarColor = "gray27", quantThresh = NULL, bins = NULL, indep = FALSE, numbers = FALSE )
plotQTHist( x, EndBarColor = "gray57", MiddleBarColor = "gray27", quantThresh = NULL, bins = NULL, indep = FALSE, numbers = FALSE )
x |
Data values to plot - vector or list of vectors |
EndBarColor |
Color for the quantile bars on both ends of the graph (optional) |
MiddleBarColor |
Color for the bars in the middle of the graph (optional) |
quantThresh |
Quantile of data to be contained in each end bar (optional) quantThresh values must be under .2, optimal size is under .1 |
bins |
The number of bins for the histogram to allocate data to. (optional) |
indep |
logical value which returns a list of plots that have had their bins calculated independently; the normal version will plot them on the same x and y axis. |
numbers |
a logical indicating whether the raw numbers should be displayed, rather than percentages (optional). |
x-axis breaks for the frequency calculations are based on the "divisions"
results from helper function calcDivisions
.
A ggplot2 plot object
regWidths = calcWidth(vistaEnhancers) qtHist = plotQTHist(regWidths) qtHist2 = plotQTHist(regWidths, quantThresh=0.1)
regWidths = calcWidth(vistaEnhancers) qtHist = plotQTHist(regWidths) qtHist2 = plotQTHist(regWidths, quantThresh=0.1)
calcSummarySignal
.The function plotSummarySignal visualizes the signalSummaryMatrix obtained from
calcSummarySignal
.
plotSummarySignal( signalSummaryList, plotType = "barPlot", metadata = NULL, colorColumn = NULL, filterGroupColumn = NULL, filterGroup = NULL )
plotSummarySignal( signalSummaryList, plotType = "barPlot", metadata = NULL, colorColumn = NULL, filterGroupColumn = NULL, filterGroup = NULL )
signalSummaryList |
Output list from |
plotType |
Options are: "jitter" - jitter plot with box plot on top, "boxPlot" - box plot without individual points and outliers, "barPlot" (default) - bar height represents the median signal value for a given cell type, "violinPlot" - violin plot with medians. |
metadata |
(optional) data.table used for grouping columns from 'signalMatrix' into categories, that are then plotted with different colors. Must contain variable 'colName' that contains all the condition column names from 'signaMatrix'. |
colorColumn |
(optional only if metadata provided) columns name from 'metadata' table that will be used as grouping variable for coloring. |
filterGroupColumn |
(optional only if metadata provided and 'filterGroup' specified) allows user to plot specified subgroups only. String specifying the column name in 'metadata' from which groups will be filtered (groups are specified in as 'filterGroups) |
filterGroup |
(optional only if 'metadata' and 'filterGroupColumn' provided) - string (or vector of strings) of groups from 'filterGroupColumn' to be plottted. |
A ggplot object.
signalSummaryList = calcSummarySignal(vistaEnhancers, exampleOpenSignalMatrix_hg19) metadata = cellTypeMetadata plotSignal = plotSummarySignal(signalSummaryList) plotSignalTissueColor = plotSummarySignal(signalSummaryList = signalSummaryList, plotType = "jitter", metadata = metadata, colorColumn = "tissueType") plotSignalFiltered = plotSummarySignal(signalSummaryList = signalSummaryList, plotType = "violinPlot", metadata = metadata, colorColumn = "tissueType", filterGroupColumn = "tissueType", filterGroup = c("skin", "blood"))
signalSummaryList = calcSummarySignal(vistaEnhancers, exampleOpenSignalMatrix_hg19) metadata = cellTypeMetadata plotSignal = plotSummarySignal(signalSummaryList) plotSignalTissueColor = plotSummarySignal(signalSummaryList = signalSummaryList, plotType = "jitter", metadata = metadata, colorColumn = "tissueType") plotSignalFiltered = plotSummarySignal(signalSummaryList = signalSummaryList, plotType = "violinPlot", metadata = metadata, colorColumn = "tissueType", filterGroupColumn = "tissueType", filterGroup = c("skin", "blood"))
Read local or remote file
retrieveFile(source, destDir = NULL)
retrieveFile(source, destDir = NULL)
source |
a string that is either a path to a local or remote GTF |
destDir |
a string that indicates the path to the directory where the downloaded GTF file should be stored. If not provided, a temporary directory will be used. |
data.frame retrieved file path
CElegansGtfCropped = system.file("extdata", "C_elegans_cropped_example.gtf.gz", package="GenomicDistributions") CElegansGtf = retrieveFile(CElegansGtfCropped)
CElegansGtfCropped = system.file("extdata", "C_elegans_cropped_example.gtf.gz", package="GenomicDistributions") CElegansGtf = retrieveFile(CElegansGtfCropped)
Example BED file read with rtracklayer::import
data(setB_100)
data(setB_100)
GenomicRanges::GRanges
Efficiently split a data.table by a column in the table
splitDataTable(DT, split_factor)
splitDataTable(DT, split_factor)
DT |
Data.table to split |
split_factor |
Column to split, which can be a character vector or an integer. |
List of data.table objects, split by column
Usually ggplot2 facets are labeled with boxes surrounding the label. This function removes the box, so it's a simple label for each facet.
theme_blank_facet_label()
theme_blank_facet_label()
A ggplot theme
A dataset containing chromosome sizes for Homo Sapiens hg38 genome assembly
data(TSS_hg19)
data(TSS_hg19)
A named vectors of lengths with one item per chromosome
EnsDb.Hsapiens.v75 package
Example BED file read with rtracklayer::import
data(vistaEnhancers)
data(vistaEnhancers)
GenomicRanges::GRanges