Title: | Gene Set Enrichment For ChIP-seq Peak Data |
---|---|
Description: | ChIP-Enrich and Poly-Enrich perform gene set enrichment testing using peaks called from a ChIP-seq experiment. The method empirically corrects for confounding factors such as the length of genes, and the mappability of the sequence surrounding genes. |
Authors: | Ryan P. Welch [aut, cph], Chee Lee [aut], Raymond G. Cavalcante [aut], Kai Wang [cre], Chris Lee [aut], Laura J. Scott [ths], Maureen A. Sartor [ths] |
Maintainer: | Kai Wang <[email protected]> |
License: | GPL-3 |
Version: | 2.31.0 |
Built: | 2024-12-18 06:17:37 UTC |
Source: | https://github.com/bioc/chipenrich |
Determine all overlaps between the set of input regions peaks
and the given locus definition locusdef
. In addition, report where each overlap begins and ends, as well as the length of the overlap.
assign_peak_segments(peaks, locusdef)
assign_peak_segments(peaks, locusdef)
peaks |
A |
locusdef |
A locus definition object from |
Typically, this function will not be used alone, but inside chipenrich()
with method = 'broadenrich'
.
A data.frame
with columns for peak_id, chr, peak_start, peak_end, gene_locus_start, gene_locus_end, gene_id, overlap_start, overlap_end, peak_overlap
. The result is used in num_peaks_per_gene()
.
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peak_segments( peaks = peaks, locusdef = locusdef.hg19.nearest_tss)
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peak_segments( peaks = peaks, locusdef = locusdef.hg19.nearest_tss)
Determine the midpoints of a set of input regions peaks
and the overlap of the midpoints with a given locus definition locusdef
. Also report the TSS that is nearest each region (peak) overlapping a defined locus and its distance.
assign_peaks(peaks, locusdef, tss, weighting = NULL)
assign_peaks(peaks, locusdef, tss, weighting = NULL)
peaks |
A |
locusdef |
A locus definition object from |
tss |
A |
weighting |
A string defining what weighting option they want. Current options are 'multiAssign', 'signalValue', and 'logSignal Value'. Default is NULL. |
Typically, this function will not be used alone, but inside chipenrich()
.
A data.frame
with columns for peak_id, chr, peak_start, peak_end, gene_locus_start, gene_locus_end, gene_id, nearest_tss, nearest_tss_gene, dist_to_tss, nearest_tss_gene_strand
. The result is used in num_peaks_per_gene()
.
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peaks( peaks = peaks, locusdef = locusdef.hg19.nearest_tss, tss = tss.hg19)
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peaks( peaks = peaks, locusdef = locusdef.hg19.nearest_tss, tss = tss.hg19)
Broad-Enrich is designed for use with broad peaks that may intersect multiple gene loci, and cumulatively cover greater than 5% of the genome. For example, ChIP-seq experiments for histone modifications. For more details, see the 'Broad-Enrich Method' section below. For help choosing a method, see the 'Choosing A Method' section below, or see the vignette.
broadenrich( peaks, out_name = "broadenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", mappability = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, randomization = NULL, n_cores = 1 )
broadenrich( peaks, out_name = "broadenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", mappability = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, randomization = NULL, n_cores = 1 )
peaks |
Either a file path or a |
out_name |
Prefix string to use for naming output files. This should not
contain any characters that would be illegal for the system being used (Unix,
Windows, etc.) The default value is "broadenrich", and a file "broadenrich_results.tab"
is produced. If |
out_path |
Directory to which results files will be written out. Defaults
to the current working directory as returned by |
genome |
One of the |
genesets |
A character vector of geneset databases to be tested for
enrichment. See |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
mappability |
One of |
qc_plots |
A logical variable that enables the automatic generation of plots for quality control. |
min_geneset_size |
Sets the minimum number of genes a gene set may have to be considered for enrichment testing. |
max_geneset_size |
Sets the maximum number of genes a gene set may have to be considered for enrichment testing. |
randomization |
One of |
n_cores |
The number of cores to use for enrichment testing. We recommend using only up to the maximum number of physical cores present, as virtual cores do not significantly decrease runtime. Default number of cores is set to 1. NOTE: Windows does not support multicore enrichment. |
A list, containing the following items:
opts |
A data frame containing the arguments/values passed to |
peaks |
A data frame containing peak assignments to genes. Peaks which do not overlap a gene locus are not included. Each peak that was assigned to a gene is listed, along with the peak midpoint or peak interval coordinates (depending on which was used), the gene to which the peak was assigned, the locus start and end position of the gene, and the distance from the peak to the TSS. The columns are:
|
peaks_per_gene |
A data frame of the count of peaks per gene. The columns are:
|
results |
A data frame of the results from performing the gene set enrichment test on each geneset that was requested (all genesets are merged into one final data frame.) The columns are:
|
The Broad-Enrich method uses the cumulative peak coverage of genes in its model
for enrichment: GO ~ ratio + s(log10_length)
. Here, GO
is a
binary vector indicating whether a gene is in the gene set being tested,
ratio
is a numeric vector indicating the ratio of the gene covered by
peaks, and s(log10_length)
is a binomial cubic smoothing spline which
adjusts for the relationship between gene coverage and locus length.
The following guidelines are intended to help select an enrichment function:
is designed for use with broad peaks that may intersect multiple gene loci, and cumulatively cover greater than 5% of the genome. For example, ChIP-seq experiments for histone modifications.
is designed for use with 1,000s or 10,000s of narrow peaks which results in fewer gene loci containing a peak overall. For example, ChIP-seq experiments for transcription factors.
is also designed for narrow peaks, for experiments with 100,000s of peaks, or in cases where the number of binding sites per gene affects its regulation. If unsure whether to use chipenrich or polyenrich, then we recommend hybridenrich.
is a combination of chipenrich and polyenrich, to be used when one is unsure which is the optimal method.
Randomization of locus definitions allows for the assessment of Type I Error under the null hypothesis. The randomization codes are:
NULL
:No randomizations, the default.
Shuffle the gene_id
and symbol
columns of the
locusdef
together, without regard for the chromosome location, or locus length.
The null hypothesis is that there is no true gene set enrichment.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 100 genes sorted by locus length. The null
hypothesis is that there is no true gene set enrichment, but with preserved locus
length relationship.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 50 genes sorted by genomic location. The null
hypothesis is that there is no true gene set enrichment, but with preserved
genomic location.
The return value with a selected randomization is the same list as without.
To assess the Type I error, the alpha
level for the particular data set
can be calculated by dividing the total number of gene sets with p-value < alpha
by the total number of tests. Users may want to perform multiple randomizations
for a set of peaks and take the median of the alpha
values.
Other enrichment functions:
chipenrich()
,
polyenrich()
# Run Broad-Enrich using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_H3K4me3_GM12878, package = 'chipenrich.data') peaks_H3K4me3_GM12878 = subset(peaks_H3K4me3_GM12878, peaks_H3K4me3_GM12878$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = broadenrich(peaks_H3K4me3_GM12878, locusdef='nearest_tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
# Run Broad-Enrich using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_H3K4me3_GM12878, package = 'chipenrich.data') peaks_H3K4me3_GM12878 = subset(peaks_H3K4me3_GM12878, peaks_H3K4me3_GM12878$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = broadenrich(peaks_H3K4me3_GM12878, locusdef='nearest_tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
num_peaks_per_gene()
In particular, for method = 'broadenrich'
in chipenrich()
, when using assign_peak_segments()
. This function will add aggregated peak_overlap
(in base pairs) and ratio
(relative to length
) columns to the result of num_peaks_per_gene()
so the right data is present for the method = 'broadenrich'
model.
calc_peak_gene_overlap(assigned_peaks, ppg)
calc_peak_gene_overlap(assigned_peaks, ppg)
assigned_peaks |
A |
ppg |
The aggregated peak assignments over |
Typically, this function will not be used alone, but inside chipenrich()
with method = 'broadenrich'
.
A data.frame
with columns gene_id, length, log10_length, num_peaks, peak, peak_overlap, ratio
. The result is used directly in the gene set enrichment tests in chipenrich()
when method = 'broadenrich'
.
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peak_segments( peaks = peaks, locusdef = locusdef.hg19.nearest_tss) ppg = num_peaks_per_gene( assigned_peaks = assigned_peaks, locusdef = locusdef.hg19.nearest_tss, mappa = NULL) ppg = calc_peak_gene_overlap( assigned_peaks = assigned_peaks, ppg = ppg)
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peak_segments( peaks = peaks, locusdef = locusdef.hg19.nearest_tss) ppg = num_peaks_per_gene( assigned_peaks = assigned_peaks, locusdef = locusdef.hg19.nearest_tss, mappa = NULL) ppg = calc_peak_gene_overlap( assigned_peaks = assigned_peaks, ppg = ppg)
ChIP-Enrich is designed for use with 1,000s or 10,000s of narrow peaks which results in fewer gene loci containing a peak overall. For example, ChIP-seq experiments for transcription factors. For more details, see the 'ChIP-Enrich Method' section below. For help choosing a method, see the 'Choosing A Method' section below, or see the vignette.
chipenrich( peaks, out_name = "chipenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", method = "chipenrich", mappability = NULL, fisher_alt = "two.sided", qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, num_peak_threshold = 1, randomization = NULL, n_cores = 1 )
chipenrich( peaks, out_name = "chipenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", method = "chipenrich", mappability = NULL, fisher_alt = "two.sided", qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, num_peak_threshold = 1, randomization = NULL, n_cores = 1 )
peaks |
Either a file path or a |
out_name |
Prefix string to use for naming output files. This should not
contain any characters that would be illegal for the system being used (Unix,
Windows, etc.) The default value is "chipenrich", and a file "chipenrich_results.tab"
is produced. If |
out_path |
Directory to which results files will be written out. Defaults
to the current working directory as returned by |
genome |
One of the |
genesets |
A character vector of geneset databases to be tested for
enrichment. See |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
method |
A character string specifying the method to use for enrichment testing. Must be one of ChIP-Enrich ('chipenrich') (default), or Fisher's exact test ('fet'). |
mappability |
One of |
fisher_alt |
If method is 'fet', this option indicates the alternative for Fisher's exact test, and must be one of 'two-sided' (default), 'greater', or 'less'. |
qc_plots |
A logical variable that enables the automatic generation of plots for quality control. |
min_geneset_size |
Sets the minimum number of genes a gene set may have to be considered for enrichment testing. |
max_geneset_size |
Sets the maximum number of genes a gene set may have to be considered for enrichment testing. |
num_peak_threshold |
Sets the threshold for how many peaks a gene must have to be considered as having a peak. Defaults to 1. Only relevant for Fisher's exact test and ChIP-Enrich methods. |
randomization |
One of |
n_cores |
The number of cores to use for enrichment testing. We recommend using only up to the maximum number of physical cores present, as virtual cores do not significantly decrease runtime. Default number of cores is set to 1. NOTE: Windows does not support multicore enrichment. |
A list, containing the following items:
opts |
A data frame containing the arguments/values passed to |
peaks |
A data frame containing peak assignments to genes. Peaks which do not overlap a gene locus are not included. Each peak that was assigned to a gene is listed, along with the peak midpoint or peak interval coordinates (depending on which was used), the gene to which the peak was assigned, the locus start and end position of the gene, and the distance from the peak to the TSS. The columns are:
|
peaks_per_gene |
A data frame of the count of peaks per gene. The columns are:
|
results |
A data frame of the results from performing the gene set enrichment test on each geneset that was requested (all genesets are merged into one final data frame.) The columns are:
|
The ChIP-Enrich method uses the presence of a peak in its model for enrichment:
peak ~ GO + s(log10_length)
. Here, GO
is a binary vector indicating
whether a gene is in the gene set being tested, peak
is a binary vector
indicating the presence of a peak in a gene, and s(log10_length)
is a
binomial cubic smoothing spline which adjusts for the relationship between the
presence of a peak and locus length.
The following guidelines are intended to help select an enrichment function:
is designed for use with broad peaks that may intersect multiple gene loci, and cumulatively cover greater than 5% of the genome. For example, ChIP-seq experiments for histone modifications.
is designed for use with 1,000s or 10,000s of narrow peaks which results in fewer gene loci containing a peak overall. For example, ChIP-seq experiments for transcription factors.
is also designed for narrow peaks, for experiments with 100,000s of peaks, or in cases where the number of binding sites per gene affects its regulation. If unsure whether to use chipenrich or polyenrich, then we recommend hybridenrich.
is a combination of chipenrich and polyenrich, to be used when one is unsure which is the optimal method.
Randomization of locus definitions allows for the assessment of Type I Error under the null hypothesis. The randomization codes are:
NULL
:No randomizations, the default.
Shuffle the gene_id
and symbol
columns of the
locusdef
together, without regard for the chromosome location, or locus length.
The null hypothesis is that there is no true gene set enrichment.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 100 genes sorted by locus length. The null
hypothesis is that there is no true gene set enrichment, but with preserved locus
length relationship.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 50 genes sorted by genomic location. The null
hypothesis is that there is no true gene set enrichment, but with preserved
genomic location.
The return value with a selected randomization is the same list as without.
To assess the Type I error, the alpha
level for the particular data set
can be calculated by dividing the total number of gene sets with p-value < alpha
by the total number of tests. Users may want to perform multiple randomizations
for a set of peaks and take the median of the alpha
values.
Other enrichment functions:
broadenrich()
,
polyenrich()
# Run ChipEnrich using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = chipenrich(peaks_E2F4, method='chipenrich', locusdef='nearest_tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
# Run ChipEnrich using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = chipenrich(peaks_E2F4, method='chipenrich', locusdef='nearest_tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
The chipenrich
package includes three classes of methods that adjust for potential confounders of gene set enrichment testing (locus length and mappability of the sequence reads). The first, chipenrich
, is designed for use with transcription-factor (TF) based ChIP-seq experiments and other DNA sequencing experiments with narrow genomic regions. The second, polyenrich
, is similarly designed for TF based ChIP-seq, but where the number of peaks present in gene loci may be important. The third, broadenrich
, is designed for use with histone modification based ChIP-seq experiments and other DNA sequencing experiments with broad genomic regions.
This function filters gene sets based on the genes that are present in a particular
locus definition. After determining which genes are present in both the GeneSet,
gs_obj
, and the LocusDefinition ldef_obj
, gene sets are filtered
by size with min_geneset_size
and max_geneset_size
.
filter_genesets( gs_obj, ldef_obj, min_geneset_size = 15, max_geneset_size = 2000 )
filter_genesets( gs_obj, ldef_obj, min_geneset_size = 15, max_geneset_size = 2000 )
gs_obj |
A valid GeneSet object |
ldef_obj |
A valid LocusDefinition object |
min_geneset_size |
An integer indicating the floor for genes in a geneset. Default 15. |
max_geneset_size |
An integer indicating the ceiling for genes in a geneset. Default 2000. |
An altered gs_obj
with changed set.gene
and all.genes
slots reflecting min_geneset_size
and max_geneset_size
after intersecting
with the genes present in the particular locus definition.
Data from chipenrich.data
uses three letter organism codes for the
GeneSet
objects. This function ensures the correct objects are loaded.
genome_to_organism(genome = supported_genomes())
genome_to_organism(genome = supported_genomes())
genome |
One of the |
A string for the three letter organism code. Convention is first letter of the first word in the binomial name, and first two letters of the second word in the binomial name. 'Homo sapiens' is then 'hsa', for example.
If a custom locus definition is one of the supported_genomes()
, then
the gene symbol column of the custom locus definition is populated using the
appropriate orgDb package.
genome_to_orgdb(genome = supported_genomes())
genome_to_orgdb(genome = supported_genomes())
genome |
One of the |
A data.frame
with gene_id
and symbol
columns.
The method
comes from what is used in chipenrich()
or in
polyenrich()
.
get_test_method(method)
get_test_method(method)
method |
A character for the method used. One of the |
A singleton named character vector with value of the test function and name of the method.
Hybrid test is designed for people unsure of which test between ChIP-Enrich and Poly-Enrich to use, so it takes information of both and gives adjusted P-values. For more about ChIP- and Poly-Enrich, consult their corresponding documentation.
hybridenrich( peaks, out_name = "hybridenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", methods = c("chipenrich", "polyenrich"), weighting = NULL, mappability = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, num_peak_threshold = 1, randomization = NULL, n_cores = 1 )
hybridenrich( peaks, out_name = "hybridenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", methods = c("chipenrich", "polyenrich"), weighting = NULL, mappability = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, num_peak_threshold = 1, randomization = NULL, n_cores = 1 )
peaks |
Either a file path or a |
out_name |
Prefix string to use for naming output files. This should not
contain any characters that would be illegal for the system being used (Unix,
Windows, etc.) The default value is "chipenrich", and a file "chipenrich_results.tab"
is produced. If |
out_path |
Directory to which results files will be written out. Defaults
to the current working directory as returned by |
genome |
One of the |
genesets |
A character vector of geneset databases to be tested for
enrichment. See |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
methods |
A character string array specifying the method to use for enrichment testing. Currently actually unused as the methods are forced to be one chipenrich and one polyenrich. |
weighting |
A character string specifying the weighting method. Method name will automatically be "polyenrich_weighted" if given weight options. Current options are: 'signalValue', 'logsignalValue', and 'multiAssign'. |
mappability |
One of |
qc_plots |
A logical variable that enables the automatic generation of plots for quality control. |
min_geneset_size |
Sets the minimum number of genes a gene set may have to be considered for enrichment testing. |
max_geneset_size |
Sets the maximum number of genes a gene set may have to be considered for enrichment testing. |
num_peak_threshold |
Sets the threshold for how many peaks a gene must have to be considered as having a peak. Defaults to 1. Only relevant for Fisher's exact test and ChIP-Enrich methods. |
randomization |
One of |
n_cores |
The number of cores to use for enrichment testing. We recommend using only up to the maximum number of physical cores present, as virtual cores do not significantly decrease runtime. Default number of cores is set to 1. NOTE: Windows does not support multicore enrichment. |
A data.frame containing:
results |
A data frame of the results from performing the gene set enrichment test on each geneset that was requested (all genesets are merged into one final data frame.) The columns are:
Other variables given will also be included, see the corresponding methods' documentation for their details. |
Given n tests that test for the same hypothesis, same Type I error rate, and
converted to p-values: p_1, ..., p_n
, the Hybrid p-value is computed as:
n*min(p_1, ..., p_n)
. This hybrid test will have at most the same
Type I error as any individual test, and if any of the tests have 100% power as
sample size goes to infinity, then so will the hybrid test.
Every input in hybridenrich is the same as in chipenrich and polyenrich. Inputs unique to chipenrich are: num_peak_threshold; and inputs unique to polyenrich are: weighting. Currently the test only supports running chipenrich and polyenrich, but future plans will allow you to run any number of different support tests.
Combines two existing results files and returns one results file with hybrid
p-values and FDR included. Current allowed inputs are objects from any of
the supplied enrichment tests or a dataframe with at least the following columns:
P.value, Geneset.ID
. Optional columns include: Status
. Currently
we only allow for joining two results files, but future plans will allow you to join
any number of results files.
Given a data.frame
in BEDX+Y format, use the built-in function
GenomicRanges::makeGRangesFromDataFrame()
to convert to GRanges
.
load_peaks(dframe, keep.extra.columns = TRUE)
load_peaks(dframe, keep.extra.columns = TRUE)
dframe |
A BEDX+Y style |
keep.extra.columns |
Keep extra columns parameter from |
Typically, this function will not be used alone, but inside chipenrich()
.
A GRanges
that may or may not keep.extra.columns
, and
that may or may not be stranded, depending on whether there is strand column
in the dframe
.
# Example with just chr, start, end peaks_df = data.frame( chr = c('chr1','chr2','chr3'), start = c(35,74,235), end = c(46,83,421), stringsAsFactors = FALSE) peaks = load_peaks(peaks_df) # Example with extra columns peaks_df = data.frame( chr = c('chr1','chr2','chr3'), start = c(35,74,235), end = c(46,83,421), strand = c('+','-','+'), score = c(36, 747, 13), stringsAsFactors = FALSE) peaks = load_peaks(peaks_df, keep.extra.columns = TRUE)
# Example with just chr, start, end peaks_df = data.frame( chr = c('chr1','chr2','chr3'), start = c(35,74,235), end = c(46,83,421), stringsAsFactors = FALSE) peaks = load_peaks(peaks_df) # Example with extra columns peaks_df = data.frame( chr = c('chr1','chr2','chr3'), start = c(35,74,235), end = c(46,83,421), strand = c('+','-','+'), score = c(36, 747, 13), stringsAsFactors = FALSE) peaks = load_peaks(peaks_df, keep.extra.columns = TRUE)
gene_id
columnFor each gene_id
, determine the locus length and the number of peaks.
num_peaks_per_gene(assigned_peaks, locusdef, mappa = NULL)
num_peaks_per_gene(assigned_peaks, locusdef, mappa = NULL)
assigned_peaks |
A |
locusdef |
A locus definition object from |
mappa |
A mappability object from |
Typically, this function will not be used alone, but inside chipenrich()
.
A data.frame
with columns gene_id, length, log10_length, num_peaks, peak
. The result is used directly in the gene set enrichment tests in chipenrich()
.
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peaks( peaks = peaks, locusdef = locusdef.hg19.nearest_tss, tss = tss.hg19) ppg = num_peaks_per_gene( assigned_peaks = assigned_peaks, locusdef = locusdef.hg19.nearest_tss, mappa = NULL)
data('locusdef.hg19.nearest_tss', package = 'chipenrich.data') data('tss.hg19', package = 'chipenrich.data') file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) assigned_peaks = assign_peaks( peaks = peaks, locusdef = locusdef.hg19.nearest_tss, tss = tss.hg19) ppg = num_peaks_per_gene( assigned_peaks = assigned_peaks, locusdef = locusdef.hg19.nearest_tss, mappa = NULL)
This function is used to create the *_peaks and *_peaks-per-gene files This way one does not need to remake these files whenever one just wants to test enrichment methods.
peaks2genes( peaks, out_name = "readyToEnrich", out_path = getwd(), genome = supported_genomes(), locusdef = "nearest_tss", weighting = NULL, mappability = NULL, qc_plots = TRUE, num_peak_threshold = 1, randomization = NULL )
peaks2genes( peaks, out_name = "readyToEnrich", out_path = getwd(), genome = supported_genomes(), locusdef = "nearest_tss", weighting = NULL, mappability = NULL, qc_plots = TRUE, num_peak_threshold = 1, randomization = NULL )
peaks |
Either a file path or a |
out_name |
Prefix string to use for naming output files. This should not
contain any characters that would be illegal for the system being used (Unix,
Windows, etc.) The default value is "polyenrich", and a file "polyenrich_results.tab"
is produced. If |
out_path |
Directory to which results files will be written out. Defaults
to the current working directory as returned by |
genome |
One of the |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
weighting |
(Poly-Enrich only) character string specifying the weighting method if method is chosen to be 'polyenrich_weighted'. Current options are: 'signalValue', 'logsignalValue', and 'multiAssign'. |
mappability |
One of |
qc_plots |
A logical variable that enables the automatic generation of plots for quality control. |
num_peak_threshold |
(ChIP-Enrich only) Sets the threshold for how many peaks a gene must have to be considered as having a peak. Defaults to 1. Only relevant for Fisher's exact test and ChIP-Enrich methods. |
randomization |
One of |
A list, containing the following items:
opts |
A data frame containing the arguments/values passed to |
peaks |
A data frame containing peak assignments to genes. Peaks which do not overlap a gene locus are not included. Each peak that was assigned to a gene is listed, along with the peak midpoint or peak interval coordinates (depending on which was used), the gene to which the peak was assigned, the locus start and end position of the gene, and the distance from the peak to the TSS. The columns are:
|
peaks_per_gene |
A data frame of the count of peaks per gene. The columns are:
|
Randomization of locus definitions allows for the assessment of Type I Error under the null hypothesis. The randomization codes are:
NULL
:No randomizations, the default.
Shuffle the gene_id
and symbol
columns of the
locusdef
together, without regard for the chromosome location, or locus length.
The null hypothesis is that there is no true gene set enrichment.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 100 genes sorted by locus length. The null
hypothesis is that there is no true gene set enrichment, but with preserved locus
length relationship.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 50 genes sorted by genomic location. The null
hypothesis is that there is no true gene set enrichment, but with preserved
genomic location.
The return value with a selected randomization is the same list as without.
To assess the Type I error, the alpha
level for the particular data set
can be calculated by dividing the total number of gene sets with p-value < alpha
by the total number of tests. Users may want to perform multiple randomizations
for a set of peaks and take the median of the alpha
values.
Poly-Enrich also allows weighting of individual peaks. Currently the options are:
weighs each peak based on the Signal Value given in the narrowPeak format or a user-supplied column, normalized to have mean 1.
weighs each peak based on the log Signal Value given in the narrowPeak format or a user-supplied column, normalized to have mean 1.
weighs each peak by the inverse of the number of genes it is assigned to.
# Run peaks2genes using an example dataset, assigning peaks to the nearest TSS data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata', package='chipenrich') results = peaks2genes(peaks_E2F4, locusdef='nearest_tss', genome = 'hg19', out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks
# Run peaks2genes using an example dataset, assigning peaks to the nearest TSS data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata', package='chipenrich') results = peaks2genes(peaks_E2F4, locusdef='nearest_tss', genome = 'hg19', out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks
A plot showing an approximation of the empirical spline used to model the relationship between a gene having a peak and the locus length. For visual clarity, genes are binned into groups of 25 after sorting by locus length. Expected fits assuming independence of locus length and presence of a peak, and assuming proportionality of locus length and presence of a peak are given to demonstrate deviation from either for the dataset.
plot_chipenrich_spline( peaks, locusdef = "nearest_tss", genome = supported_genomes(), mappability = NULL, legend = TRUE, xlim = NULL )
plot_chipenrich_spline( peaks, locusdef = "nearest_tss", genome = supported_genomes(), mappability = NULL, legend = TRUE, xlim = NULL )
peaks |
Either a file path or a |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
genome |
One of the |
mappability |
One of |
legend |
If true, a legend will be drawn on the plot. |
xlim |
Set the x-axis limit. NULL means select x-lim automatically. |
A trellis plot object.
# Spline plot for E2F4 example peak dataset. data(peaks_E2F4, package = 'chipenrich.data') # Create the plot for a different locus definition # to compare the effect. plot_chipenrich_spline(peaks_E2F4, locusdef = 'nearest_gene', genome = 'hg19')
# Spline plot for E2F4 example peak dataset. data(peaks_E2F4, package = 'chipenrich.data') # Create the plot for a different locus definition # to compare the effect. plot_chipenrich_spline(peaks_E2F4, locusdef = 'nearest_gene', genome = 'hg19')
Create a histogram of the distance from each peak to the nearest transcription start site (TSS) of any gene.
plot_dist_to_tss(peaks, genome = supported_genomes())
plot_dist_to_tss(peaks, genome = supported_genomes())
peaks |
Either a file path or a |
genome |
One of the |
A trellis plot object.
# Create histogram of distance from peaks to nearest TSS. data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') plot_dist_to_tss(peaks_E2F4, genome = 'hg19')
# Create histogram of distance from peaks to nearest TSS. data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') plot_dist_to_tss(peaks_E2F4, genome = 'hg19')
Create a plot showing the relationship between locus length and the proportion of gene loci covered by peaks.
plot_gene_coverage( peaks, locusdef = "nearest_tss", genome = supported_genomes(), mappability = NULL, legend = TRUE, xlim = NULL )
plot_gene_coverage( peaks, locusdef = "nearest_tss", genome = supported_genomes(), mappability = NULL, legend = TRUE, xlim = NULL )
peaks |
Either a file path or a |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
genome |
One of the |
mappability |
One of |
legend |
If true, a legend will be drawn on the plot. |
xlim |
Set the x-axis limit. NULL means select x-lim automatically. |
A trellis plot object.
# Spline plot for E2F4 example peak dataset. data(peaks_H3K4me3_GM12878, package = 'chipenrich.data') # Create the plot for a different locus definition # to compare the effect. plot_gene_coverage(peaks_H3K4me3_GM12878, locusdef = 'nearest_gene', genome = 'hg19')
# Spline plot for E2F4 example peak dataset. data(peaks_H3K4me3_GM12878, package = 'chipenrich.data') # Create the plot for a different locus definition # to compare the effect. plot_gene_coverage(peaks_H3K4me3_GM12878, locusdef = 'nearest_gene', genome = 'hg19')
Create a plot the relationship between number of peaks assigned to a gene and locus length. The plot shows an empirical fit to the data using a binomial smoothing spline.
plot_polyenrich_spline( peaks, locusdef = "nearest_tss", genome = supported_genomes(), mappability = NULL, legend = TRUE, xlim = NULL, ylim = NULL )
plot_polyenrich_spline( peaks, locusdef = "nearest_tss", genome = supported_genomes(), mappability = NULL, legend = TRUE, xlim = NULL, ylim = NULL )
peaks |
Either a file path or a |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
genome |
One of the |
mappability |
One of |
legend |
If true, a legend will be drawn on the plot. |
xlim |
Set the x-axis limit. NULL means select x-lim automatically. |
ylim |
Set the y-axis limit. NULL means select y-lim automatically. |
A trellis plot object.
# Spline plot for E2F4 example peak dataset. data(peaks_E2F4, package = 'chipenrich.data') # Create the plot for a different locus definition # to compare the effect. plot_polyenrich_spline(peaks_E2F4, locusdef = 'nearest_gene', genome = 'hg19')
# Spline plot for E2F4 example peak dataset. data(peaks_E2F4, package = 'chipenrich.data') # Create the plot for a different locus definition # to compare the effect. plot_polyenrich_spline(peaks_E2F4, locusdef = 'nearest_gene', genome = 'hg19')
Poly-Enrich is also designed for narrow peaks, for experiments with 100,000s of peaks, or in cases where the number of binding sites per gene affects its regulation. If unsure whether to use chipenrich or polyenrich, then we recommend hybridenrich. For more details, see the 'Poly-Enrich Method' section below. For help choosing a method, see the 'Choosing A Method' section below, or see the vignette.
polyenrich( peaks, out_name = "polyenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", method = "polyenrich", multiAssign = NULL, weighting = NULL, mappability = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, randomization = NULL, n_cores = 1 )
polyenrich( peaks, out_name = "polyenrich", out_path = getwd(), genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"), locusdef = "nearest_tss", method = "polyenrich", multiAssign = NULL, weighting = NULL, mappability = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, randomization = NULL, n_cores = 1 )
peaks |
Either a file path or a |
out_name |
Prefix string to use for naming output files. This should not
contain any characters that would be illegal for the system being used (Unix,
Windows, etc.) The default value is "polyenrich", and a file "polyenrich_results.tab"
is produced. If |
out_path |
Directory to which results files will be written out. Defaults
to the current working directory as returned by |
genome |
One of the |
genesets |
A character vector of geneset databases to be tested for
enrichment. See |
locusdef |
One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each,
see the vignette or |
method |
A character string specifying the method to use for enrichment
testing. Current options are |
multiAssign |
A boolean value for performing a multiassignment of peaks.
Default is |
weighting |
A character string specifying the weighting method if method is chosen to be 'polyenrich_weighted'. Current options are: 'signalValue' and 'logsignalValue'. |
mappability |
One of |
qc_plots |
A logical variable that enables the automatic generation of plots for quality control. |
min_geneset_size |
Sets the minimum number of genes a gene set may have to be considered for enrichment testing. |
max_geneset_size |
Sets the maximum number of genes a gene set may have to be considered for enrichment testing. |
randomization |
One of |
n_cores |
The number of cores to use for enrichment testing. We recommend using only up to the maximum number of physical cores present, as virtual cores do not significantly decrease runtime. Default number of cores is set to 1. NOTE: Windows does not support multicore enrichment. |
A list, containing the following items:
opts |
A data frame containing the arguments/values passed to |
peaks |
A data frame containing peak assignments to genes. Peaks which do not overlap a gene locus are not included. Each peak that was assigned to a gene is listed, along with the peak midpoint or peak interval coordinates (depending on which was used), the gene to which the peak was assigned, the locus start and end position of the gene, and the distance from the peak to the TSS. The columns are:
|
peaks_per_gene |
A data frame of the count of peaks per gene. The columns are:
|
results |
A data frame of the results from performing the gene set enrichment test on each geneset that was requested (all genesets are merged into one final data frame.) The columns are:
|
The Poly-Enrich method uses the number of peaks in genes in its model for
enrichment: num_peaks ~ GO + s(log10_length)
. Here, GO
is a
binary vector indicating whether a gene is in the gene set being tested,
num_peaks
is a numeric vector indicating the number of peaks in each
gene, and s(log10_length)
is a negative binomial cubic smoothing spline
which adjusts for the relationship between the number of peaks in a gene and
locus length.
Poly-Enrich also allows weighting of individual peaks. Currently the options are:
weighs each peak based on the Signal Value given in the narrowPeak format or a user-supplied column, normalized to have mean 1.
weighs each peak based on the log Signal Value given in the narrowPeak format or a user-supplied column, normalized to have mean 1.
The following guidelines are intended to help select an enrichment function:
is designed for use with broad peaks that may intersect multiple gene loci, and cumulatively cover greater than 5% of the genome. For example, ChIP-seq experiments for histone modifications.
is designed for use with 1,000s or 10,000s of narrow peaks which results in fewer gene loci containing a peak overall. For example, ChIP-seq experiments for transcription factors.
is also designed for narrow peaks, for experiments with 100,000s of peaks, or in cases where the number of binding sites per gene affects its regulation. If unsure whether to use chipenrich or polyenrich, then we recommend hybridenrich.
is a combination of chipenrich and polyenrich, to be used when one is unsure which is the optimal method.
Randomization of locus definitions allows for the assessment of Type I Error under the null hypothesis. The randomization codes are:
NULL
:No randomizations, the default.
Shuffle the gene_id
and symbol
columns of the
locusdef
together, without regard for the chromosome location, or locus length.
The null hypothesis is that there is no true gene set enrichment.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 100 genes sorted by locus length. The null
hypothesis is that there is no true gene set enrichment, but with preserved locus
length relationship.
Shuffle the gene_id
and symbol
columns of the
locusdef
together within bins of 50 genes sorted by genomic location. The null
hypothesis is that there is no true gene set enrichment, but with preserved
genomic location.
The return value with a selected randomization is the same list as without.
To assess the Type I error, the alpha
level for the particular data set
can be calculated by dividing the total number of gene sets with p-value < alpha
by the total number of tests. Users may want to perform multiple randomizations
for a set of peaks and take the median of the alpha
values.
Other enrichment functions:
broadenrich()
,
chipenrich()
# Run Poly-Enrich using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = polyenrich(peaks_E2F4, method='polyenrich', locusdef='nearest_tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
# Run Poly-Enrich using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = polyenrich(peaks_E2F4, method='polyenrich', locusdef='nearest_tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
data.frame
of enrichment resultsPost process the data.frame
of enrichment results
post_process_enrichments(enrich)
post_process_enrichments(enrich)
enrich |
A |
A reformatted data.frame
with columns in a specific order, filtered
of enrichment tests that failed, and ordered first by enrichment 'Status' (if
present) and then 'P.value'.
Check for overlapping input regions, sort peaks, and force peak names
postprocess_peak_grs(gr)
postprocess_peak_grs(gr)
gr |
A |
A GRanges
that is sorted if the seqinfo
is set, and has named peaks.
This method is designed for a set of narrow genomic regions (e.g. TF peaks) and is used to test whether the genomic regions assigned to genes in a gene set are closer to regulatory locations (i.e. promoters or enhancers) than by chance.
proxReg( peaks, out_name = "proxReg", out_path = getwd(), genome = supported_genomes(), reglocation = "tss", genesets = c("GOBP", "GOCC", "GOMF"), randomization = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, n_cores = 1 )
proxReg( peaks, out_name = "proxReg", out_path = getwd(), genome = supported_genomes(), reglocation = "tss", genesets = c("GOBP", "GOCC", "GOMF"), randomization = NULL, qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000, n_cores = 1 )
peaks |
Either a file path or a |
out_name |
Prefix string to use for naming output files. This should not
contain any characters that would be illegal for the system being used (Unix,
Windows, etc.) The default value is "proxReg", and a file "proxReg_results.tab"
is produced. If |
out_path |
Directory to which results files will be written out. Defaults
to the current working directory as returned by |
genome |
One of the |
reglocation |
One of: 'tss', 'enhancer'. Details in the "Regulatory locations" section |
genesets |
A character vector of geneset databases to be tested for
enrichment. See |
randomization |
One of: 'shuffle', 'unif', 'bylength', 'byenh'. These were used to test for Type I error under the null hypothesis. A general user will never have to use these. |
qc_plots |
A logical variable that enables the automatic generation of plots for quality control. |
min_geneset_size |
Sets the minimum number of genes a gene set may have to be considered for testing. |
max_geneset_size |
Sets the maximum number of genes a gene set may have to be considered for testing. |
n_cores |
The number of cores to use for testing. We recommend using only up to the maximum number of physical cores present, as virtual cores do not significantly decrease runtime. Default number of cores is set to 1. NOTE: Windows does not support multicore testing. |
A list, containing the following items:
opts |
A data frame containing the arguments/values passed to |
peaks |
A data frame containing peak assignments to genes. Peaks which do not overlap a gene locus are not included. Each peak that was assigned to a gene is listed, along with the peak midpoint or peak interval coordinates (depending on which was used), the gene to which the peak was assigned, the locus start and end position of the gene, and the distance from the peak to the TSS. The columns are:
|
results |
A data frame of the results from performing the proxReg test on each geneset that was requested (all genesets are merged into one final data frame.) The columns are:
|
Current supported regulatory locations are gene transcription start sites (tss) or enhancer locations (hg19 only)
ProxReg first calculates the distance between each peak midpoint and regulatory location in base pairs. For gene transcription start sites, since parts of the chromosome are more sparse than others, there is an association with gene locus length that needs to be adjusted for. When using tss as the regulatory location, the peak distances are adjusted for this confounding variable based on an average of 90 ENCODE ChIP-seq experiments (details in citation pending). Similarly, for enhancers, distances depend on the density of enhancers within a gene locus, so distance to enhancer is adjusted using an empirical average of 90 ChIP-seq ENCODE experiments.
For each gene set of interest, the genomic regions are divided into two groups indicating the gene with the nearest tss is in the gene set or not. A Wilcoxon Rank-Sum test is then done to test for a difference in the adjusted distances (either to tss or enhancer).
# Run proxReg using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = proxReg(peaks_E2F4, reglocation = 'tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes and their distances to # regulatory regions. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
# Run proxReg using an example dataset, assigning peaks to the nearest TSS, # and on a small custom geneset data(peaks_E2F4, package = 'chipenrich.data') peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1') gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = proxReg(peaks_E2F4, reglocation = 'tss', genome = 'hg19', genesets=gs_path, out_name=NULL) # Get the list of peaks that were assigned to genes and their distances to # regulatory regions. assigned_peaks = results$peaks # Get the results of enrichment testing. enrich = results$results
The following formats are fully supported via their file extensions: .bed,
.broadPeak, .narrowPeak, .gff3, .gff2, .gff, and .bedGraph or .bdg. BED3 through
BED6 files are supported under the .bed extension. Files without these extensions
are supported under the conditions that the first 3 columns correspond to
chr, start, and end and that there is either no header column, or it is
commented out. Files may be compressed with gzip, and so might end in .narrowPeak.gz,
for example. For files with extension support, the rtracklayer::import()
function is used to read peaks, so adherence to the mentioned file formats is
necessary.
read_bed(file_path)
read_bed(file_path)
file_path |
A path to a file with input peaks/regions. See extended description above for details about file support. |
NOTE: Header rows must be commented with #
to be ignored. Otherwise,
an error may result.
NOTE: A warning is given if any input regions overlap. In the case of enrichment
testing with method = 'broadenrich'
, regions should be disjoint.
Typically, this function will not be used alone, but inside chipenrich()
.
A GRanges
with mcols
matching any extra columns.
# Example of generic .txt file with peaks file = system.file('extdata', 'test_header.txt', package = 'chipenrich') peaks = read_bed(file) # Example of BED3 file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) # Example of narrowPeak file = system.file('extdata', 'test.narrowPeak', package = 'chipenrich') peaks = read_bed(file) # Example of gzipped broadPeak file = system.file('extdata', 'test.broadPeak.gz', package = 'chipenrich') peaks = read_bed(file) # Example of gzipped gff3 Fly peaks file = system.file('extdata', 'test.gff3.gz', package = 'chipenrich') peaks = read_bed(file)
# Example of generic .txt file with peaks file = system.file('extdata', 'test_header.txt', package = 'chipenrich') peaks = read_bed(file) # Example of BED3 file = system.file('extdata', 'test_assign.bed', package = 'chipenrich') peaks = read_bed(file) # Example of narrowPeak file = system.file('extdata', 'test.narrowPeak', package = 'chipenrich') peaks = read_bed(file) # Example of gzipped broadPeak file = system.file('extdata', 'test.broadPeak.gz', package = 'chipenrich') peaks = read_bed(file) # Example of gzipped gff3 Fly peaks file = system.file('extdata', 'test.gff3.gz', package = 'chipenrich') peaks = read_bed(file)
This function reads a two-columned tab-delimited text file (with header). Column names are ignored, but the first column should be geneset names or IDs and the second column should be Entrez Gene IDs.
read_geneset(file_path)
read_geneset(file_path)
file_path |
A file path for the custom gene set. |
A GeneSet
class object.
This function reads a tab-delimited text (with a header) file that should have
columns 'chr', 'start', 'end', and a column named 'gene_id' (or 'geneid') with the
Entrez Gene ID. If a supported_genomes()
is given, then a column
of gene symbols named 'symbol', will be added. If an unsupported genome is
used there are two options: 1) Have a column named 'symbols' with the gene
symbols in the custom locus definition, and leave genome = NA
, or 2)
leave genome = NA
, do not provide gene symbols, and NAs will be used.
read_ldef(file_path, genome = NA)
read_ldef(file_path, genome = NA)
file_path |
A file path for the custom locus definition. |
genome |
A genome from |
A LocusDefinition
class object with slots dframe
,
granges
, genome.build
, and organism
.
This function reads a two-columned tab-delimited text file (with header). Expected column names are 'mappa' and 'gene_id'. Each line is for a unique 'gene_id' and contains the mappability (between 0 and 1) for that gene.
read_mappa(file_path)
read_mappa(file_path)
file_path |
A file path for the custom mappability. |
A data.frame
containing gene_id
and mappa
columns.
Recode a vector of number of peaks to binary based on threshold
recode_peaks(num_peaks, threshold = 1)
recode_peaks(num_peaks, threshold = 1)
num_peaks |
An |
threshold |
An |
An binary vector where an entry is 1 if the corresponding entry of
num_peaks
is >= threshold
and is otherwise 0.
We use parallel::mclapply
for multicore geneset enrichment testing, but
this function supports more than one core if the OS is not Windows. If the OS
is windows, the number of cores (mc.cores
) must be set to 1.
reset_ncores_for_windows(n_cores)
reset_ncores_for_windows(n_cores)
n_cores |
An |
Either the original n_cores
if the OS is not Windows, or 1 if
the OS is Windows.
Function to setup genesets
setup_genesets(gs_codes, ldef_obj, genome, min_geneset_size, max_geneset_size)
setup_genesets(gs_codes, ldef_obj, genome, min_geneset_size, max_geneset_size)
gs_codes |
A character vector of geneset databases to be tested for
enrichment. See |
ldef_obj |
A |
genome |
One of the |
min_geneset_size |
Sets the minimum number of genes a gene set may have to be considered for enrichment testing. |
max_geneset_size |
Sets the maximum number of genes a gene set may have to be considered for enrichment testing. |
A list with components consisting of GeneSet
objects for each
of the elements of genesets
. NOTE: Custom genesets must be run separately
from built in gene sets.
Function to setup locus definitions
setup_locusdef(ldef_code, genome, randomization = NULL)
setup_locusdef(ldef_code, genome, randomization = NULL)
ldef_code |
One of 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. Alternately, a file path for
a custom locus definition. NOTE: Must be for a |
genome |
One of the |
randomization |
One of |
A list with components ldef
and tss
.
Function to setup mappability
setup_mappa(mappa_code, genome, ldef_code, ldef_obj)
setup_mappa(mappa_code, genome, ldef_code, ldef_obj)
mappa_code |
One of |
genome |
One of the |
ldef_code |
One of 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb',
'1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream',
'10kb', '10kb_outside', '10kb_outside_upstream'. Alternately, a file path for
a custom locus definition. NOTE: Must be for a |
ldef_obj |
A |
A data.frame
with columns gene_id
and mappa
.
Display supported genesets for gene set enrichment.
supported_genesets()
supported_genesets()
A data.frame
with columns geneset, organism
.
supported_genesets()
supported_genesets()
Display supported genomes.
supported_genomes()
supported_genomes()
A vector indicating supported genomes.
supported_genomes()
supported_genomes()
The locus definitions are defined as below. For advice on selecting a locus definition, see the 'Selecting A Locus Definition' section below.
The locus is the region spanning the midpoints between the TSSs of adjacent genes.
The locus is the region spanning the midpoints between the boundaries of each gene, where a gene is defined as the region between the furthest upstream TSS and furthest downstream TES for that gene. If two gene loci overlap each other, we take the midpoint of the overlap as the boundary between the two loci. When a gene locus is completely nested within another, we create a disjoint set of 3 intervals, where the outermost gene is separated into 2 intervals broken apart at the endpoints of the nested gene.
The locus is the region within 1kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 2 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene.
The locus is the region more than 1kb upstream from a TSS to the midpoint between the adjacent TSS.
The locus is the region more than 1kb upstream or downstream from a TSS to the midpoint between the adjacent TSS.
The locus is the region within 5kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 10 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene.
The locus is the region more than 5kb upstream from a TSS to the midpoint between the adjacent TSS.
The locus is the region more than 5kb upstream or downstream from a TSS to the midpoint between the adjacent TSS.
The locus is the region within 10kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 20 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene.
The locus is the region more than 10kb upstream from a TSS to the midpoint between the adjacent TSS.
The locus is the region more than 10kb upstream or downstream from a TSS to the midpoint between the adjacent TSS.
Each gene has multiple loci corresponding to its exons. Overlaps between different genes are allowed.
Each gene has multiple loci corresponding to its introns. Overlaps between different genes are allowed.
supported_locusdefs()
supported_locusdefs()
A data.frame
with columns genome, locusdef
.
For a transcription factor ChIP-seq experiment, selecting a particular locus definition for use in enrichment testing can have implications relating to how the TF regulates genes. For example, selecting the '1kb' locus definition will imply that the biological processes found enriched are a result of TF regulation near the promoter. In contrast, selecting the '5kb_outside' locus definition will imply that the biological processes found enriched are a result of TF regulation distal from the promoter.
Selecting a locus definition can also help reduce the noise in the enrichment tests. For example, if a TF is known to primarily regulate genes by binding around the promoter, then selecting the '1kb' locus definition can help to reduce the noise from TSS-distal peaks in the enrichment testing.
The plot_dist_to_tss
QC plot displays
where genomic regions fall relative to TSSs genome-wide, and can help inform
the choice of locus definition. For example, if many peaks fall far from the
TSS, the 'nearest_tss' locus definition may be a good choice because it will
capture all input genomic regions, whereas the '1kb' locus definition may
not capture many of the input genomic regions and adversely affect the
enrichment testing.
supported_locusdefs()
supported_locusdefs()
Display supported gene set enrichment methods.
supported_methods()
supported_methods()
A vector indicating supported methods for gene set enrichment.
supported_methods()
supported_methods()
Display supported read lengths for mappability
supported_read_lengths()
supported_read_lengths()
A data.frame
with columns genome, locusdef, read_length
.
supported_read_lengths()
supported_read_lengths()