Title: | Grey Lists -- Mask Artefact Regions Based on ChIP Inputs |
---|---|
Description: | Identify regions of ChIP experiments with high signal in the input, that lead to spurious peaks during peak calling. Remove reads aligning to these regions prior to peak calling, for cleaner ChIP analysis. |
Authors: | Matt Eldridge [cre], Gord Brown [aut] |
Maintainer: | Matt Eldridge <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.39.0 |
Built: | 2024-12-19 03:39:19 UTC |
Source: | https://github.com/bioc/GreyListChIP |
Based on the counts from countReads
, sample counts from the set
several times, estimate the parameters of the negative binomial distribution
for each sample, then calculate the mean of the parameters (size and
mu). Use these values to calculate the read count threshold, given
the specified p-value threshold.
calcThreshold(obj, reps=100, sampleSize=30000, p=0.99,cores=1)
calcThreshold(obj, reps=100, sampleSize=30000, p=0.99,cores=1)
obj |
A |
reps |
The number of times to sample bins and estimate the parameters of the negative binomial distribution. |
sampleSize |
The number of bins to sample on each repetition. |
p |
The p-value threshold for marking bins as “grey”. |
cores |
The number of CPU cores (parallel threads) to use when sampling repeatedly from the set of counts |
This method samples from the set of counts generated during the
countReads
step. Each sample is fitted to the negative binomial
distribution, and the parameters estimated. The means of the mu
and size
parameters is calculated, then used to choose a read count
threshold, given the p-value cutoff provided. If cores
is given,
the process will use that many cores to parallelize the parameter estimation.
The modified GreyList
object, with the threshold added.
Gord Brown
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
# Load a pre-built R object with counts. data(greyList) # Calculate the threshold: gl <- calcThreshold(greyList,reps=10,sampleSize=1000,p=0.99,cores=1)
# Load a pre-built R object with counts. data(greyList) # Calculate the threshold: gl <- calcThreshold(greyList,reps=10,sampleSize=1000,p=0.99,cores=1)
GRanges
object representing ENCODE signal artefact
or "black list" regions.These objects are made from the so-called black list or signal artefact regions defined by Anshul Kundaje and Alan Boyle as part of the ENCODE and modENCODE projects. They are regions which show a signal in essentially any ChIP or similar experiment. For this reason it is useful to remove reads aligning to these regions before carrying out any sort of functional genomics analysis.
These objects can be merged with GreyList
objects to
perform grey list and black list filtering in one step. They are included
in the package as a convenience, with the permission of the creators.
The package includes black lists for worm (ce10, ce11), fly (dm3, dm6), human (hg19, grch37, hg38, grch38), and mouse (mm9, mm10). (Note that hg19 and grch37 differ only in chromosome naming, and likewise for hg38 and grch38.)
data(ce10.blacklist,package=GreyListChIP) data(ce11.blacklist,package=GreyListChIP) data(dm3.blacklist,package=GreyListChIP) data(dm6.blacklist,package=GreyListChIP) data(grch37.blacklist,package=GreyListChIP) data(hg19.blacklist,package=GreyListChIP) data(grch38.blacklist,package=GreyListChIP) data(hg38.blacklist,package=GreyListChIP) data(mm9.blacklist,package=GreyListChIP) data(mm10.blacklist,package=GreyListChIP)
data(ce10.blacklist,package=GreyListChIP) data(ce11.blacklist,package=GreyListChIP) data(dm3.blacklist,package=GreyListChIP) data(dm6.blacklist,package=GreyListChIP) data(grch37.blacklist,package=GreyListChIP) data(hg19.blacklist,package=GreyListChIP) data(grch38.blacklist,package=GreyListChIP) data(hg38.blacklist,package=GreyListChIP) data(mm9.blacklist,package=GreyListChIP) data(mm10.blacklist,package=GreyListChIP)
An S4 GRanges
object.
A set of intervals defining the black list regions for the named genome.
https://sites.google.com/site/anshulkundaje/projects/blacklists
Amemiya HM, Kundaje A, Boyle AP. The ENCODE blacklist: identification of problematic regions of the genome. Sci Rep. 2019 Dec; 9(1) 9354 DOI: 10.1038/s41598-019-45839-z.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.
BamFile
Given the tiling of the genome created when the GreyList
object was created (or replaced via getKaryotype
), count reads
overlapping the bins, in preparation for estimating the threshold for
grey-listing bins.
countReads(obj,bamFile,yieldSize=NA_integer_)
countReads(obj,bamFile,yieldSize=NA_integer_)
obj |
A |
bamFile |
A |
yieldSize |
Number of records to yield each time the BAM file is read. |
This method counts reads contained within the bins that make up the genome tiling. Bins are overlapping (by default 1Kb bins at 512b intervals) so reads are counted once for each bin that wholly contains them.
Setting the yieldSize
can help control the memory usage. If unset, a
value of 1,000,000 is used by the summarizeOverlaps
function
from the GenomicAlignments
package that this method calls. Setting the
yieldSize
to a lower value will reduce the memory requirement at the
expense of longer run times.
The modified GreyList
object, with added counts.
Gord Brown
# Load a pre-built GreyList object. data(greyList) path <- system.file("extra", package="GreyListChIP") ## Not run: fn <- file.path(path,"sample_chr21.bam") ## Not run: gl <- countReads(greyList,fn)
# Load a pre-built GreyList object. data(greyList) path <- system.file("extra", package="GreyListChIP") ## Not run: fn <- file.path(path,"sample_chr21.bam") ## Not run: gl <- countReads(greyList,fn)
GreyList
objectThough a BSgenome
object (or a karyotype file) is
supplied when the
GreyList
object is created, it is conceivable that
the user might want to replace it. This method allows that.
getKaryotype(obj, genome, tileSize=1024)
getKaryotype(obj, genome, tileSize=1024)
obj |
A |
genome |
A |
tileSize |
The size in nucleotides of each tile. Overlapping tiles will be generated, spaced at 1/2 the width of the tiles. |
Returns the GreyList
object with a new genome and
tiling.
Gord Brown
# Load a pre-built GreyList objct. data(greyList) library(BSgenome.Hsapiens.UCSC.hg19) # Replace the karyotype, updating the genome tiling. ## Not run: gl <- getKaryotype(greyList,BSgenome.Hsapiens.UCSC.hg19)
# Load a pre-built GreyList objct. data(greyList) library(BSgenome.Hsapiens.UCSC.hg19) # Replace the karyotype, updating the genome tiling. ## Not run: gl <- getKaryotype(greyList,BSgenome.Hsapiens.UCSC.hg19)
GreyList
object for use in examples.This is a sample GreyList
object, covering
only human chromosome 21 (from genome version hg19). The input library
used to generate this grey list can be found in the European Nucleotide
Archive, under accession number ERR336953.
The library was made from a culture of the MCF-7 cell line, bought from ATCC.
The library was sequenced to a depth of 35,716,191 reads on an Illumina
Genome Analyzer IIx. The reads were aligned to human reference genome hg19
(GRCh37) using BWA version 0.7.5a with default parameters. Approximately
96% of reads aligned to the reference genome.
Reads aligning to chromosome 21 were extracted using Samtools
.
The chromosome 21-only karyotype file was created by deleting all lines
except chromosome 21, in a file generated by fetchChromSizes
as
described in the vignette. This package was then used to create the
GreyList
sample object.
When printed, the object displays several important slots in the object (if they have been filled with calculated values). For example, this object has all its slots filled, indicating that the analysis is complete:
GreyList on karyotype file karyotype_chr21.txt tiles: 94004 files: jc899_chr21.bam size (mean): 0.370332362145541 mu (mean): 10.2330008719269 params: reps=10, sample size=1000, p-value=0.99 threshold: 81 regions: 118 coverage: 4.45%
The fields are described in the class's documentation, but briefly, we can see:
the name of the karyotype file (or BSgenome
object),
the number of tiles (overlapping by 1/2 the tile width),
the BAM file(s) used for read counting (currently only 1 is allowed),
the two estimated parameters of the negative binomial distribution,
size
and mu
,
the input parameters,
the calculated read depth cutoff (over 1kb tiles),
the number of distinct regions, and
the percentage coverage of the reference genome.
The fact that all the fields are present indicates that the regions have been generated; otherwise fields still without values would be omitted. Of course any stage can be re-run with different parameters.
data(greyList)
data(greyList)
An S4 GreyList
object.
A GreyList
object named greyList
.
"GreyList"
ObjectsRegions of high signal in the input samples of a ChIP experiment can lead to artefacts in peak calling. This class generates "grey lists" of such regions, for use in filtering reads before peak calling (or filtering peaks after peak calling, though it is generally safer to filter first).
Objects can be created by calls of the
form new("GreyList", genome, ...)
,
where genome
is a "BSgenome"
object describing a genome, such as
BSgenome.Hsapiens.UCSC.hg19
. Alternatively, a karyotype
file can be provided explictly: new("GreyList", karyoFile=fn, ...)
.
Either genome
or karyoFile
must be provided; if both are present,
the BSgenome
object takes precedence. Alternatively,
an explicit list of regions may be provided as a GRanges
object.
genome
:The BSgenome
object
corresponding with the genome the reads are aligned to
karyotype
:The Seqinfo
object from
the BSgenome
object, or made from the
karyo_file
karyo_file
:The name of a file containing chromosome sizes for the reference genome of interest, one per line, as "chromName chromLength" pairs.
genomeRegions
:a GRanges
object
that defines which regions of the genome should be used to build
the grey list. This is to allow the list to be built on just part
of the genome.
tiles
:A GRanges
object with an
overlapping tiling of the genome (by default 1Kb tiles every 512b).
counts
:A numeric vector holding the counts corresponding to the tiling and the BAM file provided.
files
:A vector of BAM filenames that were used to generate the counts (currently only accepts one).
size_param
:The computed estimates of the "size" parameter
of the negative binomial distribution, estimated by
MASS::fitdistr
from repeated sampling from the counts.
size_stderr
:The standard errors of the "size" parameters, as
estimated by MASS::fitdistr
.
size_mean
:The mean of the "size" estimates.
mu_param
:Computed estimates of the "mu" parameter of the
negative binomial distribution, estimated by MASS::fitdistr
from repeated sampling from the counts.
mu_stderr
:The standard errors of the "mu" parameter.
mu_mean
:The mean of the "mu" estimates.
reps
:How many samples from the counts were taken.
sample_size
:How many values were sampled from the counts, for each estimate of "size" and "mu".
pvalue
:The requested p-value threshold.
threshold
:The calculated threshold, based on the p-value.
max_gap
:The largest gap to consider when merging nearby regions (i.e. if there are "grey" regions up to this many nucleotides apart, merge them into one long region).
regions
:A GRanges
object defining the
final grey list regions.
coverage
:The percentage of the genome covered by the grey list regions.
signature(obj = "GreyList")
: Calculate the
cutoff for reads in bins, based on fitting the counts to a negative
binomial distribution.
signature(obj = "GreyList")
: Count reads in bins
across the genome.
signature(object = "GreyList",
con = "character", format = "missing")
: Write the grey list
to a file.
signature(.Object = "GreyList")
: Create an initial
object (invoked automatically by new("GreyList",...)
).
signature(obj = "GreyList")
: Load a genome
description from a file. The file format is one line per chromosome,
with the name of the chromosome followed by white space followed by
an integer indicating the length of the chromosome.
signature(obj = "GreyList")
: Get the karyotype
of a genome from a BSgenome
object.
signature(obj = "GreyList")
: Set the region(s)
of a genome to use in making the GreyList
object.
signature(obj = "GreyList")
: Compute the actual
grey list, after calculating the threshold.
signature(object = "GreyList")
: Display the grey list.
Gord Brown ([email protected])
showClass("GreyList") # Load a karyotype file: path <- system.file("extra", package="GreyListChIP") fn <- file.path(path,"karyotype_chr21.txt") # Create a GreyList object: gl <- new("GreyList",karyoFile=fn)
showClass("GreyList") # Load a karyotype file: path <- system.file("extra", package="GreyListChIP") fn <- file.path(path,"karyotype_chr21.txt") # Create a GreyList object: gl <- new("GreyList",karyoFile=fn)
This function is a convenience function, wrapping several steps of grey list
construction into one step. If you are content to accept the package's
defaults, and are using a BSgenome
object to supply
the karyotype, this function might be of use to you.
greyListBS(genome, bam)
greyListBS(genome, bam)
genome |
a BSgenome object, for the relevant genome. |
bam |
a BAM file to use for making the grey list. |
An object of class GreyList
.
Gord Brown
# If you want to accept the defaults for everything, you can create the # GreyList in one step using a BSgenome object: library(BSgenome.Hsapiens.UCSC.hg19) path <- system.file("extra", package="GreyListChIP") ## Not run: fn <- file.path(path,"sample_chr21.bam") ## Not run: gl <- greyListBS(BSgenome.Hsapiens.UCSC.hg19,fn)
# If you want to accept the defaults for everything, you can create the # GreyList in one step using a BSgenome object: library(BSgenome.Hsapiens.UCSC.hg19) path <- system.file("extra", package="GreyListChIP") ## Not run: fn <- file.path(path,"sample_chr21.bam") ## Not run: gl <- greyListBS(BSgenome.Hsapiens.UCSC.hg19,fn)
Load a karyotype from a file, (re)generate the tiling, for a
GreyList
object.
loadKaryotype(obj, karyoFile, tileSize=1024)
loadKaryotype(obj, karyoFile, tileSize=1024)
obj |
A |
karyoFile |
A text file describing a genome's karyotype. The format is one line per chromosome (or contig or whatever), with the name of the chromosome, some white space, and an integer giving the length of the chromosome in nucleotides. |
tileSize |
The width of tiles on which to count. Tiles will be placed every tileSize/2 nucleotides, to catch regions of high signal that might otherwise be split across (non-overlapping) tiles and hence missed. |
Returns the GreyList
object with a new genome and
tiling, loaded from the provided file.
Gord Brown
# load a pre-built GreyList object: data(greyList) # Get a karyotype file: path <- system.file("extra", package="GreyListChIP") fn <- file.path(path,"karyotype_chr21.txt") # Replace the karyotype in the GreyList: gl <- loadKaryotype(greyList,fn)
# load a pre-built GreyList object: data(greyList) # Get a karyotype file: path <- system.file("extra", package="GreyListChIP") fn <- file.path(path,"karyotype_chr21.txt") # Replace the karyotype in the GreyList: gl <- loadKaryotype(greyList,fn)
GreyList
objectCreate the actual grey list, based on the threshold calculated by
calcThreshold
and the read counts from countReads
.
makeGreyList(obj,maxGap=16384)
makeGreyList(obj,maxGap=16384)
obj |
The |
maxGap |
If the distance between neighbouring grey regions is less than
or equal to |
Create the grey list as a GRanges
object. Merge grey
regions if they are separated by up to maxGap
bp.
The modified GreyList
, with the regions added.
Gord Brown
# load a pre-built GreyList object: data(greyList) # calculate the actual regions: gl <- makeGreyList(greyList)
# load a pre-built GreyList object: data(greyList) # calculate the actual regions: gl <- makeGreyList(greyList)
GreyList
objectThough a BSgenome
object (or a karyotype file, or a
GRanges
) is
supplied when the
GreyList
object is created, it is conceivable that
the user might want to replace it. This method allows that.
setRegions(obj, regions, tileSize=1024)
setRegions(obj, regions, tileSize=1024)
obj |
A |
regions |
A |
tileSize |
The size in nucleotides of each tile. Overlapping tiles will be generated, spaced at 1/2 the width of the tiles. |
Returns the GreyList
object with a new
tiling.
Gord Brown
# Load a pre-built GreyList objct. data(greyList) # Replace the karyotype, updating the genome tiling. regions=GRanges(seqnames=Rle(c('chr21','chr21','chr22')), ranges=IRanges(c(1,20000,30000),end=c(10000,30000,40000))) gl <- setRegions(greyList,regions)
# Load a pre-built GreyList objct. data(greyList) # Replace the karyotype, updating the genome tiling. regions=GRanges(seqnames=Rle(c('chr21','chr21','chr22')), ranges=IRanges(c(1,20000,30000),end=c(10000,30000,40000))) gl <- setRegions(greyList,regions)