Title: | Copy number prediction with correction for GC and mappability bias for HTS data |
---|---|
Description: | Corrects GC and mappability biases for readcounts (i.e. coverage) in non-overlapping windows of fixed length for single whole genome samples, yielding a rough estimate of copy number for furthur analysis. Designed for rapid correction of high coverage whole genome tumour and normal samples. |
Authors: | Daniel Lai, Gavin Ha, Sohrab Shah |
Maintainer: | Daniel Lai <[email protected]> |
License: | GPL-3 |
Version: | 1.49.0 |
Built: | 2024-12-29 05:20:26 UTC |
Source: | https://github.com/bioc/HMMcopy |
HMMcopy is a package for making bias-free copy number estimation by correcting for GC-content and mappability bias in HTS readcounts. It also contains an implementation of the Hidden Markov Model to robustly segment a copy number profile into non-overlapping segments predicted to be of the same copy number state, and attributes a biological copy number aberration events to the segments.
HMMcopy takes as input WIG format files generated by fast C++ tools distributed as part of the HMMcopy Suite, namely readcount, GC-content and mappability values for non-overlapping fixed width “bins” across the reference genome of interest. It then uses a filtering and LOESS model to correct the GC-content and mappability biases observed in the readcounts (Benjamini and Speed, 2012), and uses the corrected readcounts as a proxy of copy number. The resultant copy number profile is then segmented with a six state Hidden Markov Model, with a handful of quick visualization functions for quick viewing.
Package: | HMMcopy |
Type: | Package |
Version: | 0.1.0 |
Date: | 2011-09-06 |
License: | GPL-3 |
example("HMMcopy-package")
for quick tour of functionality and
visualization
vignette("HMMcopy")
for detailed example
Daniel Lai, Gavin Ha, Sohrab Shah
Maintainer: Daniel Lai <[email protected]> and Gavin Ha <[email protected]>
Yuval Benjamini and Terence P Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Res, 40(10):e72, May 2012.
# Read WIG file input rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy") gfile <- system.file("extdata", "gc.wig", package = "HMMcopy") mfile <- system.file("extdata", "map.wig", package = "HMMcopy") uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile) # Correct reads into copy number corrected_copy <- correctReadcount(uncorrected_reads) # Segment copy number profile segmented_copy <- HMMsegment(corrected_copy) # Visualize one at a time par(ask = TRUE) plotBias(corrected_copy) plotCorrection(corrected_copy) plotSegments(corrected_copy, segmented_copy)
# Read WIG file input rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy") gfile <- system.file("extdata", "gc.wig", package = "HMMcopy") mfile <- system.file("extdata", "map.wig", package = "HMMcopy") uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile) # Correct reads into copy number corrected_copy <- correctReadcount(uncorrected_reads) # Segment copy number profile segmented_copy <- HMMsegment(corrected_copy) # Visualize one at a time par(ask = TRUE) plotBias(corrected_copy) plotCorrection(corrected_copy) plotSegments(corrected_copy, segmented_copy)
Corrects readcounts for GC and mappability bias using the binning/loess method optimized for speed.
correctReadcount(x, mappability = 0.9, samplesize = 50000, verbose = TRUE)
correctReadcount(x, mappability = 0.9, samplesize = 50000, verbose = TRUE)
x |
|
mappability |
Mappability threshold [0, 1] below which points are ignored during creating the correction curve. |
samplesize |
The number of points sampled during LOESS fitting, decreasing reduces runtime and memory usage, at the expense of robustness to data randomness. |
verbose |
Set to FALSE it messages are not desired. |
Input read counts are contained in the IRanges object, where number of reads within bins (or sometimes called windows) of defined genomic size are specified. GC content should also be computed using the exact same boundaries for each bin.
Ensure that the GC content and mappability files have been mapped to the same genome build (e.g. hg18) as the tumour and normal read libraries.
Here is a summary of the correction procedure.
Filter out bins with 0 reads and 0 GC content
Filter out bins with reads within the top and bottom 1% quantile (assumed to be outliers)
Filter out bins with GC content within the top and bottom 1% quantile
Filter out bins with a mappability score of greater than 0.9 ('mappability' argument).
Randomly sample up to 50000 ('samplesize' argument) of the remaining high-quality bins for the purposes keeping a good runtime.
The first loess (on the reads-by-GC curve) with a small span (smoothing window) is performed, obtaining typically a highly sensitive curve (follows low density tails of distribution, but gets jagged in high density center).
A second loess (on the first loess results) with a larger span is performed, recapitulating the curve in the low density tails and smoothing out the jagged regions in the high density center.
'cor.gc' is obtained by correcting each bin for GC content. The number of observed reads is divided by the number of reads predicted by the loess curve given an observed GC proportion.
Filter out just the top 1% quantile of the cor.gc bins, then _randomly_ sample up to 50000 ('samplesize' argument) bins.
A separate lowess curve is computed for mappability-by-GC (cor.gc).
'cor.map' is obtained by correcting each bin for mappability. The cor.gc value is divided by the cor.gc value predicted by the mappability lowess curve generated in the previous step.
'copy' is obtained by setting all cor.map values <= to NA (i.e. NaN), then apply log2
The original A RangedData
object
appended with 5 new columns:
Valid bins, which have valid read, gc, and mappability values
Ideal bins of high mappability and no outliers
GC-corrected readcounts
Mappability corrected GC-corrected readcounts
cor.map values after log base 2
Daniel Lai
Yuval Benjamini and Terence P Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Res, 40(10):e72, May 2012.
wigsToRangedData
to easily generate the proper input.
data(tumour) # Load tumour_reads tumour_copy <- correctReadcount(tumour_reads)
data(tumour) # Load tumour_reads tumour_copy <- correctReadcount(tumour_reads)
A set of data of chromosome 6 of matched tumour normal pair. The dataset that is described here belongs to a female triple-negative breast cancer patient from the dataset published in Shah2012,Ha2012. This genome library was sequenced on the ABI/Life SOLiD platform, generating hybrid lengths of 25bp and 50bp paired-end reads. The reads were aligned using BioScope https://products.appliedbiosystems.com/ where reads mapped to multiple sites were ignored.
The number of short reads in fixed width windows across
the chromosome, generated with wigsToRangedData
from WIG files
Tumour copy number profile generated by correcting
‘tumour_reads’ with correctReadcount
Normal copy number profile generated via
correctReadcount
Parameters for segmenting ‘tumour_copy’ in
HMMsegment
Segmented output of ‘tumour_copy’ HMMsegment
using ‘tumour_param’
data(tumour)
data(tumour)
‘tumour_reads’, ‘tumour_copy’, and ‘normal_copy’ are
RangedData
objects.
‘tumour_param’ is a numeric matrix.
‘tumour_segments’ is a list.
Gavin Ha, Andrew Roth, Daniel Lai, Ali Bashashati, Jiarui Ding, Rodrigo Goya, Ryan Giuliany, Jamie Rosner, Arusha Oloumi, Karey Shumansky, Suet-Feung Chin, Gulisa Turashvili, Martin Hirst, Carlos Caldas, Marco A Marra, Samuel Aparicio, and Sohrab P Shah. Integrative analysis of genome-wide loss of heterozygosity and mono-allelic expression at nucleotide resolution reveals disrupted pathways in triple negative breast cancer. Genome Research, (advanced online publication), May 2012.
S P Shah, A Roth, R Goya, A Oloumi, G Ha, Y Zhao, G Turashvili, J Ding, K Tse, G Hafari, A Bashashati, L M Prentice, J Khattra, A Burleigh, D Yap, V Bernard, A McPherson, K Shuman- sky, A Crisan, R Giuliany, A Heravi-Moussavi, J Rosner, D Lai, I Birol, R Varhol, A Tam, N Dhalla, T Zeng, K Ma, S K Chan, M Griffth, A Moradian, S W Cheng, G B Morin, P Watson, K Gelmon, S Chia, S F Chin, C Curtis, O M Rueda, P D Pharoah, S Damaraju, J Mackey, K Hoon, T Harkins, V Tadigotla, M Sigaroudinia, P Gascard, T Tlsty, J F Costello, I M Meyer, C J Eaves, W W Wasserman, S Jones, D Huntsman, M Hirst, C Caldas, M A Marra, and S Aparicio. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature, 486(7403):395-399, Jun 2012.
Takes in a copy number profile and segments it into predicted regions of equal copy number, and assigns a biologically motivated copy number state to each region using a Hidden Markov Model (HMM). This is an extension to the HMM described in Shah et al., 2006.
HMMsegment(correctOut, param = NULL, autosomes = NULL, maxiter = 50, getparam = FALSE, verbose = TRUE)
HMMsegment(correctOut, param = NULL, autosomes = NULL, maxiter = 50, getparam = FALSE, verbose = TRUE)
correctOut |
Output value from |
param |
If none is provided, will generate a reasonable set of parameters based on the input data, which can optionally be returned for inspection and manual adjustment by setting 'getparam' to TRUE. See Details for more information on parameters. A matrix with parameters values in columns for each state in rows:
|
autosomes |
Array of LOGICAL values corresponding to the 'chr' argument where an element is TRUE if the chromosome is an autosome, otherwise FALSE. If not provided, will automatically set the following chromosomes to false: "X", "Y", "23", "24", "chrX", chrY", "M", "MT", "chrM". |
maxiter |
The maximum number of iterations allows for the Maximum-Expectation algorithm, reduce to decrease running time at the expense of robustness. |
getparam |
If TRUE, generates and returns parameters without running segmentation. |
verbose |
Set to FALSE if messages are not desired |
HMMsegment
is a two stage algorithm that first runs an
Expectation-Maximization algorithm to find the optimal set of parameters
based on suggested parameter inputs, and allowed flexibilities. After
iteratively finding the optimal parameters, the actual segmentation of the
data is conducted with the Viterbi algorithm, finally output segmented
states. This is an extension to the hidden Markov model described in Shah
et al., 2006.
Parameters are divided into two main categories:
Initial parameters: e, mu, lambda, nu, kappa
Flexibility parameters: strength, m, eta, gamma, S
Where initial parameters are treated as starting suggestions for the parameter optimization algorithm, and flexibility parameters (hyperparameters) define how much the initial parameters are allowed to deviate during the search for the optimal parameters.
With a good copy number dataset, in theory, given enough flexibility, the algorithm should always find a similar set of optimal parameters regardless of initial parameters (although running times will vary).
If for some reason you wish to manually set the parameters for the final segmentation process, one should tune all flexibility parameters to minimal values. For example, if you wish to increase the length of segments, you could set:
param$e <- 0.9999999999999999 param$strength <- 1e30
Which suggests that segments should be very long, and gives minimal to non-existant flexibility to your suggestion.
See vignette for diagrammed example:
vignette("HMMcopy")
A list object containing multiple values, although in practice only the state assigned to each copy number value in 'states' and the segments of non-overlapping states in 'segs' are of interest.
By default, there are 6 states, which in a diploid sample corresponds to the following chromosomal copies and biological state:
<=0 copies, homozogous deletion
1 copy, heterozogous deletion
2 copies, neutral
3 copies, gain
4 copies, amplification
>=5 copies, high level amplification
The full list of output is as follows:
The state assigned to each copy number value
Non-overlapping segments and medians of each segment
Optimal median of of copy numbers in state
Optimal precision of copy numbers in state
Optimal state distribution
The likelihood values of each EM algorithm iteration
Posterior marginals (responsibilities) for each position and state
Daniel Lai, Gavin Ha, Sohrab Shah
Sohrab P Shah, Xiang Xuan, Ron J DeLeeuw, Mehrnoush Khojasteh, Wan L Lam, Raymond Ng, and Kevin P Murphy. Integrating copy number polymorphisms into array cgh analysis using a robust hmm. Bioinformatics, 22(14):e431-9, Jul 2006.
correctReadcount
, to correct the readcounts prior to
segmentation and classification for better results.
data(tumour) # Load tumour_copy tumour_segments <- HMMsegment(tumour_copy)
data(tumour) # Load tumour_copy tumour_segments <- HMMsegment(tumour_copy)
Convinience functions for creating plots for the analysis of the readcount
correction process by correctReadcount
plotBias(correctOutput, points = 10000, ...) plotCorrection(correctOutput, chr = correctOutput$chr[1], ...) plotSegments(correctOutput, segmentOutput, chr = correctOutput$chr[1], ...) plotParam(segmentOutput, param, ...) stateCols()
plotBias(correctOutput, points = 10000, ...) plotCorrection(correctOutput, chr = correctOutput$chr[1], ...) plotSegments(correctOutput, segmentOutput, chr = correctOutput$chr[1], ...) plotParam(segmentOutput, param, ...) stateCols()
correctOutput |
Output value from |
segmentOutput |
Output value from |
points |
Number of random sampled points to plot, decreasing reduces runtime |
chr |
Chromosome name to plot. A single value for |
param |
Input parameters to call that produced segmentOutput |
... |
Furthur arguments are passed to |
plotBias
shows the effects of the correction process on
GC bias and mappability bias in HTS readcounts.
plotCorrection
shows the effects of the correction on the copy
number profile. Defaultly plotting the entire first chromosome found in the
list.
plotSegments
shows the resultant segments and states assigned
to each segment.
plotParam
shows the initial suggested distribution of copy
number in each state (dashed), and the optimal distribution of copy number
in each state (solid)
stateCols
returns a vector of six colours used in
plotSegments
and plotParam
Daniel Lai
correctReadcount
and HMMsegment
for generating
intended output.
data(tumour) # Visualize one at a time par(ask = TRUE) plotBias(normal_copy) plotCorrection(tumour_copy) par(mfrow = c(1, 1)) plotSegments(tumour_copy, tumour_segments) plotParam(tumour_segments, tumour_param)
data(tumour) # Visualize one at a time par(ask = TRUE) plotBias(normal_copy) plotCorrection(tumour_copy) par(mfrow = c(1, 1)) plotSegments(tumour_copy, tumour_segments) plotParam(tumour_segments, tumour_param)
Fast fixedStep WIG file reading and parsing
wigToRangedData(wigfile, verbose = TRUE) wigToArray(wigfile, verbose = TRUE)
wigToRangedData(wigfile, verbose = TRUE) wigToArray(wigfile, verbose = TRUE)
wigfile |
Filepath to fixedStep WIG format file |
verbose |
Set to FALSE to suppress messages |
Reads the entire file into memory, then processes the file in rapid fashion, thus performance will be limited by memory capacity.
The WIG file is expected to conform to the minimal fixedStep WIG format (see References), where each chromsome is started by a “fixedStep” declaration line. The function assumes only a single track in the WIG file, and will ignore any lines before the first line starting with “fixedStep”.
RangedData
for
wigToRangedData
with chromosome and position information, sorted
in decreasing chromosal size and increasing position.
Numeric array
for wigToArray
sorted in decreasing chromosal size and increasing position.
Daniel Lai
wigsToRangedData
is a wrapper around these functions for easy
WIG file loading and structure formatting.
wigfile <- system.file("extdata", "tumour.wig", package = "HMMcopy") posAndValues <- wigToRangedData(wigfile) justValues <- wigToArray(wigfile)
wigfile <- system.file("extdata", "tumour.wig", package = "HMMcopy") posAndValues <- wigToRangedData(wigfile) justValues <- wigToArray(wigfile)
Fast fixedStep WIG file formatting and output
rangedDataToWig(correctOutput, file, column = "copy", sample = "R", verbose = TRUE) rangedDataToSeg(correctOutput, file, column = "copy", sample = "R", verbose = TRUE)
rangedDataToWig(correctOutput, file, column = "copy", sample = "R", verbose = TRUE) rangedDataToSeg(correctOutput, file, column = "copy", sample = "R", verbose = TRUE)
correctOutput |
|
file |
Filepath to write output to. |
column |
Column in input object to export. Defaults to corrected copy number. |
sample |
Sample name of the exported dataset, defaults to “R” |
verbose |
Set to FALSE to suppress messages. |
Assumes that all ranges in data set are non-overlapping windows of fixed width covering the entire genome. Note that positions in WIG files are 1-based while those in SEG files are 0-based.
Daniel Lai
correctReadcount
output is the intended input
data(tumour) # Load tumour_copy rangedDataToWig(tumour_copy, file = "test.wig") rangedDataToSeg(tumour_copy, file = "test.seg")
data(tumour) # Load tumour_copy rangedDataToWig(tumour_copy, file = "test.wig") rangedDataToSeg(tumour_copy, file = "test.seg")
Loads WIG files for readcount, GC, and mappability data for non-overlapping windows of fixed length (i.e. bins), and returns a structure ready to used for readcount correction. See Details for specifics about file assumptions.
wigsToRangedData(readfile, gcfile, mapfile, verbose = FALSE)
wigsToRangedData(readfile, gcfile, mapfile, verbose = FALSE)
readfile |
Pathname to WIG file containing readcounts per bin. |
gcfile |
Pathname to WIG file containing GC content per bin. |
mapfile |
Pathname to WIG file containing average mappability per bin. |
verbose |
Set to TRUE if messages are desired |
The number of lines in the three input files are expected to be identical, although the order and names of chromsomes in the file need not be identical. Chromosome lengths are required to be identical and unique, and if the latter is not true, the order of the chromosomes must then be identical.
At present, these three WIG files are expected to be generated by external programs, namely those from the HMMcopy suite (see See Also), rather than by existing R packages out of space and memory considerations when working with high coverage full genome samples.
A RangedData
object, where each row
entry represents a bin, with the three values from the input as columns named
reads, gc, and map.
Daniel Lai
correctReadcount
, to correct the readcounts in the resultant
value.
rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy") gfile <- system.file("extdata", "gc.wig", package = "HMMcopy") mfile <- system.file("extdata", "map.wig", package = "HMMcopy") uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile)
rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy") gfile <- system.file("extdata", "gc.wig", package = "HMMcopy") mfile <- system.file("extdata", "map.wig", package = "HMMcopy") uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile)