Package 'HMMcopy' reference manual

Title:	Copy number prediction with correction for GC and mappability bias for HTS data
Description:	Corrects GC and mappability biases for readcounts (i.e. coverage) in non-overlapping windows of fixed length for single whole genome samples, yielding a rough estimate of copy number for furthur analysis. Designed for rapid correction of high coverage whole genome tumour and normal samples.
Authors:	Daniel Lai, Gavin Ha, Sohrab Shah
Maintainer:	Daniel Lai <[email protected]>
License:	GPL-3
Version:	1.49.0
Built:	2025-03-29 04:15:37 UTC
Source:	https://github.com/bioc/HMMcopy

Bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data

Description

HMMcopy is a package for making bias-free copy number estimation by correcting for GC-content and mappability bias in HTS readcounts. It also contains an implementation of the Hidden Markov Model to robustly segment a copy number profile into non-overlapping segments predicted to be of the same copy number state, and attributes a biological copy number aberration events to the segments.

Details

HMMcopy takes as input WIG format files generated by fast C++ tools distributed as part of the HMMcopy Suite, namely readcount, GC-content and mappability values for non-overlapping fixed width “bins” across the reference genome of interest. It then uses a filtering and LOESS model to correct the GC-content and mappability biases observed in the readcounts (Benjamini and Speed, 2012), and uses the corrected readcounts as a proxy of copy number. The resultant copy number profile is then segmented with a six state Hidden Markov Model, with a handful of quick visualization functions for quick viewing.

Package:	HMMcopy
Type:	Package
Version:	0.1.0
Date:	2011-09-06
License:	GPL-3

example("HMMcopy-package") for quick tour of functionality and visualization

vignette("HMMcopy") for detailed example

Author(s)

Daniel Lai, Gavin Ha, Sohrab Shah

Maintainer: Daniel Lai <[email protected]> and Gavin Ha <[email protected]>

References

Yuval Benjamini and Terence P Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Res, 40(10):e72, May 2012.

Examples


# Read WIG file input
rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy")
gfile <- system.file("extdata", "gc.wig", package = "HMMcopy")
mfile <- system.file("extdata", "map.wig", package = "HMMcopy")
uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile)

# Correct reads into copy number
corrected_copy <- correctReadcount(uncorrected_reads)

# Segment copy number profile
segmented_copy <- HMMsegment(corrected_copy)

# Visualize one at a time
par(ask = TRUE)
plotBias(corrected_copy)
plotCorrection(corrected_copy)
plotSegments(corrected_copy, segmented_copy)

# Read WIG file input
rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy")
gfile <- system.file("extdata", "gc.wig", package = "HMMcopy")
mfile <- system.file("extdata", "map.wig", package = "HMMcopy")
uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile)

# Correct reads into copy number
corrected_copy <- correctReadcount(uncorrected_reads)

# Segment copy number profile
segmented_copy <- HMMsegment(corrected_copy)

# Visualize one at a time
par(ask = TRUE)
plotBias(corrected_copy)
plotCorrection(corrected_copy)
plotSegments(corrected_copy, segmented_copy)

Readcount correction for GC and mappability bias

Description

Corrects readcounts for GC and mappability bias using the binning/loess method optimized for speed.

Usage

correctReadcount(x, mappability = 0.9, samplesize = 50000, verbose = TRUE)
correctReadcount(x, mappability = 0.9, samplesize = 50000, verbose = TRUE)

Arguments

`x`	`RangedData` object returned by `wigsToRangedData`
`mappability`	Mappability threshold [0, 1] below which points are ignored during creating the correction curve.
`samplesize`	The number of points sampled during LOESS fitting, decreasing reduces runtime and memory usage, at the expense of robustness to data randomness.
`verbose`	Set to FALSE it messages are not desired.

Details

Input read counts are contained in the IRanges object, where number of reads within bins (or sometimes called windows) of defined genomic size are specified. GC content should also be computed using the exact same boundaries for each bin.

Ensure that the GC content and mappability files have been mapped to the same genome build (e.g. hg18) as the tumour and normal read libraries.

Here is a summary of the correction procedure.

Filter out bins with 0 reads and 0 GC content
Filter out bins with reads within the top and bottom 1% quantile (assumed to be outliers)
Filter out bins with GC content within the top and bottom 1% quantile
Filter out bins with a mappability score of greater than 0.9 ('mappability' argument).
Randomly sample up to 50000 ('samplesize' argument) of the remaining high-quality bins for the purposes keeping a good runtime.
The first loess (on the reads-by-GC curve) with a small span (smoothing window) is performed, obtaining typically a highly sensitive curve (follows low density tails of distribution, but gets jagged in high density center).
A second loess (on the first loess results) with a larger span is performed, recapitulating the curve in the low density tails and smoothing out the jagged regions in the high density center.
'cor.gc' is obtained by correcting each bin for GC content. The number of observed reads is divided by the number of reads predicted by the loess curve given an observed GC proportion.
Filter out just the top 1% quantile of the cor.gc bins, then _randomly_ sample up to 50000 ('samplesize' argument) bins.
A separate lowess curve is computed for mappability-by-GC (cor.gc).
'cor.map' is obtained by correcting each bin for mappability. The cor.gc value is divided by the cor.gc value predicted by the mappability lowess curve generated in the previous step.
'copy' is obtained by setting all cor.map values <= to NA (i.e. NaN), then apply log2

Value

The original A RangedData object appended with 5 new columns:

valid: Valid bins, which have valid read, gc, and mappability values
ideal: Ideal bins of high mappability and no outliers
cor.gc: GC-corrected readcounts
cor.map: Mappability corrected GC-corrected readcounts
copy: cor.map values after log base 2

Author(s)

Daniel Lai

References

Yuval Benjamini and Terence P Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Res, 40(10):e72, May 2012.

Examples

	data(tumour) # Load tumour_reads
	tumour_copy <- correctReadcount(tumour_reads)
data(tumour) # Load tumour_reads
	tumour_copy <- correctReadcount(tumour_reads)

HMMcopy example dataset

Description

A set of data of chromosome 6 of matched tumour normal pair. The dataset that is described here belongs to a female triple-negative breast cancer patient from the dataset published in Shah2012,Ha2012. This genome library was sequenced on the ABI/Life SOLiD platform, generating hybrid lengths of 25bp and 50bp paired-end reads. The reads were aligned using BioScope https://products.appliedbiosystems.com/ where reads mapped to multiple sites were ignored.

tumour_reads: The number of short reads in fixed width windows across the chromosome, generated with wigsToRangedData from WIG files
tumour_copy: Tumour copy number profile generated by correcting ‘tumour_reads’ with correctReadcount
normal_copy: Normal copy number profile generated via correctReadcount
tumour_param: Parameters for segmenting ‘tumour_copy’ in HMMsegment
tumour_segments: Segmented output of ‘tumour_copy’ HMMsegment

using ‘tumour_param’

Usage

  data(tumour)
data(tumour)

Format

‘tumour_reads’, ‘tumour_copy’, and ‘normal_copy’ are RangedData objects.

‘tumour_param’ is a numeric matrix.

‘tumour_segments’ is a list.

References

Gavin Ha, Andrew Roth, Daniel Lai, Ali Bashashati, Jiarui Ding, Rodrigo Goya, Ryan Giuliany, Jamie Rosner, Arusha Oloumi, Karey Shumansky, Suet-Feung Chin, Gulisa Turashvili, Martin Hirst, Carlos Caldas, Marco A Marra, Samuel Aparicio, and Sohrab P Shah. Integrative analysis of genome-wide loss of heterozygosity and mono-allelic expression at nucleotide resolution reveals disrupted pathways in triple negative breast cancer. Genome Research, (advanced online publication), May 2012.

S P Shah, A Roth, R Goya, A Oloumi, G Ha, Y Zhao, G Turashvili, J Ding, K Tse, G Hafari, A Bashashati, L M Prentice, J Khattra, A Burleigh, D Yap, V Bernard, A McPherson, K Shuman- sky, A Crisan, R Giuliany, A Heravi-Moussavi, J Rosner, D Lai, I Birol, R Varhol, A Tam, N Dhalla, T Zeng, K Ma, S K Chan, M Griffth, A Moradian, S W Cheng, G B Morin, P Watson, K Gelmon, S Chia, S F Chin, C Curtis, O M Rueda, P D Pharoah, S Damaraju, J Mackey, K Hoon, T Harkins, V Tadigotla, M Sigaroudinia, P Gascard, T Tlsty, J F Costello, I M Meyer, C J Eaves, W W Wasserman, S Jones, D Huntsman, M Hirst, C Caldas, M A Marra, and S Aparicio. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature, 486(7403):395-399, Jun 2012.

Segmentation and classification of copy number profiles

Description

Takes in a copy number profile and segments it into predicted regions of equal copy number, and assigns a biologically motivated copy number state to each region using a Hidden Markov Model (HMM). This is an extension to the HMM described in Shah et al., 2006.

Usage

HMMsegment(correctOut, param = NULL, autosomes = NULL,
    maxiter = 50, getparam = FALSE, verbose = TRUE)
HMMsegment(correctOut, param = NULL, autosomes = NULL,
    maxiter = 50, getparam = FALSE, verbose = TRUE)

Arguments

`correctOut`	Output value from `correctReadcount`
`param`	If none is provided, will generate a reasonable set of parameters based on the input data, which can optionally be returned for inspection and manual adjustment by setting 'getparam' to TRUE. See Details for more information on parameters. A matrix with parameters values in columns for each state in rows: e Probability of extending a segment, increase to lengthen segments, decrase to shorten segments. Range: (0, 1) strength Strength of initial e suggestion, reducing allows e to change, increasing makes e undefiable. Range: [0, Inf) mu Suggested median for copy numbers in state, change to readjust classification of states. Range: (-Inf, Inf) lambda Suggested precision (inversed variance) for copy numbers in state, increase to reduce overlap between states. Range: [0, Inf) nu Suggested degree of freedom between states, increase to reduce overlap between states. Range: [0, Inf) kappa Suggested distribution of states. Should sum to 1. m Optimal value for mu, difference from corresponding mu value determines elasticity of the mu value. i.e. Set to identical value as mu if you don't want mu to move much. eta Mobility of mu, increase to allow more movement. Range: [0, Inf) gamma Prior shape on lambda, gamma distribution. Effects flexibility of lambda. S Prior scale on lambda, gamma distribution. Effects flexibility of lambda.
`autosomes`	Array of LOGICAL values corresponding to the 'chr' argument where an element is TRUE if the chromosome is an autosome, otherwise FALSE. If not provided, will automatically set the following chromosomes to false: "X", "Y", "23", "24", "chrX", chrY", "M", "MT", "chrM".
`maxiter`	The maximum number of iterations allows for the Maximum-Expectation algorithm, reduce to decrease running time at the expense of robustness.
`getparam`	If TRUE, generates and returns parameters without running segmentation.
`verbose`	Set to FALSE if messages are not desired

Details

HMMsegment is a two stage algorithm that first runs an Expectation-Maximization algorithm to find the optimal set of parameters based on suggested parameter inputs, and allowed flexibilities. After iteratively finding the optimal parameters, the actual segmentation of the data is conducted with the Viterbi algorithm, finally output segmented states. This is an extension to the hidden Markov model described in Shah et al., 2006.

Parameters are divided into two main categories:

Initial parameters: e, mu, lambda, nu, kappa
Flexibility parameters: strength, m, eta, gamma, S

Where initial parameters are treated as starting suggestions for the parameter optimization algorithm, and flexibility parameters (hyperparameters) define how much the initial parameters are allowed to deviate during the search for the optimal parameters.

With a good copy number dataset, in theory, given enough flexibility, the algorithm should always find a similar set of optimal parameters regardless of initial parameters (although running times will vary).

If for some reason you wish to manually set the parameters for the final segmentation process, one should tune all flexibility parameters to minimal values. For example, if you wish to increase the length of segments, you could set:

    param$e <- 0.9999999999999999
    param$strength <- 1e30

Which suggests that segments should be very long, and gives minimal to non-existant flexibility to your suggestion.

See vignette for diagrammed example:

    vignette("HMMcopy")

Value

A list object containing multiple values, although in practice only the state assigned to each copy number value in 'states' and the segments of non-overlapping states in 'segs' are of interest.

By default, there are 6 states, which in a diploid sample corresponds to the following chromosomal copies and biological state:

1: <=0 copies, homozogous deletion
2: 1 copy, heterozogous deletion
3: 2 copies, neutral
4: 3 copies, gain
5: 4 copies, amplification
6: >=5 copies, high level amplification

The full list of output is as follows:

states: The state assigned to each copy number value
segs: Non-overlapping segments and medians of each segment
mus: Optimal median of of copy numbers in state
lambda: Optimal precision of copy numbers in state
pi: Optimal state distribution
loglik: The likelihood values of each EM algorithm iteration
rho: Posterior marginals (responsibilities) for each position and state

Author(s)

Daniel Lai, Gavin Ha, Sohrab Shah

References

Sohrab P Shah, Xiang Xuan, Ron J DeLeeuw, Mehrnoush Khojasteh, Wan L Lam, Raymond Ng, and Kevin P Murphy. Integrating copy number polymorphisms into array cgh analysis using a robust hmm. Bioinformatics, 22(14):e431-9, Jul 2006.

Examples


data(tumour) # Load tumour_copy
tumour_segments <- HMMsegment(tumour_copy)

data(tumour) # Load tumour_copy
tumour_segments <- HMMsegment(tumour_copy)

Visualization functions for correctReadcount results

Description

Convinience functions for creating plots for the analysis of the readcount correction process by correctReadcount

Usage

plotBias(correctOutput, points = 10000, ...)
plotCorrection(correctOutput, chr = correctOutput$chr[1], ...)
plotSegments(correctOutput, segmentOutput, chr = correctOutput$chr[1], ...)
plotParam(segmentOutput, param, ...)
stateCols()
plotBias(correctOutput, points = 10000, ...)
plotCorrection(correctOutput, chr = correctOutput$chr[1], ...)
plotSegments(correctOutput, segmentOutput, chr = correctOutput$chr[1], ...)
plotParam(segmentOutput, param, ...)
stateCols()

Arguments

`correctOutput`	Output value from `correctReadcount`
`segmentOutput`	Output value from `HMMsegment`
`points`	Number of random sampled points to plot, decreasing reduces runtime
`chr`	Chromosome name to plot. A single value for `plotCorrection` and a vector for `plotSegments`.
`param`	Input parameters to call that produced segmentOutput
`...`	Furthur arguments are passed to `plot`.

Details

plotBias shows the effects of the correction process on GC bias and mappability bias in HTS readcounts.

plotCorrection shows the effects of the correction on the copy number profile. Defaultly plotting the entire first chromosome found in the list.

plotSegments shows the resultant segments and states assigned to each segment.

plotParam shows the initial suggested distribution of copy number in each state (dashed), and the optimal distribution of copy number in each state (solid)

stateCols returns a vector of six colours used in plotSegments and plotParam

Author(s)

Daniel Lai

Examples


data(tumour)

# Visualize one at a time
par(ask = TRUE)
plotBias(normal_copy)
plotCorrection(tumour_copy)
par(mfrow = c(1, 1))
plotSegments(tumour_copy, tumour_segments)
plotParam(tumour_segments, tumour_param)

data(tumour)

# Visualize one at a time
par(ask = TRUE)
plotBias(normal_copy)
plotCorrection(tumour_copy)
par(mfrow = c(1, 1))
plotSegments(tumour_copy, tumour_segments)
plotParam(tumour_segments, tumour_param)

WIG Import Functions

Description

Fast fixedStep WIG file reading and parsing

Usage

wigToRangedData(wigfile, verbose = TRUE)
wigToArray(wigfile, verbose = TRUE)
wigToRangedData(wigfile, verbose = TRUE)
wigToArray(wigfile, verbose = TRUE)

Arguments

`wigfile`	Filepath to fixedStep WIG format file
`verbose`	Set to FALSE to suppress messages

Details

Reads the entire file into memory, then processes the file in rapid fashion, thus performance will be limited by memory capacity.

The WIG file is expected to conform to the minimal fixedStep WIG format (see References), where each chromsome is started by a “fixedStep” declaration line. The function assumes only a single track in the WIG file, and will ignore any lines before the first line starting with “fixedStep”.

Value

RangedData for wigToRangedData with chromosome and position information, sorted in decreasing chromosal size and increasing position.

Numeric array for wigToArray sorted in decreasing chromosal size and increasing position.

Author(s)

Daniel Lai

References

WIG: http://genome.ucsc.edu/goldenPath/help/wiggle.html

Examples

  wigfile <- system.file("extdata", "tumour.wig", package = "HMMcopy")
  posAndValues <- wigToRangedData(wigfile)
  justValues <- wigToArray(wigfile)
wigfile <- system.file("extdata", "tumour.wig", package = "HMMcopy")
  posAndValues <- wigToRangedData(wigfile)
  justValues <- wigToArray(wigfile)

WIG Output Functions

Description

Fast fixedStep WIG file formatting and output

Usage

rangedDataToWig(correctOutput, file, column = "copy", sample = "R",
  verbose = TRUE)
rangedDataToSeg(correctOutput, file, column = "copy", sample = "R",
  verbose = TRUE)
rangedDataToWig(correctOutput, file, column = "copy", sample = "R",
  verbose = TRUE)
rangedDataToSeg(correctOutput, file, column = "copy", sample = "R",
  verbose = TRUE)

Arguments

`correctOutput`	`RangedData` object for output, default options expect output from `correctReadcount`.
`file`	Filepath to write output to.
`column`	Column in input object to export. Defaults to corrected copy number.
`sample`	Sample name of the exported dataset, defaults to “R”
`verbose`	Set to FALSE to suppress messages.

Details

Assumes that all ranges in data set are non-overlapping windows of fixed width covering the entire genome. Note that positions in WIG files are 1-based while those in SEG files are 0-based.

Author(s)

Daniel Lai

References

WIG: http://genome.ucsc.edu/goldenPath/help/wiggle.html
SEG: http://www.broadinstitute.org/igv/SEG

Examples

  data(tumour) # Load tumour_copy
  rangedDataToWig(tumour_copy, file = "test.wig")
  rangedDataToSeg(tumour_copy, file = "test.seg")
data(tumour) # Load tumour_copy
  rangedDataToWig(tumour_copy, file = "test.wig")
  rangedDataToSeg(tumour_copy, file = "test.seg")

Parsing and sorting of uncorrected read and sequence information files

Description

Loads WIG files for readcount, GC, and mappability data for non-overlapping windows of fixed length (i.e. bins), and returns a structure ready to used for readcount correction. See Details for specifics about file assumptions.

Usage

wigsToRangedData(readfile, gcfile, mapfile, verbose = FALSE)
wigsToRangedData(readfile, gcfile, mapfile, verbose = FALSE)

Arguments

`readfile`	Pathname to WIG file containing readcounts per bin.
`gcfile`	Pathname to WIG file containing GC content per bin.
`mapfile`	Pathname to WIG file containing average mappability per bin.
`verbose`	Set to TRUE if messages are desired

Details

The number of lines in the three input files are expected to be identical, although the order and names of chromsomes in the file need not be identical. Chromosome lengths are required to be identical and unique, and if the latter is not true, the order of the chromosomes must then be identical.

At present, these three WIG files are expected to be generated by external programs, namely those from the HMMcopy suite (see See Also), rather than by existing R packages out of space and memory considerations when working with high coverage full genome samples.

Value

A RangedData object, where each row entry represents a bin, with the three values from the input as columns named reads, gc, and map.

Author(s)

Daniel Lai

References

correctedReadcount Suite: TBA
WIG: http://genome.ucsc.edu/goldenPath/help/wiggle.html

Examples


rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy")
gfile <- system.file("extdata", "gc.wig", package = "HMMcopy")
mfile <- system.file("extdata", "map.wig", package = "HMMcopy")

uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile)

rfile <- system.file("extdata", "tumour.wig", package = "HMMcopy")
gfile <- system.file("extdata", "gc.wig", package = "HMMcopy")
mfile <- system.file("extdata", "map.wig", package = "HMMcopy")

uncorrected_reads <- wigsToRangedData(rfile, gfile, mfile)

Package 'HMMcopy'

Help Index

Bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data

Description

Details

Author(s)

References

Examples

Readcount correction for GC and mappability bias

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

HMMcopy example dataset

Description

Usage

Format

References

Segmentation and classification of copy number profiles

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Visualization functions for correctReadcount results

Description

Usage

Arguments

Details

Author(s)

See Also

Examples

WIG Import Functions

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

WIG Output Functions

Description

Usage

Arguments

Details

Author(s)

References

See Also

Examples

Parsing and sorting of uncorrected read and sequence information files

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples