Title: | Subclonal copy number and LOH prediction from whole genome sequencing of tumours |
---|---|
Description: | Hidden Markov model to segment and predict regions of subclonal copy number alterations (CNA) and loss of heterozygosity (LOH), and estimate cellular prevalence of clonal clusters in tumour whole genome sequencing data. |
Authors: | Gavin Ha |
Maintainer: | Gavin Ha <[email protected]> |
License: | GPL-3 |
Version: | 1.45.0 |
Built: | 2024-11-19 06:28:33 UTC |
Source: | https://github.com/bioc/TitanCNA |
TITAN is a software tool for inferring subclonal copy number alterations (CNA) and loss of heterozygosity (LOH). The algorithm also infers clonal group cluster membership for each event and the tumour proportion, or cellular prevalence, for each event.
Package: | TitanCNA |
Type: | Package |
Version: | 1.15.0 |
Date: | 2017-05-13 |
License: | GPL-3 |
example("TitanCNA-package")
for quick tour of functionality and visualization
vignette("TitanCNA")
for detailed example
Gavin Ha, Sohrab P Shah Maintainer: Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
message('Running TITAN ...') #### LOAD DATA #### infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") data <- loadAlleleCounts(infile) #### LOAD PARAMETERS #### message('titan: Loading default parameters') numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters, skew = 0.1) #### READ COPY NUMBER FROM HMMCOPY FILE #### message('titan: Correcting GC content and mappability biases...') tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") cnData <- correctReadDepth(tumWig, normWig, gc, map) logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #transform to natural log #### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc #### data <- filterData(data, c(1:22,"X"), minDepth = 10, maxDepth = 200, map = NULL) #### EM (FWD-BACK) TO TRAIN PARAMETERS #### #### Can use parallelization packages #### K <- length(params$genotypeParams$alphaKHyper) params$genotypeParams$alphaKHyper <- rep(500, K) params$ploidyParams$phi_0 <- 1.5 convergeParams <- runEMclonalCN(data, params, maxiter = 3, maxiterUpdate = 500, txnExpLen = 1e9, txnZstrength = 1e9, useOutlierState = FALSE, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE, is.haplotypeData = FALSE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05) convergeParams <- results$convergeParams results <- results$corrResults #### GET SEGMENT RESULTS #### segs <- outputTitanSegments(results, id = "test", convergeParams, filename = NULL, igvfilename = NULL) #### PLOT RESULTS #### norm <- tail(convergeParams$n, 1) ploidy <- tail(convergeParams$phi, 1) par(mfrow=c(4, 1)) plotCNlogRByChr(results, chr = 2, segs = segs, ploidy = ploidy, normal = norm, geneAnnot = NULL, ylim = c(-2, 2), cex = 0.5, xlab = "", main = "Chr 2") plotAllelicRatio(results, chr = 2, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotClonalFrequency(results, chr = 2, normal = norm, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotSubcloneProfiles(results, chr = 2, cex = 2, main = "Chr 2") plotSegmentMedians(segs, chr=2, resultType = "LogRatio", plotType = "CopyNumber", plot.new = TRUE, ylim = c(0, 4), main = "Chr 2")
message('Running TITAN ...') #### LOAD DATA #### infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") data <- loadAlleleCounts(infile) #### LOAD PARAMETERS #### message('titan: Loading default parameters') numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters, skew = 0.1) #### READ COPY NUMBER FROM HMMCOPY FILE #### message('titan: Correcting GC content and mappability biases...') tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") cnData <- correctReadDepth(tumWig, normWig, gc, map) logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #transform to natural log #### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc #### data <- filterData(data, c(1:22,"X"), minDepth = 10, maxDepth = 200, map = NULL) #### EM (FWD-BACK) TO TRAIN PARAMETERS #### #### Can use parallelization packages #### K <- length(params$genotypeParams$alphaKHyper) params$genotypeParams$alphaKHyper <- rep(500, K) params$ploidyParams$phi_0 <- 1.5 convergeParams <- runEMclonalCN(data, params, maxiter = 3, maxiterUpdate = 500, txnExpLen = 1e9, txnZstrength = 1e9, useOutlierState = FALSE, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE, is.haplotypeData = FALSE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05) convergeParams <- results$convergeParams results <- results$corrResults #### GET SEGMENT RESULTS #### segs <- outputTitanSegments(results, id = "test", convergeParams, filename = NULL, igvfilename = NULL) #### PLOT RESULTS #### norm <- tail(convergeParams$n, 1) ploidy <- tail(convergeParams$phi, 1) par(mfrow=c(4, 1)) plotCNlogRByChr(results, chr = 2, segs = segs, ploidy = ploidy, normal = norm, geneAnnot = NULL, ylim = c(-2, 2), cex = 0.5, xlab = "", main = "Chr 2") plotAllelicRatio(results, chr = 2, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotClonalFrequency(results, chr = 2, normal = norm, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotSubcloneProfiles(results, chr = 2, cex = 2, main = "Chr 2") plotSegmentMedians(segs, chr=2, resultType = "LogRatio", plotType = "CopyNumber", plot.new = TRUE, ylim = c(0, 4), main = "Chr 2")
Compute the S_Dbw Validity Index internal cluster validation from the TitanCNA results to use for model selection.
computeSDbwIndex(x, centroid.method = "median", data.type = "LogRatio", use.corrected.cn =TRUE, S_Dbw.method = "Halkidi", symmetric = TRUE)
computeSDbwIndex(x, centroid.method = "median", data.type = "LogRatio", use.corrected.cn =TRUE, S_Dbw.method = "Halkidi", symmetric = TRUE)
x |
Formatted TitanCNA results output from |
centroid.method |
|
data.type |
Compute S_Dbw validity index based on copy number (use ‘ |
symmetric |
|
S_Dbw.method |
Compute S_Dbw validity index using |
use.corrected.cn |
|
S_Dbw Validity Index is an internal clustering evaluation that is used for model selection (Halkidi et al. 2002). It attempts to choose the model that minimizes within cluster variances (scat) and maximizes density-based cluster separation (Dens). Then, S_Dbw(|c_T|x z)=Dens(|c_T|x z)+scat(|c_T|x z).
In the context of TitanCNA, if data.type
=‘LogRatio
’, then the S_Dbw internal data consists of copy number log ratios, and the resulting joint states of copy number (c_T, forall c_T in {0 : max.copy.number}) and clonal cluster (z) make up the clusters in the internal evaluation. If data.type
=‘AllelicRatio
’, then the S_Dbw internal data consists of the allelic ratios. The optimal TitanCNA run is chosen as the run with the minimum S_Dbw. If data.type
=‘Both
’, then the sum of the S_Dbw for ‘LogRatio
’ and ‘AllelicRatio
’ are added together. This helps account for both data types when choosing the optimal solution.
Note that for S_Dbw.method
, the Tong
method has an incorrect formulation of the scat(c)
function. The function should be a weighted sum, but that is not the formulation shown in the publication. computeSDbwIndex
uses (ni/N)
instead of (N-ni)/N
in the scat
formula, where ni
is the number of datapoints in cluster i
and N
is the total number of datapoints.
list
with components:
dens.bw |
density component of S_Dbw index |
scat |
scatter component of S_Dbw index |
S_DbwIndex |
Sum of dens.bw and scat. |
Gavin Ha <[email protected]>
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002). Clustering validity checking methods: part ii. SIGMOD Rec., 31(3):19–27.
Tong, J. and Tan, H. Clustering validity based on the improved S_Dbw* index. (2009). Journal of Electronics (China), Volume 26, Issue 2, pp 258-264.
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
outputModelParameters
, loadAlleleCounts
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### #options(cores=1) optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05) results <- results$corrResults ## use corrected results #### COMPUTE S_Dbw Validity Index FOR MODEL SELECTION #### s_dbw <- computeSDbwIndex(results, data.type = "LogRatio", centroid.method = "median", S_Dbw.method = "Tong")
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### #options(cores=1) optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05) results <- results$corrResults ## use corrected results #### COMPUTE S_Dbw Validity Index FOR MODEL SELECTION #### s_dbw <- computeSDbwIndex(results, data.type = "LogRatio", centroid.method = "median", S_Dbw.method = "Tong")
TitanCNA uses a finite state space that defines a maximum number of copies to model. High-level amplifications that exceed this defined maximum need to be corrected and reported as the likely copy number based on the observed data. correctIntegerCN
performs two tasks: (1) correct log ratio based on purity and ploidy, and then convert to decimal CN value; (2) Correct bins (from cn
) and segments (from segs
) in which the original predicted integer copy number was assigned the maximum CN state; bins and segments for all of chromosome X are also corrected, if provided in the input.
correctIntegerCN(cn, segs, purity, ploidy, maxCNtoCorrect.autosomes = NULL, maxCNtoCorrect.X = NULL, correctHOMD = TRUE, minPurityToCorrect = 0.2, gender = "male", chrs = c(1:22, "X"))
correctIntegerCN(cn, segs, purity, ploidy, maxCNtoCorrect.autosomes = NULL, maxCNtoCorrect.X = NULL, correctHOMD = TRUE, minPurityToCorrect = 0.2, gender = "male", chrs = c(1:22, "X"))
cn |
data.table object output from the function outputTitanResults |
segs |
data.table object output from the function outputTitanSegments |
purity |
Float type of the 1 minus the normal contamination estimate from TitanCNA |
ploidy |
Float type of the average tumor ploidy estimate from TitanCNA |
maxCNtoCorrect.autosomes |
Bins and segments in autosomes with this copy number value or higher will be corrected. If |
maxCNtoCorrect.X |
Bins and segments in chromosome X, if provided, with this copy number value or higher will be corrected. If |
minPurityToCorrect |
If |
correctHOMD |
If |
gender |
data.frame containing list of centromere regions. This should contain 3 columns: chr, start, and end. If this argument is used, then data at and flanking the centromeres will be removed. |
chrs |
Chromosomes to consider for copy number correction. |
cn
: data.table
object that contains the same columns as the input object but also includes new columns logR_Copy_Number
, Corrected_Copy_Number
, Corrected_Call
.
segs
: data.table
object that contains the same columns as the input object but also includes new columns logR_Copy_Number
, Corrected_Copy_Number
, Corrected_Call
, Corrected_MajorCN
, Corrected_MinorCN
.
Column definitions:
logR_Copy_Number |
Purity and ploidy corrected log ratios that have been converted to a decimal-based copy number value. |
Corrected_Copy_Number |
|
Corrected_Call |
String representation of |
Corrected_MajorCN |
Purity and ploidy corrected integer (rounded) major copy number value. |
Corrected_MinorCN |
Purity and ploidy corrected integer (rounded) minor copy number value. |
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
outputTitanResults
, outputTitanSegments
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE, correctResults = TRUE, proportionThreshold = 0.05, recomputeLogLik = FALSE, proportionThresholdClonal = 0.05, is.haplotypeData = FALSE) ## use corrected parameters convergeParams <- results$convergeParam ## use corrected results results <- results$corrResults ## get normal contamination and ploidy estimates norm <- tail(convergeParams$n,1) ploidy <- tail(convergeParams$phi,1) #### OUTPUT SEGMENTS #### segs <- outputTitanSegments(results, id = "test", convergeParams, filename = NULL, igvfilename = NULL) corrIntCN.results <- correctIntegerCN(results, segs, 1 - norm, ploidy, maxCNtoCorrect.autosomes = NULL, maxCNtoCorrect.X = NULL, correctHOMD = TRUE, minPurityToCorrect = 0.2, gender = "female", chrs = 2)
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE, correctResults = TRUE, proportionThreshold = 0.05, recomputeLogLik = FALSE, proportionThresholdClonal = 0.05, is.haplotypeData = FALSE) ## use corrected parameters convergeParams <- results$convergeParam ## use corrected results results <- results$corrResults ## get normal contamination and ploidy estimates norm <- tail(convergeParams$n,1) ploidy <- tail(convergeParams$phi,1) #### OUTPUT SEGMENTS #### segs <- outputTitanSegments(results, id = "test", convergeParams, filename = NULL, igvfilename = NULL) corrIntCN.results <- correctIntegerCN(results, segs, 1 - norm, ploidy, maxCNtoCorrect.autosomes = NULL, maxCNtoCorrect.X = NULL, correctHOMD = TRUE, minPurityToCorrect = 0.2, gender = "female", chrs = 2)
Correct GC content and mappability biases in tumour sequence read counts using Loess curve fitting. Wrapper for function in HMMcopy.
correctReadDepth(tumWig, normWig, gcWig, mapWig, genomeStyle = "NCBI", targetedSequence = NULL)
correctReadDepth(tumWig, normWig, gcWig, mapWig, genomeStyle = "NCBI", targetedSequence = NULL)
tumWig |
File path to fixedStep WIG format file for the tumour sample. See |
normWig |
File path to fixedStep WIG format file for the normal sample. |
gcWig |
File path to fixedStep WIG format file for the GC content based on the specific reference genome sequence used. |
mapWig |
File path to fixedStep WIG format file for the mappability scores computed on the specific reference genome used. |
genomeStyle |
The genome style to use for chromosomes by TitanCNA. Use one of ‘NCBI’ or ‘UCSC’. It does not matter what style is found in |
targetedSequence |
data.frame with 3 columns: chr, start position, stop position. Use this argument for exome capture sequencing or targeted deep sequencing data. This is experimental and may not work as desired. |
Wrapper for correctReadcount
in HMMcopy package. It uses a sampling of 50000 bins to find the Loess fit. Then, the log ratio for every bin is returned as the log base 2 of the ratio between the corrected tumour read count and the corrected normal read count.
data.frame
containing columns:
chr |
Chromosome; uses 'X' and 'Y' for sex chromosomes |
start |
Start genomic coordinate for bin in which read count is corrected |
end |
End genomic coordinate for bin in which read count is corrected |
logR |
Log ratio, log2(tumour:normal), for bin in which read count is corrected |
Gavin Ha <[email protected]>, Daniel Lai <[email protected]>, Yikan Wang <[email protected]>
Ha, G., Roth, A., Lai, D., Bashashati, A., Ding, J., Goya, R., Giuliany, R., Rosner, J., Oloumi, A., Shumansky, K., Chin, S.F., Turashvili, G., Hirst, M., Caldas, C., Marra, M. A., Aparicio, S., and Shah, S. P. (2012). Integrative analysis of genome wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple negative breast cancer. Genome Research, 22(10):1995,2007. (PMID: 22637570)
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
correctReadcount
and wigToRangedData
in the HMMcopy package. WIG: http://genome.ucsc.edu/goldenPath/help/wiggle.html
tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") #### GC AND MAPPABILITY CORRECTION #### cnData <- correctReadDepth(tumWig, normWig, gc, map)
tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") #### GC AND MAPPABILITY CORRECTION #### cnData <- correctReadDepth(tumWig, normWig, gc, map)
Filters all vectors in list based on specified chromosome(s) of interest, minimum and maximum read depths, missing data, mappability score threshold
filterData(data ,chrs = NULL, minDepth = 10, maxDepth = 200, positionList = NULL, map = NULL, mapThres = 0.9, centromeres = NULL, centromere.flankLength = 0)
filterData(data ,chrs = NULL, minDepth = 10, maxDepth = 200, positionList = NULL, map = NULL, mapThres = 0.9, centromeres = NULL, centromere.flankLength = 0)
data |
data.table object that contains an arbitrary number of components. Should include ‘chr’, ‘tumDepth’. All vector elements must have the same number of rows where each row corresponds to information pertaining to a chromosomal position. |
chrs |
|
minDepth |
|
maxDepth |
|
positionList |
|
map |
|
mapThres |
|
centromeres |
data.frame containing list of centromere regions. This should contain 3 columns: chr, start, and end. If this argument is used, then data at and flanking the centromeres will be removed. |
centromere.flankLength |
Integer indicating the length (in base pairs) to the left and to the right of the centromere designated for removal of data. |
All vectors in the input data.table object, and map
, must all have the same number of rows.
The same data.table
object containing filtered components.
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") #### LOAD DATA #### data <- loadAlleleCounts(infile, genomeStyle = "NCBI") #### GC AND MAPPABILITY CORRECTION #### cnData <- correctReadDepth(tumWig, normWig, gc, map) #### READ COPY NUMBER FROM HMMCOPY FILE #### logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #use natural logs #### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc #### filtereData <- filterData(data, as.character(1:24), minDepth = 10, maxDepth = 200, map = NULL, mapThres=0.9)
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") #### LOAD DATA #### data <- loadAlleleCounts(infile, genomeStyle = "NCBI") #### GC AND MAPPABILITY CORRECTION #### cnData <- correctReadDepth(tumWig, normWig, gc, map) #### READ COPY NUMBER FROM HMMCOPY FILE #### logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #use natural logs #### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc #### filtereData <- filterData(data, as.character(1:24), minDepth = 10, maxDepth = 200, map = NULL, mapThres=0.9)
Given a list of chromosomes and positions, uses a C-based function that searches a list of segments to find the overlapping segment. Then, takes the value (4th column in segment data.frame) of the overlapping segment and assigns to the given chromosome and position.
getPositionOverlap(chr, posn, dataVal)
getPositionOverlap(chr, posn, dataVal)
chr |
|
posn |
|
dataVal |
|
Numeric array
of values from the 4th column of data.frame cnData
. Each element corresponds to a genomic location from chr
and posn
that overlapped the segment in cnData
.
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
loadAlleleCounts
, correctReadDepth
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") #### LOAD DATA #### data <- loadAlleleCounts(infile) #### GC AND MAPPABILITY CORRECTION #### cnData <- correctReadDepth(tumWig, normWig, gc, map) #### READ COPY NUMBER FROM HMMCOPY FILE #### logR <- getPositionOverlap(data$chr, data$posn, cnData)
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") #### LOAD DATA #### data <- loadAlleleCounts(infile) #### GC AND MAPPABILITY CORRECTION #### cnData <- correctReadDepth(tumWig, normWig, gc, map) #### READ COPY NUMBER FROM HMMCOPY FILE #### logR <- getPositionOverlap(data$chr, data$posn, cnData)
loadHaplotypeAlleleCounts
).
Function to load phased heterozygous sites from a VCF file (getHaplotypesFromVCF
)
Function to load in the allele counts from tumour sequencing data from a delimited text file or data.frame object.
loadHaplotypeAlleleCounts(inCounts, cnfile, fun = "sum", haplotypeBinSize = 1e5, minSNPsInBin = 3, chrs = c(1:22, "X"), minNormQual = 200, genomeStyle = "NCBI", sep = "\t", header = TRUE, seqinfo = NULL, mapWig = NULL, mapThres = 0.9, centromere = NULL, minDepth = 10, maxDepth = 1000) getHaplotypesFromVCF(vcfFile, chrs = c(1:22, "X"), build = "hg19", genomeStyle = "NCBI", filterFlags = c("PASS", "10X_RESCUED_MOLECULE_HIGH_DIVERSITY"), minQUAL = 100, minDepth = 10, minVAF = 0.25, altCountField = "AD", keepGenotypes = c("1|0", "0|1", "0/1"), snpDB = NULL) loadBXcountsFromBEDDir(bxDir, chrs = c(1:22, "X", "Y"), minReads = 2)
loadHaplotypeAlleleCounts(inCounts, cnfile, fun = "sum", haplotypeBinSize = 1e5, minSNPsInBin = 3, chrs = c(1:22, "X"), minNormQual = 200, genomeStyle = "NCBI", sep = "\t", header = TRUE, seqinfo = NULL, mapWig = NULL, mapThres = 0.9, centromere = NULL, minDepth = 10, maxDepth = 1000) getHaplotypesFromVCF(vcfFile, chrs = c(1:22, "X"), build = "hg19", genomeStyle = "NCBI", filterFlags = c("PASS", "10X_RESCUED_MOLECULE_HIGH_DIVERSITY"), minQUAL = 100, minDepth = 10, minVAF = 0.25, altCountField = "AD", keepGenotypes = c("1|0", "0|1", "0/1"), snpDB = NULL) loadBXcountsFromBEDDir(bxDir, chrs = c(1:22, "X", "Y"), minReads = 2)
inCounts |
Path to text file or data.frame containing tumour allele count data. |
cnfile |
Path to file containing GC-bias and maappability corrected molecule coverage for given bin size. |
vcfFile |
Path to phased variant VCF file from LongRanger 2.1. The file name must have the suffix |
bxDir |
Path to directory containing tumor bed files for each chromosome containing BX tags. |
fun |
The function (‘SNP’, ‘sum’, ‘mean’) to use to summarize within each user defined bin using |
haplotypeBinSize |
Bin size used to summarize SNPs based on phased haplotypes. See |
minSNPsInBin |
The minimum number of SNPs required in each |
chrs |
Vector containing list of chromosomes to include in output. |
minNormQual |
Quality threshold to use for filtering; SNPs with lower than this value are excluded. This quality is any metric that provides the confidence of the locus being a true germline heterozygous SNP. |
minReads |
Minimum number of reads per barcode. |
genomeStyle |
The genome style to use for chromosomes. Use one of ‘NCBI’ or ‘UCSC’. It does not matter what style is found in |
build |
Human genome reference build. Default: hg19. |
snpDB |
Path to SNP VCF file to use for specifying sites to retain. |
minQUAL |
Variants with quality (QUAL field) greater or equal to this value will be retained. |
minDepth |
Variants with read depth greater than or equal to this value will be retained. |
maxDepth |
Variants with read depth lower than or equal to this value will be retained. |
minVAF |
Variants with a variant/reference allele fraction of greater than or equal to this value will be retained. |
altCountField |
Specify the alternate count field name. Defaulat is "AD". |
keepGenotypes |
Genotypes to retain. Default is to keep these genotypes strings: 1|0, 0|1, 0/1 |
filterFlags |
Specify the FILTER flags to retain. |
sep |
Character indicating the delimiter used for the columns for |
header |
|
seqinfo |
|
mapWig |
Mappability score WIG file for binned data. |
mapThres |
Minimum mappability score of region/sequence overlapping variants to retain. |
centromere |
File containing reference genome gap file representing centromere locations. Usually obtained from UCSC. |
loadHaplotypeAlleleCounts
returns a data.table containing components for
chr |
Chromosome; character, |
posn |
Position; integer |
phaseSet |
Phase block identifier, numeric or character |
refOriginal |
Reference allele read count at SNP; numeric |
tumDepthOriginal |
Coverage at SNP; numeric |
ref |
Phased allele count values of higher coverage haplotype based on approach used (SNP, sum, mean); numeric |
nonRef |
Phased allele count values of lower coverage haplotype; tumDepth minus ref; numeric |
tumDepth |
Mean or sum of SNP read coverage; numeric |
HapltypeRatio |
Sum of read coverage of phased alleles of higher coverage haplotype normalized by |
haplotypeCount |
Phased allele read count; numeric |
getHaplotypesFromVCF
returns a list containing 2 components
vcf.filtered |
VCF object containing the list of heterozygous variants after filtering. |
geno.gr |
GRanges object containing the genotype information of the VCF |
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
loadDefaultParameters
, plotHaplotypeFraction
## Not run: infile <- "test_alleleCounts_chr2_with_phaseInfo.txt" haplotypeBinSize <- 1e5 phaseSummarizeFun <- "sum" ## will load seqinfo_hg19 provided by TitanCNA package data <- loadHaplotypeAlleleCounts(infile, fun = phaseSummarizeFun, haplotypeBinSize = haplotypeBinSize, minSNPsInBin = 3, chrs = c(1:22, "X"), minNormQual = 200, genomeStyle = "NCBI", seqinfo = NULL) ## End(Not run) ## Not run: vcfFile <- "test.vcf" hap <- getHaplotypesFromVCF(vcfFile, chrs = c(1:22,"X"), build = "hg19", filterFlags = c("PASS", "10X_RESCUED_MOLECULE_HIGH_DIVERSITY"), minQUAL = 100, minDepth = 10, minVAF = 0.25, keepGenotypes = ("1|0", "0|1", "0/1")) ## End(Not run)
## Not run: infile <- "test_alleleCounts_chr2_with_phaseInfo.txt" haplotypeBinSize <- 1e5 phaseSummarizeFun <- "sum" ## will load seqinfo_hg19 provided by TitanCNA package data <- loadHaplotypeAlleleCounts(infile, fun = phaseSummarizeFun, haplotypeBinSize = haplotypeBinSize, minSNPsInBin = 3, chrs = c(1:22, "X"), minNormQual = 200, genomeStyle = "NCBI", seqinfo = NULL) ## End(Not run) ## Not run: vcfFile <- "test.vcf" hap <- getHaplotypesFromVCF(vcfFile, chrs = c(1:22,"X"), build = "hg19", filterFlags = c("PASS", "10X_RESCUED_MOLECULE_HIGH_DIVERSITY"), minQUAL = 100, minDepth = 10, minVAF = 0.25, keepGenotypes = ("1|0", "0|1", "0/1")) ## End(Not run)
Function to load in the allele counts from tumour sequencing data from a delimited text file or data.frame object.
loadAlleleCounts(inCounts, symmetric = TRUE, genomeStyle = "NCBI", sep = "\t", header = TRUE) setGenomeStyle(x, genomeStyle = "NCBI", species = "Homo_sapiens", filterExtraChr = TRUE)
loadAlleleCounts(inCounts, symmetric = TRUE, genomeStyle = "NCBI", sep = "\t", header = TRUE) setGenomeStyle(x, genomeStyle = "NCBI", species = "Homo_sapiens", filterExtraChr = TRUE)
inCounts |
Full file path to text file or data.frame containing tumour allele count data. |
symmetric |
|
genomeStyle |
The genome style to use for chromosomes by TitanCNA. Use one of ‘NCBI’ or ‘UCSC’. It does not matter what style is found in |
sep |
Character indicating the delimiter used for the columns for |
header |
|
x |
|
species |
|
filterExtraChr |
|
loadAlleleCounts
returns a data.table containing components for
chr |
Chromosome; character, NCBI or UCSC genome style format |
posn |
Position; integer |
ref |
Reference counts; numeric |
nonRef |
Non-reference counts; numeric |
tumDepth |
Tumour depth; numeric |
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") #### LOAD DATA FROM TEXT FILE #### data <- loadAlleleCounts(infile, symmetric = TRUE, genomeStyle = "NCBI", header = TRUE) ## use the UCSC chromosome naming convention instead ## data$chr <- setGenomeStyle(data$chr, genomeStyle = "UCSC") ## Not run: data <- loadAlleleCounts(countsDF, symmetric = TRUE, genomeStyle = "NCBI") ## End(Not run)
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") #### LOAD DATA FROM TEXT FILE #### data <- loadAlleleCounts(infile, symmetric = TRUE, genomeStyle = "NCBI", header = TRUE) ## use the UCSC chromosome naming convention instead ## data$chr <- setGenomeStyle(data$chr, genomeStyle = "UCSC") ## Not run: data <- loadAlleleCounts(countsDF, symmetric = TRUE, genomeStyle = "NCBI") ## End(Not run)
Load TITAN model parameters based on maximum copy number and number of clonal clusters.
loadDefaultParameters(copyNumber = 5, numberClonalClusters = 1, skew = 0, hetBaselineSkew = NULL, alleleEmissionModel = "binomial", symmetric = TRUE, data = NULL)
loadDefaultParameters(copyNumber = 5, numberClonalClusters = 1, skew = 0, hetBaselineSkew = NULL, alleleEmissionModel = "binomial", symmetric = TRUE, data = NULL)
copyNumber |
Maximum number of absolute copies to account for in the model. Default (and recommended) is 5. |
numberClonalClusters |
Number of clonal clusters to use in the analysis. Each cluster represents a potential clone. Using ‘1’ treats the sample as being clonal (no subclonality). ‘2’ or higher treats the tumour data as being subclonal. |
skew |
|
hetBaselineSkew |
Allelic reference base skew for heterozygous states (e.g. 1:1, 2:2, 3:3). Value is additive to baseline allelic ratios (which may already be adjusted by |
alleleEmissionModel |
Specify the emission model to use for the allelic input data. |
symmetric |
|
data |
|
Generally, TitanCNA should be run once for each number of clonal clusters in the range of 1 to 5. Then, use model selection to choose the run with the optimal number of clusters.
If the allelic ratio data is skewed towards one allele, then use skew
to help define the baseline. For example, if the data is skewed towards the reference, then use 0.1 so that the heterozygous baseline is at 0.6. The allelic ratio baseline is normally at 0.5.
sParams
, which represents the parameters for estimation of subclonality, always contains values for one cluster that represents the clonally dominant cluster (events present in nearly all tumour cells) with an initial value of sParams$s_0[1] = 0.001
.
Setting symmetric
to TRUE
will treat reference and non-reference alleles the same. For example, genotypes AA
(homozygous for reference allele) and BB
(homozygous for non-reference allele) as being equivalent. This will reduce the state space substantially.
list
containing 4 sets of parameters, each as a component:
genotypeParams |
Parameters for copy number and allelic ratios geneotype states |
normalParams |
Parameters for normal contamination |
ploidyParams |
Parameters for average tumour ploidy |
sParams |
Parameters for modeling subclonality: clonal clusters and cellular prevalence |
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
#### LOAD PARAMETERS #### numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters)
#### LOAD PARAMETERS #### numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters)
Function to format TitanCNA results in to a data.frame and output the results to a tab-delimited file.
outputTitanResults(data, convergeParams, optimalPath, filename = NULL, is.haplotypeData = FALSE, posteriorProbs = FALSE, subcloneProfiles = TRUE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05, recomputeLogLik = TRUE, rerunViterbi = FALSE, verbose = TRUE) outputModelParameters(convergeParams, results, filename, S_Dbw.scale = 1, S_Dbw.method = "Tong", S_Dbw.useCorrectedCN = TRUE) outputTitanSegments(results, id, convergeParams, filename = NULL, igvfilename = NULL)
outputTitanResults(data, convergeParams, optimalPath, filename = NULL, is.haplotypeData = FALSE, posteriorProbs = FALSE, subcloneProfiles = TRUE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05, recomputeLogLik = TRUE, rerunViterbi = FALSE, verbose = TRUE) outputModelParameters(convergeParams, results, filename, S_Dbw.scale = 1, S_Dbw.method = "Tong", S_Dbw.useCorrectedCN = TRUE) outputTitanSegments(results, id, convergeParams, filename = NULL, igvfilename = NULL)
id |
Character string identifier for sample |
data |
|
convergeParams |
|
optimalPath |
|
results |
Formatted TitanCNA results output from |
filename |
Path of the file to write the TitanCNA results. |
igvfilename |
Path of the file to write the IGV seg file. |
posteriorProbs |
|
is.haplotypeData |
|
subcloneProfiles |
|
correctResults |
|
recomputeLogLik |
|
rerunViterbi |
|
proportionThreshold |
Minimum proportion of the genome altered (by SNPs) for a cluster to be retained. Clonal clusters having lower proportion of alteration are removed. |
proportionThresholdClonal |
Minimum proportion of genome altered by clonal events (by SNPs) for the highest cellular prevalence cluster. If the highest prevalence cluster contains lower proportion of events than this threshold, this cluster will be removed and the next highest (subclonal) cluster will be readjusted to be the clonal cluster. |
S_Dbw.scale |
The S_Dbw validity index can be adjusted to account for differences between datasets. |
S_Dbw.method |
Compute S_Dbw validity index using |
S_Dbw.useCorrectedCN |
|
verbose |
Print status messages. |
outputModelParameters
outputs to a file with the estimated TITAN model parameters and model selection index. Each row contains information regarding different parameters:
1) Normal contamination estimate - proportion of normal content in the sample; tumour content is 1 minus this number
2) Average tumour ploidy estimate - average number of estimated copies in the genome; 2 represents diploid
3) Clonal cluster cellular prevalence - Z denotes the number of clonal clusters; each value (space-delimited) following are the cellular prevalence estimates for each cluster. Cellular prevalence here is defined as the proportion of tumour sample that does contain the aberrant genotype.
4) Genotype binomial means for clonal cluster Z - set of 21 binomial estimated parameters for each specified cluster
5) Genotype Gaussian means for clonal cluster Z - set of 21 Gaussian estimated means for each specified cluster
6) Genotype Gaussian variance - set of 21 Gaussian estimated variances; variances are shared for across all clusters
7) Number of iterations - number of EM iterations needed for convergence
8) Log likelihood - complete data log-likelihood for current cluster run
9) S_Dbw dens.bw - density component of S_Dbw index; see computeSDbwIndex
10) S_Dbw scat - scatter component of S_Dbw index; see computeSDbwIndex
11) S_Dbw validity index - used for model selection where the run with optimal number of clusters based on lowest S_Dbw index. This value is slightly modified from that computed from computeSDbwIndex
. It is computed as S_Dbw= S_Dbw.scale * dens.bw + scat
12) S_Dbw dens.bw, scat, validity index is computed for LogRatio
and AllelicRatio
datatypes, as well as the combination of Both
. For Both
, the values are summed for both datatypes.
outputTitanResults
outputs a file that has the similar format described in ‘Value’ section.
outputTitanResults
also returns a list containing the following:
results |
TITAN results, uncorrected for cluster number and parameters |
corrResults |
TITAN results, corrected by removing empty clusters and parameters adjusted accordingly. |
convergeParams |
Corrected parameter object |
The results
and corrResults
are data.table objects, where each row corresponds to a position in the analysis, and with the following columns:
Chr |
character denoting chromosome number. ChrX and ChrY uses ‘X’ and ‘Y’. |
Position |
genomic coordinate |
RefCount |
number of reads matching the reference base |
NRefCount |
number of reads matching the non-reference base |
Depth |
total read depth at the position |
AllelicRatio |
RefCount/Depth |
LogRatio |
log2 ratio between normalized tumour and normal read depths |
CopyNumber |
predicted TitanCNA copy number |
TITANstate |
internal state number used by TitanCNA; see Reference |
TITANcall |
interpretable TitanCNA state; string (HOMD,DLOH,HET,NLOH,ALOH,ASCNA,BCNA,UBCNA); See Reference |
ClonalCluster |
predicted TitanCNA clonal cluster; lower cluster numbers represent clusters with higher cellular prevalence |
CellularPrevalence |
proportion of tumour cells containing event; not to be mistaken as proportion of sample (including normal) |
If subcloneProfiles
is set to TRUE
, then the subclone profiles will be appended to the output data.frame
.
Subclone1.CopyNumber |
Integer copy number for Subclone 1. |
Subclone1.TITANcall |
States for Subclone 1 |
Subclone1.Prevalence |
The cellular prevalence of Subclone 1, or sometimes referred to as the subclone fraction. |
outputModelParameters
returns a list
containing the S_Dbw model selection:
dens.bw |
|
scat |
|
S_Dbw |
S_Dbw.scale * dens.bw + scat |
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
runEMclonalCN
, viterbiClonalCN
, computeSDbwIndex
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE, correctResults = TRUE, proportionThreshold = 0.05, recomputeLogLik = FALSE, proportionThresholdClonal = 0.05, is.haplotypeData = FALSE) ## use corrected parameters convergeParams <- results$convergeParam ## use corrected results results <- results$corrResults #### OUTPUT RESULTS TO FILE #### outparam <- paste0("cluster2_params.txt") outputModelParameters(convergeParams, results, outparam) #### OUTPUT SEGMENTS TO FILE #### outseg <- paste0("cluster2_segs.txt") outigv <- paste0("cluster2.seg") segs <- outputTitanSegments(results, id = "test", convergeParams, filename = outseg, igvfilename = outigv) # segment results also stored in data.frame "segs"
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE, correctResults = TRUE, proportionThreshold = 0.05, recomputeLogLik = FALSE, proportionThresholdClonal = 0.05, is.haplotypeData = FALSE) ## use corrected parameters convergeParams <- results$convergeParam ## use corrected results results <- results$corrResults #### OUTPUT RESULTS TO FILE #### outparam <- paste0("cluster2_params.txt") outputModelParameters(convergeParams, results, outparam) #### OUTPUT SEGMENTS TO FILE #### outseg <- paste0("cluster2_segs.txt") outigv <- paste0("cluster2.seg") segs <- outputTitanSegments(results, id = "test", convergeParams, filename = outseg, igvfilename = outigv) # segment results also stored in data.frame "segs"
Three plotting functions for TitanCNA results. plotCNlogRByChr
plots the copy number results from log ratio data. plotAllelicRatio
plots the allelic imbalance and loss of heterozygosity (LOH) from allelic ratio data. plotClonalFrequency
plots the clonal cluster and cellular prevalence results for each data point.
plotAllelicRatio(dataIn, chr = c(1:22), geneAnnot = NULL, spacing = 4, xlim = NULL, ...) plotClonalFrequency(dataIn, chr = c(1:22), normal = NULL, geneAnnot = NULL, spacing = 4, xlim = NULL, ...) plotCNlogRByChr(dataIn, chr = c(1:22), segs = NULL, plotCorrectedCN = TRUE, geneAnnot = NULL, ploidy = NULL, normal = NULL, spacing = 4, alphaVal = 1, xlim = NULL, ...) plotSubcloneProfiles(dataIn, chr = c(1:22), geneAnnot = NULL, spacing = 4, xlim = NULL, ...) plotSegmentMedians(dataIn, resultType = "LogRatio", plotType = "CopyNumber", plotCorrectedCN = TRUE, chr = c(1:22), geneAnnot = NULL, ploidy = NULL, spacing = 4, alphaVal = 1, xlim = NULL, plot.new = FALSE, lwd = 8, ...) plotHaplotypeFraction(dataIn, chr = c(1:22), resultType = "HaplotypeRatio", colType = "Haplotypes", phaseBlockCol = c("#9ad0f3", "#CC79A7"), geneAnnot = NULL, spacing = 4, xlim = NULL, ...)
plotAllelicRatio(dataIn, chr = c(1:22), geneAnnot = NULL, spacing = 4, xlim = NULL, ...) plotClonalFrequency(dataIn, chr = c(1:22), normal = NULL, geneAnnot = NULL, spacing = 4, xlim = NULL, ...) plotCNlogRByChr(dataIn, chr = c(1:22), segs = NULL, plotCorrectedCN = TRUE, geneAnnot = NULL, ploidy = NULL, normal = NULL, spacing = 4, alphaVal = 1, xlim = NULL, ...) plotSubcloneProfiles(dataIn, chr = c(1:22), geneAnnot = NULL, spacing = 4, xlim = NULL, ...) plotSegmentMedians(dataIn, resultType = "LogRatio", plotType = "CopyNumber", plotCorrectedCN = TRUE, chr = c(1:22), geneAnnot = NULL, ploidy = NULL, spacing = 4, alphaVal = 1, xlim = NULL, plot.new = FALSE, lwd = 8, ...) plotHaplotypeFraction(dataIn, chr = c(1:22), resultType = "HaplotypeRatio", colType = "Haplotypes", phaseBlockCol = c("#9ad0f3", "#CC79A7"), geneAnnot = NULL, spacing = 4, xlim = NULL, ...)
dataIn |
Formatted TitanCNA results output from |
chr |
Plot results for specified |
segs |
|
resultType |
For |
plotType |
Specify whether to plot the ‘CopyNumber’ or ‘Ratio’ values for the |
colType |
Specify the color scheme to use: ‘Haplotypes’ or ‘CopyNumber’. For ‘Haplotypes’, alternating blue and red used to illustrated the data within phased haplotype blocks. For ‘CopyNumber’, the same colors as |
plotCorrectedCN |
|
geneAnnot |
|
normal |
|
ploidy |
|
phaseBlockCol |
Two-element vector specifying the color to plot for alternating haplotype phase blocks. |
spacing |
Number of lines of spacing for the margin spacing at the bottom of the plot. Useful if an idiogram/karogram is plot underneath. |
alphaVal |
Set an alpha value between 0 and 1 to allow transparency in the points being plot. |
xlim |
Two element vector to specify the xlim for the plot. If |
lwd |
Explicitly specify the line width for segments being plot. |
plot.new |
|
... |
Additional arguments used in the |
plotCNlogRByChr
plots the copy number alterations from log ratio data. The Y-axis is based on log ratios. Log ratios are computed ratios between normalized tumour and normal read depths. Data points close to 0 represent diploid, above 0 are copy gains, below 0 are deletions. ploidy
argument adjusts the baseline of the data points. Colours represent the copy number state.
Bright Green - Homozygous deletion (HOMD)
Green - Hemizygous deletion (DLOH)
Blue - Diploid heterozygous (HET), Copy-neutral LOH (NLOH)
Dark Red - GAIN
Red - Allele-specific CNA (ASCNA), Unbalanced CNA (UBCNA), Balanced CNA (BCNA)
plotAllelicRatio
plots the allelic imbalance and loss of heterozygosity from allelic ratio data. The Y-axis is based on allelic ratios. Allelic ratios are computed as RefCount/Depth. Data points close to 1 represent homozygous reference base, close to 0 represent homozygous non-reference base, and close to 0.5 represent heterozygous. Normal contamination influences the divergence away from 0.5 for LOH events. No adjustments are made to the plot as the original data from dataIn
are shown. Colours represent the allelic imbalance and LOH state.
Grey - HET, BCNA
Bright Green - HOMD
Green - DLOH, ALOH
Blue - NLOH
Dark Red - GAIN
Red - ASCNA, UBCNA
plotClonalFrequency
plots the cellular prevalence and clonal clusters from the results. The Y-axis is the cellular prevalence that includes the normal proportion. Therefore, the cellular prevalence here refers to the proportion in the sample (including normal). Lines are drawn for each data point indicating the cellular prevalence. Heterozygous diploid are not shown because it is a normal genotype and is not categorized as being subclonal (this means 100% of cells are normal). The black horizontal line represents the tumour content labeled as ‘T’. Each horizontal grey line represents the cellular prevalence of the clonal clusters labeled as Z1, Z2, etc. Colours are the sames for allelic ratio plots.
plotSubcloneProfiles
plots the predicted copy number profile for individual subclones inferred by TITAN. Currently, this only works for solutions having 1 or 2 clonal clusters. The colours are the same as for plotAllelicRatio
.
plotSegmentMedians
plots the segment means for ‘LogRatio’ or ‘AllelicRatio’ data. There are also two types of plots for each of the datatypes: ‘Ratio’ or ‘CopyNumber’. For ‘Ratio’, the logRatio or allelic ratios in the output results files are plot. For ‘CopyNumber’, the y-axis is converted to the exponentiated absolute copy number levels for the easier interpretability of the results.
plotHaplotypeFraction
plots the phased SNP read count of the higher coverage haplotype, normalized by the total coverage of the haplotype. For ‘Haplotypes’, alternating colors of blue and red are used to illustrate the phased haplotype blocks provided from the input data (see loadHaplotypeAlleleCounts
).
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
outputTitanResults
, runEMclonalCN
, computeSDbwIndex
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05) convergeParams <- results$convergeParams results <- results$corrResults # use corrected results #### PLOT RESULTS #### norm <- tail(convergeParams$n, 1) ploidy <- tail(convergeParams$phi, 1) par(mfrow=c(4, 1)) plotCNlogRByChr(results, chr = 2, segs = NULL, ploidy = ploidy, normal = norm, geneAnnot = NULL, ylim = c(-2, 2), cex = 0.5, xlab = "", main = "Chr 2") plotAllelicRatio(results, chr = 2, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotClonalFrequency(results, chr = 2, normal = norm, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotSubcloneProfiles(results, chr = 2, cex = 2, main = "Chr 2") segs <- outputTitanSegments(results, id = "test", convergeParams) plotSegmentMedians(segs, chr=2, resultType = "LogRatio", plotType = "CopyNumber", plot.new = TRUE)
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams) #### FORMAT RESULTS #### results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05) convergeParams <- results$convergeParams results <- results$corrResults # use corrected results #### PLOT RESULTS #### norm <- tail(convergeParams$n, 1) ploidy <- tail(convergeParams$phi, 1) par(mfrow=c(4, 1)) plotCNlogRByChr(results, chr = 2, segs = NULL, ploidy = ploidy, normal = norm, geneAnnot = NULL, ylim = c(-2, 2), cex = 0.5, xlab = "", main = "Chr 2") plotAllelicRatio(results, chr = 2, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotClonalFrequency(results, chr = 2, normal = norm, geneAnnot = NULL, ylim = c(0, 1), cex = 0.5, xlab = "", main = "Chr 2") plotSubcloneProfiles(results, chr = 2, cex = 2, main = "Chr 2") segs <- outputTitanSegments(results, id = "test", convergeParams) plotSegmentMedians(segs, chr=2, resultType = "LogRatio", plotType = "CopyNumber", plot.new = TRUE)
Function to run the Expectation Maximization Algorithm for inference of model parameters: cellular prevalence, normal proportion, tumour ploidy. This is a key function in the TitanCNA package and is the most computationally intense. This function makes calls to a C subroutine that allows the algorithm to be run more efficiently.
runEMclonalCN(data, params, txnExpLen = 1e15, txnZstrength = 5e05, maxiter = 15, maxiterUpdate = 1500, pseudoCounts = 1e-300, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE, useOutlierState = FALSE, likChangeThreshold = 0.001, verbose = TRUE)
runEMclonalCN(data, params, txnExpLen = 1e15, txnZstrength = 5e05, maxiter = 15, maxiterUpdate = 1500, pseudoCounts = 1e-300, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE, useOutlierState = FALSE, likChangeThreshold = 0.001, verbose = TRUE)
data |
|
params |
|
txnExpLen |
Influences prior probability of genotype transitions in the HMM. Smaller value have lower tendency to change state; however, too small and it produces underflow problems. |
txnZstrength |
Influences prior probability of clonal cluster transitions in the HMM. Smaller value have lower tendency to change clonal cluster state. |
pseudoCounts |
Small, machine precision values to add to probabilities to avoid underflow. For example, |
maxiter |
Maximum number of expectation-maximization iterations allowed. In practice, for TitanCNA, it will usually not exceed 20. |
maxiterUpdate |
Maximum number of coordinate descent iterations during the M-step (of EM algorithm) when parameters are estimated. |
normalEstimateMethod |
Specifies how to handle normal proportion estimation. Using |
estimateS |
Logical indicating whether to account for clonality and estimate subclonal events. See Details. |
estimatePloidy |
Logical indicating whether to estimate and account for tumour ploidy. |
useOutlierState |
Logical indicating whether an additional outlier state should be used. In practice, this is usually not necessary. |
likChangeThreshold |
EM convergence criteria - stop EM when complete log likelihood changes less than the proportion specified by this argument. |
verbose |
Set to FALSE to suppress program messages. |
This function is implemented with the "foreach"
package and therefore supports parallelization. See "doMC"
or "doMPI"
for some parallelization packages.
The forwards-backwards algorithm is used for the E-step in the EM algorithm. This is done using a call to a C subroutine for each chromosome. The maximization step uses maximum a posteriori (MAP) for estimation of parameters.
If the sample has absolutely no normal contamination, then assign nParams$n_0 <- 0
and use argument normalEstimateMethod="fixed"
.
estimateS
should always be set to TRUE
. If no subclonality is expected, then use loadDefaultParameters(numberClonalClusters=1)
. Using estimateS=FALSE
and loadDefaultParameters(numberClonalClusters=0)
is gives more or less the same results.
list
with components for results returned from the EM algorithm, including converged parameters, posterior marginal responsibilities, log likelihood, and original parameter settings.
n |
Converged estimate for normal contamination parameter. |
s |
Converged estimate(s) for cellular prevalence parameter(s). This value is defined as the proportion of tumour sample that does not contain the aberrant genotype. This will contrast what is output in |
var |
Converged estimates for variance parameter of the Gaussian mixtures used to model the log ratio data. |
phi |
Converged estimate for tumour ploidy parameter. |
piG |
Converged estimate for initial genotype state distribution. |
piZ |
Converged estimate for initial clonal cluster state distribution. |
muR |
Mean of binomial mixtures computed as a function of |
muC |
Mean of Gaussian mixtures computed as a function of |
loglik |
Posterior Log-likelihood that includes data likelihood and the priors. |
rhoG |
Posterior marginal probabilities for the genotype states computed during the E-step. Only the final iteration is returned as a |
rhoZ |
Posterior marginal probabilities for the clonal cluster states computed during the E-step. Only the final iteration is returned as a |
genotypeParams |
Original genotype parameters. See |
ploidyParams |
Original tumour ploidy parameters. See |
normalParams |
Original normal contamination parameters. See |
clonalParams |
Original subclonal parameters. See |
txnExpLen |
Original genotype transition expected length. See |
txnZstrength |
Original clonal cluster transition expected length. See |
useOutlierState |
Original setting indicating usage of outlier state. See |
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
"foreach"
, "doMC"
, "doMPI"
, loadAlleleCounts
, loadDefaultParameters
, viterbiClonalCN
message('Running TITAN ...') #### LOAD DATA #### infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") data <- loadAlleleCounts(infile) #### LOAD PARAMETERS #### message('titan: Loading default parameters') numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters, skew = 0.1) #### READ COPY NUMBER FROM HMMCOPY FILE #### message('titan: Correcting GC content and mappability biases...') tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") cnData <- correctReadDepth(tumWig, normWig, gc, map) logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #transform to natural log #### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc #### data <- filterData(data, 1:24, minDepth = 10, maxDepth = 200, map = NULL) #### EM (FWD-BACK) TO TRAIN PARAMETERS #### #### Can use parallelization packages #### K <- length(params$genotypeParams$alphaKHyper) params$genotypeParams$alphaKHyper <- rep(500, K) params$ploidyParams$phi_0 <- 1.5 convergeParams <- runEMclonalCN(data, params, maxiter = 3, maxiterUpdate = 500, txnExpLen = 1e15, txnZstrength = 5e5, useOutlierState = FALSE, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE)
message('Running TITAN ...') #### LOAD DATA #### infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") data <- loadAlleleCounts(infile) #### LOAD PARAMETERS #### message('titan: Loading default parameters') numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters, skew = 0.1) #### READ COPY NUMBER FROM HMMCOPY FILE #### message('titan: Correcting GC content and mappability biases...') tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") cnData <- correctReadDepth(tumWig, normWig, gc, map) logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #transform to natural log #### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc #### data <- filterData(data, 1:24, minDepth = 10, maxDepth = 200, map = NULL) #### EM (FWD-BACK) TO TRAIN PARAMETERS #### #### Can use parallelization packages #### K <- length(params$genotypeParams$alphaKHyper) params$genotypeParams$alphaKHyper <- rep(500, K) params$ploidyParams$phi_0 <- 1.5 convergeParams <- runEMclonalCN(data, params, maxiter = 3, maxiterUpdate = 500, txnExpLen = 1e15, txnZstrength = 5e5, useOutlierState = FALSE, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE)
Data for chromosome 2 for a triple-negative breast cancer dataset and the expectation-maximization (EM) trained results. Only 20,000 datapoints are included and the data has been scrambled to anonymize patient SNPs.
Processed input data that is first generated by
loadAlleleCounts
, and includes log ratios that
have been GC content and mappability corrected using
correctReadDepth
.
EM results that are generated by
runEMclonalCN
data(EMresults)
data(EMresults)
‘data’ is a list. ‘convergeParams’ is a list.
Shah SP et al. (2012). The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature, 486(7403): 395-399. (PMID: 22495314)
Ha, G., Roth, A., Lai, D., Bashashati, A., Ding, J., Goya, R., Giuliany, R., Rosner, J., Oloumi, A., Shumansky, K., Chin, S.F., Turashvili, G., Hirst, M., Caldas, C., Marra, M. A., Aparicio, S., and Shah, S. P. (2012). Integrative analysis of genome wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple negative breast cancer. Genome Research, 22(10):1995,2007. (PMID: 22637570)
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
Function to run the Viterbi algorithm to find the optimal state path in the TitanCNA hidden Markov model (HMM). The states returned will indicate the optimal copy number and LOH state as well as the most likely clonal cluster for each data point. After running EM, use the converge parameters and the input data to infer the optimal state for each position. This function makes calls to a C subroutine that allows the algorithm to be run more efficiently.
viterbiClonalCN(data, convergeParams, genotypeParams = NULL)
viterbiClonalCN(data, convergeParams, genotypeParams = NULL)
data |
|
convergeParams |
|
genotypeParams |
If |
It is difficult to interpret the output of this function directly. The user should use the function outputTitanResults
after.
numeric array
containing the integer states corresponding to each data point in data
.
Gavin Ha <[email protected]>
Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)
outputTitanResults
, loadAlleleCounts
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams)
data(EMresults) #### COMPUTE OPTIMAL STATE PATH USING VITERBI #### optimalPath <- viterbiClonalCN(data, convergeParams)
Fast fixedStep WIG file reading and parsing
wigToGRanges(wigfile, verbose = TRUE) wigToRangedData(wigfile, verbose = TRUE)
wigToGRanges(wigfile, verbose = TRUE) wigToRangedData(wigfile, verbose = TRUE)
wigfile |
Filepath to fixedStep WIG format file |
verbose |
Set to FALSE to suppress messages |
Reads the entire file into memory, then processes the file in rapid fashion, thus performance will be limited by memory capacity.
The WIG file is expected to conform to the minimal fixedStep WIG format (see References), where each chromsome is started by a “fixedStep” declaration line. The function assumes only a single track in the WIG file, and will ignore any lines before the first line starting with “fixedStep”.
GRanges
object with chromosome and position information, sorted
in decreasing chromosal size and increasing position.
Gavin Ha, Daniel Lai
wigToGRanges
is a wrapper around these functions for easy
WIG file loading and structure formatting. It is modified from the HMMcopy package.
map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") mScore <- as.data.frame(wigToGRanges(map))
map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") mScore <- as.data.frame(wigToGRanges(map))