Title: | A Normalization and Copy Number Variation Detection Method for Whole Exome Sequencing |
---|---|
Description: | A normalization and copy number variation calling procedure for whole exome DNA sequencing data. CODEX relies on the availability of multiple samples processed using the same sequencing pipeline for normalization, and does not require matched controls. The normalization model in CODEX includes terms that specifically remove biases due to GC content, exon length and targeting and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based recursive segmentation procedure that explicitly models the count-based exome sequencing data. |
Authors: | Yuchao Jiang, Nancy R. Zhang |
Maintainer: | Yuchao Jiang <[email protected]> |
License: | GPL-2 |
Version: | 1.39.0 |
Built: | 2024-10-30 05:19:35 UTC |
Source: | https://github.com/bioc/CODEX |
CODEX is a normalization and copy number variation calling procedure for whole exome DNA sequencing data. CODEX relies on the availability of multiple samples processed using the same sequencing pipeline for normalization, and does not require matched controls. The normalization model in CODEX includes terms that specifically remove biases due to GC content, exon length and targeting and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based recursive segmentation procedure that explicitly models the count-based exome sequencing data.
Package: | CODEX |
Type: | Package |
Version: | 0.99.0 |
Date: | 2015-01-13 |
License: | GPL-2 |
CODEX takes as input the bam files/directories for whole exome sequencing datasets and bed files for exonic positions, returns raw and normalized coverage for each exon, and calls copy number variations with genotyping results.
Yuchao Jiang <[email protected]>, Nancy R. Zhang
Pre-stored bambedObj data for demonstration purposes.
data(bambedObjDemo)
data(bambedObjDemo)
Pre-computed using whole exome sequencing data of 46 HapMap samples.
bambedObj demo data (list) pre-computed.
Yuchao Jiang [email protected]
bamdir <- bambedObjDemo$bamdir sampname <- bambedObjDemo$sampname ref <- bambedObjDemo$ref projectname <- bambedObjDemo$projectname chr <- bambedObjDemo$chr
bamdir <- bambedObjDemo$bamdir sampname <- bambedObjDemo$sampname ref <- bambedObjDemo$ref projectname <- bambedObjDemo$projectname chr <- bambedObjDemo$chr
Determines the number of latent variables K via AIC, BIC, and deviance reduction. A pdf file containing all three plots is generated.
choiceofK(AIC, BIC, RSS, K, filename)
choiceofK(AIC, BIC, RSS, K, filename)
AIC |
vector of AIC for each K returned from |
BIC |
vector of BIC for each K returned from |
RSS |
vector of RSS for each K returned from |
K |
vector of K returned from |
filename |
Filename of the output plot of AIC and RSS |
AIC: Akaike information criterion, used for model selection; BIC: Bayesian information criterion, used for model selection; RSS: residue sum of squares, used to plot scree plot and determine the 'elbow'.
pdf file with three plots: AIC, BIC, and percentage of variance explained versus the number of latent factors.
Yuchao Jiang [email protected]
AIC <- normObjDemo$AIC BIC <- normObjDemo$BIC RSS <- normObjDemo$RSS K <- normObjDemo$K projectname <- bambedObjDemo$projectname chr <- bambedObjDemo$chr choiceofK(AIC, BIC, RSS, K, filename = paste(projectname, "_", chr, "_choiceofK", ".pdf", sep = ""))
AIC <- normObjDemo$AIC BIC <- normObjDemo$BIC RSS <- normObjDemo$RSS K <- normObjDemo$K projectname <- bambedObjDemo$projectname chr <- bambedObjDemo$chr choiceofK(AIC, BIC, RSS, K, filename = paste(projectname, "_", chr, "_choiceofK", ".pdf", sep = ""))
Pre-stored coverageObj data for demonstration purposes.
data(coverageObjDemo)
data(coverageObjDemo)
Pre-computed using whole exome sequencing data of 46 HapMap samples.
coverageObj demo data (list) pre-computed.
Yuchao Jiang [email protected]
Y <- coverageObjDemo$Y readlength <- coverageObjDemo$readlength
Y <- coverageObjDemo$Y readlength <- coverageObjDemo$readlength
Pre-stored GC content data for demonstration purposes.
data(gcDemo)
data(gcDemo)
Pre-computed using whole exome sequencing data of 46 HapMap samples.
gc demo data (vector) pre-computed.
Yuchao Jiang [email protected]
head(round(gcDemo, 2))
head(round(gcDemo, 2))
Gets bam file directories, sample names from .txt file, and exonic positions from .bed file.
getbambed(bamdir,bedFile,sampname,projectname,chr)
getbambed(bamdir,bedFile,sampname,projectname,chr)
bamdir |
Column vector. Each line specifies directory of a bam file. Should be in same order as sample names in sampname. |
bedFile |
Path to bed file specifying exonic targets. Is of type character. |
sampname |
Column vector. Each line specifies name of a sample corresponding to the bam file. Should be in same order as bam directories in bamdir. |
projectname |
String specifying the name of the project. Data will be saved using this as prefix. |
chr |
Chromosome. |
bamdir |
Bam directories |
sampname |
Sample names |
ref |
IRanges object specifying exonic positions |
projectname |
String specifying the name of the project. |
chr |
Chromosome |
Yuchao Jiang [email protected]
Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, Gentleman R, Morgan M and Carey V (2013). "Software for Computing and Annotating Genomic Ranges." PLoS Computational Biology, 9.
library(WES.1KG.WUGSC) dirPath <- system.file("extdata", package = "WES.1KG.WUGSC") bamFile <- list.files(dirPath, pattern = '*.bam$') bamdir <- file.path(dirPath, bamFile) sampnameFile <- file.path(dirPath, "sampname") sampname <- as.matrix(read.table(sampnameFile)) chr <- 22 bambedObj <- getbambed(bamdir = bamdir, bedFile = file.path(dirPath, "chr22_400_to_500.bed"), sampname = sampname, projectname = "CODEX_demo", chr) bamdir <- bambedObj$bamdir sampname <- bambedObj$sampname ref <- bambedObj$ref projectname <- bambedObj$projectname chr <- bambedObj$chr
library(WES.1KG.WUGSC) dirPath <- system.file("extdata", package = "WES.1KG.WUGSC") bamFile <- list.files(dirPath, pattern = '*.bam$') bamdir <- file.path(dirPath, bamFile) sampnameFile <- file.path(dirPath, "sampname") sampname <- as.matrix(read.table(sampnameFile)) chr <- 22 bambedObj <- getbambed(bamdir = bamdir, bedFile = file.path(dirPath, "chr22_400_to_500.bed"), sampname = sampname, projectname = "CODEX_demo", chr) bamdir <- bambedObj$bamdir sampname <- bambedObj$sampname ref <- bambedObj$ref projectname <- bambedObj$projectname chr <- bambedObj$chr
Gets depth of coverage for each exon across all samples from whole exome sequencing files.
getcoverage(bambedObj, mapqthres)
getcoverage(bambedObj, mapqthres)
bambedObj |
Object returned from |
mapqthres |
Mapping quality threshold hold of reads. |
Y |
Read depth matrix |
readlength |
Vector of read length for each sample |
Yuchao Jiang [email protected]
library(WES.1KG.WUGSC) dirPath <- system.file("extdata", package = "WES.1KG.WUGSC") bamFile <- list.files(dirPath, pattern = '*.bam$') bamdir <- file.path(dirPath, bamFile) sampnameFile <- file.path(dirPath, "sampname") sampname <- as.matrix(read.table(sampnameFile)) chr <- 22 bambedObj <- getbambed(bamdir = bamdir, bedFile = file.path(dirPath, "chr22_400_to_500.bed"), sampname = sampname, projectname = "CODEX_demo", chr) bamdir <- bambedObj$bamdir sampname <- bambedObj$sampname ref <- bambedObj$ref projectname <- bambedObj$projectname chr <- bambedObj$chr coverageObj <- getcoverage(bambedObj, mapqthres = 20) Y <- coverageObj$Y readlength <- coverageObj$readlength
library(WES.1KG.WUGSC) dirPath <- system.file("extdata", package = "WES.1KG.WUGSC") bamFile <- list.files(dirPath, pattern = '*.bam$') bamdir <- file.path(dirPath, bamFile) sampnameFile <- file.path(dirPath, "sampname") sampname <- as.matrix(read.table(sampnameFile)) chr <- 22 bambedObj <- getbambed(bamdir = bamdir, bedFile = file.path(dirPath, "chr22_400_to_500.bed"), sampname = sampname, projectname = "CODEX_demo", chr) bamdir <- bambedObj$bamdir sampname <- bambedObj$sampname ref <- bambedObj$ref projectname <- bambedObj$projectname chr <- bambedObj$chr coverageObj <- getcoverage(bambedObj, mapqthres = 20) Y <- coverageObj$Y readlength <- coverageObj$readlength
Computes GC content for each exon. Will be later used in QC procedure and normalization.
getgc(chr, ref)
getgc(chr, ref)
chr |
Chromosome returned from |
ref |
IRanges object returned from |
Vector of GC content for each exon.
Yuchao Jiang [email protected]
Team TBD. BSgenome.Hsapiens.UCSC.hg19: Full genome sequences for Homo sapiens (UCSC version hg19). R package version 1.3.99.
ref <- IRanges(st = 51207851, end = 51207982) gc <- getgc(chr = 22, ref)
ref <- IRanges(st = 51207851, end = 51207982) gc <- getgc(chr = 22, ref)
Computes mappability for each exon. To save running time, take values from pre-computed results. Will be later used in QC procedure.
getmapp(chr, ref)
getmapp(chr, ref)
chr |
Chromosome returned from |
ref |
IRanges object returned from |
To calculate the exonic mappability, we first construct consecutive reads of length 90 that are one base pair apart along the exon. The sequences are taken from the hg19 reference. We then find possible positions across the genome that the reads can map to allowing for two mismatches. We compute the mean of the probabilities that the overlapped reads map to the target places where they are generated and use this as the mappability of the exon.
Vector of mappability for each exon.
Yuchao Jiang [email protected]
ref <- IRanges(st = 51207851, end = 51207982) mapp <- getmapp(chr = 22, ref)
ref <- IRanges(st = 51207851, end = 51207982) mapp <- getmapp(chr = 22, ref)
List consisting of IRanges objects specifying exonic positions whose mappabilities are pre-computed across the genome.
data(mapp_ref)
data(mapp_ref)
Genomic positions for pre-computed mappabilities. Method used is detailed in
getmapp
.
List of length 24 with genomic positions of pre-computed mappability of each chromosome.
Yuchao Jiang [email protected]
# mappability exon reference of chromosome 1 mapp_ref[[1]]
# mappability exon reference of chromosome 1 mapp_ref[[1]]
The results of pre-computed mappabilities to save running time.
data(mappability)
data(mappability)
Pre-computed mappabilities. Method used is detailed in getmapp
.
List of length 24 with pre-computed mappability of each chromosome.
Yuchao Jiang [email protected]
# mappability of chromosome 1 head(round(mappability[[1]], 2))
# mappability of chromosome 1 head(round(mappability[[1]], 2))
Pre-stored mappability data for demonstration purposes.
data(mappDemo)
data(mappDemo)
Pre-computed using whole exome sequencing data of 46 HapMap samples.
mapp demo data (vector) pre-computed.
Yuchao Jiang [email protected]
head(round(mappDemo, 2))
head(round(mappDemo, 2))
Fits a Poisson log-linear model that normalizes the read depth data from whole exome sequencing. Includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts.
normalize(Y_qc, gc_qc, K)
normalize(Y_qc, gc_qc, K)
Y_qc |
Read depth matrix after quality control procedure returned from
|
gc_qc |
Vector of GC content for each exon after quality control procedure returned
from |
K |
Number of latent Poisson factors. Can be an integer if optimal solution has been chosen or a vector of integers so that AIC, BIC, and RSS are computed for choice of optimal k. |
Yhat |
Normalized read depth matrix |
AIC |
AIC for model selection |
BIC |
BIC for model selection |
RSS |
RSS for model selection |
K |
Number of latent Poisson factors |
Yuchao Jiang [email protected]
Y_qc <- qcObjDemo$Y_qc gc_qc <- qcObjDemo$gc_qc normObj <- normalize(Y_qc, gc_qc, K = 1:5) Yhat <- normObj$Yhat AIC <- normObj$AIC BIC <- normObj$BIC RSS <- normObj$RSS K <- normObj$K
Y_qc <- qcObjDemo$Y_qc gc_qc <- qcObjDemo$gc_qc normObj <- normalize(Y_qc, gc_qc, K = 1:5) Yhat <- normObj$Yhat AIC <- normObj$AIC BIC <- normObj$BIC RSS <- normObj$RSS K <- normObj$K
Fits a Poisson log-linear model that normalizes the read depth data from whole exome sequencing. Includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. If the WES is designed under case-control setting, CODEX estimates the exon-wise Poisson latent factor using only the read depths in the control cohort, and then computes the sample-wise latent factor terms for the case samples by regression.
normalize2(Y_qc, gc_qc, K, normal_index)
normalize2(Y_qc, gc_qc, K, normal_index)
Y_qc |
Read depth matrix after quality control procedure returned from
|
gc_qc |
Vector of GC content for each exon after quality control procedure returned
from |
K |
Number of latent Poisson factors. Can be an integer if optimal solution has been chosen or a vector of integers so that AIC, BIC, and RSS are computed for choice of optimal k. |
normal_index |
Indices of control samples. |
Yhat |
Normalized read depth matrix |
AIC |
AIC for model selection |
BIC |
BIC for model selection |
RSS |
RSS for model selection |
K |
Number of latent Poisson factors |
Yuchao Jiang [email protected]
Y_qc <- qcObjDemo$Y_qc gc_qc <- qcObjDemo$gc_qc normObj <- normalize2(Y_qc, gc_qc, K = 1:5, normal_index = seq(1, 45, 2)) Yhat <- normObj$Yhat AIC <- normObj$AIC BIC <- normObj$BIC RSS <- normObj$RSS K <- normObj$K
Y_qc <- qcObjDemo$Y_qc gc_qc <- qcObjDemo$gc_qc normObj <- normalize2(Y_qc, gc_qc, K = 1:5, normal_index = seq(1, 45, 2)) Yhat <- normObj$Yhat AIC <- normObj$AIC BIC <- normObj$BIC RSS <- normObj$RSS K <- normObj$K
Pre-stored normObj data for demonstration purposes.
data(normObjDemo)
data(normObjDemo)
Pre-computed using whole exome sequencing data of 46 HapMap samples.
normObj demo data (list) pre-computed.
Yuchao Jiang [email protected]
Yhat <- normObjDemo$Yhat AIC <- normObjDemo$AIC BIC <- normObjDemo$BIC RSS <- normObjDemo$RSS K <- normObjDemo$K
Yhat <- normObjDemo$Yhat AIC <- normObjDemo$AIC BIC <- normObjDemo$BIC RSS <- normObjDemo$RSS K <- normObjDemo$K
Applies a quality control procedure to the depth of coverage matrix both sample-wise and exon-wise before normalization.
qc(Y, sampname, chr, ref, mapp, gc,cov_thresh,length_thresh,mapp_thresh, gc_thresh)
qc(Y, sampname, chr, ref, mapp, gc,cov_thresh,length_thresh,mapp_thresh, gc_thresh)
Y |
Original read depth matrix returned from |
sampname |
Vector of sample names returned from |
chr |
Chromosome. |
ref |
IRanges object specifying exonic positions returned from
|
mapp |
Vector of mappability for each exon returned from |
gc |
Vector of GC content for each exon returned from |
cov_thresh |
Vector specifying the upper and lower bound of exonic median coverage threshold for QC. 20-4000 recommended. |
length_thresh |
Vector specifying the upper and lower bound of exonic length threshold for QC. 20-2000 recommended. |
mapp_thresh |
Scalar variable specifying exonic mappability threshold for QC. 0.9 recommended. |
gc_thresh |
Vector specifying the upper and lower bound of exonic GC content threshold for QC. 20-80 recommended. |
It is suggested that analysis by CODEX be carried out in a batch-wise fashion if multiple batches exist. CODEX further filters out exons that: have extremely low coverage–median read depth across all samples less than 20 or greater than 4000; are extremely short–less than 20 bp; are extremely hard to map– mappability less than 0.9; have extreme GC content–less than 20 or greater than 80. The above filtering thresholds are recommended and can be user-defined to be adapted to different sequencing protocols.
Y_qc |
Updated |
sampname_qc |
Updated |
gc_qc |
Updated |
mapp_qc |
Updated |
ref_qc |
Updated |
qcmat |
Matrix specifying results of exon-wise QC procedures |
Yuchao Jiang [email protected]
Y <- coverageObjDemo$Y sampname <- bambedObjDemo$sampname chr <- bambedObjDemo$chr ref <- bambedObjDemo$ref gc <- gcDemo mapp <- mappDemo cov_thresh <- c(20, 4000) length_thresh <- c(20, 2000) mapp_thresh <- 0.9 gc_thresh <- c(20, 80) qcObj <- qc(Y, sampname, chr, ref, mapp, gc, cov_thresh, length_thresh, mapp_thresh, gc_thresh) Y_qc <- qcObj$Y_qc sampname_qc <- qcObj$sampname_qc gc_qc <- qcObj$gc_qc mapp_qc <- qcObj$mapp_qc ref_qc <- qcObj$ref_qc qcmat <- qcObj$qcmat
Y <- coverageObjDemo$Y sampname <- bambedObjDemo$sampname chr <- bambedObjDemo$chr ref <- bambedObjDemo$ref gc <- gcDemo mapp <- mappDemo cov_thresh <- c(20, 4000) length_thresh <- c(20, 2000) mapp_thresh <- 0.9 gc_thresh <- c(20, 80) qcObj <- qc(Y, sampname, chr, ref, mapp, gc, cov_thresh, length_thresh, mapp_thresh, gc_thresh) Y_qc <- qcObj$Y_qc sampname_qc <- qcObj$sampname_qc gc_qc <- qcObj$gc_qc mapp_qc <- qcObj$mapp_qc ref_qc <- qcObj$ref_qc qcmat <- qcObj$qcmat
Pre-stored qcObj data for demonstration purposes.
data(qcObjDemo)
data(qcObjDemo)
Pre-computed using whole exome sequencing data of 46 HapMap samples.
qcObj demo data (list) pre-computed.
Yuchao Jiang [email protected]
Y_qc <- qcObjDemo$Y_qc sampname_qc <- qcObjDemo$sampname_qc gc_qc <- qcObjDemo$gc_qc mapp_qc <- qcObjDemo$mapp_qc ref_qc <- qcObjDemo$ref_qc
Y_qc <- qcObjDemo$Y_qc sampname_qc <- qcObjDemo$sampname_qc gc_qc <- qcObjDemo$gc_qc mapp_qc <- qcObjDemo$mapp_qc ref_qc <- qcObjDemo$ref_qc
Recursive segmentation algorithm for CNV detection and genotyping, using normalized read depth from whole exome sequencing.
segment(Y_qc, Yhat, optK, K, sampname_qc, ref_qc, chr, lmax, mode)
segment(Y_qc, Yhat, optK, K, sampname_qc, ref_qc, chr, lmax, mode)
Y_qc |
Raw read depth matrix after quality control procedure returned from
|
Yhat |
Normalized read depth matrix returned from
|
optK |
Optimal value |
K |
Number of latent Poisson factors. Can be an integer if optimal solution has been chosen or a vector of integers so that AIC, BIC, and RSS are computed for choice of optimal k. |
sampname_qc |
Vector of sample names after quality control procedure returned from
|
ref_qc |
IRanges object of genomic positions of each exon after quality control
procedure returned from |
chr |
Chromosome number returned from |
lmax |
Maximum CNV length in number of exons returned. |
mode |
Can be either "integer" or "fraction", which respectively correspond to format of the returned copy numbers. |
Final callset of CNVs with genotyping results.
Yuchao Jiang [email protected]
Y_qc <- qcObjDemo$Y_qc Yhat <- normObjDemo$Yhat BIC <- normObjDemo$BIC K <- normObjDemo$K sampname_qc <- qcObjDemo$sampname_qc ref_qc <- qcObjDemo$ref_qc chr <- bambedObjDemo$chr finalcall <- segment(Y_qc, Yhat, optK = K[which.max(BIC)], K = K, sampname_qc, ref_qc, chr, lmax = 200, mode = "integer") finalcall
Y_qc <- qcObjDemo$Y_qc Yhat <- normObjDemo$Yhat BIC <- normObjDemo$BIC K <- normObjDemo$K sampname_qc <- qcObjDemo$sampname_qc ref_qc <- qcObjDemo$ref_qc chr <- bambedObjDemo$chr finalcall <- segment(Y_qc, Yhat, optK = K[which.max(BIC)], K = K, sampname_qc, ref_qc, chr, lmax = 200, mode = "integer") finalcall