Title: | A normalization and copy number estimation method for single-cell DNA sequencing |
---|---|
Description: | Whole genome single-cell DNA sequencing (scDNA-seq) enables characterization of copy number profiles at the cellular level. This circumvents the averaging effects associated with bulk-tissue sequencing and has increased resolution yet decreased ambiguity in deconvolving cancer subclones and elucidating cancer evolutionary history. ScDNA-seq data is, however, sparse, noisy, and highly variable even within a homogeneous cell population, due to the biases and artifacts that are introduced during the library preparation and sequencing procedure. Here, we propose SCOPE, a normalization and copy number estimation method for scDNA-seq data. The distinguishing features of SCOPE include: (i) utilization of cell-specific Gini coefficients for quality controls and for identification of normal/diploid cells, which are further used as negative control samples in a Poisson latent factor model for normalization; (ii) modeling of GC content bias using an expectation-maximization algorithm embedded in the Poisson generalized linear models, which accounts for the different copy number states along the genome; (iii) a cross-sample iterative segmentation procedure to identify breakpoints that are shared across cells from the same genetic background. |
Authors: | Rujin Wang, Danyu Lin, Yuchao Jiang |
Maintainer: | Rujin Wang <[email protected]> |
License: | GPL-2 |
Version: | 1.19.0 |
Built: | 2024-10-31 04:55:00 UTC |
Source: | https://github.com/bioc/SCOPE |
Pre-stored coverageObj.scope data for demonstration purposes
coverageObj.scopeDemo
coverageObj.scopeDemo
Pre-computed using whole genome sequencing data of three single cells from 10X Genomics Single-Cell CNV solution
Get bam file directories, sample names, and whole genomic bins from .bed file
get_bam_bed(bamdir, sampname, hgref = "hg19", resolution = 500, sex = FALSE)
get_bam_bed(bamdir, sampname, hgref = "hg19", resolution = 500, sex = FALSE)
bamdir |
vector of the directory of a bam file. Should be in the same
order as sample names in |
sampname |
vector of sample names. Should be in the same order as bam
directories in |
hgref |
reference genome. This should be 'hg19', 'hg38' or 'mm10'.
Default is human genome |
resolution |
numeric value of fixed bin-length. Default is |
sex |
logical, whether to include sex chromosomes.
Default is |
A list with components
bamdir |
A vector of bam directories |
sampname |
A vector of sample names |
ref |
A GRanges object specifying whole genomic bin positions |
Rujin Wang [email protected]
library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") bamdir <- bambedObj$bamdir sampname_raw <- bambedObj$sampname ref_raw <- bambedObj$ref
library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") bamdir <- bambedObj$bamdir sampname_raw <- bambedObj$sampname ref_raw <- bambedObj$ref
Get read coverage for each genomic bin across all single cells from scDNA-seq. Blacklist regions, such as segmental duplication regions and gaps near telomeres/centromeres will be masked prior to getting coverage.
get_coverage_scDNA(bambedObj, mapqthres, seq, hgref = "hg19")
get_coverage_scDNA(bambedObj, mapqthres, seq, hgref = "hg19")
bambedObj |
object returned from |
mapqthres |
mapping quality threshold of reads |
seq |
the sequencing method to be used. This should be either 'paired-end' or 'single-end' |
hgref |
reference genome. This should be 'hg19', 'hg38' or 'mm10'.
Default is human genome |
Y |
Read depth matrix |
Rujin Wang [email protected]
library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") # Getting raw read depth coverageObj <- get_coverage_scDNA(bambedObj, mapqthres = 40, seq = 'paired-end', hgref = "hg38") Y_raw <- coverageObj$Y
library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") # Getting raw read depth coverageObj <- get_coverage_scDNA(bambedObj, mapqthres = 40, seq = 'paired-end', hgref = "hg38") Y_raw <- coverageObj$Y
Compute GC content for each bin
get_gc(ref, hgref = "hg19")
get_gc(ref, hgref = "hg19")
ref |
GRanges object returned from |
hgref |
reference genome. This should be 'hg19', 'hg38' or 'mm10'.
Default is human genome |
gc |
Vector of GC content for each bin/target |
Rujin Wang [email protected]
## Not run: library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") bamdir <- bambedObj$bamdir sampname_raw <- bambedObj$sampname ref_raw <- bambedObj$ref gc <- get_gc(ref_raw, hgref = "hg38") ## End(Not run)
## Not run: library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") bamdir <- bambedObj$bamdir sampname_raw <- bambedObj$sampname ref_raw <- bambedObj$ref gc <- get_gc(ref_raw, hgref = "hg38") ## End(Not run)
Gini index is defined as two times the area between the Lorenz curve and the diagonal.
get_gini(Y)
get_gini(Y)
Y |
raw read depth matrix after quality control procedure |
Gini |
Vector of Gini coefficients for single cells from scDNA-seq |
Rujin Wang [email protected]
Gini <- get_gini(Y_sim)
Gini <- get_gini(Y_sim)
Compute mappability for each bin. Note that scDNA sequencing is whole-genome amplification and the mappability score is essential to determine variable binning method. Mappability track for 100-mers on the GRCh37/hg19 human reference genome from ENCODE is pre-saved. Compute the mean of mappability scores that overlapped reads map to bins, weighted by the width of mappability tracks on the genome reference. Use liftOver utility to calculate mappability for hg38, which is pre-saved as well. For mm10, there are two workarounds: 1) set all mappability to 1 to avoid extensive computation; 2) adopt QC procedures based on annotation results, e.g., filter out bins within black list regions, which generally have low mappability.
get_mapp(ref, hgref = "hg19")
get_mapp(ref, hgref = "hg19")
ref |
GRanges object returned from |
hgref |
reference genome. This should be 'hg19', 'hg38' or 'mm10'.
Default is human genome |
mapp |
Vector of mappability for each bin/target |
Rujin Wang [email protected]
## Not run: library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") bamdir <- bambedObj$bamdir sampname_raw <- bambedObj$sampname ref_raw <- bambedObj$ref mapp <- get_mapp(ref_raw, hgref = "hg38") ## End(Not run)
## Not run: library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") bamdir <- bambedObj$bamdir sampname_raw <- bambedObj$sampname ref_raw <- bambedObj$ref mapp <- get_mapp(ref_raw, hgref = "hg38") ## End(Not run)
Perform QC step on single cells.
get_samp_QC(bambedObj)
get_samp_QC(bambedObj)
bambedObj |
object returned from |
QCmetric |
A matrix containing total number/proportion of reads, total number/proportion of mapped reads, total number/proportion of mapped non-duplicate reads, and number/proportion of reads with mapping quality greater than 20 |
Rujin Wang [email protected]
library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") QCmetric_raw = get_samp_QC(bambedObj)
library(WGSmapp) library(BSgenome.Hsapiens.UCSC.hg38) bamfolder <- system.file('extdata', package = 'WGSmapp') bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$') bamdir <- file.path(bamfolder, bamFile) sampname_raw <- sapply(strsplit(bamFile, '.', fixed = TRUE), '[', 1) bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, hgref = "hg38") QCmetric_raw = get_samp_QC(bambedObj)
A post cross-sample segmentation integer copy number matrix returned by SCOPE in the demo
iCN_sim
iCN_sim
A post cross-sample segmentation integer copy number matrix of five toy cells returned by SCOPE
Pre-estimate ploidies across all cells
initialize_ploidy(Y, Yhat, ref, maxPloidy = 6, minPloidy = 1.5, minBinWidth = 5, SoS.plot = FALSE)
initialize_ploidy(Y, Yhat, ref, maxPloidy = 6, minPloidy = 1.5, minBinWidth = 5, SoS.plot = FALSE)
Y |
raw read depth matrix after quality control procedure |
Yhat |
normalized read depth matrix |
ref |
GRanges object after quality control procedure |
maxPloidy |
maximum ploidy candidate. Defalut is |
minPloidy |
minimum ploidy candidate. Defalut is |
minBinWidth |
the minimum number of bins for a changed segment.
Defalut is |
SoS.plot |
logical, whether to generate ploidy pre-estimation
plots. Default is |
ploidy.SoS |
Vector of pre-estimated ploidies for each cell |
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim
Pre-estimate ploidies across cells with shared clonal memberships
initialize_ploidy_group(Y, Yhat, ref, groups, maxPloidy = 6, minPloidy = 1.5, minBinWidth = 5, SoS.plot = FALSE)
initialize_ploidy_group(Y, Yhat, ref, groups, maxPloidy = 6, minPloidy = 1.5, minBinWidth = 5, SoS.plot = FALSE)
Y |
raw read depth matrix after quality control procedure |
Yhat |
normalized read depth matrix |
ref |
GRanges object after quality control procedure |
groups |
clonal membership labels for each cell |
maxPloidy |
maximum ploidy candidate. Defalut is |
minPloidy |
minimum ploidy candidate. Defalut is |
minBinWidth |
the minimum number of bins for a changed segment.
Defalut is |
SoS.plot |
logical, whether to generate ploidy pre-estimation
plots. Default is |
ploidy.SoS |
Vector of group-wise pre-estimated ploidies for each cell |
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Group-wise ploidy initialization clones <- c("normal", "tumor1", "normal", "tumor1", "tumor1") ploidy.sim.group <- initialize_ploidy_group(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim, groups = clones) ploidy.sim.group
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Group-wise ploidy initialization clones <- c("normal", "tumor1", "normal", "tumor1", "tumor1") ploidy.sim.group <- initialize_ploidy_group(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim, groups = clones) ploidy.sim.group
Assuming that all reads are from diploid regions, fit a Poisson generalized linear model to normalize the raw read depth data from single-cell DNA sequencing, without latent factors under the case-control setting.
normalize_codex2_ns_noK(Y_qc, gc_qc, norm_index)
normalize_codex2_ns_noK(Y_qc, gc_qc, norm_index)
Y_qc |
read depth matrix after quality control |
gc_qc |
vector of GC content for each bin after quality control |
norm_index |
indices of normal/diploid cells |
A list with components
Yhat |
A list of normalized read depth matrix |
fGC.hat |
A list of estimated GC content bias matrix |
beta.hat |
A list of estimated bin-specific bias vector |
N |
A vector of cell-specific library size factor, which is computed from the genome-wide read depth data |
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12))
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12))
Fit a Poisson generalized linear model to normalize the raw read depth data from single-cell DNA sequencing, with latent factors under the case-control setting. Model GC content bias using an expectation-maximization algorithm, which accounts for the different copy number states.
normalize_scope(Y_qc, gc_qc, K, norm_index, T, ploidyInt, beta0, minCountQC = 20)
normalize_scope(Y_qc, gc_qc, K, norm_index, T, ploidyInt, beta0, minCountQC = 20)
Y_qc |
read depth matrix after quality control |
gc_qc |
vector of GC content for each bin after quality control |
K |
Number of latent Poisson factors |
norm_index |
indices of normal/diploid cells |
T |
a vector of integers indicating number of CNV groups.
Use BIC to select optimal number of CNV groups. If |
ploidyInt |
a vector of initialized ploidy return
from |
beta0 |
a vector of initialized bin-specific biases returned from CODEX2 without latent factors |
minCountQC |
the minimum read coverage required for
normalization and EM fitting. Defalut is |
A list with components
Yhat |
A list of normalized read depth matrix with EM |
alpha.hat |
A list of absolute copy number matrix |
fGC.hat |
A list of EM estimated GC content bias matrix |
beta.hat |
A list of EM estimated bin-specific bias vector |
g.hat |
A list of estimated Poisson latent factor |
h.hat |
A list of estimated Poisson latent factor |
AIC |
AIC for model selection |
BIC |
BIC for model selection |
RSS |
RSS for model selection |
K |
Number of latent Poisson factors |
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim normObj.scope.sim <- normalize_scope(Y_qc = Y_sim, gc_qc = ref_sim$gc, K = 1, ploidyInt = ploidy.sim, norm_index = which(Gini<=0.12), T = 1:5, beta0 = beta.hat.noK.sim) Yhat.sim <- normObj.scope.sim$Yhat[[which.max(normObj.scope.sim$BIC)]] fGC.hat.sim <- normObj.scope.sim$fGC.hat[[which.max(normObj.scope.sim$BIC)]]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim normObj.scope.sim <- normalize_scope(Y_qc = Y_sim, gc_qc = ref_sim$gc, K = 1, ploidyInt = ploidy.sim, norm_index = which(Gini<=0.12), T = 1:5, beta0 = beta.hat.noK.sim) Yhat.sim <- normObj.scope.sim$Yhat[[which.max(normObj.scope.sim$BIC)]] fGC.hat.sim <- normObj.scope.sim$fGC.hat[[which.max(normObj.scope.sim$BIC)]]
Fit a Poisson generalized linear model to normalize the raw read depth data from single-cell DNA sequencing, with latent factors under the case-control setting. Model GC content bias using an expectation-maximization algorithm, which accounts for the different copy number states.
normalize_scope_foreach(Y_qc, gc_qc, K, norm_index, T, ploidyInt, beta0, minCountQC = 20, nCores = NULL)
normalize_scope_foreach(Y_qc, gc_qc, K, norm_index, T, ploidyInt, beta0, minCountQC = 20, nCores = NULL)
Y_qc |
read depth matrix after quality control |
gc_qc |
vector of GC content for each bin after quality control |
K |
Number of latent Poisson factors |
norm_index |
indices of normal/diploid cells |
T |
a vector of integers indicating number of CNV groups.
Use BIC to select optimal number of CNV groups. If |
ploidyInt |
a vector of initialized ploidy return
from |
beta0 |
a vector of initialized bin-specific biases returned from CODEX2 without latent factors |
minCountQC |
the minimum read coverage required for normalization
and EM fitting. Defalut is |
nCores |
number of cores to use. If |
A list with components
Yhat |
A list of normalized read depth matrix with EM |
alpha.hat |
A list of absolute copy number matrix |
fGC.hat |
A list of EM estimated GC content bias matrix |
beta.hat |
A list of EM estimated bin-specific bias vector |
g.hat |
A list of estimated Poisson latent factor |
h.hat |
A list of estimated Poisson latent factor |
AIC |
AIC for model selection |
BIC |
BIC for model selection |
RSS |
RSS for model selection |
K |
Number of latent Poisson factors |
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim # Specify nCores = 2 only for checking examples normObj.scope.sim <- normalize_scope_foreach(Y_qc = Y_sim, gc_qc = ref_sim$gc, K = 1, ploidyInt = ploidy.sim, norm_index = which(Gini<=0.12), T = 1:5, beta0 = beta.hat.noK.sim, nCores = 2) Yhat.sim <- normObj.scope.sim$Yhat[[which.max(normObj.scope.sim$BIC)]] fGC.hat.sim <- normObj.scope.sim$fGC.hat[[which.max(normObj.scope.sim$BIC)]]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim # Specify nCores = 2 only for checking examples normObj.scope.sim <- normalize_scope_foreach(Y_qc = Y_sim, gc_qc = ref_sim$gc, K = 1, ploidyInt = ploidy.sim, norm_index = which(Gini<=0.12), T = 1:5, beta0 = beta.hat.noK.sim, nCores = 2) Yhat.sim <- normObj.scope.sim$Yhat[[which.max(normObj.scope.sim$BIC)]] fGC.hat.sim <- normObj.scope.sim$fGC.hat[[which.max(normObj.scope.sim$BIC)]]
Fit a Poisson generalized linear model to normalize the raw read depth data from single-cell DNA sequencing, with latent factors and shared clonal memberships. Model GC content bias using an expectation-maximization algorithm, which accounts for clonal specific copy number states.
normalize_scope_group(Y_qc, gc_qc, K, norm_index, groups, T, ploidyInt, beta0, minCountQC = 20)
normalize_scope_group(Y_qc, gc_qc, K, norm_index, groups, T, ploidyInt, beta0, minCountQC = 20)
Y_qc |
read depth matrix after quality control |
gc_qc |
vector of GC content for each bin after quality control |
K |
Number of latent Poisson factors |
norm_index |
indices of normal/diploid cells using group/clone labels |
groups |
clonal membership labels for each cell |
T |
a vector of integers indicating number of CNV groups.
Use BIC to select optimal number of CNV groups. If |
ploidyInt |
a vector of group-wise initialized ploidy return
from |
beta0 |
a vector of initialized bin-specific biases returned from CODEX2 without latent factors |
minCountQC |
the minimum read coverage required for
normalization and EM fitting. Defalut is |
A list with components
Yhat |
A list of normalized read depth matrix with EM |
alpha.hat |
A list of absolute copy number matrix |
fGC.hat |
A list of EM estimated GC content bias matrix |
beta.hat |
A list of EM estimated bin-specific bias vector |
g.hat |
A list of estimated Poisson latent factor |
h.hat |
A list of estimated Poisson latent factor |
AIC |
AIC for model selection |
BIC |
BIC for model selection |
RSS |
RSS for model selection |
K |
Number of latent Poisson factors |
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Group-wise ploidy initialization clones <- c("normal", "tumor1", "normal", "tumor1", "tumor1") ploidy.sim.group <- initialize_ploidy_group(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim, groups = clones) ploidy.sim.group normObj.scope.sim.group <- normalize_scope_group(Y_qc = Y_sim, gc_qc = ref_sim$gc, K = 1, ploidyInt = ploidy.sim.group, norm_index = which(clones=="normal"), groups = clones, T = 1:5, beta0 = beta.hat.noK.sim) Yhat.sim.group <- normObj.scope.sim.group$Yhat[[which.max( normObj.scope.sim.group$BIC)]] fGC.hat.sim.group <- normObj.scope.sim.group$fGC.hat[[which.max( normObj.scope.sim.group$BIC)]]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Group-wise ploidy initialization clones <- c("normal", "tumor1", "normal", "tumor1", "tumor1") ploidy.sim.group <- initialize_ploidy_group(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim, groups = clones) ploidy.sim.group normObj.scope.sim.group <- normalize_scope_group(Y_qc = Y_sim, gc_qc = ref_sim$gc, K = 1, ploidyInt = ploidy.sim.group, norm_index = which(clones=="normal"), groups = clones, T = 1:5, beta0 = beta.hat.noK.sim) Yhat.sim.group <- normObj.scope.sim.group$Yhat[[which.max( normObj.scope.sim.group$BIC)]] fGC.hat.sim.group <- normObj.scope.sim.group$fGC.hat[[which.max( normObj.scope.sim.group$BIC)]]
Pre-stored normObj.scope data for demonstration purposes
normObj.scopeDemo
normObj.scopeDemo
Pre-computed by SCOPE using pre-stored data Y_sim
Perform QC step on single cells and bins.
perform_qc(Y_raw, sampname_raw, ref_raw, QCmetric_raw, cov_thresh = 0, minCountQC = 20, mapq20_thresh = 0.3, mapp_thresh = 0.9, gc_thresh = c(20, 80), nMAD = 3)
perform_qc(Y_raw, sampname_raw, ref_raw, QCmetric_raw, cov_thresh = 0, minCountQC = 20, mapq20_thresh = 0.3, mapp_thresh = 0.9, gc_thresh = c(20, 80), nMAD = 3)
Y_raw |
raw read count matrix returned
from |
sampname_raw |
sample names for quality control returned
from |
ref_raw |
raw GRanges object with corresponding GC content
and mappability for quality control returned from
|
QCmetric_raw |
a QC metric for single cells returned from
|
cov_thresh |
scalar variable specifying the lower bound of read count
summation of each cell. Default is |
minCountQC |
the minimum read coverage required for
normalization and EM fitting. Defalut is |
mapq20_thresh |
scalar variable specifying the lower threshold
of proportion of reads with mapping quality greater than 20.
Default is |
mapp_thresh |
scalar variable specifying mappability of
each genomic bin. Default is |
gc_thresh |
vector specifying the lower and upper bound of
GC content threshold for quality control. Default is |
nMAD |
scalar variable specifying the number of MAD from the median
of total read counts adjusted by library size for each cell.
Default is |
A list with components
Y |
read depth matrix after quality control |
sampname |
sample names after quality control |
ref |
A GRanges object specifying whole genomic bin positions after quality control |
QCmetric |
A data frame of QC metric for single cells after quality control |
Rujin Wang [email protected]
Y_raw <- coverageObj.scopeDemo$Y sampname_raw <- rownames(QCmetric.scopeDemo) ref_raw <- ref.scopeDemo QCmetric_raw <- QCmetric.scopeDemo qcObj <- perform_qc(Y_raw = Y_raw, sampname_raw = sampname_raw, ref_raw = ref_raw, QCmetric_raw = QCmetric_raw)
Y_raw <- coverageObj.scopeDemo$Y sampname_raw <- rownames(QCmetric.scopeDemo) ref_raw <- ref.scopeDemo QCmetric_raw <- QCmetric.scopeDemo qcObj <- perform_qc(Y_raw = Y_raw, sampname_raw = sampname_raw, ref_raw = ref_raw, QCmetric_raw = QCmetric_raw)
A pdf file containing EM fitting results and plots is generated.
plot_EM_fit(Y_qc, gc_qc, norm_index, T, ploidyInt, beta0, minCountQC = 20, filename)
plot_EM_fit(Y_qc, gc_qc, norm_index, T, ploidyInt, beta0, minCountQC = 20, filename)
Y_qc |
read depth matrix across all cells after quality control |
gc_qc |
vector of GC content for each bin after quality control |
norm_index |
indices of normal/diploid cells |
T |
a vector of integers indicating number of CNV groups.
Use BIC to select optimal number of CNV groups.
If |
ploidyInt |
a vector of initialized ploidy return from
|
beta0 |
a vector of initialized bin-specific biases returned from CODEX2 without latent factors |
minCountQC |
the minimum read coverage required for EM fitting.
Defalut is |
filename |
the name of output pdf file |
pdf file with EM fitting results and two plots: log likelihood, and BIC versus the number of CNV groups.
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim plot_EM_fit(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12), T = 1:7, ploidyInt = ploidy.sim, beta0 = beta.hat.noK.sim, filename = 'plot_EM_fit_demo.pdf')
Gini <- get_gini(Y_sim) # first-pass CODEX2 run with no latent factors normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12)) Yhat.noK.sim <- normObj.sim$Yhat beta.hat.noK.sim <- normObj.sim$beta.hat fGC.hat.noK.sim <- normObj.sim$fGC.hat N.sim <- normObj.sim$N # Ploidy initialization ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = Yhat.noK.sim, ref = ref_sim) ploidy.sim plot_EM_fit(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12), T = 1:7, ploidyInt = ploidy.sim, beta0 = beta.hat.noK.sim, filename = 'plot_EM_fit_demo.pdf')
Show heatmap of inferred integer copy-number profiles by SCOPE with cells clustered by hierarchical clustering
plot_iCN(iCNmat, ref, Gini, annotation = NULL, plot.dendrogram = TRUE, show.names = FALSE, filename)
plot_iCN(iCNmat, ref, Gini, annotation = NULL, plot.dendrogram = TRUE, show.names = FALSE, filename)
iCNmat |
inferred integer copy-number matrix by SCOPE, with each column being a cell and each row being a genomic bin |
ref |
GRanges object after quality control procedure |
Gini |
vector of Gini coefficients for each cell,
with the same order as that of cells in columns of |
annotation |
vector of annotation for each cell,
with the same order as that of cells in columns of |
plot.dendrogram |
logical, whether to plot the dendrogram.
Default is |
show.names |
logical, whether to show cell names by y axis.
Default is |
filename |
name of the output png file |
png file with integer copy-number profiles across single cells with specified annotations
Rujin Wang [email protected]
Gini <- get_gini(Y_sim) plot_iCN(iCNmat = iCN_sim, ref = ref_sim, Gini = Gini, filename = 'plot_iCN_demo')
Gini <- get_gini(Y_sim) plot_iCN(iCNmat = iCN_sim, ref = ref_sim, Gini = Gini, filename = 'plot_iCN_demo')
Pre-stored QCmetric data for demonstration purposes
QCmetric.scopeDemo
QCmetric.scopeDemo
Pre-computed using whole genome sequencing data of three single cells from 10X Genomics Single-Cell CNV solution
A reference genome in the toy dataset
ref_sim
ref_sim
A GRanges object with 1544 bins and 1 metadata column of GC content
Pre-stored 500kb-size reference genome for demonstration purposes
ref.scopeDemo
ref.scopeDemo
Pre-computed using whole genome sequencing data with GC content and mappability scores
SCOPE offers a cross-sample Poisson likelihood-based recursive segmentation, enabling shared breakpoints across cells from the same genetic background.
segment_CBScs(Y, Yhat, sampname, ref, chr, mode = "integer", max.ns)
segment_CBScs(Y, Yhat, sampname, ref, chr, mode = "integer", max.ns)
Y |
raw read depth matrix after quality control procedure |
Yhat |
normalized read depth matrix |
sampname |
vector of sample names |
ref |
GRanges object after quality control procedure |
chr |
chromosome name. Make sure it is consistent with the reference genome. |
mode |
format of returned copy numbers. Only integer mode is supported for scDNA-seq data. |
max.ns |
a number specifying how many rounds of nested structure
searching would be performed. Defalut is |
A list with components
poolcall |
Cross-sample CNV callings indicating shared breakpoints |
finalcall |
Final cross-sample segmented callset of CNVs with genotyping results |
image.orig |
A matrix giving logarithm of normalized z-scores |
image.seg |
A matrix of logarithm of estimated copy number over 2 |
iCN |
A matrix of inferred integer copy number profiles |
Rujin Wang [email protected]
Yhat.sim <- normObj.scopeDemo$Yhat[[which.max(normObj.scopeDemo$BIC)]] segment_cs_chr1 <- segment_CBScs(Y = Y_sim, Yhat = Yhat.sim, sampname = colnames(Y_sim), ref = ref_sim, chr = 'chr1', max.ns = 1)
Yhat.sim <- normObj.scopeDemo$Yhat[[which.max(normObj.scopeDemo$BIC)]] segment_cs_chr1 <- segment_CBScs(Y = Y_sim, Yhat = Yhat.sim, sampname = colnames(Y_sim), ref = ref_sim, chr = 'chr1', max.ns = 1)
A read count matrix in the toy dataset
Y_sim
Y_sim
A read count matrix with 1544 bins and 39 cells