Package 'cardelino' reference manual

Title:	Clone Identification from Single Cell Data
Description:	Methods to infer clonal tree configuration for a population of cells using single-cell RNA-seq data (scRNA-seq), and possibly other data modalities. Methods are also provided to assign cells to inferred clones and explore differences in gene expression between clones. These methods can flexibly integrate information from imperfect clonal trees inferred based on bulk exome-seq data, and sparse variant alleles expressed in scRNA-seq data. A flexible beta-binomial error model that accounts for stochastic dropout events as well as systematic allelic imbalance is used.
Authors:	Jeffrey Pullin [aut], Yuanhua Huang [aut], Davis McCarthy [aut, cre]
Maintainer:	Davis McCarthy <[email protected]>
License:	GPL-3
Version:	1.9.0
Built:	2025-03-25 06:09:01 UTC
Source:	https://github.com/bioc/cardelino

A matrix of read numbers of alternative alleles for clone ID

Description

This matrix contains read numbers of alternative alleles for 34 somatic variants across 428 cells, from one example scRNA-seq sample

Usage

example_donor
example_donor

Format

a matrix of float

Value

NULL, but makes available a matrix

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

A matrix of read numbers of alternative alleles

Description

This matrix contains read numbers of alternative alleles for 34 germline variants (near the somatic variants) across 428 cells, from one example scRNA-seq sample

Usage

example_donor
example_donor

Format

a matrix of float

Value

NULL, but makes available a matrix

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

Assign cells to clones from cardelino results

Description

Assign cells to clones from cardelino results

Usage

assign_cells_to_clones(prob_mat, threshold = 0.5)
assign_cells_to_clones(prob_mat, threshold = 0.5)

Arguments

`prob_mat`	numeric matrix (cells x clones) of clone posterior probabilities as output by `clone_id`
`threshold`	numeric(1), posterior probability threshold for cell-clone assignment: if posterior probability is above threshold, assign cell to clone, otherwise leave cell "unassigned"

Value

a data.frame with cell ID, assigned clone label and maximum posterior probability across clones.

Author(s)

Davis McCarthy

Examples

data(example_donor)
assignments <- clone_id(A_clone, D_clone, Config = tree$Z, inference = "EM")
df <- assign_cells_to_clones(assignments$prob)
head(df)
table(df$clone)
data(example_donor)
assignments <- clone_id(A_clone, D_clone, Config = tree$Z, inference = "EM")
df <- assign_cells_to_clones(assignments$prob)
head(df)
table(df$clone)

Scoring the simulation in assignment of singlets and doublets

Description

Scoring the simulation in assignment of singlets and doublets

Usage

assign_scores(prob, I_sim, cutoff = seq(0, 1, 0.001))
assign_scores(prob, I_sim, cutoff = seq(0, 1, 0.001))

Arguments

`prob`	Probability matrix for each cell to each component
`I_sim`	The true identity of assignment from simulation
`cutoff`	A list of cutoffs from 0 to 1

Value

A list with components: df_sg, the recall/precision data.frame calculated by multiPRC(), AUC_sg, the AUC calculated by multiPRC(), df_db, the recall/precision data.frame calculated by binaryPRC() and AUC_db the AUC calculated by binaryPRC(). Note that multiPRC() is run on a multiclass version of the problem and binaryPRC is run on a binarised version of the problem.

Precision-recall curve for binary label prediction

Description

Precision-recall curve for binary label prediction

Usage

binaryPRC(
  scores,
  labels,
  cutoff = NULL,
  cut_direction = ">=",
  add_cut1 = FALSE,
  empty_precision = 1
)
binaryPRC(
  scores,
  labels,
  cutoff = NULL,
  cut_direction = ">=",
  add_cut1 = FALSE,
  empty_precision = 1
)

Arguments

`scores`	Prediction score for each sample
`labels`	True labels for each sample, e.g., from simulation
`cutoff`	A vector of cutoffs; if NULL use all unique scores
`cut_direction`	A string to compare with cutoff: >=, >, <=, <
`add_cut1`	Logical value; if True, manually add a cutoff of 1
`empty_precision`	Float value for default precision if no any recall

Value

A data.frame containing recall and precision values at various cutoffs.

Examples

scores <- 1:10
labels <- c(0, 0, 0, 1, 0, 1, 0, 1, 1, 1)
binaryPRC(scores, labels)

# Extra arguments. 
binaryPRC(scores, labels, cutoff = seq(1, 10, by = 2))
binaryPRC(scores, labels, cut_direction = ">")
binaryPRC(scores, labels, add_cut1 = TRUE)

scores <- 1:10
labels <- c(0, 0, 0, 1, 0, 1, 0, 1, 1, 1)
binaryPRC(scores, labels)

# Extra arguments. 
binaryPRC(scores, labels, cutoff = seq(1, 10, by = 2))
binaryPRC(scores, labels, cut_direction = ">")
binaryPRC(scores, labels, add_cut1 = TRUE)

ROC curve for binary label prediction

Description

ROC curve for binary label prediction

Usage

binaryROC(
  scores,
  labels,
  cutoff = NULL,
  cut_direction = ">=",
  add_cut1 = TRUE,
  cutoff_point = 0.9
)
binaryROC(
  scores,
  labels,
  cutoff = NULL,
  cut_direction = ">=",
  add_cut1 = TRUE,
  cutoff_point = 0.9
)

Arguments

`scores`	Prediction score for each sample
`labels`	True labels for each sample, e.g., from simulation
`cutoff`	A vector of cutoffs; if NULL use all unique scores
`cut_direction`	A string to compare with cutoff: >=, >, <=, <
`add_cut1`	Logical value; if True, manually add a cutoff of 1
`cutoff_point`	Numeric value; additional cutoff value

Value

A data.frame containing AUC and AUPRC at various cutoffs.

Examples

scores <- 1:10
labels <- c(0, 0, 0, 1, 0, 1, 0, 1, 1, 1)
binaryROC(scores, labels)

# Extra arguments. 
binaryROC(scores, labels, cutoff = seq(1, 10, by = 2))
binaryROC(scores, labels, cut_direction = ">")
binaryROC(scores, labels, add_cut1 = TRUE)

scores <- 1:10
labels <- c(0, 0, 0, 1, 0, 1, 0, 1, 1, 1)
binaryROC(scores, labels)

# Extra arguments. 
binaryROC(scores, labels, cutoff = seq(1, 10, by = 2))
binaryROC(scores, labels, cut_direction = ">")
binaryROC(scores, labels, add_cut1 = TRUE)

Infer clonal identity of single cells

Description

Infer clonal identity of single cells

Assign cells to clones using an EM algorithm

Assign cells to clones using a Gibbs sampling algorithm

Usage

clone_id(
  A,
  D,
  Config = NULL,
  n_clone = NULL,
  Psi = NULL,
  relax_Config = TRUE,
  relax_rate_fixed = NULL,
  inference = "sampling",
  n_chain = 1,
  n_proc = 1,
  verbose = TRUE,
  ...
)

clone_id_EM(
  A,
  D,
  Config,
  Psi = NULL,
  min_iter = 10,
  max_iter = 1000,
  logLik_threshold = 1e-05,
  verbose = TRUE
)

clone_id_Gibbs(
  A,
  D,
  Config,
  Psi = NULL,
  relax_Config = TRUE,
  relax_rate_fixed = NULL,
  relax_rate_prior = c(1, 9),
  keep_base_clone = TRUE,
  prior0 = c(0.2, 99.8),
  prior1 = c(0.45, 0.55),
  min_iter = 5000,
  max_iter = 20000,
  buin_frac = 0.5,
  wise = "variant",
  relabel = FALSE,
  verbose = TRUE
)
clone_id(
  A,
  D,
  Config = NULL,
  n_clone = NULL,
  Psi = NULL,
  relax_Config = TRUE,
  relax_rate_fixed = NULL,
  inference = "sampling",
  n_chain = 1,
  n_proc = 1,
  verbose = TRUE,
  ...
)

clone_id_EM(
  A,
  D,
  Config,
  Psi = NULL,
  min_iter = 10,
  max_iter = 1000,
  logLik_threshold = 1e-05,
  verbose = TRUE
)

clone_id_Gibbs(
  A,
  D,
  Config,
  Psi = NULL,
  relax_Config = TRUE,
  relax_rate_fixed = NULL,
  relax_rate_prior = c(1, 9),
  keep_base_clone = TRUE,
  prior0 = c(0.2, 99.8),
  prior1 = c(0.45, 0.55),
  min_iter = 5000,
  max_iter = 20000,
  buin_frac = 0.5,
  wise = "variant",
  relabel = FALSE,
  verbose = TRUE
)

Arguments

`A`	variant x cell matrix of integers; number of alternative allele reads in variant i cell j
`D`	variant x cell matrix of integers; number of total reads covering variant i cell j
`Config`	variant x clone matrix of binary values. The clone-variant configuration, which encodes the phylogenetic tree structure. This is the output Z of Canopy
`n_clone`	integer(1), the number of clone to reconstruct. This is in use only if Config is NULL
`Psi`	A vector of float. The fractions of each clone, output P of Canopy
`relax_Config`	logical(1), If TRUE, relaxing the Clone Configuration by changing it from fixed value to act as a prior Config with a relax rate.
`relax_rate_fixed`	numeric(1), If the value is between 0 to 1, the relax rate will be set as a fix value during updating clone Config. If NULL, the relax rate will be learned automatically with relax_rate_prior.
`inference`	character(1), the method to use for inference, either "sampling" to use Gibbs sampling (default) or "EM" to use expectation-maximization (faster)
`n_chain`	integer(1), the number of chains to run, which will be averaged as an output result
`n_proc`	integer(1), the number of processors to use. This parallel computing can largely reduce time when using multiple chains
`verbose`	logical(1), should the function output verbose information as it runs?
`...`	arguments passed to `clone_id_Gibbs` or `clone_id_EM` (as appropriate)
`min_iter`	A integer. The minimum number of iterations in the Gibbs sampling. The real iteration may be longer until the convergence.
`max_iter`	A integer. The maximum number of iterations in the Gibbs sampling, even haven't passed the convergence diagnosis
`logLik_threshold`	A float. The threshold of logLikelihood increase for detecting convergence.
`relax_rate_prior`	numeric(2), the two parameters of beta prior distribution of the relax rate for relaxing the clone Configuration. This mode is used when relax_relax is NULL.
`keep_base_clone`	bool(1), if TRUE, keep the base clone of Config to its input values when relax mode is used.
`prior0`	numeric(2), alpha and beta parameters for the Beta prior distribution on the inferred false positive rate.
`prior1`	numeric(2), alpha and beta parameters for the Beta prior distribution on the inferred (1 - false negative) rate.
`buin_frac`	numeric(1), the fraction of chain as burn-in period
`wise`	A string, the wise of parameters for theta1: global, variant, element.
`relabel`	bool(1), if TRUE, relabel the samples of both Config and prob during the Gibbs sampling.

Details

The two Bernoulli components correspond to false positive and false negative rates. The two binomial components correspond to the read distributions with and without the mutation present.

Value

If inference method is "EM", a list containing theta, a vector of two floats denoting the parameters of the two components of the base model, i.e., mean of Bernoulli or binomial model given variant exists or not, prob, the matrix of posterior probabilities of each cell belonging to each clone with fitted parameters, and logLik, the log likelihood of the final parameters.

If inference method is "sampling", a list containing: theta0, the mean of sampled false positive parameter values; theta1 the mean of sampled (1 - false negative rate) parameter values; theta0_all, all sampled false positive parameter values; theta1_all, all sampled (1 - false negative rate) parameter values; element; logLik_all, log-likelihood for model for all sampled parameter sets; prob_all; prob, matrix with mean of sampled cell-clone assignment posterior probabilities (the key output of the model); prob_variant.

a list containing theta, a vector of two floats denoting the binomial rates given variant exists or not, prob, the matrix of posterior probabilities of each cell belonging to each clone with fitted parameters, and logLik, the log likelihood of the final parameters.

Author(s)

Yuanhua Huang and Davis McCarthy

Yuanhua Huang

Examples

data(example_donor)
assignments <- clone_id(A_clone, D_clone,
    Config = tree$Z,
    min_iter = 800, max_iter = 1200
)
prob_heatmap(assignments$prob)

assignments_EM <- clone_id(A_clone, D_clone,
    Config = tree$Z,
    inference = "EM"
)
prob_heatmap(assignments_EM$prob)
data(example_donor)
assignments <- clone_id(A_clone, D_clone,
    Config = tree$Z,
    min_iter = 800, max_iter = 1200
)
prob_heatmap(assignments$prob)

assignments_EM <- clone_id(A_clone, D_clone,
    Config = tree$Z,
    inference = "EM"
)
prob_heatmap(assignments_EM$prob)

Column match between two matrices by minimum mean absolute difference

Description

Column match between two matrices by minimum mean absolute difference

Usage

colMatch(A, B, force = FALSE)
colMatch(A, B, force = FALSE)

Arguments

`A`	The first matrix which will be matched
`B`	The second matrix, the return index will be used on
`force`	bool(1), If TRUE, force traversing all permutations of B to find the optimised match to A with computing cost of O(n!). Otherwise, use greedy search with computing cost of O(n^2).

Value

idx, the column index of B to be matched to A

Examples

matA <- matrix(sample(seq(12)), nrow = 3)
col_idx <- sample(4)
matB <- matA[, col_idx]
colMatch(matB, matA)
matA <- matrix(sample(seq(12)), nrow = 3)
col_idx <- sample(4)
matB <- matA[, col_idx]
colMatch(matB, matA)

A list of tree configuration

Description

This list of tree configuration between 3 clones to 10 clones, each element is a list with all possible tree matrix

Usage

config_all
config_all

Format

a list of list of matrix

Value

NULL, but makes available a list

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

PASTRI Python package

A matrix of sequencing depths for clone ID

Description

This matrix contains sequencing depths for 34 somatic variants across 428 cells, from one example scRNA-seq sample

Usage

example_donor
example_donor

Format

a matrix of float

Value

NULL, but makes available a matrix

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

A matrix of sequencing depths

Description

This matrix contains sequencing depths for 34 germline variants (near the somatic variants) across 428 cells, from one example scRNA-seq sample

Usage

example_donor
example_donor

Format

a matrix of float

Value

NULL, but makes available a matrix

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

A matrix of sequencing depths

Description

This matrix contains sequencing depths for 439 somatic variants across 151 cells, from one particular scRNA-seq sample, can be used to generate sequencing depths

Usage

simulation_input
simulation_input

Format

a matrix of float

Value

NULL, but makes available a matrix

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

Deviance Information Criterion for cardelino model

Description

Deviance Information Criterion for cardelino model

Usage

devianceIC(logLik_all, logLik_post)
devianceIC(logLik_all, logLik_post)

Arguments

`logLik_all`	A vector of numeric; the log likelihood of posterior sample, i.e., posterior samples of deviance
`logLik_post`	numeric(1); the log likelihood of mean posterior parameters, i.e., deviance of posterior means

Value

DIC, a float of deviance information criterion

Author(s)

Yuanhua Huang

Reads simulator for donor identification

Description

Reads simulator for donor identification

Usage

donor_read_simulator(
  GT,
  D_seed,
  sample_variants = FALSE,
  donor_size = NULL,
  beta_shapes = NULL,
  n_cell = 5000,
  doublet_rate = NULL
)
donor_read_simulator(
  GT,
  D_seed,
  sample_variants = FALSE,
  donor_size = NULL,
  beta_shapes = NULL,
  n_cell = 5000,
  doublet_rate = NULL
)

Arguments

`GT`	Variant-by-donor matrix for genotypes
`D_seed`	Variant-by-cell matrix for read coverage for generating depth, which be row sample and column sample both with replacement
`sample_variants`	logical(1), if TRUE, sample variants with replacement to the same size, otherwise not
`donor_size`	Vector of float for the fractions of each donor; default NULL means uniform
`beta_shapes`	A 3-by-2 matrix of beta parameters for genotypes: 0, 1, and 2; default NULL means matrix(c(0.2, 0.5, 99.8, 99.8, 0.5, 0.2), nrow = 3)
`n_cell`	An integer for number of total cells
`doublet_rate`	A float from 0 to 1 for doublet rate; default NULL means rate n_cell / 100000

Value

A list of various components of the simulated dataset.

Log likelihood of clone_id model It returns P(A, D | C, I, theta0, theta1)

Description

Log likelihood of clone_id model It returns P(A, D | C, I, theta0, theta1)

Usage

get_logLik(A1, B1, Config, Assign, theta0, theta1)
get_logLik(A1, B1, Config, Assign, theta0, theta1)

Arguments

`A1`	variant x cell matrix of integers; number of alternative allele reads in variant i cell j
`B1`	variant x cell matrix of integers; number of reference allele reads in variant i cell j
`Config`	variant x clone matrix of float values. The clone-variant configuration probability, averaged by posterior samples
`Assign`	cells x clone matrix of float values. The cell-clone assignment probability, averaged by posterior samples
`theta0`	the binomial rate for alternative allele from config = 0
`theta1`	the binomial rate for alternative allele from config = 1

Value

logLik, a float of log likelihood

Author(s)

Yuanhua Huang

Get SNP data matrices from VCF object(s)

Description

Get SNP data matrices from VCF object(s)

Usage

get_snp_matrices(vcf_cell, vcf_donor = NULL, verbose = TRUE, donors = NULL)
get_snp_matrices(vcf_cell, vcf_donor = NULL, verbose = TRUE, donors = NULL)

Arguments

`vcf_cell`	a `CollapsedVCF` object containing variant data for cells
`vcf_donor`	an optional `CollapsedVCF` object containing genotype data for donors
`verbose`	logical(1), should the function output verbose information as it runs?
`donors`	optional character vector providing a set of donors to use, by subsetting the donors present in the `donor_vcf_file`; if `NULL` (default) then all donors present in VCF will be used.

Value

a list containing A, a matrix of integers. Number of alteration reads in SNP i cell j. D, a matrix of integers. Number of reads depth in SNP i cell j. R, a matrix of integers. Number of reference reads in SNP i cell j. GT_cells, a matrix of integers for genotypes. The cell-SNP configuration. GT_donors, a matrix of integers for genotypes. The donor-SNP configuration.

Examples

vcf_cell <- read_vcf(system.file("extdata", "cells.donorid.vcf.gz",
                     package = "cardelino"))
vcf_donor <-  read_vcf(system.file("extdata", "donors.donorid.vcf.gz",
                       package = "cardelino"))
snp_data <- get_snp_matrices(vcf_cell, vcf_donor)

vcf_cell <- read_vcf(system.file("extdata", "cells.donorid.vcf.gz",
                     package = "cardelino"))
vcf_donor <-  read_vcf(system.file("extdata", "donors.donorid.vcf.gz",
                       package = "cardelino"))
snp_data <- get_snp_matrices(vcf_cell, vcf_donor)

Get a clonal tree from a configuration matrix

Description

Get a clonal tree from a configuration matrix

Usage

get_tree(Config, P = NULL, strictness = "lax")
get_tree(Config, P = NULL, strictness = "lax")

Arguments

`Config`	variant x clone matrix of binary values. The clone-variant configuration, which encodes the phylogenetic tree structure. This is the output Z of Canopy
`P`	a one-column numeric matrix encoding the (observed or estimated) prevalence (or frequency) of each clone
`strictness`	character(1), a character string defining the strictness of the function if there are all-zero rows in the Config matrix. If `"lax"` then the function silently drops all-zero rows and proceeds. If `"warn"` then the function warns of dropping all-zero rows and proceeds. If `"error"` then the function throws an error is all-zero rows are detected.

Details

Output tree may be nonsensical if the input Config matrix does not define a coherent tree structure.

Value

An object of class "phylo" describing the tree structure. The output object also contains an element "sna" defining the clustering of variants onto the branches of the tree, and if P is non-null it also contains VAF (variant allele frequency), CCF (cell clone fraction) and clone prevalence values (computed from the supplied P argument).

Author(s)

Davis McCarthy

Examples

Configk3 <- matrix(c(
    rep(0, 15), rep(1, 8), rep(0, 7), rep(1, 5), rep(0, 3),
    rep(1, 7)
), ncol = 3)
tree_k3 <- get_tree(Config = Configk3, P = matrix(rep(1 / 3, 3), ncol = 1))
plot_tree(tree_k3)
Configk3 <- matrix(c(
    rep(0, 15), rep(1, 8), rep(0, 7), rep(1, 5), rep(0, 3),
    rep(1, 7)
), ncol = 3)
tree_k3 <- get_tree(Config = Configk3, P = matrix(rep(1 / 3, 3), ncol = 1))
plot_tree(tree_k3)

Geweke diagnostic for MCMC sampling.

Description

Geweke diagnostic for MCMC sampling.

Usage

Geweke_Z(X, first = 0.1, last = 0.5)
Geweke_Z(X, first = 0.1, last = 0.5)

Arguments

`X`	A matrix of MCMC samples for N samples per K variables
`first`	A float between 0 and 1. The initial region of MCMC chain.
`last`	A float between 0 and 1. The final region of MCMC chain.

Value

Z, a vector of absolute value of Z scores for each variable. When |Z| <= 2, the sampling could be taken as converged.

Author(s)

Yuanhua Huang

Plot heatmap from a matrix

Description

Plot heatmap from a matrix

Usage

heat_matrix(mat, base_size = 12, digits = 2, show_value = FALSE)
heat_matrix(mat, base_size = 12, digits = 2, show_value = FALSE)

Arguments

`mat`	A matrix to show, column by x-axis and row by y-axis
`base_size`	Numeric value for the base size in theme_bw
`digits`	Integer value for the number of digits to show
`show_value`	Logical value for showing the value for each element or not

Value

A ggplot heatmap visualization of the passed matrix.

Examples

mat <- matrix(rnorm(9), ncol = 3, nrow = 3) + diag(rnorm(3, 2, 0.1))
rownames(mat) <- paste0("sample_", letters[1:3])
colnames(mat) <- paste0("var_", 1:3)
heat_matrix(mat)

# Additional arguments. 
heat_matrix(mat, base_size = 6)
heat_matrix(mat, show_value = TRUE)
heat_matrix(mat, show_value = TRUE, digits = 4)
mat <- matrix(rnorm(9), ncol = 3, nrow = 3) + diag(rnorm(3, 2, 0.1))
rownames(mat) <- paste0("sample_", letters[1:3])
colnames(mat) <- paste0("var_", 1:3)
heat_matrix(mat)

# Additional arguments. 
heat_matrix(mat, base_size = 6)
heat_matrix(mat, show_value = TRUE)
heat_matrix(mat, show_value = TRUE, digits = 4)

The theme of heatmaps for prob_heatmap and sites_heatmap

Description

The theme of heatmaps for prob_heatmap and sites_heatmap

Usage

heatmap.theme(legend.position = "bottom", size = 12)
heatmap.theme(legend.position = "bottom", size = 12)

Arguments

`legend.position`	character, describes where to place legend on plot (passed to `theme_gray`)
`size`	numeric, base font size for plot (passed to `theme_gray`)

Value

a ggplot theme based on theme_gray

Load sparse matrices A and D from cellSNP VCF file with filtering SNPs

Description

Load sparse matrices A and D from cellSNP VCF file with filtering SNPs

Usage

load_cellSNP_vcf(
  vcf_file,
  min_count = 0,
  min_MAF = 0,
  max_other_allele = NULL,
  rowname_format = "full",
  keep_GL = FALSE
)
load_cellSNP_vcf(
  vcf_file,
  min_count = 0,
  min_MAF = 0,
  max_other_allele = NULL,
  rowname_format = "full",
  keep_GL = FALSE
)

Arguments

`vcf_file`	character(1), path to VCF file generated from cellSNP
`min_count`	minimum count across all cells, e.g., 20
`min_MAF`	minimum minor allele fraction, e.g., 0.1
`max_other_allele`	maximum ratio of other alleles comparing to REF and ALT alleles; for cellSNP vcf, we recommend 0.05
`rowname_format`	the format of rowname: NULL is the default from vcfR, short is CHROM_POS, and full is CHROM_POS_REF_ALT
`keep_GL`	logical(1), if TRUE, check if GL (genotype probability) exists it will be returned

Value

A list with elements the matrices A and D and GL, the genotype probability. If keep_GL is false the GL element will be an empty list.

Examples

vcf_file <- system.file("extdata", "cellSNP.cells.vcf.gz",
    package = "cardelino"
)
input_data <- load_cellSNP_vcf(vcf_file)
vcf_file <- system.file("extdata", "cellSNP.cells.vcf.gz",
    package = "cardelino"
)
input_data <- load_cellSNP_vcf(vcf_file)

Load genotype VCF into numeric values: 0, 1, or 2

Description

Note, the genotype VCF can be very big for whole genome. It would be more efficient to only keep the wanted variants and samples. bcftools does such jobs nicely.

Usage

load_GT_vcf(vcf_file, rowname_format = "full", na.rm = TRUE, keep_GP = TRUE)
load_GT_vcf(vcf_file, rowname_format = "full", na.rm = TRUE, keep_GP = TRUE)

Arguments

`vcf_file`	character(1), path to VCF file for donor genotypes
`rowname_format`	the format of rowname: NULL is the default from vcfR, short is CHROM_POS, and full is CHROM_POS_REF_ALT
`na.rm`	logical(1), if TRUE, remove the variants with NA values
`keep_GP`	logical(1), if TRUE, check if GP (genotype probability) exists it will be returned

Value

A list representing the loaded genotype information with two components: GT, the usual numeric representation of genotype and GP the genotype probabilities. Note that if keep_GP is false the GP component will be NULL.

Examples

vcf_file <- system.file("extdata", "cellSNP.cells.vcf.gz",
    package = "cardelino"
)
GT_dat <- load_GT_vcf(vcf_file, na.rm = FALSE)
vcf_file <- system.file("extdata", "cellSNP.cells.vcf.gz",
    package = "cardelino"
)
GT_dat <- load_GT_vcf(vcf_file, na.rm = FALSE)

EM algorithm for estimating binomial mixture model

Description

EM algorithm for estimating binomial mixture model

Usage

mixBinom(
  k,
  n,
  n_components = 2,
  p_init = NULL,
  learn_p = TRUE,
  min_iter = 10,
  max_iter = 1000,
  logLik_threshold = 1e-05
)
mixBinom(
  k,
  n,
  n_components = 2,
  p_init = NULL,
  learn_p = TRUE,
  min_iter = 10,
  max_iter = 1000,
  logLik_threshold = 1e-05
)

Arguments

`k`	A vector of integers. number of success
`n`	A vector of integers. number of trials
`n_components`	A number. number of components
`p_init`	A vector of floats with length n_components, the initial value of p
`learn_p`	bool(1) or a vector of bool, whether learn each p
`min_iter`	integer(1). number of minimum iterations
`max_iter`	integer(1). number of maximum iterations
`logLik_threshold`	A float. The threshold of logLikelihood increase for detecting convergence

Value

a list containing p, a vector of floats between 0 and 1 giving the estimated success probability for each component, psi, estimated fraction of each component in the mixture, and prob, the matrix of fitted probabilities of each observation belonging to each component.

Examples

n1 <- array(sample(1:30, 50, replace = TRUE))
n2 <- array(sample(1:30, 200, replace = TRUE))
k1 <- apply(n1, 1, rbinom, n = 1, p = 0.5)
k2 <- apply(n2, 1, rbinom, n = 1, p = 0.01)
RV <- mixBinom(c(k1, k2), c(n1, n2))
n1 <- array(sample(1:30, 50, replace = TRUE))
n2 <- array(sample(1:30, 200, replace = TRUE))
k1 <- apply(n1, 1, rbinom, n = 1, p = 0.5)
k2 <- apply(n2, 1, rbinom, n = 1, p = 0.01)
RV <- mixBinom(c(k1, k2), c(n1, n2))

Convert a matrix to data frame

Description

Convert a matrix to data frame

Usage

mtx_to_df(X)
mtx_to_df(X)

Arguments

`X`	A matrix of values

Value

A data.frame version of the passed matrix.

Examples

mtx_to_df(matrix(seq(12), nrow = 3))
mtx_to_df(matrix(seq(12), nrow = 3))

Precision-recall curve for multi-class prediction

Description

Precision-recall curve for multi-class prediction

Usage

multiPRC(
  prob_mat,
  simu_mat,
  marginal_mode = "best",
  cutoff = NULL,
  multiLabel.rm = TRUE,
  add_cut1 = FALSE
)
multiPRC(
  prob_mat,
  simu_mat,
  marginal_mode = "best",
  cutoff = NULL,
  multiLabel.rm = TRUE,
  add_cut1 = FALSE
)

Arguments

`prob_mat`	Probability matrix for each cell to each component
`simu_mat`	The true identity of assignment from simulation
`marginal_mode`	A string for the mode to marginalize the column: best, second, or delta
`cutoff`	A list of cutoff; if NULL use all unique scores
`multiLabel.rm`	Logical value; if True, remove the samples with multiple labels
`add_cut1`	Logical value; if True, manually add a cutoff of 1

Value

A list with two components: df, a data.frame containing precision and recall values at various cutoffs and AUC, the overall AUC.

Define a publication-style plot theme

Description

Define a publication-style plot theme

Usage

plot_config_diffs(Config1, Config2, show_variant_names = FALSE)
plot_config_diffs(Config1, Config2, show_variant_names = FALSE)

Arguments

`Config1`	variant by clone matrix defining the first clonal structure
`Config2`	variant by clone matrix defining the second clonal structure
`show_variant_names`	logical(1), should the variant names (rownames of Config matrices) be shown on the plot? Default is `FALSE`.

Value

a ggplot heatmap style plot showing the differences between the two Config matrices, specifically the differences Config1 - Config2.

Examples

Config1 <- matrix(c(
    rep(0, 15), rep(1, 8), rep(0, 7),
    rep(1, 5), rep(0, 3), rep(1, 7)
), ncol = 3)
Config2 <- matrix(c(
    rep(0, 15), rep(1, 8), rep(1, 7),
    rep(0, 5), rep(1, 3), rep(1, 7)
), ncol = 3)
rownames(Config1) <- rownames(Config2) <- paste0("var", 1:nrow(Config1))
colnames(Config1) <- colnames(Config2) <- paste0("clone", 1:ncol(Config1))
plot_config_diffs(Config1, Config2)
Config1 <- matrix(c(
    rep(0, 15), rep(1, 8), rep(0, 7),
    rep(1, 5), rep(0, 3), rep(1, 7)
), ncol = 3)
Config2 <- matrix(c(
    rep(0, 15), rep(1, 8), rep(1, 7),
    rep(0, 5), rep(1, 3), rep(1, 7)
), ncol = 3)
rownames(Config1) <- rownames(Config2) <- paste0("var", 1:nrow(Config1))
colnames(Config1) <- colnames(Config2) <- paste0("clone", 1:ncol(Config1))
plot_config_diffs(Config1, Config2)

Plot a phylogenetic tree

Description

Plot a phylogenetic tree

Usage

plot_tree(tree, orient = "h")
plot_tree(tree, orient = "h")

Arguments

`tree`	A phylgenetic tee object of class "phylo"
`orient`	A string for the orientation of the tree: "v" (vertical) or "h" (horizontal)

Details

This function plots a phylogenetic tree from an object of class "phylo", as produced, for example, by the Canopy package.

Value

a ggtree object

Author(s)

Davis McCarthy and Yuanhua Huang

References

This function makes use of the ggtree package:

Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628

Examples

data(example_donor)
plot_tree(tree, orient = "v")
data(example_donor)
plot_tree(tree, orient = "v")

Predicted probability from learned binomial mixture model

Description

Predicted probability from learned binomial mixture model

Usage

predMixBinom(k, n, p, psi)
predMixBinom(k, n, p, psi)

Arguments

`k`	A vector of integers. number of success
`n`	A vector of integers. number of trials
`p`	a vector of binomial success probabilities
`psi`	A float between 0 and 1. fraction of each component

Value

A list with two components: prob, a matrix representing the probability of each of the passed values coming from each component of the mixture and logLik, the total log-likelihood of the new samples.

Examples

n1 <- array(sample(1:30, 50, replace = TRUE))
n2 <- array(sample(1:30, 200, replace = TRUE))
k1 <- apply(n1, 1, rbinom, n = 1, p = 0.5)
k2 <- apply(n2, 1, rbinom, n = 1, p = 0.01)
RV <- mixBinom(c(k1, k2), c(n1, n2))
RV_pred <- predMixBinom(3, 10, RV$p, RV$psi)
n1 <- array(sample(1:30, 50, replace = TRUE))
n2 <- array(sample(1:30, 200, replace = TRUE))
k1 <- apply(n1, 1, rbinom, n = 1, p = 0.5)
k2 <- apply(n2, 1, rbinom, n = 1, p = 0.01)
RV <- mixBinom(c(k1, k2), c(n1, n2))
RV_pred <- predMixBinom(3, 10, RV$p, RV$psi)

Plot a heatmap for probability of clone assignment

Description

Plot a heatmap for probability of clone assignment

Usage

prob_heatmap(prob_mat, threshold = 0.5, mode = "best", cell_idx = NULL)
prob_heatmap(prob_mat, threshold = 0.5, mode = "best", cell_idx = NULL)

Arguments

`prob_mat`	A matrix (M x K), the probability of cell j to clone k
`threshold`	A float value, the threshold for assignable cells
`mode`	A string, the method for defining scores for filtering cells: best and delta. best: highest probability of a cell to K clones, delta: the difference between the best and second.
`cell_idx`	A vector the indices of the input cells. If NULL, order by the probability of each clone

Value

a ggplot object

Examples

data(example_donor)
assignments <- clone_id(A_clone, D_clone, Config = tree$Z, inference = "EM")
fig <- prob_heatmap(assignments$prob)
data(example_donor)
assignments <- clone_id(A_clone, D_clone, Config = tree$Z, inference = "EM")
fig <- prob_heatmap(assignments$prob)

Define a publication-style plot theme

Description

Define a publication-style plot theme

Usage

pub.theme(size = 12)
pub.theme(size = 12)

Arguments

size

numeric, base font size for adapted ggplot2 theme

Details

This theme modifies the theme_classic theme in ggplot2.

Value

a ggplot theme based on theme_classic

Examples

library(ggplot2)
x <- sample(10)
y <- x + runif(10) - 0.5
df <- data.frame(x = x, y = y)
fig <- ggplot(df, aes(x = x, y = y)) +
    geom_point() +
    pub.theme()
library(ggplot2)
x <- sample(10)
y <- x + runif(10) - 0.5
df <- data.frame(x = x, y = y)
fig <- ggplot(df, aes(x = x, y = y)) +
    geom_point() +
    pub.theme()

Read a VCF file into R session

Description

Read a VCF file into R session

Usage

read_vcf(
  vcf_file,
  genome = "GRCh37",
  seq_levels_style = "Ensembl",
  verbose = TRUE
)
read_vcf(
  vcf_file,
  genome = "GRCh37",
  seq_levels_style = "Ensembl",
  verbose = TRUE
)

Arguments

`vcf_file`	character(1), path to VCF file to read into R session as a `CollapsedVCF` object
`genome`	character(1), string indicating the genome build used in the VCF file(s) (default: "GRCh37")
`seq_levels_style`	character(1), string passed to `seqlevelsStyle` the style to use for chromosome/contig names (default: "Ensembl")
`verbose`	logical(1), should messages be printed as function runs?

Value

a vcf object

Examples

vcf <- read_vcf(system.file("extdata", "cells.donorid.vcf.gz",
               package = "cardelino"))

vcf <- read_vcf(system.file("extdata", "cells.donorid.vcf.gz",
               package = "cardelino"))

Column index of the maximum value for each row in a matrix

Description

Column index of the maximum value for each row in a matrix

Usage

rowArgmax(X)
rowArgmax(X)

Arguments

`X`	A matrix of floats.

Value

a vector of the index of column for each row. Note, when multiple columns have the same value, only the earliest column will be returned.

Examples

matA <- matrix(sample(seq(12)), nrow = 3)
rowArgmax(matA)
matA <- matrix(sample(seq(12)), nrow = 3)
rowArgmax(matA)

Maximum value for each row in a matrix

Description

Maximum value for each row in a matrix

Usage

rowMax(X, mode = "best")
rowMax(X, mode = "best")

Arguments

`X`	A matrix of floats.
`mode`	A string, the method for defining scores for filtering cells: best, second and delta. `best`: highest value for each row, similarly for the `second`. `delta` is the difference between the best and the second.

Value

a vector of the collapsed value for each row, depending on the mode used.

Examples

matA <- matrix(sample(seq(12)), nrow = 3)
rowMax(matA)
matA <- matrix(sample(seq(12)), nrow = 3)
rowMax(matA)

Update matrix D with manually selected missing rate

Description

Given missing rate, the NA will be generated first. For none NA element, sequencing depth with uniformly sampled from D, row wisely. Namely, the depth is variant specific.

Usage

sample_seq_depth(D, n_cells = NULL, n_sites = NULL, missing_rate = NULL)
sample_seq_depth(D, n_cells = NULL, n_sites = NULL, missing_rate = NULL)

Arguments

`D`	A matrix (N variants x M cells), the original sequencing coverage, NA means missing
`n_cells`	A integer, the number of the cells to generate
`n_sites`	A integer, the number of variants to generate
`missing_rate`	A float value, if NULL, use the same missing rate as D

Value

a n_sites by n_cells matrix sampled from input D.

Examples

data(simulation_input)
D1 <- sample_seq_depth(D_input,
    n_cells = 500, n_sites = 50,
    missing_rate = 0.85
)
data(simulation_input)
D1 <- sample_seq_depth(D_input,
    n_cells = 500, n_sites = 50,
    missing_rate = 0.85
)

Down sample number of SNVs in the tree

Description

Down sample number of SNVs in the tree

Usage

sample_tree_SNV(tree, n_SNV = NULL)
sample_tree_SNV(tree, n_SNV = NULL)

Arguments

`tree`	A tree object from Canopy
`n_SNV`	A integer, the number of SNVs to keep in the output tree

Value

a phylo tree with down sampled variants

Examples

data(simulation_input)
tree_lite <- sample_tree_SNV(tree_4clone, n_SNV = 10)
data(simulation_input)
tree_lite <- sample_tree_SNV(tree_4clone, n_SNV = 10)

Synthetic reads generator for genetic variants

Description

There are following steps to generate the simulated reads counts for variants in single cells: 1) given the clonal genotype and the clonal prevalence, the genotypes (i.e, the clone) of cells will be generated following a multinomial distribution. Note, one cell may contain variants from two clones when it is a doublet. 2) given the distribution of reads coverage, e.g., a matrix of read coverage from real data, (variant specific), the total reads of each variant will be generated by random sampling. Note, the missing rate is governed by this matrix. 3) the allelic frequency of each variant will be generated by following a beta distribution with parameters of mean and variance. 4) Given the genotype of a cell, if the mutation exists in a cell, the alteration read counts will be generated by a binomial distribution, parameterized the allelic frequency, sampled from step 3. 5) Given the genotype of a cell, if the mutation does not exist in a cell, the alteration read counts will be generated by a binomial distribution, parameterized by the technical error rate.

Usage

sim_read_count(
  Config,
  D,
  Psi = NULL,
  means = c(0.002, 0.45),
  vars = c(100, 1),
  wise0 = "element",
  wise1 = "variant",
  cell_num = 300,
  permute_D = FALSE,
  sample_cell = TRUE,
  doublet = 0
)
sim_read_count(
  Config,
  D,
  Psi = NULL,
  means = c(0.002, 0.45),
  vars = c(100, 1),
  wise0 = "element",
  wise1 = "variant",
  cell_num = 300,
  permute_D = FALSE,
  sample_cell = TRUE,
  doublet = 0
)

Arguments

`Config`	A matrix of binary values. The clone-variant configuration, which encodes the phylogenetic tree structure, and the genotype of each clone
`D`	A matrix of integers. Sequencing depth for N variants across x cells (ideally >100 cells). NA means 0 here.
`Psi`	A vector of float. The fractions of each clone. If NULL, set a uniform distribution.
`means`	A vector of two floats. The mean theta_1 (false positive rate) and the mean theta_2 (true positive rate).
`vars`	A vector of two floats. The variance of theta_1 and theta_2.
`wise0`	A string, the beta-binomial parameter specificity for theta0: global, variant, element.
`wise1`	A string, the beta-binomial parameter specificity for theta1: global, variant, element.
`cell_num`	A integer. The number of cells to generate.
`permute_D`	A Boolean value. If True permute variants in D.
`sample_cell`	A Boolean value. If True and M > ncol(D), sample cells.
`doublet`	A float between 0 and 1, the rate of doublets

Value

a list containing A_sim, a matrix for alteration reads, A_sim, a matrix for total reads, I_sim, a matrix for clonal label, H_sim, a matrix for genotype, theta0, a matrix of expected false positive rate, theta1, a matrix of expected true positive rate, theta0_binom, theta0 as binomial parameter, theta1_binom, theta0 as binomial parameter, and is_doublet, a vector of Boolean value if a cell is a doublet

Examples

data(simulation_input)
D2 <- sample_seq_depth(D_input, n_cells = 500, n_sites = nrow(tree_4clone$Z))
simu <- sim_read_count(tree_4clone$Z, D2, Psi = NULL, cell_num = 500)
data(simulation_input)
D2 <- sample_seq_depth(D_input, n_cells = 500, n_sites = nrow(tree_4clone$Z))
simu <- sim_read_count(tree_4clone$Z, D2, Psi = NULL, cell_num = 500)

A tree object

Description

This tree object contains clonal tree information, inferred from bulk exome-seq data

Usage

example_donor
example_donor

Format

a tree object

Value

NULL, but makes available a tree object

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

A tree object

Description

This tree object with 3 clones contains clonal tree information, inferred from bulk exome-seq data

Usage

simulation_input
simulation_input

Format

a tree object

Value

NULL, but makes available a tree object

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

A tree object

Description

This tree object with 4 clones contains clonal tree information, inferred from bulk exome-seq data

Usage

simulation_input
simulation_input

Format

a tree object

Value

NULL, but makes available a tree object

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

A tree object

Description

This tree object with 5 clones contains clonal tree information, inferred from bulk exome-seq data

Usage

simulation_input
simulation_input

Format

a tree object

Value

NULL, but makes available a tree object

Author(s)

Yuanhua Huang, Davis McCarthy, 2018-06-25

Source

A fibroblast sample from HipSci project

Plot a variant-cell heatmap for cell clonal assignment

Description

Plot a variant-cell heatmap for cell clonal assignment

Usage

vc_heatmap(mat, prob, Config, show_legend = FALSE)
vc_heatmap(mat, prob, Config, show_legend = FALSE)

Arguments

`mat`	A matrix for heatmap: N variants x M cells. row and column will be sorted automatically.
`prob`	A matrix of probability of clonal assignment: M cells x K clones
`Config`	A binary matrix of clonal Configuration: N variants x K clones
`show_legend`	A bool value: if TRUE, show the legend

Value

a pheatmap object

a ggplot object

References

This function makes use of the pheatmap packages

Examples

data(example_donor)
assignments <- clone_id(A_clone, D_clone, Config = tree$Z)
fig <- vc_heatmap(assignments$prob_variant, assignments$prob, tree$Z)
data(example_donor)
assignments <- clone_id(A_clone, D_clone, Config = tree$Z)
fig <- vc_heatmap(assignments$prob_variant, assignments$prob, tree$Z)

Package 'cardelino'

Help Index

A matrix of read numbers of alternative alleles for clone ID

Description

Usage

Format

Value

Author(s)

Source

A matrix of read numbers of alternative alleles

Description

Usage

Format

Value

Author(s)

Source

Assign cells to clones from cardelino results

Description

Usage

Arguments

Value

Author(s)

Examples

Scoring the simulation in assignment of singlets and doublets

Description

Usage

Arguments

Value

Precision-recall curve for binary label prediction

Description

Usage

Arguments

Value

Examples

ROC curve for binary label prediction

Description

Usage

Arguments

Value

Examples

Infer clonal identity of single cells

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Column match between two matrices by minimum mean absolute difference

Description

Usage

Arguments

Value

Examples

A list of tree configuration

Description

Usage

Format

Value

Author(s)

Source

A matrix of sequencing depths for clone ID

Description

Usage

Format

Value

Author(s)

Source

A matrix of sequencing depths

Description

Usage

Format

Value

Author(s)

Source

A matrix of sequencing depths

Description

Usage

Format

Value