Title: | Differential Composition Analysis Transformed by a Similarity matrix |
---|---|
Description: | Methods to detect the differential composition abundances between conditions in singel-cell RNA-seq experiments, with or without replicates. It aims to correct bias introduced by missclaisification and enable controlling of confounding covariates. To avoid the influence of proportion change from big cell types, DCATS can use either total cell number or specific reference group as normalization term. |
Authors: | Xinyi Lin [aut, cre] , Chuen Chau [aut], Yuanhua Huang [aut], Joshua W.K. Ho [aut] |
Maintainer: | Xinyi Lin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.5.0 |
Built: | 2024-11-29 05:10:44 UTC |
Source: | https://github.com/bioc/DCATS |
Create a similarity matrix assuming the misclassification error distribute uniformly in all clusters
create_simMat(K, confuse_rate)
create_simMat(K, confuse_rate)
K |
A integer for number of cluster |
confuse_rate |
A float for confusion rate, uniformly to none-self clusters |
a similarity matrix with uniform confusion with other cluster
create_simMat(4, 0.1)
create_simMat(4, 0.1)
GLM supports both beta-binomial and negative binomial from aod package.
dcats_GLM( count_mat, design_mat, similarity_mat = NULL, pseudo_count = NULL, base_model = "NULL", fix_phi = NULL, reference = NULL )
dcats_GLM( count_mat, design_mat, similarity_mat = NULL, pseudo_count = NULL, base_model = "NULL", fix_phi = NULL, reference = NULL )
count_mat |
A matrix of composition sizes (n_sample, n_cluster) for each cluster in each sample |
design_mat |
A matrix or a data frame of testing candidate factors (n_sample, n_factor) with same sample order as count_mat. All factors should be continous and categorical with only two levels. |
similarity_mat |
A matrix of floats (n_cluster, n_cluster) for the similarity matrix between cluster group pair. The order of cluster should be consistent with those in 'count_mat'. |
pseudo_count |
A pseudo count to add for counts in all cell types Default NULL means 0 except if a cell type is empty in one condition, otherwise pseudo_count will be: 0.01 * rowMeans for each condition |
base_model |
A string value: 'NULL' for 1 factor vs NULL factor testing; 'FULL' for FULL factors vs n-1 factors testing. |
fix_phi |
A numeric used to provided a fixed phi value for the GLM for all cell types |
reference |
A vector of characters indicating which cell types are used as reference for normalization. 'NULL' indicates using total count for normalization. |
a list of significance p values for each cluster
K <- 3 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 simil_mat = DCATS::create_simMat(K, confuse_rate=0.2) sim_dat <- DCATS::simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat) sim_count = rbind(sim_dat$numb_cond1, sim_dat$numb_cond2) sim_design = data.frame(condition = c("g1", "g1", "g1", "g1", "g2", "g2", "g2"), gender = sample(c("Female", "Male"), 7, replace = TRUE)) ## Using 1 factor vs NULL factor testing dcats_GLM(sim_count, sim_design, similarity_mat = simil_mat) ## Using full factors vs n-1 factors testing with intercept term dcats_GLM(sim_count, sim_design, similarity_mat = simil_mat, base_model='FULL') ## Fix phi dcats_GLM(sim_count, sim_design, similarity_mat = simil_mat, fix_phi = 1/61) ## Specify reference cell type colnames(sim_count) <- c("celltypeA", "celltypeB", "celltypeC")
K <- 3 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 simil_mat = DCATS::create_simMat(K, confuse_rate=0.2) sim_dat <- DCATS::simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat) sim_count = rbind(sim_dat$numb_cond1, sim_dat$numb_cond2) sim_design = data.frame(condition = c("g1", "g1", "g1", "g1", "g2", "g2", "g2"), gender = sample(c("Female", "Male"), 7, replace = TRUE)) ## Using 1 factor vs NULL factor testing dcats_GLM(sim_count, sim_design, similarity_mat = simil_mat) ## Using full factors vs n-1 factors testing with intercept term dcats_GLM(sim_count, sim_design, similarity_mat = simil_mat, base_model='FULL') ## Fix phi dcats_GLM(sim_count, sim_design, similarity_mat = simil_mat, fix_phi = 1/61) ## Specify reference cell type colnames(sim_count) <- c("celltypeA", "celltypeB", "celltypeC")
Assuming all cell types share the same phi. This global phi can be calculate by pooling all cell types together to fit a beta binomial distribution.
detect_reference(count_mat, design_mat, similarity_mat = NULL, fix_phi = NULL)
detect_reference(count_mat, design_mat, similarity_mat = NULL, fix_phi = NULL)
count_mat |
A matrix of composition sizes (n_sample, n_cluster) for each cluster in each sample. |
design_mat |
A matrix or a data frame of testing candidate factors (n_sample, n_factor) with same sample order as count_mat. All factors should be continous and categorical with only two levels. |
similarity_mat |
A matrix of floats (n_cluster, n_cluster) for the similarity matrix between cluster group pair. The order of cluster should be consistent with those in 'count_mat'. |
fix_phi |
A numeric used to provided a fixed phi value for the GLM for all cell types. |
A data frame with ordered cell types and their p-value. Cell types are ordered by their p-values. The order indicating how they are recommended to be selected as reference cell types.
K <- 3 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 simil_mat = DCATS::create_simMat(K, confuse_rate=0.2) sim_dat <- DCATS::simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat) sim_count = rbind(sim_dat$numb_cond1, sim_dat$numb_cond2) sim_design = data.frame(condition = c("g1", "g1", "g1", "g1", "g2", "g2", "g2"), gender = sample(c("Female", "Male"), 7, replace = TRUE)) ## Using 1 factor vs NULL factor testing detect_reference(sim_count, sim_design)
K <- 3 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 simil_mat = DCATS::create_simMat(K, confuse_rate=0.2) sim_dat <- DCATS::simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat) sim_count = rbind(sim_dat$numb_cond1, sim_dat$numb_cond2) sim_design = data.frame(condition = c("g1", "g1", "g1", "g1", "g2", "g2", "g2"), gender = sample(c("Female", "Male"), 7, replace = TRUE)) ## Using 1 factor vs NULL factor testing detect_reference(sim_count, sim_design)
Assuming all cell types share the same phi. This global phi can be calculate by pooling all cell types together to fit a beta binomial distribution.
getPhi(count_mat, design_mat)
getPhi(count_mat, design_mat)
count_mat |
A matrix of composition sizes (n_sample, n_cluster) for each cluster in each sample |
design_mat |
A matrix of testing candidate factors (n_sample, n_factor) with same sample order as count_mat |
A number indicating a global phi for all cell types
K <- 3 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 simil_mat = DCATS::create_simMat(K, confuse_rate=0.2) sim_dat <- DCATS::simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat) sim_count = rbind(sim_dat$numb_cond1, sim_dat$numb_cond2) sim_design = data.frame(condition = c("g1", "g1", "g1", "g1", "g2", "g2", "g2")) phi = DCATS::getPhi(sim_count, sim_design)
K <- 3 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 simil_mat = DCATS::create_simMat(K, confuse_rate=0.2) sim_dat <- DCATS::simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat) sim_count = rbind(sim_dat$numb_cond1, sim_dat$numb_cond2) sim_design = data.frame(condition = c("g1", "g1", "g1", "g1", "g2", "g2", "g2")) phi = DCATS::getPhi(sim_count, sim_design)
A data containing the count matrices, the similarity matrix and other variables used to generate the similarity matrix from intestinal epithelial single cell RNA sequencing data with three condition. Count matrices are calculated based on the number of cells in each cell type. The similarity matrix is calculated by support vector machine classifiers using 5-fold cross validation. Top 30 PCs are used as predictors.
Haber2017
Haber2017
A list with 7 items:
the count matrix for the control group
the count matrix for three days after H.polygyrus infection
the count matrix for ten days after H.polygyrus infection
the count matrix for two days after Salmonella infection
the similarity matrix
the source of this dataset
https://www.nature.com/articles/nature24489
library(DCATS) data(Haber2017)
library(DCATS) data(Haber2017)
A data containing the count matrices, the similarity matrix and other variables used to generate the similarity matrix from single cell RNA sequencing data of 8 pooled lupus patient samples within two conditions. Count matrices are calculated based on the number of cells in each cell type. The similarity matrix is calculated by support vector machine classifiers using 5-fold cross validation. Top 30 PCs are used as predictors. The svmDF contains 30 PCs generated by standard Seurat pipeline and the condition, cell type information collected from the original paper.
Kang2017
Kang2017
A list with 5 items:
the count matrix for three days after H.polygyrus infection
the count matrix for ten days after H.polygyrus infection
the simularity matrix
the data frame used to calculate the similarity matrix.
the source of this dataset
https://www.nature.com/articles/nbt.4042
The transition probability from cluster i to j is the fraction of neighbours of all samples in cluster i that belongs to cluster j. Note, this matrix is asymmetric, so as the input KNN connection matrix.
knn_simMat(KNN_matrix, clusters)
knn_simMat(KNN_matrix, clusters)
KNN_matrix |
a sparse binary matrix with size (n_sample, n_sample). x_ij=1 means sample j is a neighbour of sample i. As definition, we expect sum(KNN_matrix) = n_sample * K, where K is the number neighbours. |
clusters |
a (n_sample, ) vector of cluster id for each sample. |
a similarity matrix calculated based on the knn graph.
data(simulation) knn_mat = knn_simMat(simulation$knnGraphs, simulation$labels)
data(simulation) knn_mat = knn_simMat(simulation$knnGraphs, simulation$labels)
An EM algorithm to fit a multinomial with maximum likelihood
multinom_EM(X, simMM, min_iter = 10, max_iter = 1000, logLik_threshold = 0.01)
multinom_EM(X, simMM, min_iter = 10, max_iter = 1000, logLik_threshold = 0.01)
X |
A vector of commopent sizes |
simMM |
A matrix of floats (n_cluster, n_cluster) for the similarity matrix between clusters. simMM[i,j] means the proportion of cluster i will be assigned to cluster j, hence colSums(simMM) are ones. |
min_iter |
integer(1). number of minimum iterations |
max_iter |
integer(1). number of maximum iterations |
logLik_threshold |
A float. The threshold of logLikelihood increase for detecting convergence |
a list containing mu
, a vector for estimated latent proportion
of each cluster, logLik
, a float for the estimated log likelihood,
simMM
, the input of simMM, codeX, the input of X, X_prop
,
the proportion of clusters in the input X, predict_X_prop
, and the
predicted proportion of clusters based on mu and simMM.
X = c(100, 300, 1500, 500, 1000) simMM = create_simMat(5, confuse_rate=0.2) multinom_EM(X, simMM)
X = c(100, 300, 1500, 500, 1000) simMM = create_simMat(5, confuse_rate=0.2) multinom_EM(X, simMM)
A data containing the count matrix, metadata from a large COVID-19 cohort. Count matrices are calculated based on the number of cells in each cell type. The information in the design matrix is collected in the original paper.
Ren2021
Ren2021
A list with 7 items:
the count matrix for all samples comming from different condition
the corresponding metadata related to each sample
the source of this dataset
https://www.nature.com/articles/nature24489
A data containing the count matrices, the similarity matrix and other variables used to generate the similarity matrix from a simulated single cell RNA sequencing data with two conditions. Dirichlet distribution was used to generate a proportion vector for cell types based on the defined true proportions. Multinomial distribution was used to generate simulated cell numbers. Cells were selected from cell pools based on cell numbers and gene expression matrices were processed by Seurat. The count matrix were calculated by the clustering results, and the similarity matrix was calculated using knn graph. The labels are the groud true annotation of each cell.
simulation
simulation
A list with 5 items:
the count matrix of condition 1
the count matrix of condition 2
the similariy matrix
the knn graphs information used to calculate the similarity matrix.
the clusters' label for each simulated single cell
Directly simulating the composition size from a Dirichlet-Multinomial distribution with replicates for two conditions.
simulator_base( totals_cond1, totals_cond2, dirichlet_s1, dirichlet_s2, similarity_mat = NULL )
simulator_base( totals_cond1, totals_cond2, dirichlet_s1, dirichlet_s2, similarity_mat = NULL )
totals_cond1 |
A vector of integers (n_rep1, ) for the total samples in each replicate in codition 1 |
totals_cond2 |
A vector of integers (n_rep2, ) for the total samples in each replicate in codition 2 |
dirichlet_s1 |
A vector of floats (n_cluster, ) for the composition concentration in condition 1 |
dirichlet_s2 |
A vector of floats (n_cluster, ) for the composition concentration in condition 2 |
similarity_mat |
A matrix of floats (n_cluster, n_cluster) for the similarity matrix between each cluster pair |
a list of two matrices for composition sizes in each replicate and each cluster in both conditions.
K <- 2 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 confuse_rate = 0.2 simil_mat = create_simMat(2, 0.2) sim_dat <- simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat)
K <- 2 totals1 = c(100, 800, 1300, 600) totals2 = c(250, 700, 1100) diri_s1 = rep(1, K) * 20 diri_s2 = rep(1, K) * 20 confuse_rate = 0.2 simil_mat = create_simMat(2, 0.2) sim_dat <- simulator_base(totals1, totals2, diri_s1, diri_s2, simil_mat)
The transition probability from cluster i to j is calculated based on the information used to cluster cells. It is estimated by the misclassification rate from cluster i to j comparing the original labels with the labels predicted by support vector machine with 5-fold cross validation.
svm_simMat(dataframe)
svm_simMat(dataframe)
dataframe |
a data frame contains the information used for clustering and the original label of each cell. The original labels should have the column name 'clusterRes'. |
a similarity matrix estimated by 5-fold cross validation support vector machine.
data(Kang2017) svm_mat = svm_simMat(Kang2017$svmDF)
data(Kang2017) svm_mat = svm_simMat(Kang2017$svmDF)