Title: | Simulation, power evaluation, and sample size recommendation for single cell RNA-seq |
---|---|
Description: | Determining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Even though a number of methods and tools are available for sample size calculation for microarray and RNA-seq in the context of differential expression (DE), this topic in the field of single-cell RNA sequencing is understudied. Moreover, the unique data characteristics present in scRNA-seq such as sparsity and heterogeneity increase the challenge. We propose POWSC, a simulation-based method, to provide power evaluation and sample size recommendation for single-cell RNA sequencing DE analysis. POWSC consists of a data simulator that creates realistic expression data, and a power assessor that provides a comprehensive evaluation and visualization of the power and sample size relationship. |
Authors: | Kenong Su [aut, cre], Hao Wu [aut] |
Maintainer: | Kenong Su <[email protected]> |
License: | GPL-2 |
Version: | 1.15.0 |
Built: | 2024-11-08 06:00:52 UTC |
Source: | https://github.com/bioc/POWSC |
This dataset is obtained from GEO under accession number GSE29087. It is generated to profile the transcriptomes for 92 single cells consisting of mouse embryonic fibroblast (MEF) and embryonic stem (ES) cells (Islam et al., 2011). The average sequencing depth for the dataset is around half a million.
es_mef_sce
es_mef_sce
A singlecellexperiment object contain the expressin data with two cell types
Islam, S., Kjällquist, U., Moliner, A., Zajac, P., Fan, J.-B., Lönnerberg, P., and Linnarsson, S. (2011). Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome research, 21(7), 1160–1167.
These parameters include four gene-wise parameters and two cell-wise parameters.
Est2Phase(sce, low.prob = 0.99)
Est2Phase(sce, low.prob = 0.99)
sce |
SingleCellExperiment object with assays(sce)[[1]] is the count matrix or input directly |
low.prob |
lower bound probability for phase I |
a list of needed estimated parameters
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce)
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce)
plot the result use visualization.
plot_POWSC(POWSCobj, Form = c("I", "II"), Cell_Type = c("PW", "Multi"))
plot_POWSC(POWSCobj, Form = c("I", "II"), Cell_Type = c("PW", "Multi"))
POWSCobj |
a POWSC object from |
Form |
choose from "I" or "II". |
Cell_Type |
choose from "PW" or "Multi" |
for multiple comparison cases, return the pheatmap; for the pairwise comparison, return the ggplot object.
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(12) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] est_Paras = Est2Phase(sce) sim_size = c(100, 200) # A numeric vector pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW") # Note, using our previous developed tool SC2P is faster. plot_POWSC(pow_rslt, Form="II", Cell_Type = "PW") # Alternatively, we can use Form="I")
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(12) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] est_Paras = Est2Phase(sce) sim_size = c(100, 200) # A numeric vector pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW") # Note, using our previous developed tool SC2P is faster. plot_POWSC(pow_rslt, Form="II", Cell_Type = "PW") # Alternatively, we can use Form="I")
Run DE analysis by using MAST. Here we output two result tables corresponding to two forms of DE genes. These parameters include four gene-wise parameters and two cell-wise parameters.
Power_Cont( DErslt, simData, alpha = 0.1, delta = 0.5, strata = c(0, 10, 2^(seq_len(4)) * 10, Inf) )
Power_Cont( DErslt, simData, alpha = 0.1, delta = 0.5, strata = c(0, 10, 2^(seq_len(4)) * 10, Inf) )
DErslt |
is from the DE analysis by MAST |
simData |
is the corresponding simulated scRNA-seq dataset (SingCellExperiment) |
alpha |
is the cutoff for the fdr which can be modified |
delta |
or the lfc is the cutoff (=0.5) used to determined the high DE genes for Form II |
strata |
can be modified by the user. By default, it is (0, 10], (10, 20], (20, 40], (40, 80], (80, Inf] |
a list of metrics for power analysis such as: stratified targeted power and marginal power.
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=500, estParas1 = estParas, estParas2 = estParas) DErslt = runDE(simData$sce) Cont_pow = Power_Cont(DErslt, simData)
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=500, estParas1 = estParas, estParas2 = estParas) DErslt = runDE(simData$sce) Cont_pow = Power_Cont(DErslt, simData)
Run DE analysis by using MAST. Here we output two result tables corresponding to two forms of DE genes. These parameters include four gene-wise parameters and two cell-wise parameters.
Power_Disc( DErslt, simData, alpha = 0.1, delta = 0.1, strata = seq(0, 1, by = 0.2) )
Power_Disc( DErslt, simData, alpha = 0.1, delta = 0.1, strata = seq(0, 1, by = 0.2) )
DErslt |
is from the DE analysis by MAST |
simData |
is the corresponding simulated scRNA-seq dataset (SingCellExperiment) |
alpha |
is the cutoff for the fdr which can be modified |
delta |
or the zero ratio change is the cutoff (=0.1) used to determined the high DE genes for Form II |
strata |
can be modified by the user. By default, it is (0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1] |
a list of metrics for power analysis such as: stratified targeted power and marginal power.
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=1000, estParas1 = estParas, estParas2 = estParas) DErslt = runDE(simData$sce) Disc_pow = Power_Disc(DErslt, simData)
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=1000, estParas1 = estParas, estParas2 = estParas) DErslt = runDE(simData$sce) Disc_pow = Power_Disc(DErslt, simData)
A wrapper function for calling DE genes. This contains two methods: MAST and SC2P
runDE(sce, DE_Method = c("MAST", "SC2P"))
runDE(sce, DE_Method = c("MAST", "SC2P"))
sce |
is a simulated scRNA-seq dataset with two-group conditions, e.g., treatment vs control. |
DE_Method |
is a string chosen from "MAST" or "SC2P". |
a list of three tables: the first table summaries the DE result for both forms of DE genes. cont table represents the result for continous case. disc table shows the result for discontinous case.
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=100, estParas1 = estParas, estParas2 = estParas) sim_sce = simData$sce DErslt = runDE(sim_sce)
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=100, estParas1 = estParas, estParas2 = estParas) sim_sce = simData$sce DErslt = runDE(sim_sce)
Run DE analysis by using MAST. Here we output two result tables corresponding to two forms of DE genes. These parameters include four gene-wise parameters and two cell-wise parameters.
runMAST(sce)
runMAST(sce)
sce |
is a simulated scRNA-seq dataset with two-group conditions, e.g., treatment vs control. |
a list of three tables: the first table summaries the DE result for both forms of DE genes. cont table represents the result for continous case. disc table shows the result for discontinous case.
These parameters include four gene-wise parameters and two cell-wise parameters.
runPOWSC( sim_size = c(50, 100, 200, 800, 1000), per_DE = 0.05, est_Paras, DE_Method = c("MAST", "SC2P"), Cell_Type = c("PW", "Multi"), multi_Prob = NULL, alpha = 0.1, disc_delta = 0.1, cont_delta = 0.5 )
runPOWSC( sim_size = c(50, 100, 200, 800, 1000), per_DE = 0.05, est_Paras, DE_Method = c("MAST", "SC2P"), Cell_Type = c("PW", "Multi"), multi_Prob = NULL, alpha = 0.1, disc_delta = 0.1, cont_delta = 0.5 )
sim_size |
a list of numbers |
per_DE |
the percentage of the DE genes. |
est_Paras |
the template parameter estimated from one cell type. |
DE_Method |
is a string chosen from "MAST" or "SC2P". |
Cell_Type |
is a string corresponding to the 1st scenario: same cell type comparison, and 2nd scenario: multiple cell types. |
multi_Prob |
is the mixture cell proportions which sum up to 1. If not summing up to 1, then the package will internally do the normalization procedure. |
alpha |
is the cutoff for the fdr which can be modified |
disc_delta |
or the zero ratio change is the cutoff (=0.1) used to determined the high DE genes for Form II. |
cont_delta |
or the lfc is the cutoff (=0.5) used to determined the high DE genes for Form II. |
POWSC object
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(12) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] est_Paras = Est2Phase(sce) sim_size = c(100, 200) # A numeric vector pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW") # Note, using our previous developed tool SC2P is faster.
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(12) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] est_Paras = Est2Phase(sce) sim_size = c(100, 200) # A numeric vector pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW") # Note, using our previous developed tool SC2P is faster.
Run DE analysis by using SC2P. Here we output two result tables corresponding to two forms of DE genes.
runSC2P(sce)
runSC2P(sce)
sce |
is a simulated scRNA-seq dataset with two-group conditions, e.g., treatment vs control. |
a list of three tables: the first table summaries the DE result for both forms of DE genes. cont table represents the result for continous case. disc table shows the result for discontinous case.
This dataset is obtained from GEO under accession number GSE67835. It includes 7 patients about 466 cells to capture the cellular complexity of the adult and fetal human brain at a whole transcriptome level. Healthy adult temporal lobe tissue was obtained from epileptic patients during temporal lobectomy for medically refractory seizures.
sce
sce
A singlecellexperiment object contain the expressions data with three cell types about patient AS_7
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835
Darmanis, Spyros, et al. "A survey of human brain transcriptome diversity at the single cell level." Proceedings of the National Academy of Sciences 112.23 (2015): 7285-7290.
Simulate the data for two-group comparison; e.g., treatment v.s. control It simulates the DE changes in two forms corresponding two types of DE genes
Simulate2SCE(n = 100, perDE = 0.05, estParas1, estParas2)
Simulate2SCE(n = 100, perDE = 0.05, estParas1, estParas2)
n |
the number of total cells for two groups |
perDE |
percentage of DE genes |
estParas1 |
the set of parameters corresponding to cell type I |
estParas2 |
the set of parameters corresponding to cell type II |
a list of metrics recording the changes in the generated data: such as the DE gene indices for Form I and II DE genes, and simulated expression data in singlecellexperiment format.
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=100, estParas1 = estParas, estParas2 = estParas)
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(123) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] estParas = Est2Phase(sce) simData = Simulate2SCE(n=100, estParas1 = estParas, estParas2 = estParas)
Simulate the data for multiple-group comparisons; e.g., different cell types in blood It simulates the DE changes in two forms corresponding two types of DE genes
SimulateMultiSCEs( n = 1000, estParas_set, multiProb, delta1 = 0.1, delta2 = 0.5 )
SimulateMultiSCEs( n = 1000, estParas_set, multiProb, delta1 = 0.1, delta2 = 0.5 )
n |
the number of total cells for multiple groups; e.g., 1000, 2000, and etc. |
estParas_set |
a set of parameters corresponding to different cell types. |
multiProb |
a vector of probilities correponding to each cell type. It is not necessary to sum up to 1 because POWSC will normalize internally. |
delta1 |
the minimum of expression change used to determine the Form I DE. |
delta2 |
the minimum of log fold change used to determine the Form II DE. |
a list of simulated datasets. Each dataset corresponds to a pair-wise comparison including a series of metrics such as the DE gene indices for Form I and II DE genes, and simulated expression data in singlecellexperiment format.
data("es_mef_sce") set.seed(123) rix = sample(1:nrow(es_mef_sce), 500) es_mef_sce = es_mef_sce[rix, ] sce1 = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] estParas1 = Est2Phase(sce1) sce2 = es_mef_sce[, colData(es_mef_sce)$cellTypes == "stemCell"] estParas2 = Est2Phase(sce2) estParas_set = list(celltype1 = estParas1, celltype2 = estParas1, celltype3 =estParas2) multiProb = c(0.2, 0.3, 0.5) simData = SimulateMultiSCEs(n=200, estParas_set = estParas_set, multiProb = multiProb)
data("es_mef_sce") set.seed(123) rix = sample(1:nrow(es_mef_sce), 500) es_mef_sce = es_mef_sce[rix, ] sce1 = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] estParas1 = Est2Phase(sce1) sce2 = es_mef_sce[, colData(es_mef_sce)$cellTypes == "stemCell"] estParas2 = Est2Phase(sce2) estParas_set = list(celltype1 = estParas1, celltype2 = estParas1, celltype3 =estParas2) multiProb = c(0.2, 0.3, 0.5) simData = SimulateMultiSCEs(n=200, estParas_set = estParas_set, multiProb = multiProb)
summary of the result
summary_POWSC(POWSCobj, Form = c("I", "II"), Cell_Type = c("PW", "Multi"))
summary_POWSC(POWSCobj, Form = c("I", "II"), Cell_Type = c("PW", "Multi"))
POWSCobj |
a POWSC object from |
Form |
choose from "I" or "II". |
Cell_Type |
choose from "PW" or "Multi". |
return the summary of the power including stratified, marginal, and overall power.
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(12) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] est_Paras = Est2Phase(sce) sim_size = c(100, 200) # A numeric vector pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW") # Note, using our previous developed tool SC2P is faster. summary_POWSC(pow_rslt, Form="II", Cell_Type = "PW")
data("es_mef_sce") sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"] set.seed(12) rix = sample(1:nrow(sce), 500) sce = sce[rix, ] est_Paras = Est2Phase(sce) sim_size = c(100, 200) # A numeric vector pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW") # Note, using our previous developed tool SC2P is faster. summary_POWSC(pow_rslt, Form="II", Cell_Type = "PW")