The censcyt
package provides functionalities for
differential abundance analysis in cytometry when a covariate is subject
to right censoring. It extends the differential discovery capabilities
of the diffcyt
package (Weber et al. 2019) by including
association testing between the cell population abundance and a censored
variable such as survival time. The general workflow is the same as
described in the diffcyt
package vignette and it is advisable to be familiar with it before
continuing.
As a short summary, the diffcyt
workflow consists of
preprocessing of raw cytometry data, cell type assignment through
automated clustering and differential abundance analysis (or
differential state analysis) (Weber et al.
2019). If a time-to-event variable (e.g. survival time) connected
to a cytometry sample is available, one might be interested in the
association between the cell population abundance and this time-to-event
variable. In practice, a frequent problem with time-to-even variables is
that not all events are observed. Instead it is possible that only a
minimum time is known, with other words those values are (right)
censored. If standard analysis would be used (such as
testDA_GLMM
or testDA_edgeR
from
diffcyt
) a bias is introduced in the parameter estimation.
The simplest workaround is to exclude all incomplete samples, which is
known under the name of complete case analysis or listwise deletion. But
there are cases, such as high censoring rates or non random censoring,
where this approach performs unfavorably, since it can greatly reduce
statistical power or introduce a bias.
Normally, when a variable is censored, association testing can be
done using classical survival analysis methods (e.g. Cox proportional
hazard model). But in the context of differential abundance analysis in
diffcyt
the censored variable is not the response but a
covariate (in contrary to classical survival analyis). The approach used
by censcyt
to overcome this hurdle is multiple imputation
(Rubin 1988) which allows the use of the
standard differential analysis methods at the cost of increased
runtimes. Multiple imputation consists of the three steps
Imputation, Analysis and Pooling which are
shortly explained in the context of censcyt
.
In the first step (Imputation) multiple complete data sets are generated by replacing the missing (or censored) values by a random draw from a predictive distribution (See Table @ref(tab:methods)). Then in the second step (Analysis) a GLMM is fitted on each imputed dataset where the cell population counts are modeled as the response and the censored (or rather imputed) variable (e.g. survival time) as a covariate. In the last step (Pooling) the results from the second step are combined using Rubins’s rules (Rubin 1988) which take into account the additional uncertainties from the imputation.
The three main available imputation methods in the function
testDA_censoredGLMM
are listed in Table
@ref(tab:methods).
Method | Description | Abbr. |
---|---|---|
Complete case analysis | Incomplete samples are removed. | cc |
Risk set imputation (Taylor et al. 2002) | Censored values are replace by a random value from its risk set. | rs |
Kaplan-Meier imputation (Taylor et al. 2002) | Censored values are replaced by a random draw from a survival function that has been fit on the risk set of the respective censored value. | km |
Mean residual life imputation (Conditional multiple imputation, Atem (2017)) | Bootstrapping of the incomplete data and replacement of the censored values by adding the mean residual life (the expected time until event given the censoring time). | mrl |
If the largest value is censored, the survival function does not reach its theoretical minimum of 0, which would leave some replacements undefined. Multiple options to handle this problem are implemented for use in Kaplan-Meier imputation (Table @ref(tab:tailmethods)).
Description | Abbr. |
---|---|
Set the largest value as observed. (default) | km |
Model the tail of the survival function as an exponential distribution. (Moeschberger and Klein 1985) | km_exp |
Model the tail of the survival function as a weibull distribution. (Moeschberger and Klein 1985) | km_wei |
Replace all censored values larger than the largest observed value with expected values according to order statistics. (Moeschberger and Klein 1985) | km_os |
First, load the necessary packages. Important to note is that loading
censcyt
will also load diffcyt
.
suppressPackageStartupMessages({
library(censcyt)
library(ggplot2)
library(SummarizedExperiment)
library(tidyr)
})
The main input for DA analysis consists of a cluster times sample matrix of cell population abundances. How to obtain those counts is explained in the diffcyt vignette and will not be further considered here. Instead, for illustrative purposes, a dataset is simulated. In this case, the censored covariate is modeled according to an exponential distribution. To obtain censoring, an additional variable, the censoring time, is simulated as well and the actual measured value is the minimum of those two variables.
set.seed(1234)
nr_samples <- 50
nr_clusters <- 6
# the covariate is simulated from an exponential distribution:
X_true <- rexp(nr_samples)
# the censoring time is also sampled from an exponential distribution:
C <- rexp(nr_samples)
# the actual observed value is the minimum of the two:
X_obs <- pmin(X_true,C)
# additionally, we have the event indicator:
Event_ind <- ifelse(X_true>C,0,1)
# proportion censored:
(length(Event_ind)-sum(Event_ind))/length(Event_ind)
## [1] 0.52
Next, the cell counts are modeled according to a
dirichlet-multinomial distribution where some clusters have an
association with a covariate. The function
simulate_multicluster
can be used for this.
# number of cells per sample
sizes <- round(runif(nr_samples,1e3,1e4))
# alpha parameters of the dirichlet-multinomial distribution.
alphas <- runif(nr_clusters,1,10)
# main simulation function
simulation_output <- simulate_multicluster(alphas = alphas,
sizes = sizes,
covariate = X_obs,
nr_diff = 2,
return_summarized_experiment = TRUE,
slope = list(0.9))
# counts as a SummarizedExperiment object
d_counts <- simulation_output$counts
To get a sense of how this simulated data looks like we can plot the relative cell population abundance for each sample vs. the censored covariate per cell population.
# vector indicating if a cluster has a modeled association or not
is_diff_cluster <- ifelse(is.na(simulation_output$row_data$paired),FALSE,TRUE)
# first convert to proportions:
proportion <- as.data.frame(t(apply(assay(d_counts),2,function(x)x/sum(x))))
names(proportion) <- paste0("Nr: ",names(proportion), " - ",
ifelse(is_diff_cluster,"DA","non DA"))
# then to long format for plotting
counts_long <-
pivot_longer(proportion,
cols= seq_len(nr_clusters),
names_to = "cluster_id",
values_to = "proportion")
# add observed (partly censored) variable
counts_long$X_obs <- rep(X_obs,each=nr_clusters)
# add event indicator
counts_long$Event_ind <- factor(rep(Event_ind,each=nr_clusters),
levels = c(0,1),
labels=c("censored","observed"))
ggplot(counts_long) +
geom_point(aes(x=X_obs,y=proportion,color=Event_ind)) +
facet_wrap(~cluster_id)
Next step is to set up the meta data, i.e. the covariates that are
used in the testing. This is done the same way as in
diffcyt
but additionally the event indicator for the
censored covariate has to be given, since censored variables are
represented by two values, the measured value itself and an indicator if
the measured value is observed (1) or censored (0).
# all covariates and sample ids
experiment_info <- data.frame(sample_id = seq_len(nr_samples),
X_obs = X_obs,
Event_ind = Event_ind)
The createFormula
function is similar to the one used in
diffcyt
but additionally the argument
event_indicator
has to be specified by giving the
respective column name in experiment_info
. If multiple
covariates are given in argument cols_fixed
then
event_indicator
is associated with the first element given
to cols_fixed
.
The main part is the differential testing which is done using the
function testDA_censoredGLMM
. It consists of the same
arguments as the non-censored version testDA_GLMM
(from
diffcyt
) with two additional arguments specific to multiple
imputation. The first additional argument is mi_reps
(multiple imputation repetitions) which is the number of multiple
imputation steps performed. Unfortunately there is no clear rule as to
how many imputations are needed and only rules of thumb are available
(Buuren 2018). In general, more
imputations are needed if the censoring rate is high and in applications
where high statistical power is needed (Buuren
2018). One ‘rule’ is the quadratic rule of Hippel (2018) which can be shown if
verbose
is set to TRUE
. The second argument is
imputation_method
which is used to specify the imputation
method. See Table @ref(tab:methods) and Table @ref(tab:tailmethods).
# test with 50 repetitions with method risk set imputation (rs)
da_out <- testDA_censoredGLMM(d_counts = d_counts,
formula = da_formula,
contrast = contrast,
mi_reps = 30,
imputation_method = "km",
verbose = TRUE,
BPPARAM=BiocParallel::MulticoreParam(workers = 1))
## 26 of 50 values are censored
## suggested number of imputations: 54
To look at the results we can use diffcyt
’s
topTable
function:
## DataFrame with 6 rows and 3 columns
## cluster_id p_val p_adj
## <character> <numeric> <numeric>
## 3 3 0.00148338 0.00890028
## 6 6 0.00987323 0.02961970
## 1 1 0.57299042 0.89086778
## 5 5 0.59391185 0.89086778
## 4 4 0.77492552 0.92991063
## 2 2 0.98172528 0.98172528
## [1] 3 6
Instead of using the single functions as shown above, it is possible
use the wrapper function censcyt
which is the analog to the
diffcyt
wrapper. First, some expression data is simulated,
i.e. a more realistic starting position than in the example above.
# Function to create expression data (one sample)
fcs_sim <- function(n = 2000, mean = 0, sd = 1, ncol = 10, cof = 5) {
d <- matrix(sinh(rnorm(n*ncol, mean, sd)) * cof,ncol=ncol)
# each marker is heavily expressed in one population, i.e. this should result
# in 'ncol' clusters
for(i in seq_len(ncol)){
d[seq(n/ncol*(i-1)+1,n/ncol*(i)),i] <-
sinh(rnorm(n/ncol, mean+2, sd)) * cof
}
colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol))
d
}
# create multiple expression data samples with DA signal
comb_sim <- function(X, nr_samples = 10, mean = 0, sd = 1, ncol = 10, cof = 5){
# Create random data (without differential signal)
d_input <- lapply(seq_len(nr_samples),
function(i) fcs_sim(mean = mean,
sd = sd,
ncol = ncol,
cof = cof))
# Add differential abundance (DA) signal
for(i in seq_len(nr_samples)){
# number of cells in cluster 1
n_da <- round((plogis(X_true[i]))*300)
# set to random expression
d_input[[i]][seq_len(n_da), ] <-
matrix(sinh(rnorm(n_da*ncol, mean, sd)) * cof, ncol=ncol)
# increase expression for cluster 1
d_input[[i]][seq_len(n_da) ,1] <-
sinh(rnorm(n_da, mean + 2, sd)) * cof
}
d_input
}
# set parameter and simulat
d_input <- comb_sim(X_true, nr_samples = 50)
The objects experiment_info
, da_formula
and
contrast
are the same as specified above, but additionally
information about the measured markers has to be supplied.
marker_info <- data.frame(
channel_name = paste0("channel", sprintf("%03d", seq_len(10))),
marker_name = paste0("marker", sprintf("%02d", seq_len(10))),
marker_class = factor(c(rep("type", 10)),
levels = c("type", "state", "none")),
stringsAsFactors = FALSE
)
Finally the wrapper function can be run. Argument
mi_reps
is to specify the number of repetitions in the
multiple imputation and imputation_method
allows to set the
imputation method. To speed up the computation, a
BiocParallelParam
object (e.g. MulticoreParam
,
from package BiocParallel)
can be given to the argument BPPARAM
to parallelize parts
of the computation.
# Run wrapper function
out_DA <- censcyt(d_input,
experiment_info,
marker_info,
formula = da_formula,
contrast = contrast,
meta_clustering = TRUE,
meta_k = 10,
seed_clustering = 123,
verbose = FALSE,
mi_reps = 10,
imputation_method = "km",
BPPARAM=BiocParallel::MulticoreParam(workers = 1))
## FlowSOM clustering completed in 3.5 seconds
## DataFrame with 10 rows and 3 columns
## cluster_id p_val p_adj
## <character> <numeric> <numeric>
## 8 8 0.000377 0.00377
## 3 3 0.001770 0.00883
## 1 1 0.833000 0.96600
## 2 2 0.557000 0.96600
## 4 4 0.869000 0.96600
## 5 5 0.616000 0.96600
## 6 6 0.835000 0.96600
## 7 7 0.535000 0.96600
## 10 10 0.313000 0.96600
## 9 9 0.973000 0.97300
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] tidyr_1.3.1 SummarizedExperiment_1.35.5
## [3] Biobase_2.65.1 GenomicRanges_1.57.2
## [5] GenomeInfoDb_1.41.2 IRanges_2.39.2
## [7] S4Vectors_0.43.2 BiocGenerics_0.51.3
## [9] MatrixGenerics_1.17.1 matrixStats_1.4.1
## [11] ggplot2_3.5.1 censcyt_1.15.0
## [13] diffcyt_1.25.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 shape_1.4.6.1
## [5] magrittr_2.0.3 jomo_2.7-6
## [7] TH.data_1.1-2 farver_2.1.2
## [9] nloptr_2.1.1 rmarkdown_2.28
## [11] GlobalOptions_0.1.2 zlibbioc_1.51.2
## [13] vctrs_0.6.5 minqa_1.2.8
## [15] rstatix_0.7.2 htmltools_0.5.8.1
## [17] S4Arrays_1.5.11 forcats_1.0.0
## [19] broom_1.0.7 SparseArray_1.5.45
## [21] Formula_1.2-5 mitml_0.4-5
## [23] parallelly_1.38.0 sass_0.4.9
## [25] bslib_0.8.0 plyr_1.8.9
## [27] sandwich_3.1-1 zoo_1.8-12
## [29] cachem_1.1.0 buildtools_1.0.0
## [31] igraph_2.1.1 lifecycle_1.0.4
## [33] iterators_1.0.14 pkgconfig_2.0.3
## [35] Matrix_1.7-1 R6_2.5.1
## [37] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [39] future_1.34.0 clue_0.3-65
## [41] digest_0.6.37 colorspace_2.1-1
## [43] ggnewscale_0.5.0 furrr_0.3.1
## [45] ggpubr_0.6.0 labeling_0.4.3
## [47] cytolib_2.17.0 fansi_1.0.6
## [49] colorRamps_2.3.4 httr_1.4.7
## [51] polyclip_1.10-7 abind_1.4-8
## [53] compiler_4.4.1 withr_3.0.2
## [55] doParallel_1.0.17 ConsensusClusterPlus_1.69.0
## [57] backports_1.5.0 BiocParallel_1.39.0
## [59] carData_3.0-5 highr_0.11
## [61] ggforce_0.4.2 pan_1.9
## [63] broom.mixed_0.2.9.6 ggsignif_0.6.4
## [65] MASS_7.3-61 DelayedArray_0.31.14
## [67] rjson_0.2.23 FlowSOM_2.13.0
## [69] tools_4.4.1 nnet_7.3-19
## [71] glue_1.8.0 nlme_3.1-166
## [73] grid_4.4.1 Rtsne_0.17
## [75] cluster_2.1.6 reshape2_1.4.4
## [77] generics_0.1.3 gtable_0.3.6
## [79] car_3.1-3 utf8_1.2.4
## [81] XVector_0.45.0 foreach_1.5.2
## [83] pillar_1.9.0 stringr_1.5.1
## [85] limma_3.61.12 circlize_0.4.16
## [87] splines_4.4.1 flowCore_2.17.0
## [89] dplyr_1.1.4 tweenr_2.0.3
## [91] lattice_0.22-6 survival_3.7-0
## [93] dirmult_0.1.3-5 RProtoBufLib_2.17.0
## [95] tidyselect_1.2.1 ComplexHeatmap_2.21.1
## [97] locfit_1.5-9.10 maketools_1.3.1
## [99] knitr_1.48 edgeR_4.3.21
## [101] xfun_0.48 statmod_1.5.0
## [103] stringi_1.8.4 UCSC.utils_1.1.0
## [105] yaml_2.3.10 boot_1.3-31
## [107] evaluate_1.0.1 codetools_0.2-20
## [109] tibble_3.2.1 BiocManager_1.30.25
## [111] cli_3.6.3 rpart_4.1.23
## [113] munsell_0.5.1 jquerylib_0.1.4
## [115] Rcpp_1.0.13 globals_0.16.3
## [117] png_0.1-8 XML_3.99-0.17
## [119] parallel_4.4.1 glmnet_4.1-8
## [121] listenv_0.9.1 lme4_1.1-35.5
## [123] mvtnorm_1.3-1 scales_1.3.0
## [125] purrr_1.0.2 crayon_1.5.3
## [127] GetoptLong_1.0.5 rlang_1.1.4
## [129] multcomp_1.4-26 mice_3.16.0