Flow and mass cytometry are important modern immunology tools for
measuring expression levels of multiple proteins on single cells. The
goal is to better understand the mechanisms of responses on a single
cell basis by studying differential expression of proteins. Most current
data analysis tools compare expressions across many computationally
discovered cell types. Our goal is to focus on just one cell type.
Differential analysis of marker expressions can be difficult due to
marker correlations and inter-subject heterogeneity, particularly for
studies of human immunology. We address these challenges with two
multiple regression strategies: A bootstrapped generalized linear model
(GLM) and a generalized linear mixed model (GLMM). Here, we illustrate
the CytoGLMM
R package and workflow for simulated
mass cytometry data.
We construct our simulated datasets by sampling from a Poisson GLM. We confirmed—with predictive posterior checks—that Poisson GLMs with mixed effects provide a good fit to mass cytometry data (Seiler et al. 2019). We consider one underlying data generating mechanisms described by a hierarchical model for the ith cell and jth donor:
$$ \begin{aligned} \boldsymbol{X}_{ij} & \sim \text{Poisson}(\boldsymbol{\lambda}_{ij}) \\ \log(\boldsymbol{\lambda}_{ij}) & = \boldsymbol{B}_{ij} + \boldsymbol{U}_j \\ \boldsymbol{B}_{ij} & \sim \begin{cases} \text{Normal}(\boldsymbol{\delta}^{(0)}, \boldsymbol{\Sigma}_B) & \text{if } Y_{ij} = 0, \text{ cell unstimulated} \\ \text{Normal}(\boldsymbol{\delta}^{(1)}, \boldsymbol{\Sigma}_B) & \text{if } Y_{ij} = 1, \text{ cell stimulated} \end{cases} \\ \boldsymbol{U}_j & \sim \text{Normal}(\boldsymbol{0}, \boldsymbol{\Sigma}_U). \end{aligned} $$
The following graphic shows a representation of the hierarchical model.
The stimulus activates proteins and induces a difference in marker expression. We define the effect size to be the difference between expected expression levels of stimulated versus unstimulated cells on the log -scale. All markers that belong to the active set , have a non-zero effect size, whereas, all markers that are not, have a zero effect size:
$$ \begin{cases} \delta^{(1)}_p - \delta^{(0)}_p > 0 & \text{if protein } p \text{ is in activation set } p \in C \\ \delta^{(1)}_{p'} - \delta^{(0)}_{p'} = 0 & \text{if protein } p' \text{ is not in activation set } p' \notin C. \end{cases} $$
Both covariance matrices have an autoregressive structure,
$$ \begin{aligned} \Omega_{rs} & = \rho^{|r-s|} \\ \boldsymbol{\Sigma} & = \operatorname{diag}(\boldsymbol{\sigma}) \, \boldsymbol{\Omega} \, \operatorname{diag}(\boldsymbol{\sigma}), \end{aligned} $$
where Ωrs is the rth row and sth column of the correlation matrix Ω. We regulate two separate correlation parameters: a cell-level ρB and a donor-level ρU coefficient. Non-zero ρB or ρU induce a correlation between condition and marker expression even for markers with a zero effect size.
## # A tibble: 5 × 5
## donor condition m01 m02 m03
## <int> <fct> <int> <int> <int>
## 1 1 treatment 2 198 81
## 2 1 treatment 2 106 78
## 3 1 treatment 43 109 53
## 4 1 treatment 4 58 12
## 5 1 treatment 101 51 26
We define the marker names that we will focus on in our analysis by extracting them from the simulated data frame.
We recommend that marker expressions be corrected for batch effects (Nowicka et al. 2017; Chevrier et al. 2018; Schuyler et al. 2019; Van Gassen et al. 2020; Trussart et al. 2020) and transformed using variance stabilizing transformations to account for heteroskedasticity, for instance with an inverse hyperbolic sine transformation with the cofactor set to 150 for flow cytometry, and 5 for mass cytometry (Bendall et al. 2011). This transformation assumes a two-component model for the measurement error (Rocke and Lorenzato 1995; Huber et al. 2003) where small counts are less noisy than large counts. Intuitively, this corresponds to a noise model with additive and multiplicative noise depending on the magnitude of the marker expression; see (Holmes and Huber 2019) for details.
The goal of the CytoGLMM::cytoglm
function is to find
protein expression patterns that are associated with the condition of
interest, such as a response to a stimulus. We set up the GLM to predict
the experimental condition (condition
) from protein marker
expressions (protein_names
), thus our experimental
conditions are response variables and marker expressions are explanatory
variables.
glm_fit = CytoGLMM::cytoglm(df,
protein_names = protein_names,
condition = "condition",
group = "donor",
num_cores = 1,
num_boot = 1000)
glm_fit
##
## #######################
## ## paired analysis ####
## #######################
##
## number of bootstrap samples: 1000
##
## number of cells per group and condition:
## control treatment
## 1 1000 1000
## 2 1000 1000
## 3 1000 1000
## 4 1000 1000
## 5 1000 1000
## 6 1000 1000
## 7 1000 1000
## 8 1000 1000
##
## proteins included in the analysis:
## m01 m02 m03 m04 m05 m06 m07 m08 m09 m10
##
## condition compared: condition
## grouping variable: donor
We plot the maximum likelihood estimates with 95% confidence intervals for the fixed effects β. The estimates are on the log -odds scale. We see that markers m1, m2, and m3 are strong predictors of the treatment. This means that one unit increase in the transformed marker expression makes it more likely to be a cell from the treatment group, while holding the other markers constant.
The summary
function returns a table about the model fit
with unadjusted and Benjamini-Hochberg (BH) adjusted p-values.
## # A tibble: 10 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m02 0.002 0.01
## 2 m03 0.002 0.01
## 3 m01 0.018 0.06
## 4 m06 0.324 0.81
## 5 m05 0.734 0.926
## 6 m08 0.766 0.926
## 7 m10 0.838 0.926
## 8 m07 0.856 0.926
## 9 m04 0.878 0.926
## 10 m09 0.926 0.926
We can extract the proteins below an False Discovery Rate (FDR) of 0.05 from the p-value table by filtering the table.
## # A tibble: 2 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m02 0.002 0.01
## 2 m03 0.002 0.01
In this simulated dataset, the markers m2 and m3 are below an FDR of 0.05.
In the CytoGLMM::cytoglmm
function, we make additional
modeling assumptions by adding a random effect term in the standard
logistic regression model to account for the subject effect
(group
). In paired experimental design—when the same donor
provides two samples, one for each
condition—CytoGLMM::cytoglmm
is statistically more
powerful.
glmm_fit = CytoGLMM::cytoglmm(df,
protein_names = protein_names,
condition = "condition",
group = "donor",
num_cores = 1)
glmm_fit
## number of cells per group and condition:
## control treatment
## 1 1000 1000
## 2 1000 1000
## 3 1000 1000
## 4 1000 1000
## 5 1000 1000
## 6 1000 1000
## 7 1000 1000
## 8 1000 1000
##
## proteins included in the analysis:
## m01 m02 m03 m04 m05 m06 m07 m08 m09 m10
##
## condition compared: condition
## grouping variable: donor
We plot the method of moments estimates with 95% confidence intervals for the fixed effects β.
The summary
function returns a table about the model fit
with unadjusted and BH adjusted p-values.
## # A tibble: 10 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m02 0.00000418 0.0000418
## 2 m03 0.000264 0.00132
## 3 m01 0.00133 0.00443
## 4 m06 0.294 0.735
## 5 m07 0.541 0.840
## 6 m10 0.548 0.840
## 7 m08 0.588 0.840
## 8 m09 0.684 0.855
## 9 m05 0.814 0.898
## 10 m04 0.898 0.898
We can extract the proteins below an FDR of 0.05 from the p-value table by filtering the table.
## # A tibble: 3 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m02 0.00000418 0.0000418
## 2 m03 0.000264 0.00132
## 3 m01 0.00133 0.00443
In this simulated dataset, the markers m1, m2 and m3 are below an FDR
of 0.05. This illustrates the improved power for paired samples as
CytoGLMM::cytoglmm
correctly detects all three
differentially expressed proteins. We performed extensive power
simulations and comparisons to other methods in Seiler et al. (2021).
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] magrittr_2.0.3 CytoGLMM_1.15.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] pROC_1.18.5 mbest_0.6 sandwich_3.1-1
## [4] rlang_1.1.4 compiler_4.4.1 flexmix_2.3-19
## [7] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1
## [10] pkgconfig_2.0.3 fastmap_1.2.0 labeling_0.4.3
## [13] utf8_1.2.4 rmarkdown_2.28 prodlim_2024.06.25
## [16] nloptr_2.1.1 purrr_1.0.2 xfun_0.48
## [19] modeltools_0.2-23 cachem_1.1.0 jsonlite_1.8.9
## [22] recipes_1.1.0 highr_0.11 uuid_1.2-1
## [25] BiocParallel_1.39.0 parallel_4.4.1 R6_2.5.1
## [28] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
## [31] parallelly_1.38.0 boot_1.3-31 rpart_4.1.23
## [34] lubridate_1.9.3 jquerylib_0.1.4 Rcpp_1.0.13
## [37] iterators_1.0.14 knitr_1.48 future.apply_1.11.3
## [40] zoo_1.8-12 Matrix_1.7-1 splines_4.4.1
## [43] nnet_7.3-19 timechange_0.3.0 tidyselect_1.2.1
## [46] abind_1.4-8 yaml_2.3.10 timeDate_4041.110
## [49] doParallel_1.0.17 codetools_0.2-20 listenv_0.9.1
## [52] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [55] withr_3.0.2 evaluate_1.0.1 future_1.34.0
## [58] survival_3.7-0 pillar_1.9.0 BiocManager_1.30.25
## [61] foreach_1.5.2 stats4_4.4.1 generics_0.1.3
## [64] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0
## [67] minqa_1.2.8 globals_0.16.3 class_7.3-22
## [70] glue_1.8.0 pheatmap_1.0.12 maketools_1.3.1
## [73] tools_4.4.1 sys_3.4.3 data.table_1.16.2
## [76] lme4_1.1-35.5 ModelMetrics_1.2.2.2 gower_1.0.1
## [79] buildtools_1.0.0 cowplot_1.1.3 grid_4.4.1
## [82] bigmemory_4.6.4 tidyr_1.3.1 ipred_0.9-15
## [85] colorspace_2.1-1 nlme_3.1-166 cli_3.6.3
## [88] bigmemory.sri_0.1.8 fansi_1.0.6 lava_1.8.0
## [91] dplyr_1.1.4 strucchange_1.5-4 gtable_0.3.6
## [94] logging_0.10-108 sass_0.4.9 digest_0.6.37
## [97] caret_6.0-94 ggrepel_0.9.6 farver_2.1.2
## [100] htmltools_0.5.8.1 lifecycle_1.0.4 factoextra_1.0.7
## [103] hardhat_1.4.0 MASS_7.3-61