# Load required packages
suppressPackageStartupMessages({
  library(treekoR)
  library(SingleCellExperiment)
  library(ggtree)
})

Installation

# Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("adam2o1o/treekoR")
library(treekoR)

Overview

treekoR is a novel framework that aims to utilise the hierarchical nature of single cell cytometry data, to find robust and interpretable associations between cell subsets and patient clinical end points. This is achieved by deriving the tree structure of cell clusters, followed by measuring the %parent (proportions of each node in the tree relative to the number of cells belonging to the immediate parent node), in addition to the %total (proportion of cells in each node relative to all cells). These proportions are then used in significance testing and classification models to determine which cell subpopulation proportions most correlated with the patient clinical outcome of interest. treekoR then provides an interactive visualisation which helps to highlight these results.

Usage

Example Data

DeBiasi_COVID_CD8_samp: A SingleCellExperiment containing samples of flow cytometry expression data from 39 patients. This data represents a subset of a dataset that was originally used by De Biasi et al. (2020) for the characterisation of CD8+ T cells, comparing between COVID-19 patients and healthy controls.

data(COVIDSampleData)

sce <- DeBiasi_COVID_CD8_samp

Set up

treekoR requires the following information in the variables:

exprs: Single cell expression data (n × p), where p is the number of markers, and n is the number of cells
clusters: a vector of length n representing the cell type or cluster of each cell (can be character or numeric)
classes: a vector of length n containing the patient outcome/class each cell belongs to
samples: a vector of length n identifying the patient each cell belongs to

In this example: the clusters contain 100 clusters generated by FlowSOM; classes identify whether the cell belongs to a COVID-19 or healthy patient; and samples identify which cell the patient comes from.

exprs <- t(assay(sce, "exprs"))
clusters <- colData(sce)$cluster_id
classes <- colData(sce)$condition
samples <- colData(sce)$sample_id

Construction of Hierarchy of Clusters

The scaled median marker expression for each cluster is calculated which is used to construct a hierarchical tree.

In this step, the choice of hierarchical aggregation method (which determines the structure of the tree) is determined. By default the framework chooses HOPACH to construct the tree via the hierarchy_method argument, however any of the methods in hclust can be used (see @ref(hc-methods)).

clust_tree <- getClusterTree(exprs,
                             clusters,
                             hierarchy_method="hopach")

Significance testing of Cell Subpopulations

Proportions of each cell cluster in the tree are calculated - both the proportion relative to all and proportion relative to the hierarchical parent. These proportions are used in a two sample t-test, testing for equal means between the patient clinical outcome using both types of proportions.

tested_tree <- testTree(phylo=clust_tree$clust_tree,
                      clusters=clusters,
                      samples=samples,
                      classes=classes,
                      pos_class_name=NULL)

Extracting significance results

node: unique identifier for each node in the hierarchical tree
parent: the node of the parent
isTip: whether the node is a leaf node in the tree
clusters: the clusters belonging to the corresponding node
stat_all: test statistic obtained from testing between conditions using the proportion of the node relative to all cells (%total) in each sample. pval_total is the corresponding p-value (unadjusted)
stat_parent: test statistic obtained from testing between conditions using the proportion of the node relative to cells in the parent node (%parent) in each sample. pval_parent is the corresponding p-value (unadjusted)

res_df <- getTreeResults(tested_tree)

head(res_df, 10)
#>     parent node isTip     clusters stat_total stat_parent  pval_total
#> 51     139   51  TRUE           77  1.0823881    4.439944 0.290116635
#> 52     139   52  TRUE           66 -1.9247385   -4.439944 0.069760334
#> 63     147   63  TRUE           41  1.8107720    3.337019 0.080505219
#> 67     150   67  TRUE           32  1.7508315    3.229075 0.092603817
#> 136    135  136 FALSE   58, 67, 57  1.4290582   -3.198607 0.163963670
#> 71     150   71  TRUE           54 -2.3514750   -2.829766 0.033743471
#> 125    100  125 FALSE 85, 84, ....  2.7073249    2.707325 0.011833319
#> 20     115   20  TRUE           89 -0.7764417   -2.897438 0.446585533
#> 116    115  116 FALSE 59, 88, ....  3.0537776    2.897438 0.005462897
#> 140    100  140 FALSE 35, 43, .... -2.5102007   -2.510201 0.017809639
#>      pval_parent
#> 51  0.0003193122
#> 52  0.0003193122
#> 63  0.0022734266
#> 67  0.0033277959
#> 136 0.0041906737
#> 71  0.0082310590
#> 125 0.0118333189
#> 20  0.0124821272
#> 116 0.0124821272
#> 140 0.0178096392

Interactive Visualisation

The results of the previous steps are visualised by a coloured tree with a corresponding heatmap. The heatmap displays the median scaled marker expressions of each cluster to help understand what cell type each cluster may represent, and the tree not only reveals how clusters have been hierarchically aggregated, but is coloured on each node by the test statistic obtained when testing using the proportions relative to all of that node, with the branch connecting the child to the parent coloured by the test statistic obtained when testing using the proportions relative to parent of the child node.

plotInteractiveHeatmap(tested_tree,
                       clust_med_df = clust_tree$median_freq,
                       clusters=clusters)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_interactive_point()`).

Using different hierarchical aggregations

Below we change the hierarchical aggregation technique to average-linkage hierarchical clustering. The available options include any of the available ethods in the hclust() function, ie, one of "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median" or "centroid"

clust_tree <- getClusterTree(exprs,
                             clusters,
                             hierarchy_method="average")

tested_tree <- testTree(clust_tree$clust_tree,
                      clusters=clusters,
                      samples=samples,
                      classes=classes,
                      pos_class_name=NULL)

plotInteractiveHeatmap(tested_tree,
                       clust_med_df = clust_tree$median_freq,
                       clusters=clusters)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_interactive_point()`).

Using counts models for significance testing

Below we change the significance test to use count models instead of a t-test/Wilcoxon test. The available methods are "GLMM" or "edgeR"

clust_tree <- getClusterTree(exprs,
                             clusters,
                             hierarchy_method="hopach")

tested_tree_edgeR <- testTree(clust_tree$clust_tree,
                      clusters=clusters,
                      samples=samples,
                      classes=classes,
                      sig_test="edgeR",
                      pos_class_name="COV")
#> Using classic mode.

plotInteractiveHeatmap(tested_tree_edgeR,
                       clust_med_df = clust_tree$median_freq,
                       clusters=clusters)

You can also extract edgeR FDR values as so:

head(tested_tree_edgeR$data[,c("parent", "node", "isTip", "clusters", "FDR_parent", "FDR_total")])
#> # A tibble: 6 × 6
#>   parent  node isTip clusters  FDR_parent FDR_total
#>    <int> <dbl> <lgl> <list>         <dbl>     <dbl>
#> 1    102     1 TRUE  <chr [1]>      0.890     0.158
#> 2    102     2 TRUE  <chr [1]>      0.808     0.157
#> 3    104     3 TRUE  <chr [1]>      0.794     0.920
#> 4    104     4 TRUE  <chr [1]>      0.278     0.107
#> 5    105     5 TRUE  <chr [1]>      0.794     0.331
#> 6    105     6 TRUE  <chr [1]>      0.943     0.636

Feature extraction

Extract cell type proportions (both %total and %parent), as well as the geometric mean of each marker per cell type. These can then be used for further investigation via visualisation or machine learning.

Extracting Proportions

The function below returns a dataframe containing the absolute proportions and proportions to parent for each cell type for each sample.

perc_parent_n_celltype denotes the proportion of the cell type relative to its parent (%parent) n generations away, ie. perc_parent_1_2 is the proportion of cluster 2 relative to its direct parent in the hierarchical tree.

perc_total_celltype denotes the proportion of the cell type relative all cells (%total)

prop_df <- getCellProp(phylo=clust_tree$clust_tree,
                       clusters=clusters,
                       samples=samples,
                       classes=classes)

head(prop_df[,1:8])
#>   sample_id class perc_total_92 perc_total_91 perc_total_7 perc_total_50
#> 1      COV1   COV   0.029411765   0.005882353   0.01176471   0.000000000
#> 2     COV10   COV   0.066666667   0.044444444   0.00000000   0.005555556
#> 3     COV12   COV   0.051428571   0.000000000   0.03428571   0.000000000
#> 4     COV13   COV   0.006410256   0.000000000   0.01923077   0.012820513
#> 5     COV14   COV   0.019867550   0.019867550   0.00000000   0.006622517
#> 6     COV15   COV   0.000000000   0.000000000   0.00000000   0.000000000
#>   perc_total_5 perc_total_39
#> 1  0.005882353   0.000000000
#> 2  0.016666667   0.005555556
#> 3  0.005714286   0.000000000
#> 4  0.038461538   0.000000000
#> 5  0.006622517   0.000000000
#> 6  0.006993007   0.000000000

Extracting Expression Geometric Means

The function below returns a dataframe containing the geometric mean for each marker for each cell type for each sample.

m_gmean_celltype denotes the geometric mean of marker m in the cell type per sample

means_df <- getCellGMeans(clust_tree$clust_tree,
                          exprs=exprs,
                          clusters=clusters,
                          samples=samples,
                          classes=classes)

head(means_df[,1:8])
#>   sample_id class CD45RA B525-FITC-H_gmean_92 CD45RA B525-FITC-H_gmean_91
#> 1      COV1   COV                    8.026009                   10.271191
#> 2     COV10   COV                    8.804902                    8.432155
#> 3     COV12   COV                    8.268734                         NaN
#> 4     COV13   COV                   10.221825                         NaN
#> 5     COV14   COV                    9.768303                   10.470112
#> 6     COV15   COV                         NaN                         NaN
#>   CD45RA B525-FITC-H_gmean_7 CD45RA B525-FITC-H_gmean_50
#> 1                   7.551528                         NaN
#> 2                        NaN                    6.360800
#> 3                   6.611482                         NaN
#> 4                   8.369937                   10.629523
#> 5                        NaN                    8.597188
#> 6                        NaN                         NaN
#>   CD45RA B525-FITC-H_gmean_5 CD45RA B525-FITC-H_gmean_39
#> 1                   7.126569                         NaN
#> 2                   8.496825                    7.683541
#> 3                   9.265149                         NaN
#> 4                   8.178744                         NaN
#> 5                   8.543981                         NaN
#> 6                  11.036683                         NaN

sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ggtree_3.15.0               SingleCellExperiment_1.29.2
#>  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
#>  [7] IRanges_2.41.3              S4Vectors_0.45.4           
#>  [9] BiocGenerics_0.53.6         generics_0.1.3             
#> [11] MatrixGenerics_1.19.1       matrixStats_1.5.0          
#> [13] treekoR_1.15.0              BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.3               ggplotify_0.1.2            
#>   [3] tibble_3.2.1                polyclip_1.10-7            
#>   [5] XML_3.99-0.18               lifecycle_1.0.4            
#>   [7] Rdpack_2.6.3                rstatix_0.7.2              
#>   [9] edgeR_4.5.9                 doParallel_1.0.17          
#>  [11] lattice_0.22-6              MASS_7.3-65                
#>  [13] backports_1.5.0             magrittr_2.0.3             
#>  [15] limma_3.63.10               sass_0.4.9                 
#>  [17] rmarkdown_2.29              jquerylib_0.1.4            
#>  [19] yaml_2.3.10                 plotrix_3.8-4              
#>  [21] cowplot_1.1.3               buildtools_1.0.0           
#>  [23] minqa_1.2.8                 RColorBrewer_1.1-3         
#>  [25] ConsensusClusterPlus_1.71.0 multcomp_1.4-28            
#>  [27] abind_1.4-8                 Rtsne_0.17                 
#>  [29] purrr_1.0.4                 yulab.utils_0.2.0          
#>  [31] TH.data_1.1-3               tweenr_2.0.3               
#>  [33] sandwich_3.1-1              circlize_0.4.16            
#>  [35] GenomeInfoDbData_1.2.13     ggrepel_0.9.6              
#>  [37] irlba_2.3.5.1               CATALYST_1.31.2            
#>  [39] tidytree_0.4.6              maketools_1.3.2            
#>  [41] codetools_0.2-20            DelayedArray_0.33.6        
#>  [43] scuttle_1.17.0              ggforce_0.4.2              
#>  [45] tidyselect_1.2.1            shape_1.4.6.1              
#>  [47] aplot_0.2.5                 UCSC.utils_1.3.1           
#>  [49] farver_2.1.2                viridis_0.6.5              
#>  [51] ScaledMatrix_1.15.0         lme4_1.1-36                
#>  [53] jsonlite_1.9.1              hopach_2.67.0              
#>  [55] BiocNeighbors_2.1.3         GetoptLong_1.0.5           
#>  [57] Formula_1.2-5               ggridges_0.5.6             
#>  [59] survival_3.8-3              scater_1.35.4              
#>  [61] iterators_1.0.14            systemfonts_1.2.1          
#>  [63] foreach_1.5.2               tools_4.4.3                
#>  [65] ggnewscale_0.5.1            treeio_1.31.0              
#>  [67] Rcpp_1.0.14                 glue_1.8.0                 
#>  [69] gridExtra_2.3               SparseArray_1.7.7          
#>  [71] xfun_0.51                   dplyr_1.1.4                
#>  [73] withr_3.0.2                 BiocManager_1.30.25        
#>  [75] fastmap_1.2.0               boot_1.3-31                
#>  [77] rsvd_1.0.5                  digest_0.6.37              
#>  [79] R6_2.6.1                    gridGraphics_0.5-1         
#>  [81] colorspace_2.1-1            gtools_3.9.5               
#>  [83] tidyr_1.3.1                 data.table_1.17.0          
#>  [85] httr_1.4.7                  htmlwidgets_1.6.4          
#>  [87] S4Arrays_1.7.3              pkgconfig_2.0.3            
#>  [89] gtable_0.3.6                ComplexHeatmap_2.23.0      
#>  [91] RProtoBufLib_2.19.0         XVector_0.47.2             
#>  [93] diffcyt_1.27.3              sys_3.4.3                  
#>  [95] htmltools_0.5.8.1           carData_3.0-5              
#>  [97] clue_0.3-66                 scales_1.3.0               
#>  [99] png_0.1-8                   reformulas_0.4.0           
#> [101] ggfun_0.1.8                 colorRamps_2.3.4           
#> [103] knitr_1.50                  reshape2_1.4.4             
#> [105] rjson_0.2.23                uuid_1.2-1                 
#> [107] nlme_3.1-167                nloptr_2.2.1               
#> [109] cachem_1.1.0                zoo_1.8-13                 
#> [111] GlobalOptions_0.1.2         stringr_1.5.1              
#> [113] vipor_0.4.7                 parallel_4.4.3             
#> [115] pillar_1.10.1               grid_4.4.3                 
#> [117] vctrs_0.6.5                 ggpubr_0.6.0               
#> [119] BiocSingular_1.23.0         car_3.1-3                  
#> [121] cytolib_2.19.3              beachmat_2.23.7            
#> [123] cluster_2.1.8.1             beeswarm_0.4.0             
#> [125] evaluate_1.0.3              mvtnorm_1.3-3              
#> [127] cli_3.6.4                   locfit_1.5-9.12            
#> [129] compiler_4.4.3              rlang_1.1.5                
#> [131] crayon_1.5.3                ggsignif_0.6.4             
#> [133] labeling_0.4.3              FlowSOM_2.15.0             
#> [135] ggbeeswarm_0.7.2            flowCore_2.19.0            
#> [137] plyr_1.8.9                  fs_1.6.5                   
#> [139] ggiraph_0.8.12              stringi_1.8.4              
#> [141] viridisLite_0.4.2           BiocParallel_1.41.2        
#> [143] nnls_1.6                    munsell_0.5.1              
#> [145] lazyeval_0.2.2              Matrix_1.7-3               
#> [147] patchwork_1.3.0             ggplot2_3.5.1              
#> [149] statmod_1.5.0               rbibutils_2.3              
#> [151] drc_3.0-1                   igraph_2.1.4               
#> [153] broom_1.0.7                 bslib_0.9.0                
#> [155] ape_5.8-1