

count <- dplyr::count

Often, high-dimensional cytometry experiments collect tens or hundreds or millions of cells in total, and it can be useful to downsample to a smaller, more computationally tractable number of cells - either for a final analysis or while developing code.

To do this, {tidytof} implements the tof_downsample() verb, which allows downsampling using 3 methods: downsampling to an integer number of cells, downsampling to a fixed proportion of the total number of input cells, or downsampling to a fixed cellular density in phenotypic space.

Downsampling with tof_downsample()

Using {tidytof}’s built-in dataset phenograph_data, we can see that the original size of the dataset is 1000 cells per cluster, or 3000 cells in total:


phenograph_data |>
#> # A tibble: 3 × 2
#>   phenograph_cluster     n
#>   <chr>              <int>
#> 1 cluster1            1000
#> 2 cluster2            1000
#> 3 cluster3            1000

To randomly sample 200 cells per cluster, we can use tof_downsample() using the “constant” method:

phenograph_data |>
    # downsample
        group_cols = phenograph_cluster,
        method = "constant",
        num_cells = 200
    ) |>
    # count the number of downsampled cells in each cluster
#> # A tibble: 3 × 2
#>   phenograph_cluster     n
#>   <chr>              <int>
#> 1 cluster1             200
#> 2 cluster2             200
#> 3 cluster3             200

Alternatively, if we wanted to sample 50% of the cells in each cluster, we could use the “prop” method:

phenograph_data |>
    # downsample
        group_cols = phenograph_cluster,
        method = "prop",
        prop_cells = 0.5
    ) |>
    # count the number of downsampled cells in each cluster
#> # A tibble: 3 × 2
#>   phenograph_cluster     n
#>   <chr>              <int>
#> 1 cluster1             500
#> 2 cluster2             500
#> 3 cluster3             500

And finally, we might also be interested in taking a slightly different approach to downsampling that reduces the number of cells not to a fixed constant or proportion, but to a fixed density in phenotypic space. For example, the following scatterplot demonstrates that there are certain areas of phenotypic density in phenograph_data that contain more cells than others along the cd34/cd38 axes:

rescale_max <-
    function(x, to = c(0, 1), from = range(x, na.rm = TRUE)) {
        x / from[2] * to[2]

phenograph_data |>
    # preprocess all numeric columns in the dataset
    tof_preprocess(undo_noise = FALSE) |>
    # plot
    ggplot(aes(x = cd34, y = cd38)) +
    geom_hex() +
    coord_fixed(ratio = 0.4) +
    scale_x_continuous(limits = c(NA, 1.5)) +
    scale_y_continuous(limits = c(NA, 4)) +
        labels = function(x) round(rescale_max(x), 2)
    ) +
        fill = "relative density"

To reduce the number of cells in our dataset until the local density around each cell in our dataset is relatively constant, we can use the “density” method of tof_downsample:

phenograph_data |>
    tof_preprocess(undo_noise = FALSE) |>
    tof_downsample(method = "density", density_cols = c(cd34, cd38)) |>
    # plot
    ggplot(aes(x = cd34, y = cd38)) +
    geom_hex() +
    coord_fixed(ratio = 0.4) +
    scale_x_continuous(limits = c(NA, 1.5)) +
    scale_y_continuous(limits = c(NA, 4)) +
        labels = function(x) round(rescale_max(x), 2)
    ) +
        fill = "relative density"

Thus, we can see that the density after downsampling is more uniform (though not exactly uniform) across the range of cd34/cd38 values in phenograph_data.

Additional documentation

For more details, check out the documentation for the 3 underlying members of the tof_downsample_* function family (which are wrapped by tof_downsample):

  • tof_downsample_constant
  • tof_downsample_prop
  • tof_downsample_density

Session info

#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/ 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/;  LAPACK version 3.12.0
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> other attached packages:
#>  [1] tidyr_1.3.1                 stringr_1.5.1              
#>  [3] HDCytoData_1.26.0           flowCore_2.19.0            
#>  [5] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [7] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
#>  [9] IRanges_2.41.3              S4Vectors_0.45.4           
#> [11] MatrixGenerics_1.19.1       matrixStats_1.5.0          
#> [13] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
#> [15] BiocFileCache_2.15.1        dbplyr_2.5.0               
#> [17] BiocGenerics_0.53.6         generics_0.1.3             
#> [19] forcats_1.0.0               ggplot2_3.5.1              
#> [21] dplyr_1.1.4                 tidytof_1.1.0              
#> [23] rmarkdown_2.29             
#> loaded via a namespace (and not attached):
#>   [1] sys_3.4.3               jsonlite_1.8.9          shape_1.4.6.1          
#>   [4] magrittr_2.0.3          farver_2.1.2            vctrs_0.6.5            
#>   [7] memoise_2.0.1           htmltools_0.5.8.1       S4Arrays_1.7.3         
#>  [10] curl_6.2.0              SparseArray_1.7.5       sass_0.4.9             
#>  [13] parallelly_1.42.0       bslib_0.9.0             lubridate_1.9.4        
#>  [16] cachem_1.1.0            buildtools_1.0.0        igraph_2.1.4           
#>  [19] mime_0.12               lifecycle_1.0.4         iterators_1.0.14       
#>  [22] pkgconfig_2.0.3         Matrix_1.7-2            R6_2.6.1               
#>  [25] fastmap_1.2.0           GenomeInfoDbData_1.2.13 future_1.34.0          
#>  [28] digest_0.6.37           colorspace_2.1-1        AnnotationDbi_1.69.0   
#>  [31] irlba_2.3.5.1           RSQLite_2.3.9           labeling_0.4.3         
#>  [34] filelock_1.0.3          cytolib_2.19.3          yardstick_1.3.2        
#>  [37] timechange_0.3.0        httr_1.4.7              polyclip_1.10-7        
#>  [40] abind_1.4-8             compiler_4.4.2          bit64_4.6.0-1          
#>  [43] withr_3.0.2             doParallel_1.0.17       viridis_0.6.5          
#>  [46] DBI_1.2.3               ggforce_0.4.2           MASS_7.3-64            
#>  [49] lava_1.8.1              embed_1.1.5             rappdirs_0.3.3         
#>  [52] DelayedArray_0.33.6     tools_4.4.2             future.apply_1.11.3    
#>  [55] nnet_7.3-20             glue_1.8.0              grid_4.4.2             
#>  [58] Rtsne_0.17              recipes_1.1.1           gtable_0.3.6           
#>  [61] tzdb_0.4.0              class_7.3-23            data.table_1.16.4      
#>  [64] hms_1.1.3               utf8_1.2.4              tidygraph_1.3.1        
#>  [67] XVector_0.47.2          RcppAnnoy_0.0.22        ggrepel_0.9.6          
#>  [70] BiocVersion_3.21.1      foreach_1.5.2           pillar_1.10.1          
#>  [73] RcppHNSW_0.6.0          splines_4.4.2           tweenr_2.0.3           
#>  [76] lattice_0.22-6          survival_3.8-3          bit_4.5.0.1            
#>  [79] RProtoBufLib_2.19.0     tidyselect_1.2.1        Biostrings_2.75.3      
#>  [82] maketools_1.3.2         knitr_1.49              gridExtra_2.3          
#>  [85] xfun_0.50               graphlayouts_1.2.2      hardhat_1.4.1          
#>  [88] timeDate_4041.110       stringi_1.8.4           UCSC.utils_1.3.1       
#>  [91] yaml_2.3.10             evaluate_1.0.3          codetools_0.2-20       
#>  [94] ggraph_2.2.1            tibble_3.2.1            BiocManager_1.30.25    
#>  [97] cli_3.6.4               uwot_0.2.2              rpart_4.1.24           
#> [100] munsell_0.5.1           jquerylib_0.1.4         Rcpp_1.0.14            
#> [103] globals_0.16.3          png_0.1-8               parallel_4.4.2         
#> [106] gower_1.0.2             readr_2.1.5             blob_1.2.4             
#> [109] listenv_0.9.1           glmnet_4.1-8            viridisLite_0.4.2      
#> [112] ipred_0.9-15            ggridges_0.5.6          scales_1.3.0           
#> [115] prodlim_2024.06.25      purrr_1.0.4             crayon_1.5.3           
#> [118] rlang_1.1.5             KEGGREST_1.47.0