Analyzing single-cell data can be surprisingly complicated. This is partially because single-cell data analysis is an incredibly active area of research, with new methods being published on a weekly - or even daily! - basis. Accordingly, when new tools are published, they often require researchers to learn unique, method-specific application programming interfaces (APIs) with distinct requirements for input data formatting, function syntax, and output data structure. On the other hand, analyzing single-cell data can be challenging because it often involves simultaneously asking questions at multiple levels of biological scope - the single-cell level, the cell subpopulation (i.e. cluster) level, and the whole-sample or whole-patient level - each of which has distinct data processing needs.
To address both of these challenges for high-dimensional cytometry,
{tidytof}
(“tidy” as in “tidy data”; “tof” as
in “CyTOF”,
a flagship high-dimensional cytometry technology) implements a concise,
integrated “grammar” of single-cell data analysis capable of answering a
variety of biological questions. Available as an open-source R package,
{tidytof}
provides an easy-to-use pipeline for analyzing
high-dimensional cytometry data by automating many common
data-processing tasks under a common “tidy data” interface.
This vignette introduces you to the tidytof’s high-level API and shows
quick examples of how they can be applied to high-dimensional cytometry
datasets.
{tidytof}
makes heavy use of two concepts that may be
unfamiliar to R beginners. The first is the pipe (|>
),
which you can read about here. The second is
“grouping” data in a data.frame
or tibble
using dplyr::group_by
, which you can read about here. Most
{tidytof}
users will also benefit from a relatively
in-depth understanding of the dplyr package, which has a wonderful
introductory vignette here:
Everything else should be self-explanatory for both beginner and advanced R users, though if you have zero background in running R code, you should read this chapter of R for Data Science by Hadley Wickham.
Broadly speaking, {tidytof}
’s functionality is organized
to support the 3 levels of analysis inherent to single-cell data
described above:
{tidytof}
provides functions (or “verbs”) that operate
at each of these levels of analysis:
Cell-level data:
tof_read_data()
reads single-cell data from FCS or CSV
files on disk into a tidy data frame called a tof_tbl
.
tof_tbl
s represent each cell as a row and each protein
measurement (or other piece of information associated with a given cell)
as a column.tof_preprocess()
transforms protein expression values
using a user-provided function (i.e. log-transformation, centering,
scaling)tof_downsample()
reduces the number of cells in a
tof_tibble
via subsampling.tof_reduce_dimensions()
performs dimensionality
reduction (across columns)tof_write_data
writes single-cell data in a
tof_tibble
back to disk in the form of an FCS or CSV
file.Cluster-level data:
tof_cluster()
clusters cells using one of several
algorithms commonly applied to high-dimensional cytometry datatof_metacluster()
agglomerates clusters into a smaller
number of metaclusterstof_analyze_abundance()
performs differential abundance
analysis (DAA) for clusters or metaclusters across experimental
groupstof_analyze_expression()
performs differential
expression analysis (DEA) for clusters’ or metaclusters’ marker
expression levels across experimental groupstof_extract_features()
computes summary statistics
(such as mean marker expression) for each cluster. Also (optionally)
pivots these summary statistics into a sample-level tidy data frame in
which each row represents a sample and each column represents a
cluster-level summary statistic.Sample-level data:
tof_split_data()
splits sample-level data into a
training and test set for predictive modelingtof_create_grid()
creates an elastic net hyperparameter
search grid for model tuningtof_train_model()
trains a sample-level elastic net
model and saves it as a tof_model
objecttof_predict()
Applies a trained tof_model
to new data to predict sample-level outcomestof_assess_model()
calculates performance metrics for a
trained tof_model
With very few exceptions, {tidytof}
functions follow a
specific, shared syntax that involves 3 types of arguments that always
occur in the same order. These argument types are as follows:
{tidytof}
functions, the first argument
is a data frame (or tibble). This enables the use of the pipe
(|>
) for multi-step calculations, which means that your
first argument for most functions will be implicit (passed from the
previous function using the pipe). This also means that most
{tidytof}
functions are so-called “single-table
verbs,” with the exception of tof_cluster_ddpr
, which
is a “two-table verb” (for details about how to use
tof_cluster_ddpr
, see the “clustering-and-metaclustering”
vignette)._col
or
_cols
. Column specifications are unquoted column names that
tell a {tidytof}
verb which columns to compute over for a
particular operation. For example, the cluster_cols
argument in tof_cluster
allows the user to specify which
column in the input data frames should be used to perform the
clustering. Regardless of which verb requires them, column
specifications support tidyselect
helpers and follow the same rules for tidyselection as tidyverse
verbs like dplyr::select()
and
tidyr::pivot_longer()
.{tidytof}
verb are called method specifications,
and they’re comprised of every argument that isn’t an input data frame
or a column specification. Whereas column specifications represent which
columns should be used to perform an operation, method specifications
represent the details of how that operation should be performed. For
example, the tof_cluster_phenograph()
function requires the
method specification num_neighbors
, which specifies how
many nearest neighbors should be used to construct the PhenoGraph
algorithm’s k-nearest-neighbor graph. In most cases,
{tidytof}
sets reasonable defaults for each verb’s
particular method specifications, but your workflows are can also be
customized by experimenting with non-default values.The following code demonstrates how {tidytof}
verb
syntax looks in practice, with column and method specifications
explicitly pointed out:
data(ddpr_data)
set.seed(777L)
ddpr_data |>
tof_preprocess() |>
tof_cluster(
cluster_cols = starts_with("cd"), # column specification
method = "phenograph", # method specification,
) |>
tof_metacluster(
cluster_col = .phenograph_cluster, # column specification
num_metaclusters = 4, # method specification
method = "kmeans" # method specification
) |>
tof_downsample(
group_cols = .kmeans_metacluster, # column specification
num_cells = 200, # method specification
method = "constant" # method specification
) |>
tof_plot_cells_layout(
knn_cols = starts_with("cd"), # column specification
color_col = .kmeans_metacluster, # column specification
num_neighbors = 7L, # method specification
node_size = 2L # method specification
)
{tidytof}
verbs can be used on their own or in
combination with one another using the pipe (|>
)
operator. For example, here is a multistep “pipeline” that takes a
built-in {tidytof}
dataset and performs the following
analytical steps:
Arcsinh-transform each column of protein measurements (the
default behavior of the tof_preprocess
verb
Cluster our cells based on the surface markers in our panel
Downsample the dataset such that 100 random cells are picked from each cluster
Perform dimensionality reduction on the downsampled dataset using tSNE
Visualize the clusters using a low-dimensional tSNE embedding
ddpr_data |>
# step 1
tof_preprocess() |>
# step 2
tof_cluster(
cluster_cols = starts_with("cd"),
method = "phenograph",
# num_metaclusters = 4L,
seed = 2020L
) |>
# step 3
tof_downsample(
group_cols = .phenograph_cluster,
method = "constant",
num_cells = 400
) |>
# step 4
tof_reduce_dimensions(method = "tsne") |>
# step 5
tof_plot_cells_embedding(
embedding_cols = contains("tsne"),
color_col = .phenograph_cluster
) +
ggplot2::theme(legend.position = "none")
{tidytof}
was designed by a multidisciplinary team of
wet-lab biologists, bioinformaticians, and physician-scientists who
analyze high-dimensional cytometry and other kinds of single-cell data
to solve a variety of problems. As a result, {tidytof}
’s
high-level API was designed with great care to mirror that of the
{tidyverse}
itself - that is, to be human-centered,
consistent, composable, and inclusive for a wide userbase.
Practically speaking, this means a few things about using
{tidytof}
.
First, it means that {tidytof}
was designed with a few
quality-of-life features in mind. For example, you may notice that most
{tidytof}
functions begin with the prefix
tof_
. This is intentional, as it will allow you to use your
development environment’s code-completing software to search for
{tidytof}
functions easily (even if you can’t remember a
specific function name). For this reason, we recommend using
{tidytof}
within the RStudio development environment;
however, many code editors have predictive text functionality that
serves a similar function. In general, {tidytof}
verbs are
organized in such a way that your IDE’s code-completion tools should
also allow you to search for (and compare) related functions with
relative ease. (For instance, the tof_cluster_
prefix is
used for all clustering functions, and the tof_downsample_
prefix is used for all downsampling functions).
Second, it means that {tidytof}
functions
should be relatively intuitive to use due to their shared logic
- in other words, if you understand how to use one
{tidytof}
function, you should understand how to use most
of the others. An example of shared logic across {tidytof}
functions is the argument group_cols
, which shows up in
multiple verbs (tof_downsample
, tof_cluster
,
tof_daa
, tof_dea
,
tof_extract_features
, and tof_write_data
). In
each case, group_cols
works the same way: it accepts an
unquoted vector of column names (specified manually or using tidyselection)
that should be used to group cells before an operation is performed.
This idea generalizes throughout {tidytof}
: if you see an
argument in one place, it will behave identically (or at least very
similarly) wherever else you encounter it.
Finally, it means that {tidytof}
is optimized first for
ease-of-use, then for performance. Because humans and computers interact
with data differently, there is always a trade-off between choosing a
data representation that is intuitive to a human user vs. choosing a
data representation optimized for computational speed and memory
efficiency. When these design choices conflict with one another, our
team tends to err on the side of choosing a representation that is
easy-to-understand for users even at the expense of small performance
costs. Ultimately, this means that {tidytof}
may not be the
optimal tool for every high-dimensional cytometry analysis, though
hopefully its general framework will provide most users with some useful
functionality.
{tidytof}
includes multiple vignettes that cover
different components of the prototypical high-dimensional cytometry data
analysis pipeline. You can access these vignettes by running the
following:
To learn the basics, we recommend visiting the vignettes in the following order to start with smalle (cell-level) operations and work your way up to larger (cluster- and sample-level) operations:
You can also read the academic papers describing {tidytof}
and/or the larger tidyomics
initiative of which {tidytof}
is a part. You can also
visit the {tidytof}
website.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] tidyr_1.3.1 stringr_1.5.1
#> [3] HDCytoData_1.26.0 flowCore_2.19.0
#> [5] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [7] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [9] IRanges_2.41.1 S4Vectors_0.45.2
#> [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [13] ExperimentHub_2.15.0 AnnotationHub_3.15.0
#> [15] BiocFileCache_2.15.0 dbplyr_2.5.0
#> [17] BiocGenerics_0.53.3 generics_0.1.3
#> [19] forcats_1.0.0 ggplot2_3.5.1
#> [21] dplyr_1.1.4 tidytof_1.1.0
#> [23] rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] sys_3.4.3 jsonlite_1.8.9 shape_1.4.6.1
#> [4] magrittr_2.0.3 farver_2.1.2 zlibbioc_1.52.0
#> [7] vctrs_0.6.5 memoise_2.0.1 htmltools_0.5.8.1
#> [10] S4Arrays_1.7.1 curl_6.0.1 SparseArray_1.7.2
#> [13] sass_0.4.9 parallelly_1.39.0 bslib_0.8.0
#> [16] lubridate_1.9.3 cachem_1.1.0 buildtools_1.0.0
#> [19] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
#> [22] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-1
#> [25] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [28] future_1.34.0 digest_0.6.37 colorspace_2.1-1
#> [31] furrr_0.3.1 AnnotationDbi_1.69.0 irlba_2.3.5.1
#> [34] RSQLite_2.3.7 philentropy_0.9.0 labeling_0.4.3
#> [37] filelock_1.0.3 cytolib_2.19.0 fansi_1.0.6
#> [40] yardstick_1.3.1 timechange_0.3.0 httr_1.4.7
#> [43] polyclip_1.10-7 abind_1.4-8 compiler_4.4.2
#> [46] bit64_4.5.2 withr_3.0.2 doParallel_1.0.17
#> [49] viridis_0.6.5 DBI_1.2.3 ggforce_0.4.2
#> [52] MASS_7.3-61 lava_1.8.0 embed_1.1.4
#> [55] rappdirs_0.3.3 DelayedArray_0.33.2 tools_4.4.2
#> [58] future.apply_1.11.3 nnet_7.3-19 glue_1.8.0
#> [61] grid_4.4.2 Rtsne_0.17 recipes_1.1.0
#> [64] gtable_0.3.6 tzdb_0.4.0 class_7.3-22
#> [67] rsample_1.2.1 data.table_1.16.2 hms_1.1.3
#> [70] tidygraph_1.3.1 utf8_1.2.4 XVector_0.47.0
#> [73] RcppAnnoy_0.0.22 ggrepel_0.9.6 BiocVersion_3.21.1
#> [76] foreach_1.5.2 pillar_1.9.0 vroom_1.6.5
#> [79] RcppHNSW_0.6.0 splines_4.4.2 tweenr_2.0.3
#> [82] lattice_0.22-6 survival_3.7-0 bit_4.5.0
#> [85] emdist_0.3-3 RProtoBufLib_2.19.0 tidyselect_1.2.1
#> [88] Biostrings_2.75.1 maketools_1.3.1 knitr_1.49
#> [91] gridExtra_2.3 xfun_0.49 graphlayouts_1.2.0
#> [94] hardhat_1.4.0 timeDate_4041.110 stringi_1.8.4
#> [97] UCSC.utils_1.3.0 yaml_2.3.10 evaluate_1.0.1
#> [100] codetools_0.2-20 ggraph_2.2.1 tibble_3.2.1
#> [103] BiocManager_1.30.25 cli_3.6.3 uwot_0.2.2
#> [106] rpart_4.1.23 munsell_0.5.1 jquerylib_0.1.4
#> [109] Rcpp_1.0.13-1 globals_0.16.3 png_0.1-8
#> [112] parallel_4.4.2 gower_1.0.1 readr_2.1.5
#> [115] blob_1.2.4 listenv_0.9.1 glmnet_4.1-8
#> [118] viridisLite_0.4.2 ipred_0.9-15 ggridges_0.5.6
#> [121] scales_1.3.0 prodlim_2024.06.25 purrr_1.0.2
#> [124] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.47.0