In addition to implementing its
own built-in functions, {tidytof}
proposes a general
framework for analyzing single-cell data using a tidy interface. This
framework centers on the use of “verbs,” i.e. modular function families
that represent specific data operations. Users may wish to extend
{tidytof}
’s existing functionality by writing functions
that implements additional tidy interfaces to new algorithms or data
analysis methods not currently included in {tidytof}
.
If you’re interested in contributing new functions to
{tidytof}
, this vignette provides some details about how to
do so.
To extend {tidytof}
to include a new algorithm - for
example, one that you’ve just developed - you can take 1 of 2 general
strategies (and in some cases, you may take both!). The first is to
write a {tidytof}
-style verb for your algorithm that can be
included in your own standalone package. In this case, the benefit of
writing a {tidytof}
-style verb for your algorithm is that
taking advantage of {tidytof}
’s design schema will make
your algorithm easy for users to access without learning much (if any)
new syntax while still allowing you to maintain your code base
independently of our team.
The second approach is to write a {tidytof}
-style
function that you’d like our team to add to {tidytof}
itself in its next release. In this case, the code review process will
take a bit of time, but it will also allow our teams to collaborate and
provide a greater degree of critical feedback to one another as well as
to share the burden of code maintenance in the future.
In either case, you’re welcome to contact the {tidytof}
team to review your code via a pull
request and/or an issue
on the {tidytof}
GitHub page. This
tutorial may be helpful if you don’t have a lot of experience
collaborating with other programmers via GitHub.
After you open your request, you can submit code to our team to be
reviewed. Whether you want your method to be incorporated into
{tidytof}
or if you’re simply looking for external code
review/feedback from our team, please mention this in your request.
{tidytof}
uses the tidyverse style guide.
Adhering to tidyverse style is something our team will expect for any
code being incorporated into {tidytof}
, and it’s also
something we encourage for any functions you write for your own analysis
packages. In our experience, the best code is written not just to be
executed, but also to be read by other humans! There are also many tools
you can use to lint or automatically style your R code, such as the {lintr}
and {styler}
packages.
In addition to written well-styled code, we encourage you to write
unit tests for every function you write. This is common practice in the
software engineering world, but not as common as it probably should
be(!) in the bioinformatics community. The {tidytof}
team
uses the {testthat}
package
for all of its unit tests, and there’s a great tutorial for doing so here.
The most important part of writing a function that extends
{tidytof}
is to adhere to {tidytof}
verb
syntax. With very few exceptions, {tidytof}
functions
follow a specific, shared syntax that involves 3 types of arguments that
always occur in the same order. These argument types are as follows:
{tidytof}
functions, the first argument
is a data frame (or tibble). This enables the use of the pipe
(|>
) for multi-step calculations, which means that your
first argument for most functions will be implicit (passed from the
previous function using the pipe)._col
or
_cols
. Column specifications are unquoted column names that
tell a {tidytof}
verb which columns to compute over for a
particular operation. For example, the cluster_cols
argument in tof_cluster
allows the user to specify which
column in the input data frames should be used to perform the
clustering. Regardless of which verb requires them, column
specifications support tidyselect
helpers and follow the same rules for tidyselection as tidyverse
verbs like dplyr::select()
and
tidyr::pivot_longer()
.{tidytof}
verb are called method specifications,
and they’re comprised of every argument that isn’t an input data frame
or a column specification. Whereas column specifications represent which
columns should be used to perform an operation, method specifications
represent the details of how that operation should be performed. For
example, the tof_cluster_phenograph()
function requires the
method specification num_neighbors
, which specifies how
many nearest neighbors should be used to construct the PhenoGraph
algorithm’s k-nearest-neighbor graph.With few exceptions, any {tidytof}
extension should
include the same 3 argument types (in the same order).
In addition, any functions that extend {tidytof}
should
have a name that starts with the prefix tof_
. This will
make it easier for users to find {tidytof}
functions using
the text completion functionality included in most development
environments.
{tidytof}
verb{tidytof}
currently includes multiple verbs that perform
fundamental single-cell data manipulation tasks. Currently,
{tidytof}
’s extensible verbs are the following:
tof_analyze_abundance
: Perform differential cluster
abundance analysistof_analyze_expression
: Perform differential marker
expression analysistof_annotate_clusters
: Annotate clusters with manual
IDstof_batch_correct
: Perform batch correctiontof_cluster
: Cluster cells into subpopulationstof_downsample
: Subsample a dataset into a smaller
number of cellstof_extract
: Calculate sample-level summary
statisticstof_metacluster
: Metacluster clusters into a smaller
number of subpopulationstof_plot_cells
: Plot cell-level datatof_plot_clusters
: Plot cluster-level datatof_plot_model
: Plot the results of a sample-level
modeltof_read_data
: Read data into memory from disktof_reduce_dimensions
: Perform dimensionality
reductiontof_transform
: Transform marker expression values in a
vectorized fashiontof_upsample
: Assign new cells to existing clusters
(defined on a downsample dataset)tof_write_data
: Write data from memory to diskEach {tidytof}
verb wraps a family of related functions
that all perform the same basic task. For example, the
tof_cluster
verb is a wrapper for the following functions:
tof_cluster_ddpr
, tof_cluster_flowsom
,
tof_cluster_kmeans
, and
tof_cluster_phenograph
. All of these functions implement a
different clustering algorithm, but they share an underlying logic that
is standardized under the tof_cluster
abstraction. In
practice, this means that users can apply the DDPR, FlowSOM, K-means,
and PhenoGraph clustering algorithms to their datasets either by calling
one of the tof_cluster_*
functions directly, or by calling
tof_cluster
with the method
argument set to
the appropriate value (“ddpr”, “flowsom”, “kmeans”, and “phenograph”,
respectively).
To extend an existing {tidytof}
verb, write a function
whose name fits the pattern tof_{verb name}_*
, where “*”
represents the name of the algorithm being used to perform the
computation. In the function definition, try to share as many arguments
as possible with the {tidytof}
verb you’re extending, and
return the same output object as that described in the “Value” heading
of the help file for the verb being extended.
For example, suppose I wanted to write a {tidytof}
-style
interface for my new clustering algorithm “supercluster”, which performs
k-means clustering on a dataset twice and then outputs a final cluster
assignment equal to the two k-means cluster assignments spliced
together. To add the supercluster algorithm to {tidytof}
, I
might write a function like this:
#' Perform superclustering on high-dimensional cytometry data.
#'
#' This function applies the silly, hypothetical clustering algorithm
#' "supercluster" to high-dimensional cytometry data using user-specified
#' input variables/cytometry measurements.
#'
#' @param tof_tibble A `tof_tbl` or `tibble`.
#'
#' @param cluster_cols Unquoted column names indicating which columns in
#' `tof_tibble` to use in computing the supercluster clusters.
#' Supports tidyselect helpers.
#'
#' @param num_kmeans_clusters An integer indicating how many clusters should be
#' used for the two k-means clustering steps.
#'
#' @param sep A string to use when splicing the 2 k-means clustering assignments
#' to one another.
#'
#' @param ... Optional additional parameters to pass to
#' \code{\link[tidytof]{tof_cluster_kmeans}}
#'
#' @return A tibble with one column named `.supercluster_cluster` containing
#' a character vector of length `nrow(tof_tibble)` indicating the id of the
#' supercluster cluster to which each cell (i.e. each row) in `tof_tibble` was
#' assigned.
#'
#' @importFrom dplyr tibble
#'
tof_cluster_supercluster <-
function(tof_tibble, cluster_cols, num_kmeans_clusters = 10L, sep = "_", ...) {
kmeans_1 <-
tof_tibble |>
tof_cluster_kmeans(
cluster_cols = {{ cluster_cols }},
num_clusters = num_kmeans_clusters,
...
)
kmeans_2 <-
tof_tibble |>
tof_cluster_kmeans(
cluster_cols = {{ cluster_cols }},
num_clusters = num_kmeans_clusters,
...
)
final_result <-
dplyr::tibble(
.supercluster_cluster =
paste(kmeans_1$.kmeans_cluster, kmeans_2$.kmeans_cluster, sep = sep)
)
return(final_result)
}
In the example above, note that tof_cluster_supercluster
is named using the tof_{verb name}_*
style, that the
function definition uses the same tof_tibble
and
cluster_cols
arguments as tof_cluster
, and
that the returned output object is a tof_tbl
with a single
column encoding the cluster ids for each of the cells in
tof_tibble
.
{tidytof}
verbIf you want to contribute a function to {tidytof}
that
represents a new operation not encompassed by any of the existing verbs
above, you should include the suggestion to create a new verb in your
pull request to the {tidytof}
team. In this case, you’ll
have considerably more flexibility to define the interface
{tidytof}
will use to implement your new verb, and the
{tidytof}
team is happy to work with you to figure out what
makes the most sense (or at least to brainstorm together).
At this point in its development, we don’t recommend extending
{tidytof}
’s modeling functionality, as it is likely to be
abstracted into its own standalone package (with an emphasis on
interoperability with the tidymodels
ecosystem) at some
point in the future.
For general questions/comments/concerns about {tidytof}
,
feel free to reach out to our team on GitHub here.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] HDCytoData_1.26.0 flowCore_2.19.0
#> [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [5] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [7] IRanges_2.41.1 S4Vectors_0.45.2
#> [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [11] ExperimentHub_2.15.0 AnnotationHub_3.15.0
#> [13] BiocFileCache_2.15.0 dbplyr_2.5.0
#> [15] BiocGenerics_0.53.3 generics_0.1.3
#> [17] forcats_1.0.0 ggplot2_3.5.1
#> [19] dplyr_1.1.4 tidytof_1.1.0
#> [21] rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] sys_3.4.3 jsonlite_1.8.9 shape_1.4.6.1
#> [4] magrittr_2.0.3 farver_2.1.2 zlibbioc_1.52.0
#> [7] vctrs_0.6.5 memoise_2.0.1 htmltools_0.5.8.1
#> [10] S4Arrays_1.7.1 curl_6.0.1 SparseArray_1.7.2
#> [13] sass_0.4.9 parallelly_1.39.0 bslib_0.8.0
#> [16] lubridate_1.9.3 cachem_1.1.0 buildtools_1.0.0
#> [19] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
#> [22] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-1
#> [25] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [28] future_1.34.0 digest_0.6.37 colorspace_2.1-1
#> [31] AnnotationDbi_1.69.0 RSQLite_2.3.7 labeling_0.4.3
#> [34] filelock_1.0.3 cytolib_2.19.0 fansi_1.0.6
#> [37] yardstick_1.3.1 timechange_0.3.0 httr_1.4.7
#> [40] polyclip_1.10-7 abind_1.4-8 compiler_4.4.2
#> [43] bit64_4.5.2 withr_3.0.2 doParallel_1.0.17
#> [46] viridis_0.6.5 DBI_1.2.3 ggforce_0.4.2
#> [49] MASS_7.3-61 lava_1.8.0 rappdirs_0.3.3
#> [52] DelayedArray_0.33.2 tools_4.4.2 future.apply_1.11.3
#> [55] nnet_7.3-19 glue_1.8.0 grid_4.4.2
#> [58] recipes_1.1.0 gtable_0.3.6 tzdb_0.4.0
#> [61] class_7.3-22 tidyr_1.3.1 data.table_1.16.2
#> [64] hms_1.1.3 tidygraph_1.3.1 utf8_1.2.4
#> [67] XVector_0.47.0 ggrepel_0.9.6 BiocVersion_3.21.1
#> [70] foreach_1.5.2 pillar_1.9.0 stringr_1.5.1
#> [73] RcppHNSW_0.6.0 splines_4.4.2 tweenr_2.0.3
#> [76] lattice_0.22-6 survival_3.7-0 bit_4.5.0
#> [79] RProtoBufLib_2.19.0 tidyselect_1.2.1 Biostrings_2.75.1
#> [82] maketools_1.3.1 knitr_1.49 gridExtra_2.3
#> [85] xfun_0.49 graphlayouts_1.2.0 hardhat_1.4.0
#> [88] timeDate_4041.110 stringi_1.8.4 UCSC.utils_1.3.0
#> [91] yaml_2.3.10 evaluate_1.0.1 codetools_0.2-20
#> [94] ggraph_2.2.1 tibble_3.2.1 BiocManager_1.30.25
#> [97] cli_3.6.3 rpart_4.1.23 munsell_0.5.1
#> [100] jquerylib_0.1.4 Rcpp_1.0.13-1 globals_0.16.3
#> [103] png_0.1-8 parallel_4.4.2 gower_1.0.1
#> [106] readr_2.1.5 blob_1.2.4 listenv_0.9.1
#> [109] glmnet_4.1-8 viridisLite_0.4.2 ipred_0.9-15
#> [112] ggridges_0.5.6 scales_1.3.0 prodlim_2024.06.25
#> [115] purrr_1.0.2 crayon_1.5.3 rlang_1.1.4
#> [118] KEGGREST_1.47.0