How to contribute code

In addition to implementing its own built-in functions, {tidytof} proposes a general framework for analyzing single-cell data using a tidy interface. This framework centers on the use of “verbs,” i.e. modular function families that represent specific data operations. Users may wish to extend {tidytof}’s existing functionality by writing functions that implements additional tidy interfaces to new algorithms or data analysis methods not currently included in {tidytof}.

If you’re interested in contributing new functions to {tidytof}, this vignette provides some details about how to do so.

General Guidelines

To extend {tidytof} to include a new algorithm - for example, one that you’ve just developed - you can take 1 of 2 general strategies (and in some cases, you may take both!). The first is to write a {tidytof}-style verb for your algorithm that can be included in your own standalone package. In this case, the benefit of writing a {tidytof}-style verb for your algorithm is that taking advantage of {tidytof}’s design schema will make your algorithm easy for users to access without learning much (if any) new syntax while still allowing you to maintain your code base independently of our team.

The second approach is to write a {tidytof}-style function that you’d like our team to add to {tidytof} itself in its next release. In this case, the code review process will take a bit of time, but it will also allow our teams to collaborate and provide a greater degree of critical feedback to one another as well as to share the burden of code maintenance in the future.

In either case, you’re welcome to contact the {tidytof} team to review your code via a pull request and/or an issue on the {tidytof} GitHub page. This tutorial may be helpful if you don’t have a lot of experience collaborating with other programmers via GitHub.

After you open your request, you can submit code to our team to be reviewed. Whether you want your method to be incorporated into {tidytof} or if you’re simply looking for external code review/feedback from our team, please mention this in your request.

Code style

{tidytof} uses the tidyverse style guide. Adhering to tidyverse style is something our team will expect for any code being incorporated into {tidytof}, and it’s also something we encourage for any functions you write for your own analysis packages. In our experience, the best code is written not just to be executed, but also to be read by other humans! There are also many tools you can use to lint or automatically style your R code, such as the {lintr} and {styler} packages.

Testing

In addition to written well-styled code, we encourage you to write unit tests for every function you write. This is common practice in the software engineering world, but not as common as it probably should be(!) in the bioinformatics community. The {tidytof} team uses the {testthat} package for all of its unit tests, and there’s a great tutorial for doing so here.

How to contribute

General principles

The most important part of writing a function that extends {tidytof} is to adhere to {tidytof} verb syntax. With very few exceptions, {tidytof} functions follow a specific, shared syntax that involves 3 types of arguments that always occur in the same order. These argument types are as follows:

  1. For almost all {tidytof} functions, the first argument is a data frame (or tibble). This enables the use of the pipe (|>) for multi-step calculations, which means that your first argument for most functions will be implicit (passed from the previous function using the pipe).
  2. The second group of arguments are called column specifications, and they end in the suffix _col or _cols. Column specifications are unquoted column names that tell a {tidytof} verb which columns to compute over for a particular operation. For example, the cluster_cols argument in tof_cluster allows the user to specify which column in the input data frames should be used to perform the clustering. Regardless of which verb requires them, column specifications support tidyselect helpers and follow the same rules for tidyselection as tidyverse verbs like dplyr::select() and tidyr::pivot_longer().
  3. Finally, the third group of arguments for each {tidytof} verb are called method specifications, and they’re comprised of every argument that isn’t an input data frame or a column specification. Whereas column specifications represent which columns should be used to perform an operation, method specifications represent the details of how that operation should be performed. For example, the tof_cluster_phenograph() function requires the method specification num_neighbors, which specifies how many nearest neighbors should be used to construct the PhenoGraph algorithm’s k-nearest-neighbor graph.

With few exceptions, any {tidytof} extension should include the same 3 argument types (in the same order).

In addition, any functions that extend {tidytof} should have a name that starts with the prefix tof_. This will make it easier for users to find {tidytof} functions using the text completion functionality included in most development environments.

Contributing a new method to an existing {tidytof} verb

{tidytof} currently includes multiple verbs that perform fundamental single-cell data manipulation tasks. Currently, {tidytof}’s extensible verbs are the following:

  • tof_analyze_abundance: Perform differential cluster abundance analysis
  • tof_analyze_expression: Perform differential marker expression analysis
  • tof_annotate_clusters: Annotate clusters with manual IDs
  • tof_batch_correct: Perform batch correction
  • tof_cluster: Cluster cells into subpopulations
  • tof_downsample: Subsample a dataset into a smaller number of cells
  • tof_extract: Calculate sample-level summary statistics
  • tof_metacluster: Metacluster clusters into a smaller number of subpopulations
  • tof_plot_cells: Plot cell-level data
  • tof_plot_clusters: Plot cluster-level data
  • tof_plot_model: Plot the results of a sample-level model
  • tof_read_data: Read data into memory from disk
  • tof_reduce_dimensions: Perform dimensionality reduction
  • tof_transform: Transform marker expression values in a vectorized fashion
  • tof_upsample: Assign new cells to existing clusters (defined on a downsample dataset)
  • tof_write_data: Write data from memory to disk

Each {tidytof} verb wraps a family of related functions that all perform the same basic task. For example, the tof_cluster verb is a wrapper for the following functions: tof_cluster_ddpr, tof_cluster_flowsom, tof_cluster_kmeans, and tof_cluster_phenograph. All of these functions implement a different clustering algorithm, but they share an underlying logic that is standardized under the tof_cluster abstraction. In practice, this means that users can apply the DDPR, FlowSOM, K-means, and PhenoGraph clustering algorithms to their datasets either by calling one of the tof_cluster_* functions directly, or by calling tof_cluster with the method argument set to the appropriate value (“ddpr”, “flowsom”, “kmeans”, and “phenograph”, respectively).

To extend an existing {tidytof} verb, write a function whose name fits the pattern tof_{verb name}_*, where “*” represents the name of the algorithm being used to perform the computation. In the function definition, try to share as many arguments as possible with the {tidytof} verb you’re extending, and return the same output object as that described in the “Value” heading of the help file for the verb being extended.

For example, suppose I wanted to write a {tidytof}-style interface for my new clustering algorithm “supercluster”, which performs k-means clustering on a dataset twice and then outputs a final cluster assignment equal to the two k-means cluster assignments spliced together. To add the supercluster algorithm to {tidytof}, I might write a function like this:

#' Perform superclustering on high-dimensional cytometry data.
#'
#' This function applies the silly, hypothetical clustering algorithm
#' "supercluster" to high-dimensional cytometry data using user-specified
#' input variables/cytometry measurements.
#'
#' @param tof_tibble A `tof_tbl` or `tibble`.
#'
#' @param cluster_cols Unquoted column names indicating which columns in
#' `tof_tibble` to use in computing the supercluster clusters.
#' Supports tidyselect helpers.
#'
#' @param num_kmeans_clusters An integer indicating how many clusters should be
#' used for the two k-means clustering steps.
#'
#' @param sep A string to use when splicing the 2 k-means clustering assignments
#' to one another.
#'
#' @param ... Optional additional parameters to pass to
#' \code{\link[tidytof]{tof_cluster_kmeans}}
#'
#' @return A tibble with one column named `.supercluster_cluster` containing
#' a character vector of length `nrow(tof_tibble)` indicating the id of the
#' supercluster cluster to which each cell (i.e. each row) in `tof_tibble` was
#' assigned.
#'
#' @importFrom dplyr tibble
#'
tof_cluster_supercluster <-
    function(tof_tibble, cluster_cols, num_kmeans_clusters = 10L, sep = "_", ...) {
        kmeans_1 <-
            tof_tibble |>
            tof_cluster_kmeans(
                cluster_cols = {{ cluster_cols }},
                num_clusters = num_kmeans_clusters,
                ...
            )

        kmeans_2 <-
            tof_tibble |>
            tof_cluster_kmeans(
                cluster_cols = {{ cluster_cols }},
                num_clusters = num_kmeans_clusters,
                ...
            )

        final_result <-
            dplyr::tibble(
                .supercluster_cluster =
                    paste(kmeans_1$.kmeans_cluster, kmeans_2$.kmeans_cluster, sep = sep)
            )

        return(final_result)
    }

In the example above, note that tof_cluster_supercluster is named using the tof_{verb name}_* style, that the function definition uses the same tof_tibble and cluster_cols arguments as tof_cluster, and that the returned output object is a tof_tbl with a single column encoding the cluster ids for each of the cells in tof_tibble.

Creating a new {tidytof} verb

If you want to contribute a function to {tidytof} that represents a new operation not encompassed by any of the existing verbs above, you should include the suggestion to create a new verb in your pull request to the {tidytof} team. In this case, you’ll have considerably more flexibility to define the interface {tidytof} will use to implement your new verb, and the {tidytof} team is happy to work with you to figure out what makes the most sense (or at least to brainstorm together).

A note about modeling functions

At this point in its development, we don’t recommend extending {tidytof}’s modeling functionality, as it is likely to be abstracted into its own standalone package (with an emphasis on interoperability with the tidymodels ecosystem) at some point in the future.

Contact us

For general questions/comments/concerns about {tidytof}, feel free to reach out to our team on GitHub here.

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] HDCytoData_1.26.0           flowCore_2.19.0            
#>  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [5] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
#>  [7] IRanges_2.41.1              S4Vectors_0.45.2           
#>  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> [11] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
#> [13] BiocFileCache_2.15.0        dbplyr_2.5.0               
#> [15] BiocGenerics_0.53.3         generics_0.1.3             
#> [17] forcats_1.0.0               ggplot2_3.5.1              
#> [19] dplyr_1.1.4                 tidytof_1.1.0              
#> [21] rmarkdown_2.29             
#> 
#> loaded via a namespace (and not attached):
#>   [1] sys_3.4.3               jsonlite_1.8.9          shape_1.4.6.1          
#>   [4] magrittr_2.0.3          farver_2.1.2            zlibbioc_1.52.0        
#>   [7] vctrs_0.6.5             memoise_2.0.1           htmltools_0.5.8.1      
#>  [10] S4Arrays_1.7.1          curl_6.0.1              SparseArray_1.7.2      
#>  [13] sass_0.4.9              parallelly_1.39.0       bslib_0.8.0            
#>  [16] lubridate_1.9.3         cachem_1.1.0            buildtools_1.0.0       
#>  [19] igraph_2.1.1            mime_0.12               lifecycle_1.0.4        
#>  [22] iterators_1.0.14        pkgconfig_2.0.3         Matrix_1.7-1           
#>  [25] R6_2.5.1                fastmap_1.2.0           GenomeInfoDbData_1.2.13
#>  [28] future_1.34.0           digest_0.6.37           colorspace_2.1-1       
#>  [31] AnnotationDbi_1.69.0    RSQLite_2.3.7           labeling_0.4.3         
#>  [34] filelock_1.0.3          cytolib_2.19.0          fansi_1.0.6            
#>  [37] yardstick_1.3.1         timechange_0.3.0        httr_1.4.7             
#>  [40] polyclip_1.10-7         abind_1.4-8             compiler_4.4.2         
#>  [43] bit64_4.5.2             withr_3.0.2             doParallel_1.0.17      
#>  [46] viridis_0.6.5           DBI_1.2.3               ggforce_0.4.2          
#>  [49] MASS_7.3-61             lava_1.8.0              rappdirs_0.3.3         
#>  [52] DelayedArray_0.33.2     tools_4.4.2             future.apply_1.11.3    
#>  [55] nnet_7.3-19             glue_1.8.0              grid_4.4.2             
#>  [58] recipes_1.1.0           gtable_0.3.6            tzdb_0.4.0             
#>  [61] class_7.3-22            tidyr_1.3.1             data.table_1.16.2      
#>  [64] hms_1.1.3               tidygraph_1.3.1         utf8_1.2.4             
#>  [67] XVector_0.47.0          ggrepel_0.9.6           BiocVersion_3.21.1     
#>  [70] foreach_1.5.2           pillar_1.9.0            stringr_1.5.1          
#>  [73] RcppHNSW_0.6.0          splines_4.4.2           tweenr_2.0.3           
#>  [76] lattice_0.22-6          survival_3.7-0          bit_4.5.0              
#>  [79] RProtoBufLib_2.19.0     tidyselect_1.2.1        Biostrings_2.75.1      
#>  [82] maketools_1.3.1         knitr_1.49              gridExtra_2.3          
#>  [85] xfun_0.49               graphlayouts_1.2.0      hardhat_1.4.0          
#>  [88] timeDate_4041.110       stringi_1.8.4           UCSC.utils_1.3.0       
#>  [91] yaml_2.3.10             evaluate_1.0.1          codetools_0.2-20       
#>  [94] ggraph_2.2.1            tibble_3.2.1            BiocManager_1.30.25    
#>  [97] cli_3.6.3               rpart_4.1.23            munsell_0.5.1          
#> [100] jquerylib_0.1.4         Rcpp_1.0.13-1           globals_0.16.3         
#> [103] png_0.1-8               parallel_4.4.2          gower_1.0.1            
#> [106] readr_2.1.5             blob_1.2.4              listenv_0.9.1          
#> [109] glmnet_4.1-8            viridisLite_0.4.2       ipred_0.9-15           
#> [112] ggridges_0.5.6          scales_1.3.0            prodlim_2024.06.25     
#> [115] purrr_1.0.2             crayon_1.5.3            rlang_1.1.4            
#> [118] KEGGREST_1.47.0