--- title: "Clustering and metaclustering" author: "Timothy Keyes" date: "`r Sys.Date()`" output: rmarkdown::html_vignette description: > Read this vignette to learn how to identify clusters of cells with shared phenotypic characteristics using {tidytof}. vignette: > %\VignetteIndexEntry{07. Clustering and metaclustering} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options( rmarkdown.html_vignette.check_title = FALSE ) ``` ```{r setup, message = FALSE, warning = FALSE} library(tidytof) library(dplyr) ``` Often, clustering single-cell data to identify communities of cells with shared characteristics is a major goal of high-dimensional cytometry data analysis. To do this, `{tidytof}` provides the `tof_cluster()` verb. Several clustering methods are implemented in `{tidytof}`, including the following: - [FlowSOM](https://pubmed.ncbi.nlm.nih.gov/25573116/) - [k-means](https://www.jstor.org/stable/2346830?origin=crossref&seq=1#metadata_info_tab_contents) - [PhenoGraph](https://pubmed.ncbi.nlm.nih.gov/26095251/) - [Supervised distance-based clustering](https://pubmed.ncbi.nlm.nih.gov/29505032/) - [X-shift](https://pubmed.ncbi.nlm.nih.gov/27183440/) Each of these methods are wrapped by `tof_cluster()`. ## Clustering with `tof_cluster()` To demonstrate, we can apply the PhenoGraph clustering algorithm to `{tidytof}`'s built-in `phenograph_data`. Note that `phenograph_data` contains 3000 total cells (1000 each from 3 clusters identified in the [original PhenoGraph publication](https://pubmed.ncbi.nlm.nih.gov/26095251/)). For demonstration purposes, we also metacluster our PhenoGraph clusters using k-means clustering. ```{r} data(phenograph_data) set.seed(203L) phenograph_clusters <- phenograph_data |> tof_preprocess() |> tof_cluster( cluster_cols = starts_with("cd"), num_neighbors = 50L, distance_function = "cosine", method = "phenograph" ) |> tof_metacluster( cluster_col = .phenograph_cluster, metacluster_cols = starts_with("cd"), num_metaclusters = 3L, method = "kmeans" ) phenograph_clusters |> dplyr::select(sample_name, .phenograph_cluster, .kmeans_metacluster) |> head() ``` The outputs of both `tof_cluster()` and `tof_metacluster()` are a `tof_tbl` identical to the input tibble, but now with the addition of an additional column (in this case, ".phenograph_cluster" and ".kmeans_metacluster") that encodes the cluster id for each cell in the input `tof_tbl`. Note that all output columns added to a tibble or `tof_tbl` by `{tidytof}` begin with a full-stop (".") to reduce the likelihood of collisions with existing column names. Because the output of `tof_cluster` is a `tof_tbl`, we can use `dplyr`'s `count` method to assess the accuracy of our clustering procedure compared to the original clustering from the PhenoGraph paper. ```{r} phenograph_clusters |> dplyr::count(phenograph_cluster, .kmeans_metacluster, sort = TRUE) ``` Here, we can see that our clustering procedure groups most cells from the same PhenoGraph cluster with one another (with a small number of mistakes). To change which clustering algorithm `tof_cluster` uses, alter the `method` flag. ```{r, eval = FALSE} # use the kmeans algorithm phenograph_data |> tof_preprocess() |> tof_cluster( cluster_cols = contains("cd"), method = "kmeans" ) # use the flowsom algorithm phenograph_data |> tof_preprocess() |> tof_cluster( cluster_cols = contains("cd"), method = "flowsom" ) ``` To change the columns used to compute the clusters, change the `cluster_cols` flag. And finally, if you want to return a one-column `tibble` that only includes the cluster labels (as opposed to the cluster labels added as a new column to the input `tof_tbl`), set `augment` to `FALSE`. ```{r} # will result in a tibble with only 1 column (the cluster labels) phenograph_data |> tof_preprocess() |> tof_cluster( cluster_cols = contains("cd"), method = "kmeans", augment = FALSE ) |> head() ``` # Session info ```{r} sessionInfo() ```