Reading and writing data

library(tidytof)
library(dplyr)

This vignette teaches you how to read CyTOF data into an R session from two common file formats in which CyTOF data is typically stored: Flow Cytometry Standard (FCS) and Comma-Separated Value (CSV) files.

Accessing the data for this vignette

{tidytof} comes bundled with several example mass cytometry datasets. To access the raw FCS and CSV files containing these data, use the tidytof_example_data function. When called with no arguments, tidytof_example_data will return a character vector naming the datasets contained in {tidytof}:

tidytof_example_data()
#>  [1] "aml"                  "ddpr"                 "ddpr_metadata.csv"   
#>  [4] "mix"                  "mix2"                 "phenograph"          
#>  [7] "phenograph_csv"       "scaffold"             "statistical_scaffold"
#> [10] "surgery"

The details of the datasets contained in each of these directories isn’t particularly important, but some basic information is as follows:

  • aml - one FCS file containing myeloid cells from a healthy bone marrow and one FCS file containing myeloid cells from an AML patient bone marrow
  • ddpr - two FCS files containing B-cell lineage cells from this paper
  • mix - two FCS files with different CyTOF antigen panels (one FCS file from the “aml” directory and one from the “phenograph” directory)
  • mix2 - three files with different CyTOF antigen panels and different file extensions (one FCS file from the “aml” directory and two CSV files from the “phenograph_csv directory)
  • phenograph - three FCS files containing AML cells from this paper
  • phenograph_csv - the same cells as in the “phenograph” directory, but stored in CSV files
  • scaffold - three FCS files from this paper
  • statistical_scaffold - three FCS files from this paper
  • surgery - three FCS files from this paper

To obtain the file path for the directory containing each dataset, call tidytof_example_data with one of these dataset names as its argument. For example, to obtain the directory for the phenograph data, we would use the following command:

tidytof_example_data("phenograph")
#> [1] "/tmp/RtmpgsFa8r/Rinst2ece69d18297/tidytof/extdata/phenograph"

Reading Data with tof_read_data

Using one of these directories (or any other directory containing CyTOF data on your local machine), we can use tof_read_data to read CyTOF data from raw files. Acceptable formats include FCS files and CSV files. Importantly, tof_read_data is smart enough to read single FCS/CSV files or multiple FCS/CSV files depending on whether its first argument (path) leads to a single file or to a directory of files.

Here, we can use tof_read_data to read in all of the FCS files in the “phenograph” example dataset bundled into {tidytof} and store it in the phenograph variable.

phenograph <-
    tidytof_example_data("phenograph") %>%
    tof_read_data()

phenograph %>%
    class()
#> [1] "tof_tbl"    "tbl_df"     "tbl"        "data.frame"

Regardless of the input data file type, {tidytof} reads data into an extended tibble class called a tof_tbl (pronounced “tof tibble”).

tof tibbles are an S3 class identical to tbl_df, but with one additional attribute (“panel”). {tidytof} stores this additional attribute in tof_tbls because, in addition to analyzing CyTOF data from individual experiments, CyTOF users often want to compare panels between experiments to find common markers or to compare which metals are associated with particular markers across panels. To retrieve this panel information from a tof_tbl, use tof_get_panel:

phenograph %>%
    tof_get_panel()
#> # A tibble: 44 × 2
#>   metals      antigens   
#>   <chr>       <chr>      
#> 1 Time        Time       
#> 2 Cell_length Cell_length
#> 3 Ir191       DNA1       
#> 4 Ir193       DNA2       
#> # ℹ 40 more rows

A few additional notes about tof_tbls:

  • tof_tbls contains one cell per row and one CyTOF channel per column (to provide the data in its “tidy” format).
  • tof_read_data adds an additional column to the output tof_tbl encoding the name of the file from which each cell was read (the “file_name” column).
  • Because tof_tbls inherit from the tbl_df class, all methods available to tibbles are also available to tof_tbls.

Using tibble methods with {tidytof} tibbles

As an extension of the tbl_df class, tof_tbls get access to all {dplyr} and {tidyr} for free. These can be useful for performing a variety of common operations.

For example, the phenograph object above has two columns - PhenoGraph and Condition - that encode categorical variables as numeric codes. We might be interested in converting the types of these columns into strings to make sure that we don’t accidentally perform any quantitative operations on them later. Thus, {dplyr}’s useful mutate method can be applied to phenograph to convert those two columns into character vectors.

phenograph <-
    phenograph %>%
    # mutate the input tof_tbl
    mutate(
        PhenoGraph = as.character(PhenoGraph),
        Condition = as.character(Condition)
    )

phenograph %>%
    # use dplyr's select method to show
    # that the columns have been changed
    select(where(is.character))
#> # A tibble: 300 × 3
#>   file_name                  PhenoGraph Condition
#>   <chr>                      <chr>      <chr>    
#> 1 H1_PhenoGraph_cluster1.fcs 7          7        
#> 2 H1_PhenoGraph_cluster1.fcs 6          6        
#> 3 H1_PhenoGraph_cluster1.fcs 9          9        
#> 4 H1_PhenoGraph_cluster1.fcs 2          2        
#> # ℹ 296 more rows

And note that the tof_tbl class is preserved even after these transformations.

phenograph %>%
    class()
#> [1] "tof_tbl"    "tbl_df"     "tbl"        "data.frame"

Importantly, tof_read_data uses an opinionated heuristic to mine different keyword slots of input FCS file(s) and guess which metals and antigens were used during data acquisition. Thus, when CSV files are read using tof_read_data, it is recommended to use the panel_info argument to provide the panel manually (as CSV files, unlike FCS files, do not provide built-in metadata about the columns they contain).

# when csv files are read, the tof_tibble's "panel"
# attribute will be empty by default
tidytof_example_data("phenograph_csv") %>%
    tof_read_data() %>%
    tof_get_panel()
#> # A tibble: 0 × 0

# to add a panel manually, provide it as a tibble
# to tof_read_data
phenograph_panel <-
    phenograph %>%
    tof_get_panel()

tidytof_example_data("phenograph_csv") %>%
    tof_read_data(panel_info = phenograph_panel) %>%
    tof_get_panel()
#> # A tibble: 44 × 2
#>   antigens    metals     
#>   <chr>       <chr>      
#> 1 Time        Time       
#> 2 Cell_length Cell_length
#> 3 DNA1        Ir191      
#> 4 DNA2        Ir193      
#> # ℹ 40 more rows

Writing data from a tof_tbl to disk

Users may wish to store CyTOF data as FCS or CSV files after transformation, concatenation, filtering, or other data processing. To write single-cell data from a tof_tbl into FCS or CSV files, use tof_write_data. To illustrate how to use this verb, we use the {tidytof}’s built-in phenograph_data dataset.

data(phenograph_data)

print(phenograph_data)
#> # A tibble: 3,000 × 25
#>   sample_name  phenograph_cluster    cd19 cd11b   cd34  cd45  cd123   cd33  cd47
#>   <chr>        <chr>                <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>
#> 1 H1_PhenoGra… cluster1           -0.168  29.0   3.23   131. -0.609  1.21   13.0
#> 2 H1_PhenoGra… cluster1            1.65    4.83 -0.582  230.  2.53  -0.507  12.9
#> 3 H1_PhenoGra… cluster1            2.79   36.1   5.20   293. -0.265  3.67   27.1
#> 4 H1_PhenoGra… cluster1            0.0816 48.8   0.363  431.  2.04   9.40   41.0
#> # ℹ 2,996 more rows
#> # ℹ 16 more variables: cd7 <dbl>, cd44 <dbl>, cd38 <dbl>, cd3 <dbl>,
#> #   cd117 <dbl>, cd64 <dbl>, cd41 <dbl>, pstat3 <dbl>, pstat5 <dbl>,
#> #   pampk <dbl>, p4ebp1 <dbl>, ps6 <dbl>, pcreb <dbl>, `pzap70-syk` <dbl>,
#> #   prb <dbl>, `perk1-2` <dbl>
# when copying and pasting this code, feel free to change this path
# to wherever you'd like to save your output files
my_path <- file.path("~", "Desktop", "tidytof_vignette_files")

phenograph_data %>%
    tof_write_data(
        group_cols = phenograph_cluster,
        out_path = my_path,
        format = "fcs"
    )

tof_write_data’s trickiest argument is group_cols, the argument used to specify which columns in tof_tibble should be used to group cells (the rows of tof_tibble) into separate FCS or CSV files. Simply put, this argument allows tof_write_data to create a single FCS or CSV file for each unique combination of values in the group_cols columns specified by the user. In the example above, cells are grouped into 3 output FCS files - one for each of the 3 clusters encoded by the phenograph_cluster column in phenograph_data. These files should have the following names (derived from the values in the phenograph_cluster column):

  • cluster1.fcs
  • cluster2.fcs
  • cluster3.fcs

Note that these file names match the distinct values in our group_cols column (phenograph_cluster):

phenograph_data %>%
    distinct(phenograph_cluster)
#> # A tibble: 3 × 1
#>   phenograph_cluster
#>   <chr>             
#> 1 cluster1          
#> 2 cluster2          
#> 3 cluster3

However, suppose we wanted to write multiple files for each cluster by breaking cells into two groups: those that express high levels of pstat5 and those that express low levels of pstat5. We can use dplyr::mutate to create a new column in phenograph_data that breaks cells into high- and low-pstat5 expression groups, then add this column to our group_cols specification:

phenograph_data %>%
    # create a variable representing if a cell is above or below
    # the median expression level of pstat5
    mutate(
        expression_group = if_else(pstat5 > median(pstat5), "high", "low")
    ) %>%
    tof_write_data(
        group_cols = c(phenograph_cluster, expression_group),
        out_path = my_path,
        format = "fcs"
    )

This will write 6 files with the following names (derived from the values in phenograph_cluster and expression_group).

  • cluster1_low.fcs
  • cluster1_high.fcs
  • cluster2_low.fcs
  • cluster2_high.fcs
  • cluster3_low.fcs
  • cluster3_high.fcs

As above, note that these file names match the distinct values in our group_cols columns (phenograph_cluster and expression_group):

phenograph_data %>%
    mutate(
        expression_group = if_else(pstat5 > median(pstat5), "high", "low")
    ) %>%
    distinct(phenograph_cluster, expression_group)
#> # A tibble: 6 × 2
#>   phenograph_cluster expression_group
#>   <chr>              <chr>           
#> 1 cluster1           low             
#> 2 cluster1           high            
#> 3 cluster2           low             
#> 4 cluster2           high            
#> # ℹ 2 more rows

A useful feature of tof_write_data is that it will automatically concatenate cells into single FCS or CSV files based on the specified group_cols regardless of how many unique files those cells came from. This allows for easy concatenation of FCS or CSV files containing data from a single sample acquired over multiple CyTOF runs, for example.

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] tidyr_1.3.1                 stringr_1.5.1              
#>  [3] HDCytoData_1.26.0           flowCore_2.19.0            
#>  [5] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [7] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
#>  [9] IRanges_2.41.2              S4Vectors_0.45.2           
#> [11] MatrixGenerics_1.19.1       matrixStats_1.5.0          
#> [13] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
#> [15] BiocFileCache_2.15.0        dbplyr_2.5.0               
#> [17] BiocGenerics_0.53.3         generics_0.1.3             
#> [19] forcats_1.0.0               ggplot2_3.5.1              
#> [21] dplyr_1.1.4                 tidytof_1.1.0              
#> [23] rmarkdown_2.29             
#> 
#> loaded via a namespace (and not attached):
#>   [1] sys_3.4.3               jsonlite_1.8.9          shape_1.4.6.1          
#>   [4] magrittr_2.0.3          farver_2.1.2            vctrs_0.6.5            
#>   [7] memoise_2.0.1           htmltools_0.5.8.1       S4Arrays_1.7.1         
#>  [10] curl_6.1.0              SparseArray_1.7.3       sass_0.4.9             
#>  [13] parallelly_1.41.0       bslib_0.8.0             lubridate_1.9.4        
#>  [16] cachem_1.1.0            buildtools_1.0.0        igraph_2.1.3           
#>  [19] mime_0.12               lifecycle_1.0.4         iterators_1.0.14       
#>  [22] pkgconfig_2.0.3         Matrix_1.7-1            R6_2.5.1               
#>  [25] fastmap_1.2.0           GenomeInfoDbData_1.2.13 future_1.34.0          
#>  [28] digest_0.6.37           colorspace_2.1-1        furrr_0.3.1            
#>  [31] AnnotationDbi_1.69.0    irlba_2.3.5.1           RSQLite_2.3.9          
#>  [34] philentropy_0.9.0       labeling_0.4.3          filelock_1.0.3         
#>  [37] cytolib_2.19.1          yardstick_1.3.1         timechange_0.3.0       
#>  [40] httr_1.4.7              polyclip_1.10-7         abind_1.4-8            
#>  [43] compiler_4.4.2          bit64_4.5.2             withr_3.0.2            
#>  [46] doParallel_1.0.17       viridis_0.6.5           DBI_1.2.3              
#>  [49] ggforce_0.4.2           MASS_7.3-64             lava_1.8.1             
#>  [52] embed_1.1.4             rappdirs_0.3.3          DelayedArray_0.33.3    
#>  [55] tools_4.4.2             future.apply_1.11.3     nnet_7.3-20            
#>  [58] glue_1.8.0              grid_4.4.2              Rtsne_0.17             
#>  [61] recipes_1.1.0           gtable_0.3.6            tzdb_0.4.0             
#>  [64] class_7.3-23            rsample_1.2.1           data.table_1.16.4      
#>  [67] hms_1.1.3               utf8_1.2.4              tidygraph_1.3.1        
#>  [70] XVector_0.47.2          RcppAnnoy_0.0.22        ggrepel_0.9.6          
#>  [73] BiocVersion_3.21.1      foreach_1.5.2           pillar_1.10.1          
#>  [76] vroom_1.6.5             RcppHNSW_0.6.0          splines_4.4.2          
#>  [79] tweenr_2.0.3            lattice_0.22-6          survival_3.8-3         
#>  [82] bit_4.5.0.1             emdist_0.3-3            RProtoBufLib_2.19.0    
#>  [85] tidyselect_1.2.1        Biostrings_2.75.3       maketools_1.3.1        
#>  [88] knitr_1.49              gridExtra_2.3           xfun_0.50              
#>  [91] graphlayouts_1.2.1      hardhat_1.4.0           timeDate_4041.110      
#>  [94] stringi_1.8.4           UCSC.utils_1.3.1        yaml_2.3.10            
#>  [97] evaluate_1.0.3          codetools_0.2-20        ggraph_2.2.1           
#> [100] tibble_3.2.1            BiocManager_1.30.25     cli_3.6.3              
#> [103] uwot_0.2.2              rpart_4.1.24            munsell_0.5.1          
#> [106] jquerylib_0.1.4         Rcpp_1.0.14             globals_0.16.3         
#> [109] png_0.1-8               parallel_4.4.2          gower_1.0.2            
#> [112] readr_2.1.5             blob_1.2.4              listenv_0.9.1          
#> [115] glmnet_4.1-8            viridisLite_0.4.2       ipred_0.9-15           
#> [118] ggridges_0.5.6          scales_1.3.0            prodlim_2024.06.25     
#> [121] purrr_1.0.2             crayon_1.5.3            rlang_1.1.4            
#> [124] KEGGREST_1.47.0