Complete Guide to cytofQC

Introduction

This is a description of the workflow for cytofQC, which applies QC labels to each observation in a CyTOF dataset. Although other packages on Bioconductor can work with CyTOF data, cytofQC is the only package that uses a model based and labeling approach to cleaning CyTOF data. CATALYST does data normalization using bead normalization and can remove identified beads from the data. It also has many other functions, but it does not clean out debris, doublets, or dead cells. Other Bioconductor packages that analyze CyTOF data assume that data have been cleaned prior to using their package. Our package does not replicate the features of any Bioconductor packages. It uses a statistical learning approach to label each event in a CyTOF dataset as a cell, gdpZero (has a 0 value for at least one Gaussian parameter) bead, debris, doublet, or dead event. Data are stored as a SingleCellExperiment along with the labels and other types of information that help data users interpret the labels. Our method is able to distinguish between doublets and large cells, something that other cleaning methods struggle to do.

The Quick Start section shows how to read in data from FCS files and use the general function labelQC to obtain labels for all of the observations. We then demonstrate the inner workings of the function to illustrate how it works for each observation type and show how to use the individual modeling functions that make up labelQC. We end with a demonstration of how the labeled data cluster using UMAP.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("cytofQC")
library(cytofQC)
library(CATALYST)
library(SingleCellExperiment)
library(ggplot2)
library(gridExtra)

Quick Start

Read in data and create initial dataset

The function readCytof read in the data and returns a SingleCellExperiment that contains all of the QC variables and place holders for the labels, event scores, initial classification, and event probabilities. The original data is stored in the SingleCellExperiment using the format used in CATALYST. The function for reading in the file requires that the names of the QC variables are identified. The function prepData from CATALYST can be used to identify the names of the QC variables.

f <- system.file("extdata", "raw_cytof.fcs", package = "cytofQC")
x <- prepData(f)
names(int_colData(x))
#> [1] "reducedDims"  "altExps"      "colPairs"     "Time"         "Event_length"
#> [6] "Center"       "Offset"       "Width"        "Residual"
rownames(x)
#>  [1] "CD45"         "Live_Dead"    "120Sn"        "127I"         "131Xe"       
#>  [6] "133Cs"        "138Ba"        "Bead"         "CD196_CCR6"   "Bead"        
#> [11] "CD123_IL-3R"  "CD19"         "CD4"          "CD8a"         "CD11c"       
#> [16] "CD16"         "CD45RO"       "CD45RA"       "CD161"        "CD194_CCR4"  
#> [21] "CD25_IL-2Ra"  "CD27"         "CD57"         "CD183_CXCR3"  "CD185_CXCR5" 
#> [26] "CD28"         "CD38"         "CD56_NCAM"    "TCRgd"        "Bead"        
#> [31] "CD294"        "CD197_CCR7"   "CD14"         "CD33"         "CD3"         
#> [36] "CD20"         "CD66b"        "HLA-DR"       "IgD"          "Bead"        
#> [41] "CD127_IL-7Ra" "190BCKG"      "DNA1"         "DNA2"         "194Pt"       
#> [46] "195Pt"        "198Pt"        "208Pb"        "CD11b"

x <- readCytof(f, 
                 beads = c("Bead"),
                 dna = c("DNA1", "DNA2"),
                 event_length = "Event_length",
                 viability = "Live_Dead",
                 gaussian = c("Center", "Offset", "Width", "Residual"))
x
#> class: SingleCellExperiment 
#> dim: 49 6450 
#> metadata(2): experiment_info chs_by_fcs
#> assays(2): counts exprs
#> rownames(49): CD45 Live_Dead ... 208Pb CD11b
#> rowData names(4): channel_name marker_name marker_class use_channel
#> colnames: NULL
#> colData names(6): sample_id label ... scores initial
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

The function readCytof adds four DataFrames that are place holders for the cytofQC functions. The four DataFrames are:

tech: Contains the data used to create the event scores and fit the models that label the data. The Gaussian parameters and the event length are transformed with the log1p function.

label: A character vector that contains the labels for the event type: bead, debris, doublet, dead (a measure of viability), cell for events that appear to be legitimate cells, and GDPzero for events where at least one Gaussian parameter or the Event_length has a value of 0. The call to readCytof sets all labels to “cell” or to “GDPzero”. Later function calls will change these values.

scores: Initially filled with NA’s, but will eventually be filled with scores for each event type that are used to select a training dataset that is used to create a model that labels the data.

initial: Initially filled in with zeros. It is filled with the initial classification for each even type. A value of -1 means that the even is not of that event type, a value of 1 means it is very likely that even type, and a value of 0 means it is unclear. Only events with values of -1 or 1 are included in the training dataset.

probs: Initially filled with NA. The predicted probability from the model classifying each event type is recorded in this DataFrame. The labels are assigned in order. That is, once an event is classified as a bead, it cannot be classified as a different event type. The probabilities in this DataFrame can be used to identify events that look similar to another event type.

head(tech(x))
#> DataFrame with 6 rows and 12 columns
#>       Bead1     Bead2     Bead3     Bead4      DNA1      DNA2 Viability
#>   <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1   0.00000 0.0000000  0.038878         0   6.71406   7.37525 0.4747735
#> 2   1.09948 1.1223978  0.469108         0   6.32194   6.90066 0.0000000
#> 3   1.60338 0.0000000  1.011434         0   6.59706   7.13219 0.0269896
#> 4   0.00000 0.0000000  0.000000         0   6.91721   7.56727 0.5478744
#> 5   0.00000 0.1370409  0.000000         0   6.43765   7.06383 0.0000000
#> 6   0.00000 0.0299852  0.000000         0   6.29761   6.97990 0.0000000
#>   Event_length    Center    Offset     Width  Residual
#>      <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1      3.29584   7.02432   4.57859   4.30889   4.49785
#> 2      3.61092   7.43631   4.66198   4.71573   4.90282
#> 3      3.87120   7.52199   4.53954   4.84790   4.82643
#> 4      3.29584   6.85111   4.40316   4.12833   4.71137
#> 5      3.80666   7.54676   4.75641   5.00201   5.01878
#> 6      3.91202   7.68825   4.70737   5.05778   4.97622
head(label(x))
#> [1] "cell" "cell" "cell" "cell" "cell" "cell"
head(scores(x))
#> DataFrame with 6 rows and 4 columns
#>   beadScore debrisScore doubletScore deadScore
#>   <logical>   <logical>    <logical> <logical>
#> 1        NA          NA           NA        NA
#> 2        NA          NA           NA        NA
#> 3        NA          NA           NA        NA
#> 4        NA          NA           NA        NA
#> 5        NA          NA           NA        NA
#> 6        NA          NA           NA        NA
head(initial(x))
#> DataFrame with 6 rows and 4 columns
#>   beadInitial debrisInitial doubletInitial deadInitial
#>     <numeric>     <numeric>      <numeric>   <numeric>
#> 1           0             0              0           0
#> 2           0             0              0           0
#> 3           0             0              0           0
#> 4           0             0              0           0
#> 5           0             0              0           0
#> 6           0             0              0           0
head(probs(x))
#> DataFrame with 6 rows and 4 columns
#>        bead    debris   doublet      dead
#>   <logical> <logical> <logical> <logical>
#> 1        NA        NA        NA        NA
#> 2        NA        NA        NA        NA
#> 3        NA        NA        NA        NA
#> 4        NA        NA        NA        NA
#> 5        NA        NA        NA        NA
#> 6        NA        NA        NA        NA

Use labelQC to obtain labels for each obsevation

The integral function of cytofQC is labelQC. A SingleCellExperiment obtained from readCytof can be passed to labelQC with no other arguments and it will return a data.frame that contains all of the labels and the probability that each observation belongs to each of the specified labels. The following call will label the cells using a support vector machine and 4000 data points to train the SVM for each event type. If there are not 2000 events with a -1 or 1 initial classification assignment a warning is given and a smaller dataset is used to train the data. Note that this dataset does not appear to have any dead cells.

x <- labelQC(x)
table(label(x))
#> 
#>    bead    cell  debris doublet gdpZero 
#>     150    5025     599     376     300
head(scores(x))
#> DataFrame with 6 rows and 4 columns
#>   beadScore debrisScore doubletScore deadScore
#>   <numeric>   <numeric>    <numeric> <numeric>
#> 1 0.0388780   0.2514018     0.333115  0.476094
#> 2 2.6909858   0.5996081     0.642124 -0.592169
#> 3 2.6148098  -0.0820102     1.614873 -0.531441
#> 4 0.0000000  -0.1195681     1.041812  0.640575
#> 5 0.1370409   0.1819777     1.186818 -0.592169
#> 6 0.0299852   0.2566207     1.287680 -0.592169
head(initial(x))
#> DataFrame with 6 rows and 4 columns
#>   beadInitial debrisInitial doubletInitial deadInitial
#>     <numeric>     <numeric>      <numeric>   <numeric>
#> 1          -1            -1             -1          -1
#> 2          -1            -1             -1          -1
#> 3          -1            -1             -1          -1
#> 4          -1            -1             -1          -1
#> 5          -1            -1             -1          -1
#> 6          -1            -1             -1          -1
head(probs(x))
#> DataFrame with 6 rows and 4 columns
#>          bead      debris     doublet      dead
#>     <numeric>   <numeric>   <numeric> <numeric>
#> 1 0.000288051 8.64515e-05 9.30144e-05         0
#> 2 0.000436660 2.42873e-03 2.04456e-04         0
#> 3 0.000460188 1.84206e-03 8.66004e-03         0
#> 4 0.000287349 2.03377e-05 8.27420e-04         0
#> 5 0.000320680 1.18993e-04 7.85932e-04         0
#> 6 0.000352565 1.48816e-04 1.34518e-03         0

Histograms can be created using cytofHist. The function creates a histogram colored by groups. The following uses cytofHist to examine the event scores and how the labels relate to them. The package includes four get functions that retrieve the various results of the labels. The four functions are scores, probs, label, and tech. The functions scores and probs can return the vector of scores or probabilities for a specific event type (‘bead’, ‘dead’, ‘debris’, or ‘doublet’) or can return the entire DataFrame containing all of the scores or probabilities. The function label returns a character vector containing all of the label assignments and tech returns the DataFrame containing all of the data used to label the events. Note that a plot is not done for the dead scores because none of the events in the example dataset are classified as dead.

bead <- cytofHist(scores(x, type = 'bead'), label(x), title = "Bead score")
debris <- cytofHist(scores(x, type = 'debris'), label(x), title = "Debris score")
doublet <- cytofHist(scores(x, type = 'doublet'), label(x), title = "Doublet score")
grid.arrange(bead, debris, doublet, ncol = 1)

It can be difficult to see some of the less represented event types. The argument type = “density” creates a histogram where each of the groups are represented by the same area instead of by count so that small groups are visible.

bead <- cytofHist(scores(x, type = 'bead'), label(x), type = "density", title = "Bead score")
debris <- cytofHist(scores(x, type = 'debris'), label(x), type = "density", title = "Debris score")
doublet <- cytofHist(scores(x, type = 'doublet'), label(x), type = "density", title = "Doublet score")
grid.arrange(bead, debris, doublet, ncol = 1)

Labeling can be done using a random forest or gradient boosting machine as well. The following code shows how to generate labels with a random forest or gradient boosting machine using classification error as a loss function.

x.rf <- labelQC(x, model = "rf", loss = "class")
x.gbm <- labelQC(x, model = "gbm", loss = "class")

Individual functions

Each of the event types can be modeled separately outside of labelQC. The following code snippets show how to do this. This code is the method used by labelQC so this section demonstrates the labeling methods.

Beads

The beads are typically the first type of observation that should be labeled. The beads should separate clearly from the non-beads. The first step is to obtain initial labels for the beads. This is done with the initialBead function. The function returns a SingleCellExperiment that contains the bead assignment for each observation in the scores object and an initial classification that in the initial object. This information is then used to assign a label of bead to observations that look like beads. The predicted probability returned by the classification model is reported in the probs object and indicates how much each observation looks like a bead.

x <- readCytof(f, 
                 beads = c("Bead"),
                 dna = c("DNA1", "DNA2"),
                 event_length = "Event_length",
                 viability = "Live_Dead",
                 gaussian = c("Center", "Offset", "Width", "Residual"))

x <- initialBead(x)
x <- svmLabel(x, type = "bead", n = 500)

Debris

The debris are typically classified after the beads. They are identified primarily by low DNA content and short event time. Labeling debris prior to doublets aids in identifying doublets.

x <- initialDebris(x)
x <- svmLabel(x, type = "debris", n = 500)

Doublets

Doublets are more difficult to identify than beads or debris so it is recommended that they are labeled prior to labeling doublets.

x <- initialDoublet(x)
x <- svmLabel(x, type = "doublet", n = 500)

Dead cells

Viability is not as straightforward as many CyTOF data cleaning sites indicate. The viability measure is really measuring the permeability of the cell. It may be important to the downstream analysis or it may not. It is important to understand how permeability impacts your data. However, because it is most often used to distinguish between live and dead cells, it is referred to as “dead” for brevity.

Note that this produces a warning that there are not enough dead or non-dead cells to build a classification model for viability. This is because fewer than 100 points were classified as dead cells by the initialDead function. Thus, none of the observations will be labeled “dead” in .

x <- initialDead(x)
x <- svmLabel(x, type = "dead", n = 500)

UMAP for exploration

The following is a UMAP that was created using only the tech data and colored with the labels from cytofQC. Note that the labels from cytofQC match the clusters in the UMAP and that neither they nor the scores were used to generate the UMAP.

library(uwot)
lab.umap <- umap(scale(x$tech), ret_model = TRUE)
lab.umapD <- data.frame(x = lab.umap$embedding[, 1], 
                        y = lab.umap$embedding[, 2], 
                        labs = x$label)
library(RColorBrewer)
ggplot(lab.umapD, aes(x = x, y = y)) +
    geom_point(aes(color = labs), size = 0.5, alpha = 0.5) +
    scale_color_manual(name = NULL, values = brewer.pal(5, "Dark2")) +
    guides(colour = guide_legend(override.aes = list(size = 2))) +
    labs(x = NULL, y = NULL) +
    theme_bw()

SessionInfo

library(utils)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] RColorBrewer_1.1-3          uwot_0.2.2                 
#>  [3] Matrix_1.7-1                gridExtra_2.3              
#>  [5] ggplot2_3.5.1               CATALYST_1.31.2            
#>  [7] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#>  [9] Biobase_2.67.0              GenomicRanges_1.59.1       
#> [11] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#> [13] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [15] generics_0.1.3              MatrixGenerics_1.19.0      
#> [17] matrixStats_1.4.1           cytofQC_1.7.0              
#> [19] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] sys_3.4.3                   jsonlite_1.8.9             
#>   [3] shape_1.4.6.1               magrittr_2.0.3             
#>   [5] ggbeeswarm_0.7.2            TH.data_1.1-2              
#>   [7] nloptr_2.1.1                farver_2.1.2               
#>   [9] rmarkdown_2.29              GlobalOptions_0.1.2        
#>  [11] vctrs_0.6.5                 ROCR_1.0-11                
#>  [13] rstatix_0.7.2               htmltools_0.5.8.1          
#>  [15] S4Arrays_1.7.1              plotrix_3.8-4              
#>  [17] BiocNeighbors_2.1.2         broom_1.0.7                
#>  [19] SparseArray_1.7.2           Formula_1.2-5              
#>  [21] pracma_2.4.4                sass_0.4.9                 
#>  [23] bslib_0.8.0                 plyr_1.8.9                 
#>  [25] sandwich_3.1-1              zoo_1.8-12                 
#>  [27] cachem_1.1.0                buildtools_1.0.0           
#>  [29] igraph_2.1.2                lifecycle_1.0.4            
#>  [31] iterators_1.0.14            pkgconfig_2.0.3            
#>  [33] optimx_2024-12.2            rsvd_1.0.5                 
#>  [35] R6_2.5.1                    fastmap_1.2.0              
#>  [37] GenomeInfoDbData_1.2.13     clue_0.3-66                
#>  [39] numDeriv_2016.8-1.1         digest_0.6.37              
#>  [41] colorspace_2.1-1            ggnewscale_0.5.0           
#>  [43] scater_1.35.0               irlba_2.3.5.1              
#>  [45] ggpubr_0.6.0                beachmat_2.23.5            
#>  [47] labeling_0.4.3              cytolib_2.19.1             
#>  [49] colorRamps_2.3.4            nnls_1.6                   
#>  [51] httr_1.4.7                  polyclip_1.10-7            
#>  [53] abind_1.4-8                 compiler_4.4.2             
#>  [55] proxy_0.4-27                EZtune_3.1.1               
#>  [57] fontquiver_0.2.1            withr_3.0.2                
#>  [59] doParallel_1.0.17           ConsensusClusterPlus_1.71.0
#>  [61] backports_1.5.0             BiocParallel_1.41.0        
#>  [63] viridis_0.6.5               carData_3.0-5              
#>  [65] Rttf2pt1_1.3.12             ggforce_0.4.2              
#>  [67] ggsignif_0.6.4              MASS_7.3-61                
#>  [69] drc_3.0-1                   DelayedArray_0.33.3        
#>  [71] rjson_0.2.23                FlowSOM_2.15.0             
#>  [73] gtools_3.9.5                tools_4.4.2                
#>  [75] vipor_0.4.7                 beeswarm_0.4.0             
#>  [77] extrafontdb_1.0             glue_1.8.0                 
#>  [79] grid_4.4.2                  Rtsne_0.17                 
#>  [81] cluster_2.1.8               reshape2_1.4.4             
#>  [83] gtable_0.3.6                class_7.3-22               
#>  [85] tidyr_1.3.1                 data.table_1.16.4          
#>  [87] BiocSingular_1.23.0         ScaledMatrix_1.15.0        
#>  [89] car_3.1-3                   XVector_0.47.1             
#>  [91] RcppAnnoy_0.0.22            ggrepel_0.9.6              
#>  [93] foreach_1.5.2               pillar_1.10.0              
#>  [95] stringr_1.5.1               circlize_0.4.16            
#>  [97] splines_4.4.2               flowCore_2.19.0            
#>  [99] dplyr_1.1.4                 tweenr_2.0.3               
#> [101] lattice_0.22-6              survival_3.8-3             
#> [103] RProtoBufLib_2.19.0         tidyselect_1.2.1           
#> [105] fontLiberation_0.1.0        ComplexHeatmap_2.23.0      
#> [107] maketools_1.3.1             scuttle_1.17.0             
#> [109] knitr_1.49                  fontBitstreamVera_0.1.1    
#> [111] xfun_0.49                   stringi_1.8.4              
#> [113] UCSC.utils_1.3.0            yaml_2.3.10                
#> [115] evaluate_1.0.1              codetools_0.2-20           
#> [117] extrafont_0.19              gdtools_0.4.1              
#> [119] tibble_3.2.1                BiocManager_1.30.25        
#> [121] cli_3.6.3                   rpart_4.1.23               
#> [123] systemfonts_1.1.0           munsell_0.5.1              
#> [125] jquerylib_0.1.4             Rcpp_1.0.13-1              
#> [127] png_0.1-8                   XML_3.99-0.17              
#> [129] parallel_4.4.2              hrbrthemes_0.8.7           
#> [131] viridisLite_0.4.2           mvtnorm_1.3-2              
#> [133] e1071_1.7-16                scales_1.3.0               
#> [135] ggridges_0.5.6              purrr_1.0.2                
#> [137] crayon_1.5.3                GetoptLong_1.0.5           
#> [139] rlang_1.1.4                 cowplot_1.1.3              
#> [141] multcomp_1.4-26