cytofQC
This is a description of the workflow for cytofQC
, which
applies QC labels to each observation in a CyTOF dataset. Although other
packages on Bioconductor can work with CyTOF data, cytofQC
is the only package that uses a model based and labeling approach to
cleaning CyTOF data. CATALYST
does data normalization using
bead normalization and can remove identified beads from the data. It
also has many other functions, but it does not clean out debris,
doublets, or dead cells. Other Bioconductor packages that analyze CyTOF
data assume that data have been cleaned prior to using their package.
Our package does not replicate the features of any Bioconductor
packages. It uses a statistical learning approach to label each event in
a CyTOF dataset as a cell, gdpZero (has a 0 value for at least one
Gaussian parameter) bead, debris, doublet, or dead event. Data are
stored as a SingleCellExperiment
along with the labels and
other types of information that help data users interpret the labels.
Our method is able to distinguish between doublets and large cells,
something that other cleaning methods struggle to do.
The Quick Start section shows how to read in data from FCS files and
use the general function labelQC
to obtain labels for all
of the observations. We then demonstrate the inner workings of the
function to illustrate how it works for each observation type and show
how to use the individual modeling functions that make up
labelQC
. We end with a demonstration of how the labeled
data cluster using UMAP.
The function readCytof
read in the data and returns a
SingleCellExperiment
that contains all of the QC variables
and place holders for the labels, event scores, initial classification,
and event probabilities. The original data is stored in the
SingleCellExperiment
using the format used in
CATALYST
. The function for reading in the file requires
that the names of the QC variables are identified. The function
prepData
from CATALYST
can be used to identify
the names of the QC variables.
f <- system.file("extdata", "raw_cytof.fcs", package = "cytofQC")
x <- prepData(f)
names(int_colData(x))
#> [1] "reducedDims" "altExps" "colPairs" "Time" "Event_length"
#> [6] "Center" "Offset" "Width" "Residual"
rownames(x)
#> [1] "CD45" "Live_Dead" "120Sn" "127I" "131Xe"
#> [6] "133Cs" "138Ba" "Bead" "CD196_CCR6" "Bead"
#> [11] "CD123_IL-3R" "CD19" "CD4" "CD8a" "CD11c"
#> [16] "CD16" "CD45RO" "CD45RA" "CD161" "CD194_CCR4"
#> [21] "CD25_IL-2Ra" "CD27" "CD57" "CD183_CXCR3" "CD185_CXCR5"
#> [26] "CD28" "CD38" "CD56_NCAM" "TCRgd" "Bead"
#> [31] "CD294" "CD197_CCR7" "CD14" "CD33" "CD3"
#> [36] "CD20" "CD66b" "HLA-DR" "IgD" "Bead"
#> [41] "CD127_IL-7Ra" "190BCKG" "DNA1" "DNA2" "194Pt"
#> [46] "195Pt" "198Pt" "208Pb" "CD11b"
x <- readCytof(f,
beads = c("Bead"),
dna = c("DNA1", "DNA2"),
event_length = "Event_length",
viability = "Live_Dead",
gaussian = c("Center", "Offset", "Width", "Residual"))
x
#> class: SingleCellExperiment
#> dim: 49 6450
#> metadata(2): experiment_info chs_by_fcs
#> assays(2): counts exprs
#> rownames(49): CD45 Live_Dead ... 208Pb CD11b
#> rowData names(4): channel_name marker_name marker_class use_channel
#> colnames: NULL
#> colData names(6): sample_id label ... scores initial
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
The function readCytof
adds four DataFrames
that are place holders for the cytofQC
functions. The four
DataFrames
are:
tech
: Contains the data used to create the event scores
and fit the models that label the data. The Gaussian parameters and the
event length are transformed with the log1p
function.
label
: A character vector that contains the labels for
the event type: bead, debris, doublet, dead (a measure of viability),
cell for events that appear to be legitimate cells, and GDPzero for
events where at least one Gaussian parameter or the Event_length has a
value of 0. The call to readCytof
sets all labels to “cell”
or to “GDPzero”. Later function calls will change these values.
scores
: Initially filled with NA’s, but will eventually
be filled with scores for each event type that are used to select a
training dataset that is used to create a model that labels the
data.
initial
: Initially filled in with zeros. It is filled
with the initial classification for each even type. A value of -1 means
that the even is not of that event type, a value of 1 means it is very
likely that even type, and a value of 0 means it is unclear. Only events
with values of -1 or 1 are included in the training dataset.
probs
: Initially filled with NA. The predicted
probability from the model classifying each event type is recorded in
this DataFrame
. The labels are assigned in order. That is,
once an event is classified as a bead, it cannot be classified as a
different event type. The probabilities in this DataFrame
can be used to identify events that look similar to another event
type.
head(tech(x))
#> DataFrame with 6 rows and 12 columns
#> Bead1 Bead2 Bead3 Bead4 DNA1 DNA2 Viability
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 0.00000 0.0000000 0.038878 0 6.71406 7.37525 0.4747735
#> 2 1.09948 1.1223978 0.469108 0 6.32194 6.90066 0.0000000
#> 3 1.60338 0.0000000 1.011434 0 6.59706 7.13219 0.0269896
#> 4 0.00000 0.0000000 0.000000 0 6.91721 7.56727 0.5478744
#> 5 0.00000 0.1370409 0.000000 0 6.43765 7.06383 0.0000000
#> 6 0.00000 0.0299852 0.000000 0 6.29761 6.97990 0.0000000
#> Event_length Center Offset Width Residual
#> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 3.29584 7.02432 4.57859 4.30889 4.49785
#> 2 3.61092 7.43631 4.66198 4.71573 4.90282
#> 3 3.87120 7.52199 4.53954 4.84790 4.82643
#> 4 3.29584 6.85111 4.40316 4.12833 4.71137
#> 5 3.80666 7.54676 4.75641 5.00201 5.01878
#> 6 3.91202 7.68825 4.70737 5.05778 4.97622
head(label(x))
#> [1] "cell" "cell" "cell" "cell" "cell" "cell"
head(scores(x))
#> DataFrame with 6 rows and 4 columns
#> beadScore debrisScore doubletScore deadScore
#> <logical> <logical> <logical> <logical>
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 NA NA NA NA
#> 5 NA NA NA NA
#> 6 NA NA NA NA
head(initial(x))
#> DataFrame with 6 rows and 4 columns
#> beadInitial debrisInitial doubletInitial deadInitial
#> <numeric> <numeric> <numeric> <numeric>
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 0 0 0
#> 5 0 0 0 0
#> 6 0 0 0 0
head(probs(x))
#> DataFrame with 6 rows and 4 columns
#> bead debris doublet dead
#> <logical> <logical> <logical> <logical>
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 NA NA NA NA
#> 5 NA NA NA NA
#> 6 NA NA NA NA
The integral function of cytofQC is labelQC
. A
SingleCellExperiment
obtained from readCytof
can be passed to labelQC
with no other arguments and it
will return a data.frame
that contains all of the labels
and the probability that each observation belongs to each of the
specified labels. The following call will label the cells using a
support vector machine and 4000 data points to train the SVM for each
event type. If there are not 2000 events with a -1 or 1 initial
classification assignment a warning is given and a smaller dataset is
used to train the data. Note that this dataset does not appear to have
any dead cells.
x <- labelQC(x)
table(label(x))
#>
#> bead cell debris doublet gdpZero
#> 150 5027 595 378 300
head(scores(x))
#> DataFrame with 6 rows and 4 columns
#> beadScore debrisScore doubletScore deadScore
#> <numeric> <numeric> <numeric> <numeric>
#> 1 0.0388780 0.2514018 0.333115 0.476094
#> 2 2.6909858 0.5996081 0.642124 -0.592169
#> 3 2.6148098 -0.0820102 1.614873 -0.531441
#> 4 0.0000000 -0.1195681 1.041812 0.640575
#> 5 0.1370409 0.1819777 1.186818 -0.592169
#> 6 0.0299852 0.2566207 1.287680 -0.592169
head(initial(x))
#> DataFrame with 6 rows and 4 columns
#> beadInitial debrisInitial doubletInitial deadInitial
#> <numeric> <numeric> <numeric> <numeric>
#> 1 -1 -1 -1 -1
#> 2 -1 -1 -1 -1
#> 3 -1 -1 -1 -1
#> 4 -1 -1 -1 0
#> 5 -1 -1 -1 -1
#> 6 -1 -1 -1 -1
head(probs(x))
#> DataFrame with 6 rows and 4 columns
#> bead debris doublet dead
#> <numeric> <numeric> <numeric> <numeric>
#> 1 0.000288289 8.64510e-05 6.75484e-05 0
#> 2 0.000437201 2.86300e-03 2.63498e-04 0
#> 3 0.000524323 4.95308e-03 8.05475e-03 0
#> 4 0.000281412 1.82789e-05 5.09614e-04 0
#> 5 0.000327662 8.77436e-05 7.85407e-04 0
#> 6 0.000368489 1.32218e-04 1.31476e-03 0
Histograms can be created using cytofHist
. The function
creates a histogram colored by groups. The following uses
cytofHist
to examine the event scores and how the labels
relate to them. The package includes four get
functions
that retrieve the various results of the labels. The four functions are
scores
, probs
, label
, and
tech
. The functions scores
and
probs
can return the vector of scores or probabilities for
a specific event type (‘bead’, ‘dead’, ‘debris’, or ‘doublet’) or can
return the entire DataFrame
containing all of the scores or
probabilities. The function label
returns a character
vector containing all of the label assignments and tech
returns the DataFrame
containing all of the data used to
label the events. Note that a plot is not done for the dead scores
because none of the events in the example dataset are classified as
dead.
bead <- cytofHist(scores(x, type = 'bead'), label(x), title = "Bead score")
debris <- cytofHist(scores(x, type = 'debris'), label(x), title = "Debris score")
doublet <- cytofHist(scores(x, type = 'doublet'), label(x), title = "Doublet score")
grid.arrange(bead, debris, doublet, ncol = 1)
It can be difficult to see some of the less represented event types. The argument type = “density” creates a histogram where each of the groups are represented by the same area instead of by count so that small groups are visible.
bead <- cytofHist(scores(x, type = 'bead'), label(x), type = "density", title = "Bead score")
debris <- cytofHist(scores(x, type = 'debris'), label(x), type = "density", title = "Debris score")
doublet <- cytofHist(scores(x, type = 'doublet'), label(x), type = "density", title = "Doublet score")
grid.arrange(bead, debris, doublet, ncol = 1)
Labeling can be done using a random forest or gradient boosting machine as well. The following code shows how to generate labels with a random forest or gradient boosting machine using classification error as a loss function.
Each of the event types can be modeled separately outside of
labelQC
. The following code snippets show how to do this.
This code is the method used by labelQC
so this section
demonstrates the labeling methods.
The beads are typically the first type of observation that should be
labeled. The beads should separate clearly from the non-beads. The first
step is to obtain initial labels for the beads. This is done with the
initialBead
function. The function returns a
SingleCellExperiment
that contains the bead assignment for
each observation in the scores
object and an initial
classification that in the initial
object. This information
is then used to assign a label of bead
to observations that
look like beads. The predicted probability returned by the
classification model is reported in the probs
object and
indicates how much each observation looks like a bead.
The debris are typically classified after the beads. They are identified primarily by low DNA content and short event time. Labeling debris prior to doublets aids in identifying doublets.
Doublets are more difficult to identify than beads or debris so it is recommended that they are labeled prior to labeling doublets.
Viability is not as straightforward as many CyTOF data cleaning sites indicate. The viability measure is really measuring the permeability of the cell. It may be important to the downstream analysis or it may not. It is important to understand how permeability impacts your data. However, because it is most often used to distinguish between live and dead cells, it is referred to as “dead” for brevity.
Note that this produces a warning that there are not enough dead or
non-dead cells to build a classification model for viability. This is
because fewer than 100 points were classified as dead cells by the
initialDead
function. Thus, none of the observations will
be labeled “dead” in .
The following is a UMAP that was created using only the
tech
data and colored with the labels from
cytofQC
. Note that the labels from cytofQC
match the clusters in the UMAP and that neither they nor the scores were
used to generate the UMAP.
library(utils)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] RColorBrewer_1.1-3 uwot_0.2.2
#> [3] Matrix_1.7-1 gridExtra_2.3
#> [5] ggplot2_3.5.1 CATALYST_1.31.2
#> [7] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#> [9] Biobase_2.67.0 GenomicRanges_1.59.1
#> [11] GenomeInfoDb_1.43.2 IRanges_2.41.1
#> [13] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [15] generics_0.1.3 MatrixGenerics_1.19.0
#> [17] matrixStats_1.4.1 cytofQC_1.7.0
#> [19] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.22 splines_4.4.2
#> [3] tibble_3.2.1 polyclip_1.10-7
#> [5] XML_3.99-0.17 rpart_4.1.23
#> [7] lifecycle_1.0.4 rstatix_0.7.2
#> [9] doParallel_1.0.17 lattice_0.22-6
#> [11] MASS_7.3-61 backports_1.5.0
#> [13] magrittr_2.0.3 sass_0.4.9
#> [15] rmarkdown_2.29 jquerylib_0.1.4
#> [17] yaml_2.3.10 plotrix_3.8-4
#> [19] cowplot_1.1.3 buildtools_1.0.0
#> [21] ConsensusClusterPlus_1.71.0 multcomp_1.4-26
#> [23] abind_1.4-8 zlibbioc_1.52.0
#> [25] Rtsne_0.17 purrr_1.0.2
#> [27] pracma_2.4.4 TH.data_1.1-2
#> [29] tweenr_2.0.3 sandwich_3.1-1
#> [31] gdtools_0.4.1 circlize_0.4.16
#> [33] GenomeInfoDbData_1.2.13 ggrepel_0.9.6
#> [35] irlba_2.3.5.1 maketools_1.3.1
#> [37] optimx_2023-10.21 codetools_0.2-20
#> [39] DelayedArray_0.33.2 scuttle_1.17.0
#> [41] ggforce_0.4.2 tidyselect_1.2.1
#> [43] shape_1.4.6.1 UCSC.utils_1.3.0
#> [45] farver_2.1.2 ScaledMatrix_1.15.0
#> [47] viridis_0.6.5 jsonlite_1.8.9
#> [49] GetoptLong_1.0.5 BiocNeighbors_2.1.1
#> [51] e1071_1.7-16 Formula_1.2-5
#> [53] ggridges_0.5.6 survival_3.7-0
#> [55] scater_1.35.0 iterators_1.0.14
#> [57] systemfonts_1.1.0 foreach_1.5.2
#> [59] tools_4.4.2 ggnewscale_0.5.0
#> [61] hrbrthemes_0.8.7 Rcpp_1.0.13-1
#> [63] glue_1.8.0 Rttf2pt1_1.3.12
#> [65] SparseArray_1.7.2 xfun_0.49
#> [67] dplyr_1.1.4 withr_3.0.2
#> [69] numDeriv_2016.8-1.1 BiocManager_1.30.25
#> [71] fastmap_1.2.0 fansi_1.0.6
#> [73] digest_0.6.37 rsvd_1.0.5
#> [75] R6_2.5.1 colorspace_2.1-1
#> [77] EZtune_3.1.1 gtools_3.9.5
#> [79] utf8_1.2.4 tidyr_1.3.1
#> [81] fontLiberation_0.1.0 data.table_1.16.2
#> [83] class_7.3-22 httr_1.4.7
#> [85] S4Arrays_1.7.1 pkgconfig_2.0.3
#> [87] gtable_0.3.6 ComplexHeatmap_2.23.0
#> [89] RProtoBufLib_2.19.0 XVector_0.47.0
#> [91] sys_3.4.3 htmltools_0.5.8.1
#> [93] fontBitstreamVera_0.1.1 carData_3.0-5
#> [95] clue_0.3-66 scales_1.3.0
#> [97] png_0.1-8 colorRamps_2.3.4
#> [99] knitr_1.49 reshape2_1.4.4
#> [101] rjson_0.2.23 nloptr_2.1.1
#> [103] proxy_0.4-27 cachem_1.1.0
#> [105] zoo_1.8-12 GlobalOptions_0.1.2
#> [107] stringr_1.5.1 parallel_4.4.2
#> [109] vipor_0.4.7 extrafont_0.19
#> [111] pillar_1.9.0 grid_4.4.2
#> [113] vctrs_0.6.5 ggpubr_0.6.0
#> [115] car_3.1-3 BiocSingular_1.23.0
#> [117] cytolib_2.19.0 beachmat_2.23.2
#> [119] cluster_2.1.6 extrafontdb_1.0
#> [121] beeswarm_0.4.0 evaluate_1.0.1
#> [123] mvtnorm_1.3-2 cli_3.6.3
#> [125] compiler_4.4.2 rlang_1.1.4
#> [127] crayon_1.5.3 ggsignif_0.6.4
#> [129] labeling_0.4.3 FlowSOM_2.15.0
#> [131] flowCore_2.19.0 plyr_1.8.9
#> [133] ggbeeswarm_0.7.2 stringi_1.8.4
#> [135] viridisLite_0.4.2 BiocParallel_1.41.0
#> [137] nnls_1.6 munsell_0.5.1
#> [139] fontquiver_0.2.1 ROCR_1.0-11
#> [141] drc_3.0-1 igraph_2.1.1
#> [143] broom_1.0.7 bslib_0.8.0