There are currently two general strategies for working with cytometry data in R. The first is to perform the entire analysis from within the R coding environment. There are some excellent packages and standards that allow the import of FCS files, compensation, data transformation, plotting, and exporting summary statistics. However, manual gating of flow cytometry data remains cumbersome and difficult to perform accurately. Of note, there are packages available now that enable automatic, data-driven gates that avoid these problems, but are themselves complex to properly prepare and validate the automatic gates. Moreover, these data-driven gates represent an unfamiliar workflow for the majority of cytometerists that are accustomed to GUI-based analyses such as those enabled by FlowJoTM, KaluzaTM, and other cytometry analysis software.
The second strategy is to first perform compensation, transformation, and gating in FlowJo, and then import the resulting workspace object into R using the flowWorkspace package. This has the advantage of allowing cytometerists to work in a familiar way while still enabling the use of cutting-edge cytometry tools. However, this approach lacks all of the other benefits that an R-exclusive cytometry package would allow. In particular, it is dependent on expensive, closed-source software and does not allow for easy commenting and version control. Nevertheless, manual gating remains sufficiently difficult in R as to make this the more common method.
flowGate was developed to fill in this missing ability for manual, GUI-based gating in R, finally enabling a familiar cytometry analysis pipeline completely within R. This vignette will demonstrate the flowGate function within the context of a complete cytometry analysis pipeline and is geared toward a researcher who has never used R for flow cytometry analysis at all.
flowGate is now submitted to the Bioconductor project, so installing it should be pretty straightforward. Note that it’s still possible to install the unstable devel version from github (see instructions below), but for general use, the Bioconductor version is the better choice.
If you’d prefer to install the devel version from GitHub, please complete steps 1 and 2 above, then proceed here instead:
if(!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
if(!requireNamespace("flowCore", quietly = TRUE)){
BiocManager::install("flowCore")
}
if(!requireNamespace("flowWorkspace", quietly = TRUE)){
BiocManager::install("flowWorkspace")
}
if(!requireNamespace("ggcyto", quietly = TRUE)){
BiocManager::install("ggcyto")
}
Once you have the BioConductor dependencies installed, you should be able to run the following code to download and install flowGate.
# install devtools if you don't already have it
if(!requireNamespace("devtools", quietly = TRUE)){
install.packages("devtools")
}
devtools::install_github("NKInstinct/flowGate")
Again, if you aren’t sure which version to use, please go with the Bioconductor version.
Before you can start drawing gates on your data, you need to read it into R and compensate it . If this seems a little complex, don’t worry—it can be a bit much to wrap your head around the first time you do it, but once you know the options you would like to use, you can write them all into a single function and then use that for automatic data import and transformation in the future. We’ll cover preparing just such a function at the end of this section in case you aren’t sure how to do that.
The example flow data used in this package comes from the GvHD data included in flowCore, the base package for flow cytometry data analysis in R. The GvHD data that comes with flowCore is already stored in a flowSet, so for this example, you can find the first three samples from GvHD bundled with flowGate as individual .FCS files, which we will turn into a flowSet here.
library(flowGate)
#> Loading required package: flowWorkspace
#> As part of improvements to flowWorkspace, some behavior of
#> GatingSet objects has changed. For details, please read the section
#> titled "The cytoframe and cytoset classes" in the package vignette:
#>
#> vignette("flowWorkspace-Introduction", "flowWorkspace")
#> Loading required package: ggcyto
#> Loading required package: ggplot2
#> Loading required package: flowCore
#> Loading required package: ncdfFlow
#> Loading required package: BH
library(flowCore)
path_to_fcs <- system.file("extdata", package = "flowGate")
The system.file()
command above is needed to get at the
.FCS files that have shipped with flowGate. However, when you are ready
to read your own flow data into R, you can provide a simple path to the
directory you are interested in. For example, if your target directory
is something called “Flow Data” in your current directory, you could
simply pass path_to_fcs <- "./Flow Data/"
and that would
do the same thing.
Running this command first finds all of the files at the location
specified by path_to_fcs and checks if any of them ends in .FCS (that’s
the pattern = ".FCS$"
part of the code). All of them that
satisfy the pattern are loaded into a flowSet called fs. Note that there
are a lot of customization options here—have a look at the flowCore
vignettes if you want to change something about this default
behaviour.
If you have a lot of flow data, you might not want to load it all into a flowSet following the above instructions. This is because the whole flowSet we just created exists in RAM, and if it is extremely large, it might cause problems depending on how much RAM you have available to you. Instead, you can use the ncdfFlow package to directly access the data on your hard drive. This will cause all operations on the data to be slightly slower, but will not consume an enormous chunk of system RAM. It also has some benefits for keeping all your data and analyses together, which we’ll touch on when we talk about saving gated cytometry data. For those reasons, I tend to prefer this NCDF approach for cytometry analysis, but it doesn’t really matter which one you use, especially when just starting out.
FlowSets are excellent containers for holding flow data, but they cannot store information about gating very well. The flowWorkspace package introduces a new, similar data structure called a GatingSet that holds both the flowSet we just made and all gating information about it. Thankfully, once we have made a flowSet, it is very easy to prepare a GatingSet.
So far we have created a flowSet and then turned it into a GatingSet. This GatingSet is a container that can hold many different kinds of information. Currently, it has the raw expression values recorded off of the cytometer and experimental metadata. Both of these were contained in the FCS file, and so were loaded into the flowSet and then brought over to the GatingSet. Next, we are going to add a compensation matrix to this container, so that the data are properly compensated. For these specific example data, the compensation matrix is stored in an external file which we will import and apply to the GatingSet. There are other ways to compensate, which we will mention below.
path_to_comp <- system.file("extdata",
"compdata",
"compmatrix",
package = "flowCore")
comp_matrix <- read.table(path_to_comp,
header = TRUE,
skip = 2,
check.names = FALSE)
comp <- compensation(comp_matrix)
gs <- compensate(gs, comp)
Although the above example uses an externally-stored compensation matrix, a common use-case will be that you have acquired flow data for which you performed instrument-level compensation. In this case, the compensation matrix is saved directly in the .FCS file upon export, and you can read that into the comp object instead of loading an external file.
There are several places in a .FCS file where a compensation matrix
can be stored. My Fortessa X20 stores it in the $SPIL slot of a .FCS,
but other cytometers may behave differently. To find out where yours is,
you need to first call spillover()
on one of the samples in
your flowSet (not the GatingSet!), and then look at the
output. One of the slots that gets returned to you will look like a
compensation matrix—take note of which one that is and then store it in
a variable called comp.
Here is an example. Note that since the .FCS files used in this example do not have an acquisition-defined compensation, trying to run this code using these data will fail.
# Not run for the purposes of the vignette
# Find out which slot the compensation data are in.
spillover(some_new_fs[[3]])
# You need to select one of the samples contained in the flowSet. I chose the
# third one here ([[3]]), but that is completely arbitrary. The should all have
# the exact same compensation matrix.
# This command should output a list of several objects. One of them should look
# like a compensation matrix. If we were running this command on data from my
# Fortessa, it would be the first object in the list (the $SPIL slot), but look
# at your data and see which object you want to work with. Once you know which
# object is your compensation matrix, proceed.
comp_matrix <- spillover(some_new_fs[[3]])[[1]]
# Note that I have selected the first object in the list, which corresponds to
# where my acquisition-defined matrices are stored. If yours is in a different
# list object, put that object's number in place of the [[1]] above.
# From here, it is exactly like before:
comp <- compensation(comp_matrix)
some_new_gs <- compensate(some_new_gs, comp)
It is also possible, using the flowCore package, to automatically create a compensation matrix from single colour control samples. Exactly how to do this is beyond the scope of this vignette, so you are encouraged to look into the flowCore vignettes for further instructions.
As mentioned above, all of that seems like a lot of work just to import the flow data into R. However, a lot of the complexity of this import step comes from not knowing exactly how your specific experiment is set up. Once you know that, you can write all of this into a single function that holds all of your defaults, and then you can just call that function to import your data. If we were to do that with the above data import, it would look something like this:
import_gs <- function(path, pattern, compensation_matrix){
fs <- read.flowSet(path = path,
pattern = pattern,
full.names = TRUE)
gs <- GatingSet(fs)
comp <- compensation(compensation_matrix)
gs <- compensate(gs, comp)
return(gs)
}
Now that we have defined this function, we can do all of the above steps in a couple of lines of code:
#Setup the paths to your data as before
path_to_fcs <- system.file("extdata",
package = "flowGate")
path_to_comp<- system.file("extdata",
"compdata",
"compmatrix",
package = "flowCore")
comp_matrix <- read.table(path_to_comp,
header = TRUE,
skip = 2,
check.names = FALSE)
#Then run the import function we just created
gs <- import_gs(path = path_to_fcs,
pattern = ".FCS$",
compensation_matrix = comp_matrix)
This is a pretty basic function, and there’s a lot more we could do to make it more useful for other experiments with slightly different conditions, but it’s a good start for now and hopefully demonstrates that once you have the hang of it, importing cytometry data into R is neither difficult nor time consuming.
If you are used to working with flow data in R, you might be wondering why we haven’t applied transforms (biex, arcsinh, etc) to the fluorescent channels of these data yet. flowGate is designed so that you can apply biexponential scaling to your flow data directly when gating, rather than at this import step, meaning you can set the biex parameters yourself while looking at the data. Note that in this way, the data are transformed at the plotting level rather than the data level. We’ll talk a little more about that later.
However: if you have very large flow files, you might notice some odd behaviour when trying to draw polygon gates in particular, where your data seems to move around the plot. The gate that you’re drawing will move with them, so you can kind of “roll with it” if you like, but if you find this unworkable, you should apply your transformations first before trying to draw gates on your data. Because flowGate is written to allow on-the-fly changes to almost all aspects of your plot, R has to re-calculate the transformation every time you change anything else in the plot, which can cause this artifact when working with large amounts of data.
The flowCore and flowWorkspace vignettes go into great detail about
how to apply transformations (start with the flowWorkspace one since
it’s designed for GatingSet data), and if you just want a decent
transform without any fuss, have a look at the
estimateLogicle
function in flowWorkspace.
Now that we have a compensated and transformed GatingSet object holding all of our flow cytometry data, it is time to start gating through it. If you were working on your own data, you would probably be able to jump right in knowing what parameters are in your dataset, but since this is an example set, it is helpful to have a quick look at the channel names contained in the data.
There are two kinds of names for each channel in this GatingSet
object: the detector name (such as “FL1-H”) and, if specified, the
marker name (such as “CD15 FITC”). We can access the first kind of name
with colnames()
like we did above, and the second kind with
markernames()
:
colnames(gs)
#> [1] "FSC-H" "SSC-H" "FL1-H" "FL2-H" "FL3-H" "FL2-A" "FL4-H" "Time"
markernames(gs)
#> FL1-H FL2-H FL3-H FL4-H
#> "CD15 FITC" "CD45 PE" "CD14 PerCP" "CD33 APC"
Note that not every detector name has a corresponding marker name. Thankfully, flowGate can handle either kind of name interchangeably, so it’s not hard to use whichever is more appropriate for the situation.
As with most cytometry experiments, the first thing we are going to look at is cells, as defined by their forward and side scatter. Ironically, because this vignette is non-interactive, you will have to do some of the legwork here yourself to draw your gates. I will annotate this to help you follow along, but your best bet is to run this code while reading to see how it works.
#> [1] 2
When you first run the gs_gate_interactive command, a window like this should appear
The sidebar on the left lets you decide what kind of gate you want to draw, and the main window in the middle shows a plot of your data that you can interact with. You can also set the number of bins with a slider to the left - more bins means the data will be plotted with higher resolution (more, smaller dots).
To draw a gate, the first thing you need to do is pick what kind of gate you want to draw. It is very important that you pick the kind of gate you want to draw first, and then draw it second. Doing it the other way tends to cause errors.
To select your leukocytes, switch the gating mode over to polygon by clicking on the polygon radio button
Once you have clicked on the kind of gate you want, you can proceed to draw it. Polygon gates (like this one) are drawn by clicking multiple times to trace a polygon. Other gates are drawn differently (Rectangle and Span gates are drawn by clicking and dragging a rectangle, and Quadrant gates are drawn by clicking once where the four quadrants meet).
Go ahead and draw a polygon gate on your data.
If you think you’ve mis-clicked and want to start over, just hit the “Reset” button on the top left and then start drawing your polygon again. Once you are happy with it, click “Done” to close the window and apply the gate to your whole GatingSet.
When you are finished gating, R will return a list containing a few items that help you to re-create the gate you just drew. The first is the actual gate definition as a flowCore gate object. This will give you the filterID, the coordinates, gate type, and so on, and can be passed to other cytometry packages or added to other GatingSets like any gate. The second is the number of bins you chose to display the data as: this is useful if you want to exactly re-create the plot you gated on later. Finally, the third item in the list is the scaling parameters used in the gating. In this case, we didn’t turn on any scaling since we were looking at linear data, so it should return “unused”, but in a moment, we’ll look at what that means more directly.
If something goes wrong when you are drawing your gates (such as if you draw a polygon gate when you still have “rectangle gate” selected), the shiny app can hang. If you exit out of the shiny window without clicking on “done” first, or if you do click “done” but get some warnings or errors, it’s a good idea to make sure that the shiny app isn’t still running in the background. Have a look at your R Console window and see if there is a stop sign.
If that button is there, it means that the shiny app is still running and you should stop it before proceeding. Trying to do anything else in R while the shiny app runs in the background can cause all kinds of mysterious errors.
Now that we have drawn a leukocytes gate, it is a good idea to have a
look at the plot and see that it looks the way we want it to. There are
a number of ways to plot flow cytometry data in R—my favourite is with
the ggcyto package, which is automatically installed with flowGate for
visualization purposes. Since we only want to peek at the data right now
to make sure our gate looks right, we can use the easy-but-rigid
autoplot()
function, like so:
autoplot(gs[[1]], gate = "Leukocytes")
#> Coordinate system already present. Adding new coordinate system, which will
#> replace the existing one.
The important thing here is that you have a gate drawn on a hex plot with a percentage in the middle of it—don’t worry if it doesn’t look exactly like the one here, and don’t worry if the plot doesn’t look the way you want it to for publication. We’ll cover how to make very nice plots at the end.
Common Mistake: if, when you run autoplot, you don’t
see a gate but you do see a big “0%” sitting roughly where your gate
should be, that means you didn’t switch the window to polygon gate
before drawing your polygon, and the program is trying to make a
rectangle gate out of the very first point you clicked for your polygon
(hence an invisibly-small rectangle with 0% of the events in it). Again,
we’re working on making this mistake harder to make, but for now, just
redo the gate by re-running gs_gate_interactive()
again.
Just make sure you add regate = TRUE
to the command so
flowGate knows to delete this Leukocytes gate before trying to add
another Leukocytes gate to root.
Now that we have a Leukocytes gate drawn, we can drill down into them and gate on the other markers in our sample. The next likely gate will be to take all of the CD45+ cells for further analysis. For illustrative purposes, let’s gate this one with a 1-D span gate on a histogram.
#> [1] 3
Note that this time, unlike previously, we specify the dimensions
(dims
) we want to work with (CD45), and also the
subset
we want to look at (the “parent gate”, in this case
Leukocytes). Running gs_gate_interactive()
with these
arguments will draw a histogram instead of a dot plot, but otherwise the
window will look as before.
Switch the gating mode over to “Span”
Then draw your gate. For span and rectangle gates, you draw them by clicking and dragging a rectangle on the plot. In fact, the only difference between span and rectangle gates is that span only considers the horizontal dimensions, while rectangle considers both. So for this gate, you can draw the rectangle as tall as you want, since span gates don’t care about the vertical dimensions of the rectangle. This can be useful for drawing the gate exactly where you want it, since you can use the vertical dimensions to help line it up with your histogram peaks.
As before, when you are happy with the gate, click Done to close the window and apply the gate. Unlike with polygon and quadrant gates, you don’t need to click Reset if you want to adjust the rectangle on your graph—you can just adjust or redraw it as many times as you like (note: nothing bad will happen if you click Reset when drawing rectangles, so don’t worry if you do).
Now, as before, we can have a peek at the gate. This time, however, we’re going to use the more expanded ggcyto command to start to get familiar with it. ggcyto uses the same grammar of graphics that ggplot uses, so if you are already familiar with ggplot, this should be very straightforward. If not, a full introduction to the grammar of graphics is beyond the scope of this vignette, but look at the code below and see if you can follow what is happening.
The idea behind the grammar of graphics is that you can draw any
graph by first specifying the data the graph comes from, and then
specifying the kind of image you want to make from those data. In this
case, the first line of code tells ggcyto that we want to make a graph
based on the first sample contained in gs (gs[[1]]
), and
that we want to look at the CD45 dimension of these data. In the grammar
of graphics, this is called an aesthetic, hence “aes”.
After this line specifying what sort of data we want to look at, the
next three apply types of images to the data. The first,
geom_density()
, draws a density plot of the CD45 aesthetic
in our data. The second, geom_gate("CD45")
, adds the gate
named “CD45” to our data. The third, geom_stats()
, adds all
relevant stats (in this case the percent parent) to any gates drawn on
the graph.
Using ggcyto()
instead of autoplot()
is a
little more cumbersome to type, but provides much more customization to
the resulting graph, and is well worth learning how to use if you are
regularly going to use R and flowGate for flow analysis. For that
reason, the rest of this vignette will use ggcyto()
calls
to generate graphs, so we can get more familiar with what they look
like.
The last gate to add to this plot is a quadrant gate on CD33 and
CD15. As before, we’ll call gs_gate_interactive()
with the
appropriate arguments, then switch our gating mode to quadrant, and then
click exactly once, where you want the center of the quadrant gate to
be. As with the polygon gate, if you mis-click, you can click on Reset
to reset your selection and then try again.
Note this time, however, the fact that we haven’t transformed any fluorescent channels is really causing problems. Fortunately, you can click on the “use FlowJo Biex” checkbox to enable re-scaling this plot on a flowJo-style biexponential scale.
This enables a number of extra sliders that allow you to set maximum values, biexponential widths, and extra positive and negative decades in real time. Adjust these sliders until your plot looks like something you can confidently draw a quadrant gate on.
#> [1] 4 5 6 7
As before, we can check the gate by plotting it with ggcyto. This
time, however, we’ll need to pass a flowJo biexponential scale to the
plot as well, so it will look like the one above. This is done with the
scale_x_flowjo_biex
and scale_y_flowjo_biex
commands. Note that you can use the output from gs_gate_interactive to
copy the exact biex scaling params you just used.
ggcyto(gs[[1]], aes("CD33", "CD15")) +
geom_hex(bins = 128) +
geom_gate("CD33 CD15") +
scale_x_flowjo_biexp(maxValue = 252245, widthBasis = -29, pos = 4) +
scale_y_flowjo_biexp(maxValue = 250000, widthBasis = -12, pos = 5) +
geom_stats()
But wait! Running this command gives an error: the gate “CD33 CD15” isn’t found. It’s always a good idea to check your spelling when you see errors like this, but we can confirm that we did definitely just make a quadrant gate called “CD33 CD15” and apply it to gs, so it should be in there like any others.
You may already have a hunch of what’s going on here, but to check for sure, it’s a good idea to ask gs for the names of all stored gates
gs_get_pop_paths(gs)
#> [1] "root"
#> [2] "/Leukocytes"
#> [3] "/Leukocytes/CD45"
#> [4] "/Leukocytes/CD45/CD33 APC-CD15 FITC+"
#> [5] "/Leukocytes/CD45/CD33 APC+CD15 FITC+"
#> [6] "/Leukocytes/CD45/CD33 APC+CD15 FITC-"
#> [7] "/Leukocytes/CD45/CD33 APC-CD15 FITC-"
As you can see above, when you draw a quadrant gate, it actually puts four new gates on the plot (one for each quadrant). Since all four of these can’t be named “CD33 CD15”, the quadrant gate ignores the filterId you give it and makes up unique names for the four gates based on the marker names involved in the plot. So to plot this quadrant gate, we need to specify all four of these gates.
ggcyto(gs[[1]], aes("CD33", "CD15")) +
geom_hex(bins = 128) +
scale_x_flowjo_biexp(maxValue = 252245, widthBasis = -29, pos = 4) +
scale_y_flowjo_biexp(maxValue = 250000, widthBasis = -12, pos = 5) +
geom_gate(c("CD33 APC-CD15 FITC+",
"CD33 APC+CD15 FITC+",
"CD33 APC+CD15 FITC-",
"CD33 APC-CD15 FITC-")) +
geom_hline(yintercept = 80.2426) +
geom_vline(xintercept = 97.42192) +
geom_stats()
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_path()`).
#> `geom_path()`: Each group consists of only one observation.
#> ℹ Do you need to adjust the group aesthetic?
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_path()`).
#> `geom_path()`: Each group consists of only one observation.
#> ℹ Do you need to adjust the group aesthetic?
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_path()`).
#> `geom_path()`: Each group consists of only one observation.
#> ℹ Do you need to adjust the group aesthetic?
(Note also that there is a slight bug in ggcyto where adding a
flowjo_biex scale to the graph makes the quad gates not appear. Note
that the gates still exist and still must be added to the plot in order
to show their stats, but the gates themselves aren’t drawn as they had
been previously. As a work-around, you can add a horizontal and a
vertical line with geom_hline()
and
geom_vline()
, and can get the exact coordinates of their
intercepts from the output of the gs_gate_interactive command’s gate
definition).
If you are comfortable with the stringr package, you can also specify this a little more efficiently by first selecting all of the population paths that contain the word “CD33” and then passing this list to geom_gate:
quadgates <- gs_get_pop_paths(gs)[stringr::str_detect(gs_get_pop_paths(gs),
"CD33")]
ggcyto(gs[[1]], aes("CD33", "CD15")) +
geom_hex(bins = 128) +
scale_x_flowjo_biexp(maxValue = 252245, widthBasis = -29, pos = 4) +
scale_y_flowjo_biexp(maxValue = 250000, widthBasis = -12, pos = 5) +
geom_gate(quadgates) +
geom_hline(yintercept = 80.2426) +
geom_vline(xintercept = 97.42192) +
geom_stats()
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_path()`).
#> `geom_path()`: Each group consists of only one observation.
#> ℹ Do you need to adjust the group aesthetic?
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_path()`).
#> `geom_path()`: Each group consists of only one observation.
#> ℹ Do you need to adjust the group aesthetic?
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_path()`).
#> `geom_path()`: Each group consists of only one observation.
#> ℹ Do you need to adjust the group aesthetic?
Again, this isn’t that important, so if stringr is unfamiliar to you, don’t worry about it yet and come back to this idea when you’re more comfortable with working with strings.
Once you have your gates drawn to your satisfaction, the last thing
to do is to save your GatingSet object so that you don’t need to re-draw
your gates when you come back to them. One detail about the
flowWorkspace package that we haven’t covered yet is that GatingSet
objects are very different from most R objects, and so saving your
GatingSet using normal R proceedures will fail (i.e. if you try to save
it as a .Rds it won’t load properly). If you want to understand what is
going on under the hood, have a look at the flowWorkspace vignette. To
save this object, you need the save_gs()
function from
flowWorkspace. This saves your GatingSet object as a directory that you
can then load in with load_gs()
. Although not necessary, I
like to end this directory name with “.gs” to remind myself that the
whole directory is the GatingSet object, so it’s helpful to think of it
more like a single file than a directory.
One very important note here—in some versions of flowWorkspace, the
load_gs()
command is very sensitive to the way file paths
are specified in a system, and in particular fails when you try to use
it in a Windows environment. Wrapping your filepath with
file.path()
from base R will solve this problem and make it
work across any OS.
It’s also worth mentioning that another benefit of the NCDF style of flow cytometry data that we mentioned all the way back in data import is that the NCDF data itself gets saved inside this GatingSet, which means that the whole thing (data, gates, compensation, etc) are stored in your GatingSet.gs directory. This is another reason that I like using the NCDF style of import, but as long as you don’t move the .FCS files that are making up your GatingSet, you shouldn’t run into any problems if you load the GatingSet through the conventional workflow (just like in FlowJo—don’t move your FCS files around after starting to analyze them!).
At this point, we know how to import flow data into R and interactively draw gates on it. Below, we’ll talk about how to nicely plot and export the gated flow data for a complete cytometry analysis solution. However, before we get to that, there is an important second function included in flowGate that we need to talk about, which will let you work through a gating strategy much more efficiently than above.
You may have noticed reading through the above section that gating on flow data using gs_gate_interactive requires a lot of typing. This is by design—one of the motivating forces behind developing flowGate was a desire to make cytometry data analysis easier to document and reproduce. Therefore, we designed gs_gate_interactive to do as little “invisible” work as possible, and make the user explicitly state all of the different components that made up their gating strategy.
Nevertheless, working with gs_gate_interactive quickly becomes
cumbersome when you have a moderately complex gating strategy you’d like
to apply to one or more GatingSets. To get around this problem while
still insisting on leaving a written record of how you gated your
samples, flowGate comes with a helper function called
gs_apply_gating_strategy()
. If you are familiar with the
purrr package, this function is essentially just a wrapper around pwalk
that lets you apply a pre-defined gating strategy to a GatingSet without
re-typing the commands each time. However, the purrr package is a
level-up in complexity when you are first learning R, so instead of
making you go learn how to think functionally, we wrote
gs_apply_gating_strategy to let you leverage this powerful package for
flow analysis right away.
To make use of gs_apply_gating_strategy, you first need to create a
gating strategy in a format the function can recognize. This strategy
must be a tibble, a special kind of dataframe implemented in the
tidyverse family of packages (which includes purrr), where each column
name is one of the parameters you would normally define in
gs_gate_interactive, and each row is a new gate you are going to draw.
You can make this tibble any way you want, but we’ve found that the
easiest approach is using the tribble()
function from the
tibble package, which lets you define a tibble as if you were actually
writing it down on a piece of paper. To demonstrate, we’ll re-create the
above gating strategy here.
strategy <- tibble::tribble(
~filterId, ~dims, ~subset, ~bins,
"Leukocytes", list("FSC-H", "SSC-H"), "root", 256,
"CD45", list("CD45"), "Leukocytes", 256,
"CD33 CD15", list("CD33", "CD15"), "CD45", 64
)
When using the tribble()
function, you define column
names with a “~” and then the name (unquoted), and then fill in each row
of data with whatever is supposed to be there. As usual in R, whitespace
doesn’t really matter much, so you are free to lay out the tribble
exactly as a table to make it nice and human readable. In the above
example, this will result in a four-column tibble, with a column for
filterId
, dims
, subset
, and
bins
, which are the four arguments we used above when
gating through our data. Any arguments not specified here will use their
defaults, or can be set outside the strategy, which we’ll talk about in
a second.
Once the gating strategy is defined, applying it to your gating set is extremely easy. Just call the following:
That single command will now go through your GatingSet object and apply the gating strategy line-by-line, automatically popping up the interactive window each time it moves to a new line and letting you draw the gate you want. This way, you don’t need to write out a full call to gs_gate_interactive for each new gate you want to apply to your GatingSet, but you still get a record of the gating strategy you used and each parameter contained in it.
As mentioned above, any parameters not included in the gating
strategy will use their defaults from gs_gate_interactive. If you don’t
want to use the default behaviour, you could just add a new column to
the gating strategy. However, this will get annoying if that new column
doesn’t change ever. For example, if we wanted to run the above gating
strategy, but use bins = 64
for all the gates, it would be
annoying to have to include that in the strategy. Instead, we can add it
to the call to gs_apply_gating_strategy
directly:
Now that we have gated on our samples, the last step is to have a look at the different flow samples and pick some example plots to show others, as well as extract statistics like percent of parent populations and MFI. Again, the flowCore, ggcyto, and flowWorkspace packages are very feature-rich in this department, so what we will show below is only the tip of the iceberg. However, this should cover many of the conventional cytometry use cases and let you do complete, basic cytometry analysis using only R.
Imagine we wanted to show the overall proportions of leukocytes,
based on the very first gate we drew. Since there’s only three samples
in the GatingSet, the best way to do this is just plot three dotplots
showing each sample’s FSC-H and SSC-H, gated with our polygon gate.
Fortunately, we already know how to do this. Remember how above, each
time we plotted something, we specified that we only wanted the first
sample in gs with gs[[1]]
? All we need to do is not specify
anything and ggcyto will plot all three graphs.
These graphs do a good job conveying the information we want to show, but they look a little primitive compared to graphs that commercial software produces, so let’s try to use the huge customization possibilities in ggcyto to make them nicer. We’ll change three things: change the gates from bright red to a semi-transparent grey, make the stats overlays slightly more friendly numbers, and increase the bins to make the dots a little less crude. Have a look at the code below and see if you can understand what is happening before reading on.
NB: the ggcyto package insists on “colour” as the only spelling of the word. Trying to set the “color” of your gate will result in hours of fruitless debugging.
ggcyto(gs, aes("FSC-H", "SSC-H")) +
geom_hex(bins = 128) +
geom_gate("Leukocytes", colour = "grey3", alpha = 0.2) +
geom_stats(digits = 1)
That’s starting to look much classier, while still clearly presenting
the data. The last thing I like to do with presentation plots is to
remove the grey in the background. Changing the overall appearance of a
plot is accomplished with the theme_
family of layers to
add to a ggcyto object. There are some default themes, as well as
packages with other nice ones out there. It’s also possible to get
extremely fine control over theme elements with the theme()
layer, but that’s an advanced topic that you should look into the first
time you run into a theme problem you can’t fix with a pre-loaded
one.
In this case, we only care about CD33 and CD15 expression on three
samples, so the graphs above are sufficient to convey all the
information we want. Normally, however, it’s necessary to retrieve some
summary statistics such as percent of parent or MFI from many different
populations and then graph them somehow. flowWorkspace has a number of
functions we can use to access this information. The easiest is
gs_pop_get_counts_fast()
, which returns a basic table of
population counts without any fuss.
gs_pop_get_count_fast(gs)
#> name Population Parent Count
#> <char> <char> <char> <int>
#> 1: s10a01.FCS /Leukocytes root 1320
#> 2: s10a01.FCS /Leukocytes/CD45 /Leukocytes 512
#> 3: s10a01.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ /Leukocytes/CD45 46
#> 4: s10a01.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ /Leukocytes/CD45 16
#> 5: s10a01.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- /Leukocytes/CD45 70
#> 6: s10a01.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- /Leukocytes/CD45 380
#> 7: s10a02.FCS /Leukocytes root 1273
#> 8: s10a02.FCS /Leukocytes/CD45 /Leukocytes 886
#> 9: s10a02.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ /Leukocytes/CD45 30
#> 10: s10a02.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ /Leukocytes/CD45 1
#> 11: s10a02.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- /Leukocytes/CD45 24
#> 12: s10a02.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- /Leukocytes/CD45 831
#> 13: s10a03.FCS /Leukocytes root 2319
#> 14: s10a03.FCS /Leukocytes/CD45 /Leukocytes 1741
#> 15: s10a03.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ /Leukocytes/CD45 12
#> 16: s10a03.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ /Leukocytes/CD45 0
#> 17: s10a03.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- /Leukocytes/CD45 1
#> 18: s10a03.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- /Leukocytes/CD45 1728
#> ParentCount
#> <int>
#> 1: 3420
#> 2: 1320
#> 3: 512
#> 4: 512
#> 5: 512
#> 6: 512
#> 7: 3405
#> 8: 1273
#> 9: 886
#> 10: 886
#> 11: 886
#> 12: 886
#> 13: 3435
#> 14: 2319
#> 15: 1741
#> 16: 1741
#> 17: 1741
#> 18: 1741
However, geting anything useful out of that table will require a bit
more data cleaning. Instead, we can use gs_pop_get_stats()
,
which is a more robust version of gs_pop_get_counts_fast()
.
Here, we can specify whether we want percent (of parent) or count, or
can even specify our own functions to derive a stat. We can also specify
certain gates if we don’t want to look at all the populations contained
within the GatingSet.
gs_pop_get_stats(gs, type = "percent")
#> sample pop percent
#> <char> <char> <num>
#> 1: s10a01.FCS root 1.0000000000
#> 2: s10a01.FCS /Leukocytes 0.3859649123
#> 3: s10a01.FCS /Leukocytes/CD45 0.3878787879
#> 4: s10a01.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ 0.0898437500
#> 5: s10a01.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ 0.0312500000
#> 6: s10a01.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- 0.1367187500
#> 7: s10a01.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- 0.7421875000
#> 8: s10a02.FCS root 1.0000000000
#> 9: s10a02.FCS /Leukocytes 0.3738619677
#> 10: s10a02.FCS /Leukocytes/CD45 0.6959937156
#> 11: s10a02.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ 0.0338600451
#> 12: s10a02.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ 0.0011286682
#> 13: s10a02.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- 0.0270880361
#> 14: s10a02.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- 0.9379232506
#> 15: s10a03.FCS root 1.0000000000
#> 16: s10a03.FCS /Leukocytes 0.6751091703
#> 17: s10a03.FCS /Leukocytes/CD45 0.7507546356
#> 18: s10a03.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ 0.0068925905
#> 19: s10a03.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ 0.0000000000
#> 20: s10a03.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- 0.0005743825
#> 21: s10a03.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- 0.9925330270
#> sample pop percent
gs_pop_get_stats(gs, type = "percent", nodes = "CD45")
#> sample pop percent
#> <char> <char> <num>
#> 1: s10a01.FCS CD45 0.3878788
#> 2: s10a02.FCS CD45 0.6959937
#> 3: s10a03.FCS CD45 0.7507546
Specifying our own function is how we get MFI from this command, although the resulting table is a little different in that it returns the MFI for each channel for each population:
gs_pop_get_stats(gs, type = pop.MFI)
#> sample pop FSC-Height SSC-Height
#> <char> <char> <num> <num>
#> 1: s10a01.FCS root 197.0 145.5
#> 2: s10a01.FCS /Leukocytes 346.0 106.0
#> 3: s10a01.FCS /Leukocytes/CD45 330.5 84.0
#> 4: s10a01.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ 426.0 225.0
#> 5: s10a01.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ 393.0 131.5
#> 6: s10a01.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- 334.5 71.5
#> 7: s10a01.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- 321.0 81.5
#> 8: s10a02.FCS root 155.0 73.0
#> 9: s10a02.FCS /Leukocytes 311.0 88.0
#> 10: s10a02.FCS /Leukocytes/CD45 324.0 86.5
#> 11: s10a02.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ 463.0 291.0
#> 12: s10a02.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ 680.0 513.0
#> 13: s10a02.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- 384.5 109.0
#> 14: s10a02.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- 317.0 85.0
#> 15: s10a03.FCS root 256.0 91.0
#> 16: s10a03.FCS /Leukocytes 307.0 98.0
#> 17: s10a03.FCS /Leukocytes/CD45 309.0 94.0
#> 18: s10a03.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC+ 339.0 111.5
#> 19: s10a03.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC+ NaN NaN
#> 20: s10a03.FCS /Leukocytes/CD45/CD33 APC+CD15 FITC- 228.0 73.0
#> 21: s10a03.FCS /Leukocytes/CD45/CD33 APC-CD15 FITC- 309.0 94.0
#> sample pop FSC-Height SSC-Height
#> CD15 FITC CD45 PE CD14 PerCP CD33 APC Time (51.20 sec.)
#> <num> <num> <num> <num> <num>
#> 1: 13.5996876 30.45058 -9.643422 8.711668 270
#> 2: 5.4676127 144.97336 -16.788250 15.042021 272
#> 3: 9.6229916 785.92584 -94.139992 27.272420 263
#> 4: 331.2456818 841.37839 -82.922741 35.039520 275
#> 5: 318.6315460 964.64413 -120.808823 366.754135 229
#> 6: 0.1012886 775.45840 -93.690350 242.710533 279
#> 7: 9.3366413 777.21405 -94.580257 23.122594 262
#> 8: 2.0787711 50.52770 -4.050773 5.188426 378
#> 9: 4.4729295 615.16132 -79.856590 21.387505 367
#> 10: 4.8510027 710.75040 -91.941307 23.976637 362
#> 11: 237.5526505 1049.94025 -129.168713 62.134062 301
#> 12: 1915.9261475 1457.85132 -258.817078 143.360886 640
#> 13: 1.8730214 689.62305 -91.864277 115.895218 392
#> 14: 4.7642722 698.62616 -91.074135 23.106012 362
#> 15: 0.9790034 853.45288 -115.138008 26.940256 373
#> 16: 0.4265432 1048.86206 -142.170654 32.176792 382
#> 17: 0.6490263 975.08081 -131.160278 29.779650 387
#> 18: 373.4992218 941.59219 -141.635803 30.404140 322
#> 19: NaN NaN NaN NaN NaN
#> 20: -5.9972925 1228.10071 -126.859680 313.914093 56
#> 21: 0.6288088 975.24161 -131.019043 29.771698 387
#> CD15 FITC CD45 PE CD14 PerCP CD33 APC Time (51.20 sec.)
As with any tabular data in R, you can save any of these results
tables as .csv files by storing them in a variable and then calling
write.csv()
around it:
This would let you analyze these data in any software you like, just like you already do with your analysis software of choice. However, R is a powerhouse of analysis and graphing capabilities, and so the best thing to do with these data is to analyze and display them in R. This is a huge subject that goes far beyond the scope of this vignette, but if you’ve never tried R’s graphing and data carpentry capabilities before, I highly recommend Wickham and Grolemund’s R for Data Science (https://r4ds.had.co.nz/), which will give you an excellent introduction to all of these subjects and let you quickly plot graphs from these data without needing any external software.
The authors thank Dr. Meromit Singer and the Partners and HMS computing cores for help getting our lab started in applying bioinformatics methods. The original code for flowGate was based on the choose_cells() function from Monocle3 by the Trapnell lab (https://cole-trapnell-lab.github.io/monocle3/). Andrew Wight was funded by an AAI Intersect fellowship for cross-training immunologists in computational biology & a grant from the Claudia Adams Barr foundation. FlowJo is a registered trademark of Beckton, Dickenson, and Company. Kaluza is a registered trademark of Beckman Coulter Inc.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] flowGate_1.7.0 ggcyto_1.33.0 ncdfFlow_2.51.0
#> [4] BH_1.84.0-0 flowCore_2.17.0 ggplot2_3.5.1
#> [7] flowWorkspace_4.17.0 rmarkdown_2.28
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
#> [4] Biobase_2.67.0 lattice_0.22-6 vctrs_0.6.5
#> [7] tools_4.4.1 generics_0.1.3 stats4_4.4.1
#> [10] tibble_3.2.1 fansi_1.0.6 highr_0.11
#> [13] pkgconfig_2.0.3 data.table_1.16.2 RColorBrewer_1.1-3
#> [16] S4Vectors_0.43.2 graph_1.83.0 lifecycle_1.0.4
#> [19] compiler_4.4.1 farver_2.1.2 stringr_1.5.1
#> [22] munsell_0.5.1 httpuv_1.6.15 htmltools_0.5.8.1
#> [25] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
#> [28] yaml_2.3.10 pillar_1.9.0 hexbin_1.28.4
#> [31] later_1.3.2 jquerylib_0.1.4 cachem_1.1.0
#> [34] mime_0.12 RProtoBufLib_2.17.0 tidyselect_1.2.1
#> [37] digest_0.6.37 stringi_1.8.4 dplyr_1.1.4
#> [40] maketools_1.3.1 labeling_0.4.3 fastmap_1.2.0
#> [43] grid_4.4.1 colorspace_2.1-1 cli_3.6.3
#> [46] magrittr_2.0.3 XML_3.99-0.17 utf8_1.2.4
#> [49] withr_3.0.2 scales_1.3.0 promises_1.3.0
#> [52] matrixStats_1.4.1 gridExtra_2.3 cytolib_2.17.0
#> [55] shiny_1.9.1 evaluate_1.0.1 knitr_1.48
#> [58] rlang_1.1.4 Rcpp_1.0.13 xtable_1.8-4
#> [61] glue_1.8.0 Rgraphviz_2.49.1 BiocManager_1.30.25
#> [64] BiocGenerics_0.53.0 jsonlite_1.8.9 R6_2.5.1
#> [67] plyr_1.8.9 zlibbioc_1.51.2