CytoML
provides flowjo_to_gatingset
function to parse FlowJo
workspace (xml
or
wsp
file) and FCS files into a self-contained
GatingSet
object, which captures the entire analysis
recorded in flowJo, include compensation
,
transformation
and gating
.
library(CytoML)
dataDir <- system.file("extdata",package="flowWorkspaceData")
wsfile <- list.files(dataDir, pattern="manual.xml",full=TRUE)
ws <- open_flowjo_xml(wsfile)
ws
## File location: /github/workspace/pkglib/flowWorkspaceData/extdata/manual.xml
##
## Groups in Workspace
## Name Num.Samples
## 1 All Samples 45
## 2 B-cell 4
## 3 DC 4
## 4 T-cell 4
## 5 Thelper 4
## 6 Treg 4
Once opened, sample group information can be retrieved
## groupName groupID sampleID
## 60 Thelper 4 30
## 61 Thelper 4 31
## 62 Treg 5 37
## 63 Treg 5 38
## 64 Treg 5 39
## 65 Treg 5 40
And sample information for a given group
## sampleID name count pop.counts
## 1 28 CytoTrol_CytoTrol_1.fcs 136304 23
## 2 29 CytoTrol_CytoTrol_2.fcs 115827 23
## 3 30 CytoTrol_CytoTrol_3.fcs 123170 23
## 4 31 CytoTrol_CytoTrol_4.fcs 114802 23
In majority use cases, only two parameters are required to complete the parsing, i.e.
name
: the group to importpath
: the data path of FCS files.name
parameter can be set to the group name displayed
above through flowJo_workspace
APIs.
name
can also be the numeric index
As shown above, the path
be omitted if fcs files are
located at the same folder as xml file.
Otherwise, path
is set the actual folder where FCS files
are located. The folder can contain sub-folders and the parser will
recursively look up the directory for FCS files (by matching the file
names, keywords, etc)
path
can alternatively be a , which should contain two
columns:‘sampleID’ and ‘file’. It essentially provides hardcoded mapping
between ‘sampleID’ and FCS file (absolute) path to avoid the file system
searching or sample matching process (between the flowJo sample
reference and the FCS files).
However this is rarely needed since auto-searching does pretty accurate and robust matching.
Due to the varieties of FlowJo workspace or FCS file issues, sometime the default setting won’t be sufficient to handle some edge cases, e.g. when the error occurs at specific gate due to the incorrect gate parameters defined in xml , but we want to be able to import the upstream gates that are still useful. Or there is letter case inconsistency for channels used in xml, which will trigger an error by default.
Also there are other features provided by the parser that allow users to speed up the parsing or extract more meta data from either xml or FCS files.
flowjo_to_gatingset
provides more parameters that can be
configured to solve different problems during the parsing. In document
aims to go through these parameters and explore them one by one.
It is possible to only import the gating structure without reading
the FCS data by setting execute
flag to
FALSE
.
## A GatingSet with 4 samples
Gating hierarchy is immediately available
So are the gates
## $CytoTrol_CytoTrol_1.fcs_119531
## Ellipsoid gate 'CD3+' in dimensions <V450-A> and SSC-A
##
## $CytoTrol_CytoTrol_2.fcs_115728
## Ellipsoid gate 'CD3+' in dimensions <V450-A> and SSC-A
##
## $CytoTrol_CytoTrol_3.fcs_113883
## Ellipsoid gate 'CD3+' in dimensions <V450-A> and SSC-A
##
## $CytoTrol_CytoTrol_4.fcs_110898
## Ellipsoid gate 'CD3+' in dimensions <V450-A> and SSC-A
compensations
## $CytoTrol_CytoTrol_1.fcs_119531
## Compensation object 'defaultCompensation':
## B710-A G560-A G780-A R660-A R780-A V450-A V545-A
## B710-A 1.000000 0.0009476 0.071170 0.0362400 0.1800000 0.007104 0.007608
## G560-A 0.115400 1.0000000 0.009097 0.0018360 0.0000000 0.000000 0.000000
## G780-A 0.014280 0.0380000 1.000000 0.0006481 0.1500000 0.000000 0.000000
## R660-A 0.005621 0.0000000 0.006604 1.0000000 0.1786000 0.000000 0.000000
## R780-A 0.000000 0.0000000 0.035340 0.0102100 1.0000000 0.000000 0.000000
## V450-A 0.000000 0.0000000 -0.060000 -0.0400000 0.0000000 1.000000 0.410000
## V545-A 0.002749 0.0000000 0.000000 0.0000000 0.0006963 0.035000 1.000000
transformations
## function (x, deriv = 0)
## {
## deriv <- as.integer(deriv)
## if (deriv < 0 || deriv > 3)
## stop("'deriv' must be between 0 and 3")
## if (deriv > 0) {
## z0 <- double(z$n)
## z[c("y", "b", "c")] <- switch(deriv, list(y = z$b, b = 2 *
## z$c, c = 3 * z$d), list(y = 2 * z$c, b = 6 * z$d,
## c = z0), list(y = 6 * z$d, b = z0, c = z0))
## z[["d"]] <- z0
## }
## res <- stats:::.splinefun(x, z)
## if (deriv > 0 && z$method == 2 && any(ind <- x <= z$x[1L]))
## res[ind] <- ifelse(deriv == 1, z$y[1L], 0)
## res
## }
## <bytecode: 0x55930cf58378>
## <environment: 0x559319121548>
## attr(,"type")
## [1] "biexp"
## attr(,"parameters")
## attr(,"parameters")$channelRange
## [1] 4096
##
## attr(,"parameters")$maxValue
## [1] 261589.9
##
## attr(,"parameters")$neg
## [1] 0
##
## attr(,"parameters")$pos
## [1] 4.5
##
## attr(,"parameters")$widthBasis
## [1] -10
and stats
## sample pop count
## <char> <char> <num>
## 1: CytoTrol_CytoTrol_1.fcs_119531 root 119531
## 2: CytoTrol_CytoTrol_1.fcs_119531 /not debris 91720
## 3: CytoTrol_CytoTrol_1.fcs_119531 /not debris/singlets 87033
## 4: CytoTrol_CytoTrol_1.fcs_119531 /not debris/singlets/CD3+ 54737
## 5: CytoTrol_CytoTrol_1.fcs_119531 /not debris/singlets/CD3+/CD4 34083
## 6: CytoTrol_CytoTrol_1.fcs_119531 /not debris/singlets/CD3+/CD4/38- DR+ 1124
Note that xml
flag needs to be set in order to tell it
to return the stats from xml file. Otherwise it will display the value
computed from FCS file, which will be NA
in this case since
we didn’t load FCS files.
Apparently, it is very fast to only import xml, but data won’t be available for retrieving.
## Error: gate is not parsed!
As shown above, the sample names appear to be the FCS filename appended with some numbers by default
## [1] "CytoTrol_CytoTrol_1.fcs_119531" "CytoTrol_CytoTrol_2.fcs_115728"
## [3] "CytoTrol_CytoTrol_3.fcs_113883" "CytoTrol_CytoTrol_4.fcs_110898"
These numbers are actually the value of $TOT
keyword,
which is the total number of events for each sample. This value is
typically unique for each file thus is used to tag the sample on top of
the existing sample name. This default behavior is recommended so that
samples can be uniquely identified even when duplication of file names
occur, which is not uncommon. e.g. We often see the multiple files are
named as Specimen_XXX
that are located under different
sub-folders.
However, if users decide it is unnecessary in their case and prefer
to the shorter names. It can be removed through
additional.keys
parameter
## [1] "CytoTrol_CytoTrol_1.fcs" "CytoTrol_CytoTrol_2.fcs"
## [3] "CytoTrol_CytoTrol_3.fcs" "CytoTrol_CytoTrol_4.fcs"
Or it can be tagged by some extra keywords value if $TOT
turns out to be not sufficient for the unique ID, which rarely
happens.
gs <- flowjo_to_gatingset(ws, name = 4, execute = FALSE, additional.keys = c("$TOT", "EXPERIMENT NAME"))
sampleNames(gs)
## [1] "CytoTrol_CytoTrol_1.fcs_119531_C2_Tcell"
## [2] "CytoTrol_CytoTrol_2.fcs_115728_C2_Tcell"
## [3] "CytoTrol_CytoTrol_3.fcs_113883_C2_Tcell"
## [4] "CytoTrol_CytoTrol_4.fcs_110898_C2_Tcell"
And we can even include the sample ID
used by flowJo xml
when the filename and other keywords can’t be differentiated between
samples. This can be turned on by additional.sampleID
flag
gs <- flowjo_to_gatingset(ws, name = 4, execute = FALSE, additional.sampleID = TRUE)
sampleNames(gs)
## [1] "CytoTrol_CytoTrol_1.fcs_19_119531" "CytoTrol_CytoTrol_2.fcs_20_115728"
## [3] "CytoTrol_CytoTrol_3.fcs_21_113883" "CytoTrol_CytoTrol_4.fcs_22_110898"
Be default, the pheno data of samples (returned by
pData()
) only contains single column of file name.
## name
## CytoTrol_CytoTrol_1.fcs_19_119531 CytoTrol_CytoTrol_1.fcs
## CytoTrol_CytoTrol_2.fcs_20_115728 CytoTrol_CytoTrol_2.fcs
## CytoTrol_CytoTrol_3.fcs_21_113883 CytoTrol_CytoTrol_3.fcs
## CytoTrol_CytoTrol_4.fcs_22_110898 CytoTrol_CytoTrol_4.fcs
keywords
can be specified to ask the parser to exact the
keywords and attach them to the pdata.
gs <- flowjo_to_gatingset(ws, name = 4, execute = FALSE, additional.keys = NULL, keywords = c("EXPERIMENT NAME", "TUBE NAME"))
pData(gs)
## name EXPERIMENT NAME TUBE NAME
## CytoTrol_CytoTrol_1.fcs CytoTrol_CytoTrol_1.fcs C2_Tcell CytoTrol_1
## CytoTrol_CytoTrol_2.fcs CytoTrol_CytoTrol_2.fcs C2_Tcell CytoTrol_2
## CytoTrol_CytoTrol_3.fcs CytoTrol_CytoTrol_3.fcs C2_Tcell CytoTrol_3
## CytoTrol_CytoTrol_4.fcs CytoTrol_CytoTrol_4.fcs C2_Tcell CytoTrol_4
By default, keywords
are parsed from xml file.
Alternatively it can be read from FCS files by setting
keywords.source = "FCS"
. Obviously, execute
needs to be set to TRUE
in order for this to succeed.
Occasionally, keyword names may not be case consistent across
samples, e.g. tube name
vs TUBE NAME
, which
will prevent the parser from constructing the pData
data
structure properly.
keyword.ignore.case
can be optionally set to
TRUE
in order to relax the rule to make the parser succeed.
But it is recommended to keep it as default (especially when
keywords.source = "FCS"
) and use the dedicated tool
cytoqc
package to standardize FCS
files before
running the parser.
Sometime it is useful to only select a small subset of samples to import to quickly test or review the content of gating tree instead of waiting for the entire data set to be completed, which could take longer time if the total number of samples is big.
subset
argument takes numeric indies, e.g.
gs <- flowjo_to_gatingset(ws, name = 4, execute = FALSE, additional.keys = NULL, subset = 1:2)
sampleNames(gs)
## [1] "CytoTrol_CytoTrol_1.fcs" "CytoTrol_CytoTrol_2.fcs"
or the vector of sample (FCS) names, e.g.
gs <- flowjo_to_gatingset(ws, name = 4, execute = FALSE, additional.keys = NULL, subset = c("CytoTrol_CytoTrol_3.fcs"))
sampleNames(gs)
or an that is similar to the one passed to ‘base::subset’ function to filter a data.frame. Note that the columns referred by the expression must also be explicitly specified in ‘keywords’ argument, which we will cover in the later sections.
e.g.
gs <- flowjo_to_gatingset(ws, name = 4, execute = FALSE, additional.keys = NULL
, subset = `EXPERIMENT NAME` == "C2_Tcell"
, keywords = c("EXPERIMENT NAME")
)
pData(gs)
## name EXPERIMENT NAME
## CytoTrol_CytoTrol_1.fcs CytoTrol_CytoTrol_1.fcs C2_Tcell
## CytoTrol_CytoTrol_2.fcs CytoTrol_CytoTrol_2.fcs C2_Tcell
## CytoTrol_CytoTrol_3.fcs CytoTrol_CytoTrol_3.fcs C2_Tcell
## CytoTrol_CytoTrol_4.fcs CytoTrol_CytoTrol_4.fcs C2_Tcell
It is possible to utilize multiple cpus (cores) to speed up the
parsing where each sample is parsed and gated concurrently. Parallel
parsing is enabled when mc.cores
is set to the value >
1.
However, be careful to avoid setting too many cores, which may end up slowing down the process caused by disk IO because FCS files need to be read from disk and the type of file system also contributes to the performance. e.g. parallel file system provided by HPC usually performs better than the regular personal computer.
Sometime the gating tree could be big and could be slow to import all the gates, e.g.
gs <- flowjo_to_gatingset(ws, name="Samples", subset = "1379326.fcs", execute = FALSE)
nodes <- gs_get_pop_paths(gs)
length(nodes)
plot(gs, "3+")
it contains a lot of boolean gates at the bottom level of the tree, e.g.
If these boolean gates are not important for the analysis, user can choose to skip computing them
gs <- flowjo_to_gatingset(ws, name="Samples", subset = "1379326.fcs", leaf.bool = F)
gs_pop_get_stats(gs)
As shown, these boolean leaf gates are still imported, but not gated or computed (which is why stats are NA for these gates). This will speed up the parsing by avoid calculating significant portion of the tree. And it can be computed later if user decides to do so
This parsing error is due to the incorrect channel name is used by
CD4
gate that is defined in flowJo and FCS files don’t have
that channel
gs <- flowjo_to_gatingset(ws, name = 4, path = dataDir, execute = FALSE)
plot(gs)
gs_pop_get_gate(gs, "CD4")[[1]]
The parser can be told to skip CD4
and all its
descendants so that the rest of gates can still be parsed
gs <- flowjo_to_gatingset(ws, name = 4, path = dataDir, subset = 1, skip_faulty_gate = TRUE)
head(gs_pop_get_stats(gs))
Similar to skipping leaf boolean gates, faulty gates are still imported, but they are not computed.
This parsing error is actually due to the fact that the flowJo workspace doesn’t have any gates defined. We can see that through
By default, parser will ignore and skip any sample that has zero
population
s (consider them as invalid entries in xml). This
is why it gives the message of no samples to parse.
User can choose to import it anyway (so that he can get FCS files to be compensated and transformed by flowJo xml),
This will fail to find the samples.
As the error message indicates, FCS file has the non-conventional
extention .B08
other than .fcs
. Parser can be
configured to search for the customized fcs files.