Title: | Differential discovery in high-dimensional cytometry via high-resolution clustering |
---|---|
Description: | Statistical methods for differential discovery analyses in high-dimensional cytometry data (including flow cytometry, mass cytometry or CyTOF, and oligonucleotide-tagged cytometry), based on a combination of high-resolution clustering and empirical Bayes moderated tests adapted from transcriptomics. |
Authors: | Lukas M. Weber [aut, cre] |
Maintainer: | Lukas M. Weber <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.27.0 |
Built: | 2024-12-31 12:01:04 UTC |
Source: | https://github.com/bioc/diffcyt |
Calculate number of cells per cluster-sample combination
calcCounts(d_se)
calcCounts(d_se)
d_se |
Data object from previous steps, in |
Calculate number of cells per cluster-sample combination (referred to as cluster cell 'counts', 'abundances', or 'frequencies').
The cluster cell counts are required for testing for differential abundance of cell populations, and are also used for weights and filtering when testing for differential states within cell populations.
Results are returned as a new SummarizedExperiment
object, where rows =
clusters, columns = samples, assay
= values (counts). (Note that this structure
differs from the input data object.)
d_counts
: SummarizedExperiment
object, where rows =
clusters, columns = samples, assay
= values (counts).
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se)
Calculate cluster medians (median expression for each cluster-sample-marker combination)
calcMedians(d_se)
calcMedians(d_se)
d_se |
Data object from previous steps, in |
Calculate median marker expression for each cluster and sample (i.e. medians for each cluster-sample-marker combination).
The data object is assumed to contain a factor marker_class
in the column
meta-data (see prepareData
), which indicates the protein marker class for
each column of data ("type"
, "state"
, or "none"
).
The cluster medians are required for testing for differential states within cell populations, and for plotting purposes.
Variables id_type_markers
and id_state_markers
are saved in the
metadata
slot of the output object. These can be used to identify the 'cell
type' and 'cell state' markers in the list of assays
in the output
SummarizedExperiment
object, which is useful in later steps of the
'diffcyt' pipeline.
Results are returned as a new SummarizedExperiment
object, where rows =
clusters, columns = samples, sheets (assays
slot) = markers. Note that there is
a separate table of values (assay
) for each marker. The metadata
slot
also contains variables id_type_markers
and id_state_markers
, which can
be used to identify the sets of cell type and cell state markers in the list of
assays
.
d_medians
: SummarizedExperiment
object, where rows =
clusters, columns = samples, sheets (assays
slot) = markers. The
metadata
slot contains variables id_type_markers
and
id_state_markers
, which can be accessed with
metadata(d_medians)$id_type_markers
and
metadata(d_medians)$id_state_markers
.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate medians d_medians <- calcMedians(d_se)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate medians d_medians <- calcMedians(d_se)
Calculate medians for each cluster-marker combination
calcMediansByClusterMarker(d_se)
calcMediansByClusterMarker(d_se)
d_se |
Data object from previous steps, in |
Calculate median marker expression for each cluster, across all samples (i.e. medians for each cluster-marker combination).
The data object is assumed to contain a factor marker_class
in the column
meta-data (see prepareData
), which indicates the protein marker class for
each column of data ("type"
, "state"
, or "none"
). Cluster medians
are calculated for all markers.
The medians by cluster and marker are required for plotting purposes.
Variables id_type_markers
and id_state_markers
are saved in the
metadata
slot of the output object. These can be used to identify the 'cell
type' and 'cell state' markers in the sequence of markers (columns) in the output
object, which is useful in later steps of the 'diffcyt' pipeline.
Results are returned as a new SummarizedExperiment
object, where rows =
clusters, columns = markers, assay
= values (marker expression values). The
metadata
slot also contains variables id_type_markers
and
id_state_markers
, which can be used to identify the sets of cell type and cell
state markers in the columns.
d_medians_by_cluster_marker
: SummarizedExperiment
object,
where rows = clusters, columns = markers, assay
= values (marker expression
values). The metadata
slot contains variables id_type_markers
and
id_state_markers
, which can be accessed with
metadata(d_medians)$id_type_markers
and
metadata(d_medians)$id_state_markers
.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate medians (by cluster and marker) d_medians_by_cluster_marker <- calcMediansByClusterMarker(d_se)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate medians (by cluster and marker) d_medians_by_cluster_marker <- calcMediansByClusterMarker(d_se)
Calculate medians for each sample-marker combination
calcMediansBySampleMarker(d_se)
calcMediansBySampleMarker(d_se)
d_se |
Data object from previous steps, in |
Calculate overall median marker expression for each sample (i.e. medians for each sample-marker combination).
The data object is assumed to contain a factor marker_class
in the column
meta-data (see prepareData
), which indicates the protein marker class for
each column of data ("type"
, "state"
, or "none"
). Cluster medians
are calculated for all markers.
The medians by sample and marker are required for plotting purposes.
Variables id_type_markers
and id_state_markers
are saved in the
metadata
slot of the output object. These can be used to identify the 'cell
type' and 'cell state' markers in the sequence of markers (columns) in the output
object, which is useful in later steps of the 'diffcyt' pipeline.
Results are returned as a new SummarizedExperiment
object, where rows =
samples, columns = markers, assay
= values (marker expression values). The
metadata
slot also contains variables id_type_markers
and
id_state_markers
, which can be used to identify the sets of cell type and cell
state markers in the columns.
d_medians_by_sample_marker
: SummarizedExperiment
object,
where rows = samples, columns = markers, assay
= values (marker expression
values). The metadata
slot contains variables id_type_markers
and
id_state_markers
, which can be accessed with
metadata(d_medians)$id_type_markers
and
metadata(d_medians)$id_state_markers
.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate medians (by sample and marker) d_medians_by_sample_marker <- calcMediansBySampleMarker(d_se)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate medians (by sample and marker) d_medians_by_sample_marker <- calcMediansBySampleMarker(d_se)
Create contrast matrix for differential testing
createContrast(contrast)
createContrast(contrast)
contrast |
Vector defining the contrast of interest. This should be a numeric
vector specifying the combination of model parameters to test whether they are equal
to zero. The entries correspond to the columns of the design matrix, or the levels of
the fixed effect terms in the model formula. For example, using a design matrix:
|
Creates a contrast matrix specifying the comparison of interest, in the correct format for the differential testing functions. This can then be provided to the differential testing functions, together with either a design matrix or model formula, and the data object.
The argument contrast
defines the contrast of interest. This should be a numeric
vector specifying the combination of model parameters to test whether they are equal to
zero. In many cases, this will simply be a vector of zeros and a single entry equal to
one; this will test whether a single parameter is equal to zero (e.g. c(0, 1, 0, 0,
0)).
If a design matrix has been used, the entries of contrast
correspond to the
columns of the design matrix; and the length of contrast
equals the number of
columns in the design matrix. If a model formula has been used, the entries correspond
to the levels of the fixed effect terms; and the length equals the number of levels of
the fixed effect terms.
The contrast matrix is formatted as a matrix with a single column containing the contrast of interest. To perform tests for multiple contrasts, run this function and the corresponding differential testing function multiple times.
contrast
: Returns a contrast matrix containing the contrast of interest,
formatted as a matrix with a single column.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Example: contrast matrix createContrast(c(0, 1, 0, 0, 0))
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Example: contrast matrix createContrast(c(0, 1, 0, 0, 0))
Create design matrix for model fitting
createDesignMatrix(experiment_info, cols_design = NULL)
createDesignMatrix(experiment_info, cols_design = NULL)
experiment_info |
|
cols_design |
Argument specifying the columns of |
Creates a design matrix specifying the models to be fitted. (Alternatively,
createFormula
can be used to generate a model formula instead of a design
matrix.)
The design matrix can then be provided to the differential testing functions, together with the data object and contrast matrix.
The experiment_info
input (which was also previously provided to
prepareData
) should be a data frame containing all factors and covariates
of interest. For example, depending on the experimental design, this may include the
following columns:
group IDs (e.g. groups for differential testing)
block IDs (e.g. patient IDs in a paired design)
batch IDs (batch effects)
continuous covariates
The argument cols_design
specifies which columns in experiment_info
to
include in the design matrix. (For example, there may be an additional column of sample
IDs, which should not be included.) This can be provided as a character vector of
column names, a numeric vector of column indices, or a logical vector. By default, all
columns are included.
Columns of indicator variables (e.g. group IDs, block IDs, and batch IDs) in
experiment_info
must be formatted as factors (otherwise they will be treated as
numeric values). The indicator columns will be expanded into the design matrix format.
The names for each parameter are taken from the column names of experiment_info
.
All factors provided here will be included as fixed effect terms in the design matrix.
Alternatively, to use random effects for some factors (e.g. for block IDs), see
createFormula
; or, depending on the method used, provide them directly to
the differential testing function (testDA_voom
and
testDS_limma
).
design
: Returns a design matrix (numeric matrix), with one row per
sample, and one column per model parameter.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Example: simple design matrix experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) createDesignMatrix(experiment_info, cols_design = "group_id") # Example: more complex design matrix: patient IDs and batch IDs experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:8)), group_id = factor(rep(paste0("group", 1:2), each = 4)), patient_id = factor(rep(paste0("patient", 1:4), 2)), batch_id = factor(rep(paste0("batch", 1:2), 4)), stringsAsFactors = FALSE ) createDesignMatrix(experiment_info, cols_design = c("group_id", "patient_id", "batch_id")) # Example: more complex design matrix: continuous covariate experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), age = c(52, 35, 71, 60), stringsAsFactors = FALSE ) createDesignMatrix(experiment_info, cols_design = c("group_id", "age"))
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Example: simple design matrix experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) createDesignMatrix(experiment_info, cols_design = "group_id") # Example: more complex design matrix: patient IDs and batch IDs experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:8)), group_id = factor(rep(paste0("group", 1:2), each = 4)), patient_id = factor(rep(paste0("patient", 1:4), 2)), batch_id = factor(rep(paste0("batch", 1:2), 4)), stringsAsFactors = FALSE ) createDesignMatrix(experiment_info, cols_design = c("group_id", "patient_id", "batch_id")) # Example: more complex design matrix: continuous covariate experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), age = c(52, 35, 71, 60), stringsAsFactors = FALSE ) createDesignMatrix(experiment_info, cols_design = c("group_id", "age"))
Create model formula and corresponding data frame of variables for model fitting
createFormula(experiment_info, cols_fixed = NULL, cols_random = NULL)
createFormula(experiment_info, cols_fixed = NULL, cols_random = NULL)
experiment_info |
|
cols_fixed |
Argument specifying columns of |
cols_random |
Argument specifying columns of |
Creates a model formula and corresponding data frame of variables specifying the models
to be fitted. (Alternatively, createDesignMatrix
can be used to generate
a design matrix instead of a model formula.)
The output is a list containing the model formula and corresponding data frame of variables (one column per formula term). These can then be provided to differential testing functions that require a model formula, together with the main data object and contrast matrix.
The experiment_info
input (which was also previously provided to
prepareData
) should be a data frame containing all factors and covariates
of interest. For example, depending on the experimental design, this may include the
following columns:
group IDs (e.g. groups for differential testing)
block IDs (e.g. patient IDs in a paired design; these may be included as either fixed effect or random effects)
batch IDs (batch effects)
continuous covariates
sample IDs (e.g. to include random intercept terms for each sample, to account for overdispersion typically seen in high-dimensional cytometry data; this is known as an 'observation-level random effect' (OLRE); see see Nowicka et al., 2017, F1000Research for more details)
The arguments cols_fixed
and cols_random
specify the columns in
experiment_info
to include as fixed effect terms and random intercept terms
respectively. These can be provided as character vectors of column names, numeric
vectors of column indices, or logical vectors. The names for each formula term are
taken from the column names of experiment_info
.
Note that for some methods, random effect terms (e.g. for block IDs) must be provided
directly to the differential testing function instead (testDA_voom
and
testDS_limma
).
If there are no random effect terms, it will usually be simpler to use a design matrix
instead of a model formula; see createDesignMatrix
.
formula
: Returns a list with three elements:
formula
: model formula
data
: data frame of variables corresponding to the model formula
random_terms
: TRUE if model formula contains any random effect terms
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Example: model formula experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:8)), group_id = factor(rep(paste0("group", 1:2), each = 4)), patient_id = factor(rep(paste0("patient", 1:4), 2)), stringsAsFactors = FALSE ) createFormula(experiment_info, cols_fixed = "group_id", cols_random = c("sample_id", "patient_id"))
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Example: model formula experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:8)), group_id = factor(rep(paste0("group", 1:2), each = 4)), patient_id = factor(rep(paste0("patient", 1:4), 2)), stringsAsFactors = FALSE ) createFormula(experiment_info, cols_fixed = "group_id", cols_random = c("sample_id", "patient_id"))
Wrapper function to run complete 'diffcyt' pipeline
diffcyt( d_input, experiment_info = NULL, marker_info = NULL, design = NULL, formula = NULL, contrast, analysis_type = c("DA", "DS"), method_DA = c("diffcyt-DA-edgeR", "diffcyt-DA-voom", "diffcyt-DA-GLMM"), method_DS = c("diffcyt-DS-limma", "diffcyt-DS-LMM"), markers_to_test = NULL, clustering_to_use = NULL, cols_to_include = NULL, subsampling = FALSE, n_sub = NULL, seed_sub = NULL, transform = TRUE, cofactor = 5, cols_clustering = NULL, xdim = 10, ydim = 10, meta_clustering = FALSE, meta_k = 40, seed_clustering = NULL, min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM", trend_method = "none", block_id = NULL, trend = TRUE, weights = TRUE, plot = FALSE, path = ".", verbose = TRUE )
diffcyt( d_input, experiment_info = NULL, marker_info = NULL, design = NULL, formula = NULL, contrast, analysis_type = c("DA", "DS"), method_DA = c("diffcyt-DA-edgeR", "diffcyt-DA-voom", "diffcyt-DA-GLMM"), method_DS = c("diffcyt-DS-limma", "diffcyt-DS-LMM"), markers_to_test = NULL, clustering_to_use = NULL, cols_to_include = NULL, subsampling = FALSE, n_sub = NULL, seed_sub = NULL, transform = TRUE, cofactor = 5, cols_clustering = NULL, xdim = 10, ydim = 10, meta_clustering = FALSE, meta_k = 40, seed_clustering = NULL, min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM", trend_method = "none", block_id = NULL, trend = TRUE, weights = TRUE, plot = FALSE, path = ".", verbose = TRUE )
d_input |
Input data. Must be either: (i) a |
experiment_info |
|
marker_info |
|
design |
Design matrix, created with |
formula |
Model formula object, created with |
contrast |
Contrast matrix, created with |
analysis_type |
Type of differential analysis to perform: differential abundance
(DA) of cell populations, or differential states (DS) within cell populations.
Options are |
method_DA |
Method to use for calculating differential abundance (DA) tests.
Options are |
method_DS |
Method to use for calculating differential state (DS) tests. Options
are |
markers_to_test |
(Optional) Logical vector specifying which markers to test for
differential expression (from the set of markers stored in the |
clustering_to_use |
(Optional) Column name indicating which set of cluster labels
to use for differential testing, when input data are provided as a |
cols_to_include |
Logical vector indicating which columns to include from the
input data. Default = all columns. See |
subsampling |
Whether to use random subsampling to select an equal number of cells
from each sample. Default = FALSE. See |
n_sub |
Number of cells to select from each sample by random subsampling, if
|
seed_sub |
Random seed for subsampling. Set to an integer value to generate
reproducible results. Default = |
transform |
Whether to apply 'arcsinh' transform. This may be set to FALSE if the
input data has already been transformed. Default = TRUE. See
|
cofactor |
Cofactor parameter for 'arcsinh' transform. Default = 5, which is
appropriate for mass cytometry (CyTOF) data. For fluorescence flow cytometry, we
recommend cofactor = 150 instead. See |
cols_clustering |
Columns to use for clustering. Default = |
xdim |
Horizontal length of grid for self-organizing map for FlowSOM clustering
(number of clusters = |
ydim |
Vertical length of grid for self-organizing map for FlowSOM clustering
(number of clusters = |
meta_clustering |
Whether to include FlowSOM 'meta-clustering' step. Default =
|
meta_k |
Number of meta-clusters for FlowSOM, if |
seed_clustering |
Random seed for clustering. Set to an integer value to generate
reproducible results. Default = |
min_cells |
Filtering parameter. Default = 3. Clusters are kept for differential
testing if they have at least |
min_samples |
Filtering parameter. Default = |
normalize |
Whether to include optional normalization factors to adjust for
composition effects. Default = FALSE. See |
norm_factors |
Normalization factors to use, if |
trend_method |
Method for estimating dispersion trend; passed to function
|
block_id |
(Optional) Vector or factor of block IDs (e.g. patient IDs) for paired
experimental designs, to be included as random effects (for method |
trend |
(Optional) Whether to fit a mean-variance trend when calculating moderated
tests with function |
weights |
(Optional) Whether to include precision weights (for method
|
plot |
Whether to save diagnostic plots (for method |
path |
Path for diagnostic plots, if |
verbose |
Whether to print status messages during each step of the pipeline. Default = TRUE. |
This wrapper function runs the complete 'diffcyt' analysis pipeline, by calling the functions for the individual steps in the pipeline in the correct sequence.
For more details about the functions for the individual steps, see the package vignette and the function help pages. Running the individual functions may provide additional flexibility, especially for complex analyses.
The input data can be provided as a flowSet
or a list of
flowFrames
, DataFrames
, data.frames
, or matrices
(one flowFrame
or list item per sample). Alternatively, it is also possible to
provide the input as a daFrame
object from the CATALYST
Bioconductor
package (Chevrier, Crowell, Zanotelli et al., 2018). This can be useful when initial
exploratory analyses and clustering have been performed using CATALYST
; the
daFrame
object from CATALYST
(containing cluster labels in the
rowData
) can then be provided directly to the diffcyt
functions for
differential testing.
Minimum required arguments when not providing a flowSet
or list of
flowFrames
, DataFrames
, data.frames
, or matrices:
d_input
experiment_info
marker_info
either design
or formula
(depending on the differential testing
method used)
contrast
analysis_type
Minimum required arguments when providing a CATALYST
daFrame
object:
d_input
either design
or formula
(depending on the differential testing
method used)
contrast
analysis_type
Returns a list containing the results object res
, as well as the data
objects d_se
, d_counts
, d_medians
,
d_medians_by_cluster_marker
, and d_medians_by_sample_marker
. (If a
CATALYST
daFrame
object was used as input, the output list contains
objects res
, d_counts
, and d_medians
.) The structure of
res
depends on the differential testing method used. See
testDA_edgeR
, testDA_voom
, testDA_GLMM
,
testDS_limma
, or testDS_LMM
.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters (using default method 'diffcyt-DA-edgeR') out_DA <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", seed_clustering = 123, verbose = FALSE) # Test for differential states (DS) within clusters (using default method 'diffcyt-DS-limma') out_DS <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DS", method_DS = "diffcyt-DS-limma", seed_clustering = 123, verbose = FALSE) # Display results for top DA clusters topTable(out_DA, format_vals = TRUE) # Display results for top DS cluster-marker combinations topTable(out_DS, format_vals = TRUE) # Plot heatmap for DA tests plotHeatmap(out_DA, analysis_type = "DA") # Plot heatmap for DS tests plotHeatmap(out_DS, analysis_type = "DS")
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters (using default method 'diffcyt-DA-edgeR') out_DA <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", seed_clustering = 123, verbose = FALSE) # Test for differential states (DS) within clusters (using default method 'diffcyt-DS-limma') out_DS <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DS", method_DS = "diffcyt-DS-limma", seed_clustering = 123, verbose = FALSE) # Display results for top DA clusters topTable(out_DA, format_vals = TRUE) # Display results for top DS cluster-marker combinations topTable(out_DS, format_vals = TRUE) # Plot heatmap for DA tests plotHeatmap(out_DA, analysis_type = "DA") # Plot heatmap for DS tests plotHeatmap(out_DS, analysis_type = "DS")
Generate high-resolution clusters for diffcyt
analysis
generateClusters( d_se, cols_clustering = NULL, xdim = 10, ydim = 10, meta_clustering = FALSE, meta_k = 40, seed_clustering = NULL, ... )
generateClusters( d_se, cols_clustering = NULL, xdim = 10, ydim = 10, meta_clustering = FALSE, meta_k = 40, seed_clustering = NULL, ... )
d_se |
Transformed input data, from |
cols_clustering |
Columns to use for clustering. Default = |
xdim |
Horizontal length of grid for self-organizing map for FlowSOM clustering
(number of clusters = |
ydim |
Vertical length of grid for self-organizing map for FlowSOM clustering
(number of clusters = |
meta_clustering |
Whether to include FlowSOM 'meta-clustering' step. Default =
|
meta_k |
Number of meta-clusters for FlowSOM, if |
seed_clustering |
Random seed for clustering. Set to an integer value to generate
reproducible results. Default = |
... |
Other parameters to pass to the FlowSOM clustering algorithm (through the
function |
Performs clustering to group cells into clusters representing cell populations or subsets, which can then be further analyzed by testing for differential abundance of cell populations or differential states within cell populations. By default, we use high-resolution clustering or over-clustering (i.e. we generate a large number of small clusters), which helps ensure that rare populations are adequately separated from larger ones.
Data is assumed to be in the form of a SummarizedExperiment
object
generated with prepareData
and transformed with
transformData
.
The input data object d_se
is assumed to contain a vector marker_class
in
the column meta-data. This vector indicates the marker class for each column
("type"
, "state"
, or "none"
). By default, clustering is performed
using the 'cell type' markers only. For example, in immunological data, this may be the
lineage markers. The choice of cell type markers is an important design choice for the
user, and will depend on the underlying experimental design and research questions. It
may be made based on prior biological knowledge or using data-driven methods. For an
example of a data-driven method of marker ranking and selection, see Nowicka et al.
(2017), F1000Research.
By default, we use the FlowSOM
clustering algorithm (Van Gassen et al.
2015, Cytometry Part A, available from Bioconductor) to generate the clusters.
We previously showed that FlowSOM
gives very good clustering performance for
high-dimensional cytometry data, for both major and rare cell populations, and is
extremely fast (Weber and Robinson, 2016, Cytometry Part A).
The clustering is run at high resolution to give a large number of small clusters (i.e.
over-clustering). This is done by running only the initial 'self-organizing map'
clustering step in the FlowSOM
algorithm, i.e. without the final
'meta-clustering' step. This ensures that small or rare populations are adequately
separated from larger populations, which is crucial for detecting differential signals
for extremely rare populations.
The minimum spanning tree (MST) object from BuildMST
is stored in the
experiment metadata
slot in the SummarizedExperiment
object
d_se
, and can be accessed with metadata(d_se)$MST
.
d_se
: Returns the SummarizedExperiment
input object, with
cluster labels for each cell stored in an additional column of row meta-data. Row
meta-data can be accessed with rowData
. The minimum spanning tree (MST)
object is also stored in the metadata
slot, and can be accessed with
metadata(d_se)$MST
.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se)
Plot heatmap showing top clusters or cluster-marker combinations
plotHeatmap( out = NULL, analysis_type = c("DA", "DS"), top_n = 20, threshold = 0.1, res = NULL, d_se = NULL, d_counts = NULL, d_medians = NULL, d_medians_by_cluster_marker = NULL, sample_order = NULL )
plotHeatmap( out = NULL, analysis_type = c("DA", "DS"), top_n = 20, threshold = 0.1, res = NULL, d_se = NULL, d_counts = NULL, d_medians = NULL, d_medians_by_cluster_marker = NULL, sample_order = NULL )
out |
Output object from |
analysis_type |
Whether to plot heatmap for differential abundance (DA) or differential state (DS) test results. |
top_n |
Number of top clusters (DA tests) or cluster-marker combinations (DS tests) to display. Default = 20. |
threshold |
Threshold for significant adjusted p-values. Default = 0.1. |
res |
Object containing differential test results. Alternatively, the combined
output object from the wrapper function |
d_se |
Data object. Alternatively, the combined output object from the wrapper
function |
d_counts |
Data object. Alternatively, the combined output object from the wrapper
function |
d_medians |
Data object. (Required for DS tests only.) Alternatively, the combined
output object from the wrapper function |
d_medians_by_cluster_marker |
Data object. Alternatively, the combined output
object from the wrapper function |
sample_order |
(Optional) Custom ordering for samples (columns) in right-hand panel of heatmap. (This is useful when the default ordering does not group samples by condition; e.g. samples are ordered alphabetically by sample IDs instead.) |
Display heatmap to visualize results for the top (most highly significant) detected clusters or cluster-marker combinations.
For DA tests, the heatmap consists of the following panels:
median (arcsinh-transformed) expression (across all samples) for 'cell type' markers
cluster abundances by sample
row annotation indicating significant detected clusters
For DS tests, the heatmap consists of:
median (arcsinh-transformed) expression (across all samples) for 'cell type' markers
median (arcsinh-transformed) expression (across all samples) for 'cell state' markers
median (arcsinh-transformed) expression (by sample) for 'cell state' markers for the top cluster-marker combinations
row annotation indicating significant detected cluster-marker combinations
Heatmaps are generated using the ComplexHeatmap
package (Gu et al., 2016), and
color scales are generated using the circlize
package (Gu et al., 2014). Both
packages are available from Bioconductor.
Displays a heatmap.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters (using default method 'diffcyt-DA-edgeR') out_DA <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", seed_clustering = 123, verbose = FALSE) # Test for differential states (DS) within clusters (using default method 'diffcyt-DS-limma') out_DS <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DS", method_DS = "diffcyt-DS-limma", seed_clustering = 123, verbose = FALSE) # Display results for top DA clusters topTable(out_DA, format_vals = TRUE) # Display results for top DS cluster-marker combinations topTable(out_DS, format_vals = TRUE) # Plot heatmap for DA tests plotHeatmap(out_DA, analysis_type = "DA") # Plot heatmap for DS tests plotHeatmap(out_DS, analysis_type = "DS")
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters (using default method 'diffcyt-DA-edgeR') out_DA <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", seed_clustering = 123, verbose = FALSE) # Test for differential states (DS) within clusters (using default method 'diffcyt-DS-limma') out_DS <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DS", method_DS = "diffcyt-DS-limma", seed_clustering = 123, verbose = FALSE) # Display results for top DA clusters topTable(out_DA, format_vals = TRUE) # Display results for top DS cluster-marker combinations topTable(out_DS, format_vals = TRUE) # Plot heatmap for DA tests plotHeatmap(out_DA, analysis_type = "DA") # Plot heatmap for DS tests plotHeatmap(out_DS, analysis_type = "DS")
Prepare data into format for diffcyt
pipeline
prepareData( d_input, experiment_info, marker_info, cols_to_include = NULL, subsampling = FALSE, n_sub = NULL, seed_sub = NULL )
prepareData( d_input, experiment_info, marker_info, cols_to_include = NULL, subsampling = FALSE, n_sub = NULL, seed_sub = NULL )
d_input |
Input data. Must be a |
experiment_info |
|
marker_info |
|
cols_to_include |
Logical vector indicating which columns to include from the input data. Default = all columns. |
subsampling |
Whether to use random subsampling to select an equal number of cells from each sample. Default = FALSE. |
n_sub |
Number of cells to select from each sample by random subsampling, if
|
seed_sub |
Random seed for subsampling. Set to an integer value to generate
reproducible results. Default = |
Functions in the diffcyt
analysis pipeline assume that input data is provided as
a SummarizedExperiment
object, which contains a single matrix of
expression values, together with row and column meta-data.
This function accepts a flowSet
or a list of flowFrames
,
data.frames
, or matrices as input (i.e. one flowFrame
or list item per
sample). The function then concatenates the data tables into a single matrix of values,
and adds row and column meta-data.
Row meta-data should be provided as a data frame named experiment_info
,
containing columns of relevant experiment information, such as sample IDs and group
IDs (for each sample). This must contain at least a column named sample_id
.
Column meta-data should be provided as a data frame named marker_info
,
containing the following columns of marker information. The column names must be as
shown.
marker_name
: protein marker names (and column names for any other columns)
marker_class
: factor indicating the protein marker class for each column
of data (usually, entries will be either "type"
, "state"
, or
"none"
)
The split into 'cell type' and 'cell state' markers is crucial for the analysis. Cell type markers are used to define cell populations by clustering, and to test for differential abundance of cell populations; while cell state markers are used to test for differential states within cell populations.
The optional argument cols_to_include
allows unnecessary columns (e.g. any
columns not containing protein markers) to be discarded.
Optionally, random subsampling can be used to select an equal number of cells from each
sample (subsampling = TRUE
). This can be useful when there are large differences
in total numbers of cells per sample, since it ensures that samples with relatively
large numbers of cells do not dominate the clustering. However, subsampling should
generally not be used when rare cell populations are of interest, due to the
significant loss of information if cells from the rare population are discarded.
d_se
: Returns data as a SummarizedExperiment
containing a
single matrix of data (expression values) in the assays
slot, together with
row meta-data (experiment information) and column meta-data (marker information). The
metadata
slot also contains the experiment_info
data frame, and a
vector n_cells
of the number of cells per sample; these can be accessed with
metadata(d_se)$experiment_info
and metadata(d_se)$n_cells
.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info)
Calculate tests for differential abundance of cell populations using method 'diffcyt-DA-edgeR'
testDA_edgeR( d_counts, design, contrast, trend_method = "none", min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM" )
testDA_edgeR( d_counts, design, contrast, trend_method = "none", min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM" )
d_counts |
|
design |
Design matrix, created with |
contrast |
Contrast matrix, created with |
trend_method |
Method for estimating dispersion trend; passed to function
|
min_cells |
Filtering parameter. Default = 3. Clusters are kept for differential
testing if they have at least |
min_samples |
Filtering parameter. Default = |
normalize |
Whether to include optional normalization factors to adjust for composition effects (see details). Default = FALSE. |
norm_factors |
Normalization factors to use, if |
Calculates tests for differential abundance of clusters, using functions from the
edgeR
package.
This method uses the edgeR
package (Robinson et al. 2010,
Bioinformatics; McCarthy et al. 2012, Nucleic Acids Research) to fit
models and calculate moderated tests at the cluster level. Moderated tests improve
statistical power by sharing information on variability (i.e. variance across samples
for a single cluster) between clusters. By default, we use the option
trend.method = "none"
to calculate dispersions, since the dispersion-mean
relationship typically does not resemble RNA-sequencing data; see edgeR
User's
Guide. The statistical methods implemented in the edgeR
package were originally
designed for the analysis of gene expression data such as RNA-sequencing counts. Here,
we apply these methods to cluster cell counts.
The experimental design must be specified using a design matrix, which can be created
with createDesignMatrix
. Flexible experimental designs are possible,
including blocking (e.g. paired designs), batch effects, and continuous covariates. See
createDesignMatrix
for more details.
The contrast matrix specifying the contrast of interest can be created with
createContrast
. See createContrast
for more details.
Filtering: Clusters are kept for differential testing if they have at least
min_cells
cells in at least min_samples
samples. This removes clusters
with very low cell counts across conditions, to improve power.
Normalization for the total number of cells per sample (library sizes) and total number
of cells per cluster is automatically performed by the edgeR
functions. Optional
normalization factors can also be included to adjust for composition effects in the
cluster cell counts per sample. For example, in an extreme case, if several additional
clusters are present in only one condition, while all other clusters are approximately
equally abundant between conditions, then simply normalizing by the total number of
cells per sample will create a false positive differential abundance signal for the
non-differential clusters. (For a detailed explanation in the context of RNA sequencing
gene expression, see Robinson and Oshlack, 2010.) Normalization factors can be
calculated automatically using the 'trimmed mean of M-values' (TMM) method (Robinson
and Oshlack, 2010), implemented in the edgeR
package (see also the edgeR
User's Guide for details). Alternatively, a vector of values can be provided (the
values should multiply to 1).
Returns a new SummarizedExperiment
object, with differential test
results stored in the rowData
slot. Results include raw p-values
(p_val
) and adjusted p-values (p_adj
) from the edgeR
moderated
tests, which can be used to rank clusters by evidence for differential abundance.
Additional output columns from the edgeR
tests are also included. The results
can be accessed with the rowData
accessor function.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters res_DA <- testDA_edgeR(d_counts, design, contrast)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters res_DA <- testDA_edgeR(d_counts, design, contrast)
Calculate tests for differential abundance of cell populations using method 'diffcyt-DA-GLMM'
testDA_GLMM( d_counts, formula, contrast, min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM" )
testDA_GLMM( d_counts, formula, contrast, min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM" )
d_counts |
|
formula |
Model formula object, created with |
contrast |
Contrast matrix, created with |
min_cells |
Filtering parameter. Default = 3. Clusters are kept for differential
testing if they have at least |
min_samples |
Filtering parameter. Default = |
normalize |
Whether to include optional normalization factors to adjust for composition effects (see details). Default = FALSE. |
norm_factors |
Normalization factors to use, if |
Calculates tests for differential abundance of clusters, using generalized linear mixed models (GLMMs).
This methodology was originally developed and described by Nowicka et al. (2017), F1000Research, and has been modified here to make use of high-resolution clustering to enable investigation of rare cell populations. Note that unlike the original method by Nowicka et al., we do not attempt to manually merge clusters into canonical cell populations. Instead, results are reported at the high-resolution cluster level, and the interpretation of significant differential clusters is left to the user via visualizations such as heatmaps (see the package vignette for an example).
This method fits generalized linear mixed models (GLMMs) for each cluster, and calculates differential tests separately for each cluster. The response variables in the models are the cluster cell counts, which are assumed to follow a binomial distribution. There is one model per cluster. We also include a filtering step to remove clusters with very small numbers of cells, to improve statistical power.
For more details on the statistical methodology, see Nowicka et al. (2017), F1000Research (section 'Differential cell population abundance'.)
The experimental design must be specified using a model formula, which can be created
with createFormula
. Flexible experimental designs are possible, including
blocking (e.g. paired designs), batch effects, and continuous covariates. Blocking
variables can be included as either random intercept terms or fixed effect terms (see
createFormula
). For paired designs, we recommend using random intercept
terms to improve statistical power; see Nowicka et al. (2017), F1000Research for
details. Batch effects and continuous covariates should be included as fixed effects.
In addition, we include random intercept terms for each sample to account for
overdispersion typically seen in high-dimensional cytometry count data. The
sample-level random intercept terms are known as 'observation-level random effects'
(OLREs); see Nowicka et al. (2017), F1000Research for more details.
The contrast matrix specifying the contrast of interest can be created with
createContrast
. See createContrast
for more details.
Filtering: Clusters are kept for differential testing if they have at least
min_cells
cells in at least min_samples
samples. This removes clusters
with very low cell counts across conditions, to improve power.
Normalization: Optional normalization factors can be included to adjust for composition
effects in the cluster cell counts per sample. For example, in an extreme case, if
several additional clusters are present in only one condition, while all other clusters
are approximately equally abundant between conditions, then simply normalizing by the
total number of cells per sample will create a false positive differential abundance
signal for the non-differential clusters. (For a detailed explanation in the context of
RNA sequencing gene expression, see Robinson and Oshlack, 2010.) Normalization factors
can be calculated automatically using the 'trimmed mean of M-values' (TMM) method
(Robinson and Oshlack, 2010), implemented in the edgeR
package (see also the
edgeR
User's Guide for details). Alternatively, a vector of values can be
provided (the values should multiply to 1).
Returns a new SummarizedExperiment
object, with differential test
results stored in the rowData
slot. Results include raw p-values
(p_val
) and adjusted p-values (p_adj
), which can be used to rank
clusters by evidence for differential abundance. The results can be accessed with the
rowData
accessor function.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Create model formula formula <- createFormula(experiment_info, cols_fixed = "group_id", cols_random = "sample_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters res_DA <- testDA_GLMM(d_counts, formula, contrast)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Create model formula formula <- createFormula(experiment_info, cols_fixed = "group_id", cols_random = "sample_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters res_DA <- testDA_GLMM(d_counts, formula, contrast)
Calculate tests for differential abundance of cell populations using method 'diffcyt-DA-voom'
testDA_voom( d_counts, design, contrast, block_id = NULL, min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM", plot = FALSE, path = "." )
testDA_voom( d_counts, design, contrast, block_id = NULL, min_cells = 3, min_samples = NULL, normalize = FALSE, norm_factors = "TMM", plot = FALSE, path = "." )
d_counts |
|
design |
Design matrix, created with |
contrast |
Contrast matrix, created with |
block_id |
(Optional) Vector or factor of block IDs (e.g. patient IDs) for paired
experimental designs, to be included as random effects. If provided, the block IDs
will be included as random effects using the |
min_cells |
Filtering parameter. Default = 3. Clusters are kept for differential
testing if they have at least |
min_samples |
Filtering parameter. Default = |
normalize |
Whether to include optional normalization factors to adjust for composition effects (see details). Default = FALSE. |
norm_factors |
Normalization factors to use, if |
plot |
Whether to save diagnostic plots for the |
path |
Path for diagnostic plots, if |
Calculates tests for differential abundance of clusters, using functions from the
limma
package and voom
method.
This method uses the limma
package (Ritchie et al. 2015, Nucleic
Acids Research) to fit models and calculate moderated tests at the cluster level.
Moderated tests improve statistical power by sharing information on variability (i.e.
variance across samples for a single cluster) between clusters. Since count data are
often heteroscedastic, we use the voom
method (Law et al. 2014,
Genome Biology) to transform the raw cluster cell counts and estimate
observation-level weights to stabilize the mean-variance relationship. Diagnostic plots
are shown if plot = TRUE
.
The experimental design must be specified using a design matrix, which can be created
with createDesignMatrix
. Flexible experimental designs are possible,
including blocking (e.g. paired designs), batch effects, and continuous covariates. See
createDesignMatrix
for more details.
For paired designs, either fixed effects or random effects can be used. Fixed effects
are simpler, but random effects may improve power in data sets with unbalanced designs
or very large numbers of samples. To use fixed effects, provide the block IDs (e.g.
patient IDs) to createDesignMatrix
. To use random effects, provide the
block_id
argument here instead. This will make use of the limma
duplicateCorrelation
methodology. Note that >2 measures per sample are
not possible in this case (fixed effects should be used instead). Block IDs should not
be included in the design matrix if the limma
duplicateCorrelation
methodology is used.
The contrast matrix specifying the contrast of interest can be created with
createContrast
. See createContrast
for more details.
Filtering: Clusters are kept for differential testing if they have at least
min_cells
cells in at least min_samples
samples. This removes clusters
with very low cell counts across conditions, to improve power.
Normalization for the total number of cells per sample (library sizes) and total number
of cells per cluster is automatically performed by the limma
and voom
functions. Optional normalization factors can also be included to adjust for
composition effects in the cluster cell counts per sample. For example, in an extreme
case, if several additional clusters are present in only one condition, while all other
clusters are approximately equally abundant between conditions, then simply normalizing
by the total number of cells per sample will create a false positive differential
abundance signal for the non-differential clusters. (For a detailed explanation in the
context of RNA sequencing gene expression, see Robinson and Oshlack, 2010.)
Normalization factors can be calculated automatically using the 'trimmed mean of
M-values' (TMM) method (Robinson and Oshlack, 2010), implemented in the edgeR
package (see also the edgeR
User's Guide for details). Alternatively, a vector
of values can be provided (the values should multiply to 1).
Returns a new SummarizedExperiment
object, with differential test
results stored in the rowData
slot. Results include raw p-values
(p_val
) and adjusted p-values (p_adj
) from the limma
moderated
tests, which can be used to rank clusters by evidence for differential abundance.
Additional output columns from the limma
tests are also included. The results
can be accessed with the rowData
accessor function.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters res_DA <- testDA_voom(d_counts, design, contrast)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters res_DA <- testDA_voom(d_counts, design, contrast)
Calculate tests for differential states within cell populations using method 'diffcyt-DS-limma'
testDS_limma( d_counts, d_medians, design, contrast, block_id = NULL, trend = TRUE, weights = TRUE, markers_to_test = NULL, min_cells = 3, min_samples = NULL, plot = FALSE, path = "." )
testDS_limma( d_counts, d_medians, design, contrast, block_id = NULL, trend = TRUE, weights = TRUE, markers_to_test = NULL, min_cells = 3, min_samples = NULL, plot = FALSE, path = "." )
d_counts |
|
d_medians |
|
design |
Design matrix, created with |
contrast |
Contrast matrix, created with |
block_id |
(Optional) Vector or factor of block IDs (e.g. patient IDs) for paired
experimental designs, to be included as random effects. If provided, the block IDs
will be included as random effects using the |
trend |
(Optional) Whether to fit a mean-variance trend when calculating moderated
tests with function |
weights |
(Optional) Whether to use cluster cell counts as precision weights
(across all samples and clusters); this allows the |
markers_to_test |
(Optional) Logical vector specifying which markers to test for
differential expression (from the set of markers stored in the |
min_cells |
Filtering parameter. Default = 3. Clusters are kept for differential
testing if they have at least |
min_samples |
Filtering parameter. Default = |
plot |
Whether to save diagnostic plot. Default = FALSE. |
path |
Path for diagnostic plot, if |
Calculates tests for differential states within cell populations (i.e. differential expression of cell state markers within clusters). Clusters are defined using cell type markers, and cell states are characterized by the median transformed expression of cell state markers.
This method uses the limma
package (Ritchie et al. 2015, Nucleic
Acids Research) to fit models and calculate moderated tests at the cluster level.
Moderated tests improve statistical power by sharing information on variability (i.e.
variance across samples for a single cluster) between clusters. By default, we provide
option trend = TRUE
to the limma
eBayes
function; this fits
a mean-variance trend when calculating moderated tests, which is also known as the
limma-trend
method (Law et al., 2014; Phipson et al., 2016). Diagnostic plots
are shown if plot = TRUE
.
The experimental design must be specified using a design matrix, which can be created
with createDesignMatrix
. Flexible experimental designs are possible,
including blocking (e.g. paired designs), batch effects, and continuous covariates. See
createDesignMatrix
for more details.
For paired designs, either fixed effects or random effects can be used. Fixed effects
are simpler, but random effects may improve power in data sets with unbalanced designs
or very large numbers of samples. To use fixed effects, provide the block IDs (e.g.
patient IDs) to createDesignMatrix
. To use random effects, provide the
block_id
argument here instead. This will make use of the limma
duplicateCorrelation
methodology. Note that >2 measures per sample are
not possible in this case (fixed effects should be used instead). Block IDs should not
be included in the design matrix if the limma
duplicateCorrelation
methodology is used.
The contrast matrix specifying the contrast of interest can be created with
createContrast
. See createContrast
for more details.
By default, differential tests are performed for all cell state markers (which are
identified with the vector id_state_markers
stored in the meta-data of the
cluster medians input object). The optional argument markers_to_test
allows the
user to specify a different set of markers to test (e.g. to investigate differences for
cell type markers).
Filtering: Clusters are kept for differential testing if they have at least
min_cells
cells in at least min_samples
samples. This removes clusters
with very low cell counts across conditions, to improve power.
Weights: By default, cluster cell counts are used as precision weights (across all
samples and clusters); allowing the limma
model fitting functions to account for
uncertainty due to the total number of cells per sample (library sizes) and total
number of cells per cluster. This option can also be disabled with weights =
FALSE
, if required.
Returns a new SummarizedExperiment
object, where rows =
cluster-marker combinations, and columns = samples. In the rows, clusters are
repeated for each cell state marker (i.e. the sheets or assays
from the
previous d_medians
object are stacked into a single matrix). Differential test
results are stored in the rowData
slot. Results include raw p-values
(p_val
) and adjusted p-values (p_adj
) from the limma
moderated
tests, which can be used to rank cluster-marker combinations by evidence for
differential states within cell populations. Additional output columns from the
limma
tests are also included. The results can be accessed with the
rowData
accessor function.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_type <- 1:10 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Calculate medians d_medians <- calcMedians(d_se) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential states (DS) within clusters res_DS <- testDS_limma(d_counts, d_medians, design, contrast)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_type <- 1:10 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Calculate medians d_medians <- calcMedians(d_se) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential states (DS) within clusters res_DS <- testDS_limma(d_counts, d_medians, design, contrast)
Calculate tests for differential states within cell populations using method 'diffcyt-DS-LMM'
testDS_LMM( d_counts, d_medians, formula, contrast, weights = TRUE, markers_to_test = NULL, min_cells = 3, min_samples = NULL )
testDS_LMM( d_counts, d_medians, formula, contrast, weights = TRUE, markers_to_test = NULL, min_cells = 3, min_samples = NULL )
d_counts |
|
d_medians |
|
formula |
Model formula object, created with |
contrast |
Contrast matrix, created with |
weights |
(Optional) Whether to include precision weights within each model (across samples, i.e. within the model for each cluster); these represent the relative uncertainty in calculating each median value (within each model). Accepts values of TRUE, FALSE, or a numeric vector of custom weights. Default = TRUE, in which case cluster cell counts are used as weights. |
markers_to_test |
(Optional) Logical vector specifying which markers to test for
differential expression (from the set of markers stored in the |
min_cells |
Filtering parameter. Default = 3. Clusters are kept for differential
testing if they have at least |
min_samples |
Filtering parameter. Default = |
Calculates tests for differential states within cell populations (i.e. differential expression of cell state markers within clusters), using linear mixed models (LMMs). Clusters are defined using cell type markers, and cell states are characterized by the median transformed expression of cell state markers.
This methodology was originally developed and described by Nowicka et al. (2017), F1000Research, and has been modified here to make use of high-resolution clustering to enable investigation of rare cell populations. Note that unlike the original method by Nowicka et al., we do not attempt to manually merge clusters into canonical cell populations. Instead, results are reported at the high-resolution cluster level, and the interpretation of significant differential clusters is left to the user via visualizations such as heatmaps (see the package vignette for an example).
This method fits linear mixed models (LMMs) for each cluster-marker combination (cell state markers only), and calculates differential tests separately for each cluster-marker combination. The response variable in each model is the median arcsinh-transformed marker expression of the cell state marker, which is assumed to follow a Gaussian distribution. There is one model per cluster per cell state marker. Within each model, sample-level weights are included (by default) for the number of cells per sample; these weights represent the relative uncertainty in calculating each median value. (Additional uncertainty exists due to variation in the total number of cells per cluster; however, it is not possible to account for this, since there are separate models for each cluster-marker combination.) We also include a filtering step to remove clusters with very small numbers of cells, to improve statistical power.
For more details on the statistical methodology, see Nowicka et al. (2017), F1000Research (section 'Differential analysis of marker expression stratified by cell population'.)
The experimental design must be specified using a model formula, which can be created
with createFormula
. Flexible experimental designs are possible, including
blocking (e.g. paired designs), batch effects, and continuous covariates. Blocking
variables can be included as either random intercept terms or fixed effect terms (see
createFormula
). For paired designs, we recommend using random intercept
terms to improve statistical power; see Nowicka et al. (2017), F1000Research for
details. Batch effects and continuous covariates should be included as fixed effects.
If no random intercept terms are included in the model formula, model fitting is performed using a linear model (LM) instead of a LMM.
The contrast matrix specifying the contrast of interest can be created with
createContrast
. See createContrast
for more details.
By default, differential tests are performed for all cell state markers (which are
identified with the vector id_state_markers
stored in the meta-data of the
cluster medians input object). The optional argument markers_to_test
allows the
user to specify a different set of markers to test (e.g. to investigate differences for
cell type markers).
Filtering: Clusters are kept for differential testing if they have at least
min_cells
cells in at least min_samples
samples. This removes clusters
with very low cell counts across conditions, to improve power.
Weights: By default, cluster cell counts are used as precision weights within each model (across samples only, i.e. within the model for each cluster); these represent the relative uncertainty in calculating each median value (within each model). See above for details.
Returns a new SummarizedExperiment
object, where rows =
cluster-marker combinations, and columns = samples. In the rows, clusters are
repeated for each cell state marker (i.e. the sheets or assays
from the
previous d_medians
object are stacked into a single matrix). Differential test
results are stored in the rowData
slot. Results include raw p-values
(p_val
) and adjusted p-values (p_adj
), which can be used to rank
cluster-marker combinations by evidence for differential states within cell
populations. The results can be accessed with the rowData
accessor
function.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_type <- 1:10 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Calculate medians d_medians <- calcMedians(d_se) # Create model formula formula <- createFormula(experiment_info, cols_fixed = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential states (DS) within clusters res_DS <- testDS_LMM(d_counts, d_medians, formula, contrast)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_type <- 1:10 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se) # Generate clusters d_se <- generateClusters(d_se) # Calculate counts d_counts <- calcCounts(d_se) # Calculate medians d_medians <- calcMedians(d_se) # Create model formula formula <- createFormula(experiment_info, cols_fixed = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential states (DS) within clusters res_DS <- testDS_LMM(d_counts, d_medians, formula, contrast)
Alias for function 'topTable' (deprecated)
topClusters(...)
topClusters(...)
... |
See arguments for function |
The function topClusters
has been renamed to topTable
, to more
accurately reflect the structure of the results (results are returned for either
clusters or cluster-marker combinations, depending on the type of differential tests
performed).
This alias is provided for backward compatibility. The new function name
topTable
should be used whenever possible.
See topTable
for details.
Show table of results for top (most highly significant) clusters or cluster-marker combinations
topTable( res, d_counts = NULL, d_medians = NULL, order = TRUE, order_by = "p_adj", all = FALSE, top_n = 20, show_counts = FALSE, show_props = FALSE, show_meds = FALSE, show_logFC = FALSE, show_all_cols = FALSE, sort_cols = TRUE, format_vals = FALSE, digits = 3 )
topTable( res, d_counts = NULL, d_medians = NULL, order = TRUE, order_by = "p_adj", all = FALSE, top_n = 20, show_counts = FALSE, show_props = FALSE, show_meds = FALSE, show_logFC = FALSE, show_all_cols = FALSE, sort_cols = TRUE, format_vals = FALSE, digits = 3 )
res |
Output object from either the |
d_counts |
(Optional) |
d_medians |
(Optional) |
order |
Whether to order results by values in column |
order_by |
Name of column to use to order rows by values, if |
all |
Whether to display all clusters or cluster-marker combinations (instead of
top |
top_n |
Number of clusters or cluster-marker combinations to display (if |
show_counts |
Whether to display cluster cell counts by sample (from
|
show_props |
Whether to display cluster cell count proportions by sample
(calculated from |
show_meds |
Whether to display median expression values for each cluster-marker
combination (from |
show_logFC |
Whether to display log fold change (logFC) values. Default = FALSE. |
show_all_cols |
Whether to display all columns from output object (e.g. logFC, logCPM, LR, etc.) Default = FALSE. |
sort_cols |
Whether to sort columns of counts, proportions, and medians; by levels
of factor |
format_vals |
Whether to display rounded values in numeric columns. This improves readability of the summary table, but should not be used when exact numeric values are required for subsequent steps (e.g. plotting). Default = FALSE. |
digits |
Number of significant digits to show, if |
Summary function to display table of results for top (most highly significant) detected clusters or cluster-marker combinations.
The differential testing functions return results in the form of p-values and adjusted
p-values for each cluster (DA tests) or cluster-marker combination (DS tests), which
can be used to rank the clusters or cluster-marker combinations by their evidence for
differential abundance or differential states. The p-values and adjusted p-values are
stored in the rowData
of the output SummarizedExperiment
object
generated by the testing functions.
This function displays a summary table of results. By default, the top_n
clusters or cluster-marker combinations are shown, ordered by adjusted p-values.
Optionally, cluster counts, proportions, and median expression by cluster-marker
combination can also be included. The format_vals
and digits
arguments
can be used to display rounded values to improve readability of the summary table.
Returns a DataFrame
table of results for the top_n
clusters or cluster-marker combinations, ordered by values in column order_by
(default: adjusted p-values). Optionally, cluster counts, proportions, and median
expression by cluster-marker combination are also included.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters (using default method 'diffcyt-DA-edgeR') out_DA <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", seed_clustering = 123, verbose = FALSE) # Test for differential states (DS) within clusters (using default method 'diffcyt-DS-limma') out_DS <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DS", method_DS = "diffcyt-DS-limma", seed_clustering = 123, verbose = FALSE) # Display results for top DA clusters topTable(out_DA, format_vals = TRUE) # Display results for top DS cluster-marker combinations topTable(out_DS, format_vals = TRUE)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) # Add differential abundance (DA) signal ix_DA <- 801:900 ix_cols_type <- 1:10 d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10) # Add differential states (DS) signal ix_DS <- 901:1000 ix_cols_DS <- 19:20 d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10) d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Create design matrix design <- createDesignMatrix(experiment_info, cols_design = "group_id") # Create contrast matrix contrast <- createContrast(c(0, 1)) # Test for differential abundance (DA) of clusters (using default method 'diffcyt-DA-edgeR') out_DA <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", seed_clustering = 123, verbose = FALSE) # Test for differential states (DS) within clusters (using default method 'diffcyt-DS-limma') out_DS <- diffcyt(d_input, experiment_info, marker_info, design = design, contrast = contrast, analysis_type = "DS", method_DS = "diffcyt-DS-limma", seed_clustering = 123, verbose = FALSE) # Display results for top DA clusters topTable(out_DA, format_vals = TRUE) # Display results for top DS cluster-marker combinations topTable(out_DS, format_vals = TRUE)
Transform data prior to clustering
transformData(d_se, cofactor = 5)
transformData(d_se, cofactor = 5)
d_se |
Input data. Assumed to be in the form of a
|
cofactor |
Cofactor parameter for 'arcsinh' transform. Default = 5, which is appropriate for mass cytometry (CyTOF) data. For fluorescence flow cytometry, we recommend cofactor = 150 instead. |
Flow and mass cytometry data should be transformed prior to clustering. The raw data follows an approximately log-normal distribution. Transforming with a log (or similar) function brings the data closer to a normal distribution, which improves clustering performance and allows positive and negative populations to be distinguished more clearly.
This function implements an inverse hyperbolic sine ('arcsinh') transform with adjustable 'cofactor' parameter. The arcsinh transform is widely used for CyTOF data. It behaves similarly to a log transform at high values, but is approximately linear near zero; so unlike the log, it can handle zeros or small negative values. The cofactor parameter controls the width of the linear region. Zero values and small negatives occur in CyTOF data when no ions are detected in a given channel (negatives are due to background subtraction and randomization of integer count values, which are performed by default by the CyTOF instrument software).
Recommended values for the cofactor parameter are 5 (mass cytometry, CyTOF) or 150 (fluorescence flow cytometry); see Bendall et al. (2011), Science, Supplementary Figure S2.
The transform should be applied to protein marker columns only. The
SummarizedExperiment
object created in the previous step
(prepareData
) is assumed to contain a factor marker_class
in the
column meta-data, where entries "none"
indicate non-marker columns. (If this is
not available, all columns will be transformed instead.)
d_se
: Data with transform applied to protein marker columns.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se)
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, # see the package vignette. # Function to create random data (one sample) d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) { d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol)) d } # Create random data (without differential signal) set.seed(123) d_input <- list( sample1 = d_random(), sample2 = d_random(), sample3 = d_random(), sample4 = d_random() ) experiment_info <- data.frame( sample_id = factor(paste0("sample", 1:4)), group_id = factor(c("group1", "group1", "group2", "group2")), stringsAsFactors = FALSE ) marker_info <- data.frame( channel_name = paste0("channel", sprintf("%03d", 1:20)), marker_name = paste0("marker", sprintf("%02d", 1:20)), marker_class = factor(c(rep("type", 10), rep("state", 10)), levels = c("type", "state", "none")), stringsAsFactors = FALSE ) # Prepare data d_se <- prepareData(d_input, experiment_info, marker_info) # Transform data d_se <- transformData(d_se)