Title: | distinct: a method for differential analyses via hierarchical permutation tests |
---|---|
Description: | distinct is a statistical method to perform differential testing between two or more groups of distributions; differential testing is performed via hierarchical non-parametric permutation tests on the cumulative distribution functions (cdfs) of each sample. While most methods for differential expression target differences in the mean abundance between conditions, distinct, by comparing full cdfs, identifies, both, differential patterns involving changes in the mean, as well as more subtle variations that do not involve the mean (e.g., unimodal vs. bi-modal distributions with the same mean). distinct is a general and flexible tool: due to its fully non-parametric nature, which makes no assumptions on how the data was generated, it can be applied to a variety of datasets. It is particularly suitable to perform differential state analyses on single cell data (i.e., differential analyses within sub-populations of cells), such as single cell RNA sequencing (scRNA-seq) and high-dimensional flow or mass cytometry (HDCyto) data. To use distinct one needs data from two or more groups of samples (i.e., experimental conditions), with at least 2 samples (i.e., biological replicates) per group. |
Authors: | Simone Tiberi [aut, cre]. |
Maintainer: | Simone Tiberi <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.19.0 |
Built: | 2024-11-02 06:02:55 UTC |
Source: | https://github.com/bioc/distinct |
distinct is a statistical method to perform differential testing between two or more groups of distributions; differential testing is performed via hierarchical non-parametric permutation tests on the cumulative distribution functions (cdfs) of each sample. While most methods for differential expression target differences in the mean abundance between conditions, distinct, by comparing full cdfs, identifies, both, differential patterns involving changes in the mean, as well as more subtle variations that do not involve the mean (e.g., unimodal vs. bi-modal distributions with the same mean). distinct is a general and flexible tool: due to its fully non-parametric nature, which makes no assumptions on how the data was generated, it can be applied to a variety of datasets. It is particularly suitable to perform differential state analyses on single cell data (i.e., differential analyses within sub-populations of cells), such as single cell RNA sequencing (scRNA-seq) and high-dimensional flow or mass cytometry (HDCyto) data. To use distinct one needs data from two or more groups of samples (i.e., experimental conditions), with at least 2 samples (i.e., biological replicates) per group.
Questions relative to distinct should be either written to the Bioconductor support site, tagging the question with 'distinct', or reported as a new issue at BugReports (preferred choice).
To access the vignettes, type: browseVignettes("distinct")
.
Simone Tiberi [aut, cre].
Maintainer: Simone Tiberi <[email protected]>
distinct_test
tests for differential state between two groups of samples.
distinct_test( x, name_assays_expression = "logcounts", name_cluster = "cluster_id", name_sample = "sample_id", design, column_to_test = 2, P_1 = 100, P_2 = 500, P_3 = 2000, P_4 = 10000, N_breaks = 25, min_non_zero_cells = 20, n_cores = 1 )
distinct_test( x, name_assays_expression = "logcounts", name_cluster = "cluster_id", name_sample = "sample_id", design, column_to_test = 2, P_1 = 100, P_2 = 500, P_3 = 2000, P_4 = 10000, N_breaks = 25, min_non_zero_cells = 20, n_cores = 1 )
x |
a |
name_assays_expression |
a character ("logcounts" by default),
indicating the name of the assays(x) element which stores the expression data (i.e., assays(x)$name_assays_expression).
We strongly encourage using normalized data, such as counts per million (CPM) or log2-CPM (e.g., 'logcounts' as created via |
name_cluster |
a character ("cluster_id" by default), indicating the name of the colData(x) element which stores the cluster id of each cell (i.e., colData(x)$name_cluster). |
name_sample |
a character ("sample_id" by default), indicating the name of the colData(x) element which stores the sample id of each cell (i.e., colData(x)$name_sample). |
design |
a |
column_to_test |
indicates the column(s) of the design one wants to test (do not include the intercept). |
P_1 |
the number of permutations to use on all gene-cluster combinations. |
P_2 |
the number of permutations to use, when a (raw) p-value is < 0.1 (500 by default). |
P_3 |
the number of permutations to use, when a (raw) p-value is < 0.01 (2,000 by default). |
P_4 |
the number of permutations to use, when a (raw) p-value is < 0.001 (10,000 by default). In order to obtain a finer ranking for the most significant genes, if computational resources are available, we encourage users to set P_4 = 20,000. |
N_breaks |
the number of breaks at which to evaluate the comulative density function. |
min_non_zero_cells |
the minimum number of non-zero cells (across all samples) in each cluster for a gene to be evaluated. |
n_cores |
the number of cores to parallelize the tasks on (parallelization is at the cluster level: each cluster is parallelized on a thread). |
A data.frame
object.
Columns 'gene' and 'cluster_id' contain the gene and cell-cluster name, while 'p_val', 'p_adj.loc' and 'p_adj.glb' report the raw p-values, locally and globally adjusted p-values, via Benjamini and Hochberg (BH) correction.
In locally adjusted p-values ('p_adj.loc') BH correction is applied in each cluster separately, while in globally adjusted p-values ('p_adj.glb') BH correction is performed to the results from all clusters.
Column 'filtered' indicates whether a gene-cluster result was filtered (if TRUE), or analyzed (if FALSE).
A gene-cluster combination is filtered when fewer than 'min_non_zero_cells' non-zero cells are available.
Filtered results have raw and adjusted p-values equal to 1.
Simone Tiberi [email protected]
plot_cdfs
, plot_densities
, log2_FC
, top_results
# load the input data: data("Kang_subset", package = "distinct") Kang_subset # create the design of the study: samples = Kang_subset@metadata$experiment_info$sample_id group = Kang_subset@metadata$experiment_info$stim design = model.matrix(~group) # rownames of the design must indicate sample ids: rownames(design) = samples design # Note that the sample names in `colData(x)$name_sample` have to be the same ones as those in `rownames(design)`. rownames(design) unique(SingleCellExperiment::colData(Kang_subset)$sample_id) # In order to obtain a finer ranking for the most significant genes, if computational resources are available, we encourage users to increase P_4 (i.e., the number of permutations when a raw p-value is < 0.001) and set P_4 = 20,000 (by default P_4 = 10,000). # The group we would like to test for is in the second column of the design, therefore we will specify: column_to_test = 2 set.seed(61217) res = distinct_test( x = Kang_subset, name_assays_expression = "logcounts", name_cluster = "cell", design = design, column_to_test = 2, min_non_zero_cells = 20, n_cores = 2) # We can optionally add the fold change (FC) and log2-FC between groups: res = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") # Visualize significant results: head(top_results(res)) # Visualize significant results from a specified cluster of cells: top_results(res, cluster = "Dendritic cells") # By default, results from 'top_results' are sorted by (globally) adjusted p-value; # they can also be sorted by log2-FC: top_results(res, cluster = "Dendritic cells", sort_by = "log2FC") # Visualize significant UP-regulated genes only: top_results(res, up_down = "UP", cluster = "Dendritic cells") # Plot density and cdf for gene 'ISG15' in cluster 'Dendritic cells'. plot_densities(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim") plot_cdfs(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim")
# load the input data: data("Kang_subset", package = "distinct") Kang_subset # create the design of the study: samples = Kang_subset@metadata$experiment_info$sample_id group = Kang_subset@metadata$experiment_info$stim design = model.matrix(~group) # rownames of the design must indicate sample ids: rownames(design) = samples design # Note that the sample names in `colData(x)$name_sample` have to be the same ones as those in `rownames(design)`. rownames(design) unique(SingleCellExperiment::colData(Kang_subset)$sample_id) # In order to obtain a finer ranking for the most significant genes, if computational resources are available, we encourage users to increase P_4 (i.e., the number of permutations when a raw p-value is < 0.001) and set P_4 = 20,000 (by default P_4 = 10,000). # The group we would like to test for is in the second column of the design, therefore we will specify: column_to_test = 2 set.seed(61217) res = distinct_test( x = Kang_subset, name_assays_expression = "logcounts", name_cluster = "cell", design = design, column_to_test = 2, min_non_zero_cells = 20, n_cores = 2) # We can optionally add the fold change (FC) and log2-FC between groups: res = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") # Visualize significant results: head(top_results(res)) # Visualize significant results from a specified cluster of cells: top_results(res, cluster = "Dendritic cells") # By default, results from 'top_results' are sorted by (globally) adjusted p-value; # they can also be sorted by log2-FC: top_results(res, cluster = "Dendritic cells", sort_by = "log2FC") # Visualize significant UP-regulated genes only: top_results(res, up_down = "UP", cluster = "Dendritic cells") # Plot density and cdf for gene 'ISG15' in cluster 'Dendritic cells'. plot_densities(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim") plot_cdfs(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim")
muscData
package.Subset from the 'Kang18_8vs8()' object of the muscData
package.
Kang_subset |
contains a |
Simone Tiberi [email protected]
#################### # Object 'Kang_subset' is generated as follows: #################### # library(muscData) # sce = Kang18_8vs8() # # library(scater) # sce = computeLibraryFactors(sce) # sce = logNormCounts(sce) # cpm(sce) <- calculateCPM(sce) # # Select genes with at least 1000 non-zero cells: # sce = sce[ rowSums(assays(sce)$counts > 0) >= 1000, ] # # randomly select 100 of these genes: # set.seed(61217) # sel = sample( rownames(sce), size = 100) # sce = sce[ rownames(sce) %in% sel, ] # # select 3 individuals only: # ind_selected = levels(factor(colData(sce)$ind))[1:3] # sce = sce[, sce$ind %in% ind_selected] # # make a sample_id column: # colData(sce)$sample_id = factor(paste(colData(sce)$stim, colData(sce)$ind, sep = "_")) # # create an experiment_info object containing sample-group information: # experiment_info = unique(data.frame(sample_id = colData(sce)$sample_id, # stim = colData(sce)$stim) ) # metadata(sce)$experiment_info = data.frame(experiment_info, row.names = NULL) # # remove unnecessary information to reduce storage space: # sce$cluster = NULL; # sce$multiplets = NULL; # rowData(sce) = NULL; # colnames(sce) = NULL; # reducedDim(sce) = NULL # sce$ind = NULL # sce$sizeFactor = NULL # rm assays counts # assays(sce) = assays(sce)[2:3] # # Kang_subset = sce # save(Kang_subset, file = "Kang_subset.RData")
#################### # Object 'Kang_subset' is generated as follows: #################### # library(muscData) # sce = Kang18_8vs8() # # library(scater) # sce = computeLibraryFactors(sce) # sce = logNormCounts(sce) # cpm(sce) <- calculateCPM(sce) # # Select genes with at least 1000 non-zero cells: # sce = sce[ rowSums(assays(sce)$counts > 0) >= 1000, ] # # randomly select 100 of these genes: # set.seed(61217) # sel = sample( rownames(sce), size = 100) # sce = sce[ rownames(sce) %in% sel, ] # # select 3 individuals only: # ind_selected = levels(factor(colData(sce)$ind))[1:3] # sce = sce[, sce$ind %in% ind_selected] # # make a sample_id column: # colData(sce)$sample_id = factor(paste(colData(sce)$stim, colData(sce)$ind, sep = "_")) # # create an experiment_info object containing sample-group information: # experiment_info = unique(data.frame(sample_id = colData(sce)$sample_id, # stim = colData(sce)$stim) ) # metadata(sce)$experiment_info = data.frame(experiment_info, row.names = NULL) # # remove unnecessary information to reduce storage space: # sce$cluster = NULL; # sce$multiplets = NULL; # rowData(sce) = NULL; # colnames(sce) = NULL; # reducedDim(sce) = NULL # sce$ind = NULL # sce$sizeFactor = NULL # rm assays counts # assays(sce) = assays(sce)[2:3] # # Kang_subset = sce # save(Kang_subset, file = "Kang_subset.RData")
log2_FC
extends the results obtained via distinct_test
, by computing fold changes (FC) and log2-FC between conditions.
log2_FC( res, x, name_assays_expression = "cpm", name_group = "group_id", name_cluster = "cluster_id" )
log2_FC( res, x, name_assays_expression = "cpm", name_group = "group_id", name_cluster = "cluster_id" )
res |
a |
x |
a |
name_assays_expression |
a character ("cpm" by default), indicating the name of the assays(x) element which stores the expression data (i.e., assays(x)$name_assays_expression). We strongly encourage using normalized data, such as counts per million (CPM). Do not use logarithm transformed data to compute FCs. |
name_group |
a character ("group_id" by default), indicating the name of the colData(x) element which stores the group id of each cell (i.e., colData(x)$name_group). |
name_cluster |
a character ("cluster_id" by default), indicating the name of the colData(x) element which stores the cluster id of each cell (i.e., colData(x)$name_cluster). |
A data.frame
object, extending the results in 'res'.
Two additional columns are added: FC_group1/group2 and log2FC_group1/group2, inicating the FC and log2-FC of group1/group2.
A FC > 1 (or log2FC > 0) indicates up-regulation of group1 (compared to group2); while a FC < 1 (or log2FC < 0) indicates down-regulation of group1 (compared to group2).
Simone Tiberi [email protected]
# load pre-computed results (obtaines via `distinct_test`) data("res", package = "distinct") # load the input data: data("Kang_subset", package = "distinct") # We can optionally add the fold change (FC) and log2-FC between groups: res = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") # Visualize significant results: head(top_results(res))
# load pre-computed results (obtaines via `distinct_test`) data("res", package = "distinct") # load the input data: data("Kang_subset", package = "distinct") # We can optionally add the fold change (FC) and log2-FC between groups: res = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") # Visualize significant results: head(top_results(res))
plot_cdfs
returns a ggplot
object
with the sample-specific cumulative density function (CDF) estimates, for a specified cluster and gene.
plot_cdfs( x, name_assays_expression = "logcounts", name_cluster = "cluster_id", name_sample = "sample_id", name_group = "group_id", cluster, gene, group_level = FALSE, pad = TRUE, size = 0.75 )
plot_cdfs( x, name_assays_expression = "logcounts", name_cluster = "cluster_id", name_sample = "sample_id", name_group = "group_id", cluster, gene, group_level = FALSE, pad = TRUE, size = 0.75 )
x |
a |
name_assays_expression |
a character ("logcounts" by default), indicating the name of the assays(x) element which stores the expression data (i.e., assays(x)$name_assays_expression). We strongly encourage using normalized data, such as counts per million (CPM) or log-CPM. |
name_cluster |
a character ("cluster_id" by default), indicating the name of the colData(x) element which stores the cluster id of each cell (i.e., colData(x)$name_colData_cluster). |
name_sample |
a character ("sample_id" by default), indicating the name of the colData(x) element which stores the sample id of each cell (i.e., colData(x)$name_colData_sample). |
name_group |
a character ("group_id" by default), indicating the name of the colData(x) element which stores the group id of each cell (i.e., colData(x)$name_colData_group). |
cluster |
a character, indicating the name of the cluster to plot. |
gene |
a character, indicating the name of the gene to plot. |
group_level |
a logical, indicating whether to plot group-level (if TRUE) or sample-level curves (if FALSE). |
pad |
a logical element indicating whether to plot the lines of the CDF when 0 and 1 (TRUE) or not (FALSE). |
size |
a numeric argument defining the width of lines, passed to |
A ggplot
object.
Simone Tiberi [email protected]
data("Kang_subset", package = "distinct") Kang_subset plot_cdfs(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim")
data("Kang_subset", package = "distinct") Kang_subset plot_cdfs(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim")
plot_densities
returns a ggplot
object
with the sample-specific density estimates, for a specified cluster and gene.
plot_densities( x, name_assays_expression = "logcounts", name_cluster = "cluster_id", name_sample = "sample_id", name_group = "group_id", cluster, gene, group_level = FALSE, adjust = 1, size = 0.75 )
plot_densities( x, name_assays_expression = "logcounts", name_cluster = "cluster_id", name_sample = "sample_id", name_group = "group_id", cluster, gene, group_level = FALSE, adjust = 1, size = 0.75 )
x |
a |
name_assays_expression |
a character ("logcounts" by default), indicating the name of the assays(x) element which stores the expression data (i.e., assays(x)$name_assays_expression). We strongly encourage using normalized data, such as counts per million (CPM) or log-CPM. |
name_cluster |
a character ("cluster_id" by default), indicating the name of the colData(x) element which stores the cluster id of each cell (i.e., colData(x)$name_colData_cluster). |
name_sample |
a character ("sample_id" by default), indicating the name of the colData(x) element which stores the sample id of each cell (i.e., colData(x)$name_colData_sample). |
name_group |
a character ("group_id" by default), indicating the name of the colData(x) element which stores the group id of each cell (i.e., colData(x)$name_colData_group). |
cluster |
a character, indicating the name of the cluster to plot. |
gene |
a character, indicating the name of the gene to plot. |
group_level |
a logical, indicating whether to plot group-level (if TRUE) or sample-level curves (if FALSE). |
adjust |
a numeric, representing a multiplicate bandwidth adjustment, argument passed to |
size |
a numeric argument defining the width of lines, passed to |
A ggplot
object.
Simone Tiberi [email protected]
data("Kang_subset", package = "distinct") Kang_subset plot_densities(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim")
data("Kang_subset", package = "distinct") Kang_subset plot_densities(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim")
distinct_test
functionResults from distinct_test
function
res |
contains a |
Simone Tiberi [email protected]
# load the input data: # data("Kang_subset", package = "distinct") # Kang_subset # # create the design of the study: # samples = Kang_subset@metadata$experiment_info$sample_id # group = Kang_subset@metadata$experiment_info$stim # design = model.matrix(~group) # rownames of the design must indicate sample ids: # rownames(design) = samples # design # # Note that the sample names in `colData(x)$name_sample` have to be the same ones as those in `rownames(design)`. # rownames(design) # unique(SingleCellExperiment::colData(Kang_subset)$sample_id) # # In order to obtain a finer ranking for the most significant genes, if computational resources are available, we encourage users to increase P_4 (i.e., the number of permutations when a raw p-value is < 0.001) and set P_4 = 20,000 (by default P_4 = 10,000). # # The group we would like to test for is in the second column of the design, therefore we will specify: column_to_test = 2 # # set.seed(61217) # res = distinct_test( # x = Kang_subset, # name_assays_expression = "logcounts", # name_cluster = "cell", # design = design, # column_to_test = 2, # min_non_zero_cells = 20, # n_cores = 2) # # save(res, file = "res.RData") # saveRDS(res, file = "res.rds")
# load the input data: # data("Kang_subset", package = "distinct") # Kang_subset # # create the design of the study: # samples = Kang_subset@metadata$experiment_info$sample_id # group = Kang_subset@metadata$experiment_info$stim # design = model.matrix(~group) # rownames of the design must indicate sample ids: # rownames(design) = samples # design # # Note that the sample names in `colData(x)$name_sample` have to be the same ones as those in `rownames(design)`. # rownames(design) # unique(SingleCellExperiment::colData(Kang_subset)$sample_id) # # In order to obtain a finer ranking for the most significant genes, if computational resources are available, we encourage users to increase P_4 (i.e., the number of permutations when a raw p-value is < 0.001) and set P_4 = 20,000 (by default P_4 = 10,000). # # The group we would like to test for is in the second column of the design, therefore we will specify: column_to_test = 2 # # set.seed(61217) # res = distinct_test( # x = Kang_subset, # name_assays_expression = "logcounts", # name_cluster = "cell", # design = design, # column_to_test = 2, # min_non_zero_cells = 20, # n_cores = 2) # # save(res, file = "res.RData") # saveRDS(res, file = "res.rds")
top_results
returns the significant results obtained via distinct_test
.
top_results( res, cluster = "all", significance = 0.01, global = TRUE, up_down = "both", sort_by = "p_adj.glb" )
top_results( res, cluster = "all", significance = 0.01, global = TRUE, up_down = "both", sort_by = "p_adj.glb" )
res |
a |
cluster |
a character indicating the cluster(s) whose results have to be returned. Results from all clusters are returned by default ("all"). |
significance |
numeric, results with adjusted p-value < significance will be returned. |
global |
logical indicating whether to filter results according to p_adj.glb (when TRUE), or p_adj.loc (when FALSE). |
up_down |
a character indicating whether to return: all results ("both" or "BOTH"), only up-regulated results ("up" or "UP") or down-regulated results ("down" or "DOWN"). In 'res', a FC > 1 (or log2FC > 0) indicates up-regulation of group1 (compared to group2); while a FC < 1 (or log2FC < 0) indicates down-regulation of group1 (compared to group2). |
sort_by |
a character indicating how results should be sorted. Results can be sorted by globally adjusted p-value ("p_adj.glb", default choice), locally adjusted p-value ("p_adj.loc"), raw p-value ("p_val") or (log2)fold-change ("FC" or "log2FC"). |
A data.frame
object.
Columns 'gene' and 'cluster_id' contain the gene and cell-cluster name, while 'p_val', 'p_adj.loc' and 'p_adj.glb' report the raw p-values, locally and globally adjusted p-values, via Benjamini and Hochberg (BH) correction.
In locally adjusted p-values ('p_adj.loc') BH correction is applied in each cluster separately, while in globally adjusted p-values ('p_adj.glb') BH correction is performed to the results from all clusters.
Simone Tiberi [email protected]
# load pre-computed results (obtaines via `distinct_test`) data("res", package = "distinct") # Visualize significant results: head(top_results(res)) # Visualize significant results from a specified cluster of cells: top_results(res, cluster = "Dendritic cells") # We can optionally add the fold change (FC) and log2-FC between groups: # load the input data: data("Kang_subset", package = "distinct") res = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") # By default, results from 'top_results' are sorted by (globally) adjusted p-value; # they can also be sorted by log2-FC: top_results(res, cluster = "Dendritic cells", sort_by = "log2FC") # Visualize significant UP-regulated genes only: top_results(res, up_down = "UP", cluster = "Dendritic cells")
# load pre-computed results (obtaines via `distinct_test`) data("res", package = "distinct") # Visualize significant results: head(top_results(res)) # Visualize significant results from a specified cluster of cells: top_results(res, cluster = "Dendritic cells") # We can optionally add the fold change (FC) and log2-FC between groups: # load the input data: data("Kang_subset", package = "distinct") res = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") # By default, results from 'top_results' are sorted by (globally) adjusted p-value; # they can also be sorted by log2-FC: top_results(res, cluster = "Dendritic cells", sort_by = "log2FC") # Visualize significant UP-regulated genes only: top_results(res, up_down = "UP", cluster = "Dendritic cells")