distinct: a method for differential analyses via hierarchical permutation tests


Introduction

distinct is a statistical method to perform differential testing between two or more groups of distributions; differential testing is performed via hierarchical non-parametric permutation tests on the cumulative distribution functions (cdfs) of each sample. While most methods for differential expression target differences in the mean abundance between conditions, distinct, by comparing full cdfs, identifies, both, differential patterns involving changes in the mean, as well as more subtle variations that do not involve the mean (e.g., unimodal vs. bi-modal distributions with the same mean). distinct is a general and flexible tool: due to its fully non-parametric nature, which makes no assumptions on how the data was generated, it can be applied to a variety of datasets. It is particularly suitable to perform differential state analyses on single cell data (e.g., differential analyses within sub-populations of cells), such as single cell RNA sequencing (scRNA-seq) and high-dimensional flow or mass cytometry (HDCyto) data.

At present, covariates are not allowed, and only 2-group comparisons are implemented. In future releases, we will allow for covariates and for differential testing between more than 2 groups.

A pre-print will follow in the coming months.

To access the R code used in the vignettes, type:

browseVignettes("distinct")

Questions relative to distinct should be reported as a new issue at BugReports.

To cite distinct, type:

citation("distinct")

Bioconductor installation

distinct is available on Bioconductor and can be installed with the command:

if (!requireNamespace("BiocManager", quietly=TRUE))
  install.packages("BiocManager")
BiocManager::install("distinct")

Differential State analysis

Differential state analyses aim at investigating differential patterns between conditions in sub-populations of cells. To use distinct one needs data from two or more groups of samples (i.e., experimental conditions), with at least 2 samples (i.e., biological replicates) per group. Given a single-cell RNA-sequencing (scRNA-seq) or a high dimensional flow or mass cytometry (HDCyto) dataset, cells need first to be clustered in groups via some form of clustering algorithms; distinct is then applied to identify differential patterns between groups, within each cluster of cells.

Input data

Load the example dataset, consisting of a subset of 6 samples (3 individuals observed across 2 conditions) and 100 genes selected from the Kang18_8vs8() object of the muscData package.

library(SingleCellExperiment)
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'IRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomeInfoDb'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomicRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'XVector'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'SummarizedExperiment'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'S4Arrays'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'DelayedArray'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'SparseArray'
## Warning: replacing previous import 'S4Arrays::read_block' by
## 'DelayedArray::read_block' when loading 'SummarizedExperiment'
data("Kang_subset", package = "distinct")
Kang_subset
## class: SingleCellExperiment 
## dim: 100 9517 
## metadata(1): experiment_info
## assays(2): logcounts cpm
## rownames(100): ISG15 SYF2 ... MX2 PDXK
## rowData names(0):
## colnames: NULL
## colData names(3): stim cell sample_id
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Columns ind and stim of the colData indicate the indivual id and the experimental condition (control or stimulated) of each cell, while column sample_id shows the sample id, needed for the differential anlyses. Column cell represents the cell type, which defines the clustering structure of cells: differential testing between conditions is performed separately for each cluster of cells. Note that, if cell clustering label was unknown, we would need to cluster cells into groups via some clustering algorithm.

colData(Kang_subset)
## DataFrame with 9517 rows and 3 columns
##          stim            cell sample_id
##      <factor>        <factor>  <factor>
## 1        ctrl CD4 T cells     ctrl_107 
## 2        ctrl CD14+ Monocytes ctrl_1015
## 3        ctrl NK cells        ctrl_1015
## 4        ctrl CD4 T cells     ctrl_107 
## 5        ctrl CD14+ Monocytes ctrl_1015
## ...       ...             ...       ...
## 9513     stim CD14+ Monocytes stim_1015
## 9514     stim CD4 T cells     stim_101 
## 9515     stim CD14+ Monocytes stim_101 
## 9516     stim CD14+ Monocytes stim_107 
## 9517     stim CD4 T cells     stim_107

The experimental design compares two groups (stim vs ctrl) with 3 biological replicates each.

Kang_subset@metadata$experiment_info
##   sample_id stim
## 1  ctrl_107 ctrl
## 2 ctrl_1015 ctrl
## 3  ctrl_101 ctrl
## 4  stim_101 stim
## 5 stim_1015 stim
## 6  stim_107 stim

Differential analyses within sub-populations of cells

Load distinct.

library(distinct)

Create the design of the study:

samples = Kang_subset@metadata$experiment_info$sample_id
group = Kang_subset@metadata$experiment_info$stim
design = model.matrix(~group)
# rownames of the design must indicate sample ids:
rownames(design) = samples
design
##           (Intercept) groupstim
## ctrl_107            1         0
## ctrl_1015           1         0
## ctrl_101            1         0
## stim_101            1         1
## stim_1015           1         1
## stim_107            1         1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Perform differential state testing between conditions. Parameter name_assays_expression specifies the input data (logcounts) in assays(x), while name_cluster and name_sample define the column names of colData(x) containing the clustering of cells (cell) and the id of individual samples (sample_id). The group we would like to test for is in the second column of the design, therefore we will specify: column_to_test = 2.

Note that the sample names in colData(x)$name_sample have to be the same ones as those in rownames(design) (although not necessarily in the same order).

rownames(design)
## [1] "ctrl_107"  "ctrl_1015" "ctrl_101"  "stim_101"  "stim_1015" "stim_107"
unique(colData(Kang_subset)$sample_id)
## [1] ctrl_107  ctrl_1015 ctrl_101  stim_101  stim_1015 stim_107 
## Levels: ctrl_101 ctrl_1015 ctrl_107 stim_101 stim_1015 stim_107

In order to obtain a finer ranking for the most significant genes, if computational resources are available, we encourage users to increase P_4 (i.e., the number of permutations when a raw p-value is < 0.001) and set P_4 = 20,000 (by default P_4 = 10,000).

We strongly encourage using normalized data, such as counts per million (CPM) or log2-CPM (e.g., logcounts as created via scater::logNormCounts).

set.seed(61217)

res = distinct_test(x = Kang_subset, 
                    name_assays_expression = "logcounts",
                    name_cluster = "cell",
                    name_sample = "sample_id",
                    design = design,
                    column_to_test = 2,
                    min_non_zero_cells = 20,
                    n_cores = 2)
## 2 groups of samples provided
## Data loaded, starting differential testing
## Differential testing completed, returning results

Handling covariates and batch effects

Covariates (such as batch effects), if present, can be added to the design matrix. In each cluster of cells, we fit a linear model, with covariates as predictors, and regress them out by performing differential analyeses on the residuals. By separately fitting a linear model on each cluster, we allow the effect of covariates to vary from cluster to cluster.

When specifying covariates, we highly recommend using log-normalized data, such as log2-CPMs (e.g., logcounts as created via scater::logNormCounts), because it is generally assumed that covariates (and particularly batch effects) have an approximately linear effect on the log or log2 scale of counts.

Assume samples are associated to three different batches; we modify the design to also include batches.

batch = factor(c("A", "B", "C", "A", "B", "C"))

design = model.matrix(~group + batch)
# rownames of the design must indicate sample ids:
rownames(design) = samples
design
##           (Intercept) groupstim batchB batchC
## ctrl_107            1         0      0      0
## ctrl_1015           1         0      1      0
## ctrl_101            1         0      0      1
## stim_101            1         1      0      0
## stim_1015           1         1      1      0
## stim_107            1         1      0      1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
## 
## attr(,"contrasts")$batch
## [1] "contr.treatment"

We proceed as before to perform differential testing. Again, we specify the column of the design to be tested via column_to_test = 2.

set.seed(61217)

res_batches = distinct_test(x = Kang_subset, 
                            name_assays_expression = "logcounts",
                            name_cluster = "cell",
                            name_sample = "sample_id",
                            design = design,
                            column_to_test = 2,
                            min_non_zero_cells = 20,
                            n_cores = 2)
## 2 groups of samples provided
## Covariates detected
## Data loaded, starting differential testing
## Differential testing completed, returning results

Visualizing results

Results are reported as a data.frame, where columns gene and cluster_id contain the gene and cell-cluster name, while p_val, p_adj.loc and p_adj.glb report the raw p-values, locally and globally adjusted p-values, via Benjamini and Hochberg (BH) correction. In locally adjusted p-values (p_adj.loc) BH correction is applied in each cluster separately, while in globally adjusted p-values (p_adj.glb) BH correction is performed to the results from all clusters.

We can further compute the fold change (FC) and log2-FC between groups. To compute FCs, use normalized data, such as CPMs; do not use logarithm transformed data (e.g., logcounts).

res = log2_FC(res = res,
              x = Kang_subset, 
              name_assays_expression = "cpm",
              name_group = "stim",
              name_cluster = "cell")
## FC and log2_FC computed, returning results

log2_FC computes the log-FC between the first and the second level of the group id, in this case beween controls (numerator) and stimulated samples (denominator).

levels(colData(Kang_subset)$stim)
## [1] "ctrl" "stim"
head(res[,9:10], 3)
##   FC_ctrl/stim log2FC_ctrl/stim
## 1   0.02309151       -5.4364934
## 2   1.16891993        0.2251761
## 3   1.26131602        0.3349298

To use a different level (i.e., stim/ctrl), we can reorder the levels before running log2_FC2.

# set "stim" as 1st level:
colData(Kang_subset)$stim = relevel(colData(Kang_subset)$stim, "stim")
levels(colData(Kang_subset)$stim)
## [1] "stim" "ctrl"
res_2 = log2_FC(res = res,
              x = Kang_subset, 
              name_assays_expression = "cpm",
              name_group = "stim",
              name_cluster = "cell")
## 'res' already contains columns 'FC' and/or 'log2FC': they will be overwritten
## FC and log2_FC computed, returning results
head(res_2[,9:10], 3)
##   FC_stim/ctrl log2FC_stim/ctrl
## 1   43.3059524        5.4364934
## 2    0.8554906       -0.2251761
## 3    0.7928227       -0.3349298

We can visualize significant results via top_results function.

head(top_results(res))
##     gene cluster_id     p_val   p_adj.loc   p_adj.glb filtered mean_ctrl
## 1  ISG15    B cells 9.999e-05 0.001224878 0.000731038    FALSE  198.0569
## 41 POLD2    B cells 9.999e-05 0.001224878 0.000731038    FALSE  129.3231
## 49  RPL7    B cells 9.999e-05 0.001224878 0.000731038    FALSE 6330.3843
## 57  SPI1    B cells 9.999e-05 0.001224878 0.000731038    FALSE  159.7687
## 61 PRDX5    B cells 9.999e-05 0.001224878 0.000731038    FALSE  382.3718
## 72 PSME2    B cells 9.999e-05 0.001224878 0.000731038    FALSE  668.6966
##     mean_stim FC_ctrl/stim log2FC_ctrl/stim
## 1  8577.04339   0.02309151       -5.4364934
## 41   51.37722   2.51712882        1.3317791
## 49 4537.11929   1.39524308        0.4805165
## 57   21.60257   7.39582088        2.8867103
## 61  136.40614   2.80318652        1.4870677
## 72 1574.42658   0.42472389       -1.2354028

We can also visualize significant results for a specific cluster of cells.

top_results(res, cluster = "Dendritic cells")
##       gene      cluster_id      p_val   p_adj.loc   p_adj.glb filtered
## 401  ISG15 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 449   RPL7 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 472  PSME2 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 499    MX2 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 417 ARID5A Dendritic cells 0.00069993 0.008679132 0.004264388    FALSE
##      mean_ctrl  mean_stim FC_ctrl/stim log2FC_ctrl/stim
## 401  262.99661 21085.0630   0.01247312       -6.3250333
## 449 3568.75526  2156.7050   1.65472570        0.7265921
## 472  785.99389  2291.2623   0.34303968       -1.5435526
## 499   41.46270   502.7550   0.08247097       -3.5999698
## 417   37.40927   181.3747   0.20625400       -2.2775060

By default, results from ‘top_results’ are sorted by (globally) adjusted p-value; they can also be sorted by log2-FC.

top_results(res, cluster = "Dendritic cells", sort_by = "log2FC")
##       gene      cluster_id      p_val   p_adj.loc   p_adj.glb filtered
## 401  ISG15 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 499    MX2 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 417 ARID5A Dendritic cells 0.00069993 0.008679132 0.004264388    FALSE
## 472  PSME2 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 449   RPL7 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
##      mean_ctrl  mean_stim FC_ctrl/stim log2FC_ctrl/stim
## 401  262.99661 21085.0630   0.01247312       -6.3250333
## 499   41.46270   502.7550   0.08247097       -3.5999698
## 417   37.40927   181.3747   0.20625400       -2.2775060
## 472  785.99389  2291.2623   0.34303968       -1.5435526
## 449 3568.75526  2156.7050   1.65472570        0.7265921

We can further filter results to visualize significant up- or down-regulated results only. Here we visualize the down-regulated gene-cluster results; i.e., results with lower expression in ‘ctlr’ group compared to ‘stim’.

top_results(res, up_down = "down",
            cluster = "Dendritic cells")
##       gene      cluster_id      p_val   p_adj.loc   p_adj.glb filtered
## 401  ISG15 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 472  PSME2 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 499    MX2 Dendritic cells 0.00009999 0.001549845 0.000731038    FALSE
## 417 ARID5A Dendritic cells 0.00069993 0.008679132 0.004264388    FALSE
##     mean_ctrl  mean_stim FC_ctrl/stim log2FC_ctrl/stim
## 401 262.99661 21085.0630   0.01247312        -6.325033
## 472 785.99389  2291.2623   0.34303968        -1.543553
## 499  41.46270   502.7550   0.08247097        -3.599970
## 417  37.40927   181.3747   0.20625400        -2.277506

Plotting significant results

Density plot of one significant gene (ISG15) in Dendritic cells cluster.

plot_densities(x = Kang_subset,
               gene = "ISG15",
               cluster = "Dendritic cells",
               name_assays_expression = "logcounts",
               name_cluster = "cell",
               name_sample = "sample_id",
               name_group = "stim")

Instead of one curve per sample, we can also plot aggregated group-level curves by setting group_level = TRUE.

plot_densities(x = Kang_subset,
               gene = "ISG15",
               cluster = "Dendritic cells",
               name_assays_expression = "logcounts",
               name_cluster = "cell",
               name_sample = "sample_id",
               name_group = "stim",
               group_level = TRUE)

CDF plot of one significant gene (ISG15) in Dendritic cells cluster.

plot_cdfs(x = Kang_subset,
          gene = "ISG15",
          cluster = "Dendritic cells",
          name_assays_expression = "logcounts",
          name_cluster = "cell",
          name_sample = "sample_id",
          name_group = "stim")

Violin plots of significant genes in Dendritic cells cluster.

# select cluster of cells:
cluster = "Dendritic cells"
sel_cluster = res$cluster_id == cluster
sel_column = Kang_subset$cell == cluster

# select significant genes:
sel_genes = res$p_adj.glb < 0.01
genes = as.character(res$gene[sel_cluster & sel_genes])

# make violin plots:
library(scater)
## Loading required package: scuttle
## Loading required package: ggplot2
plotExpression(Kang_subset[,sel_column],
               features = genes, exprs_values = "logcounts",
               log2_values = FALSE,
               x = "sample_id", colour_by = "stim", ncol = 3) +
  guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Visualize the concordance of differential results between cell clusters. We select as significant genes with globally adjusted p-value below 0.01.

library(UpSetR)
res_by_cluster = split( ifelse(res$p_adj.glb < 0.01, 1, 0), res$cluster_id)
upset(data.frame(do.call(cbind, res_by_cluster)), nsets = 10, nintersects = 20)

# Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] UpSetR_1.4.0                scater_1.35.0              
##  [3] ggplot2_3.5.1               scuttle_1.17.0             
##  [5] distinct_1.19.0             SingleCellExperiment_1.29.0
##  [7] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [9] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
## [11] IRanges_2.41.0              S4Vectors_0.45.0           
## [13] BiocGenerics_0.53.1         generics_0.1.3             
## [15] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [17] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] viridisLite_0.4.2       vipor_0.4.7             farver_2.1.2           
##  [4] viridis_0.6.5           fastmap_1.2.0           digest_0.6.37          
##  [7] rsvd_1.0.5              lifecycle_1.0.4         statmod_1.5.0          
## [10] magrittr_2.0.3          compiler_4.4.1          rlang_1.1.4            
## [13] sass_0.4.9              rngtools_1.5.2          tools_4.4.1            
## [16] utf8_1.2.4              yaml_2.3.10             knitr_1.48             
## [19] S4Arrays_1.7.1          labeling_0.4.3          doRNG_1.8.6            
## [22] DelayedArray_0.33.1     plyr_1.8.9              abind_1.4-8            
## [25] BiocParallel_1.41.0     withr_3.0.2             sys_3.4.3              
## [28] grid_4.4.1              fansi_1.0.6             beachmat_2.23.0        
## [31] colorspace_2.1-1        scales_1.3.0            iterators_1.0.14       
## [34] cli_3.6.3               rmarkdown_2.28          crayon_1.5.3           
## [37] httr_1.4.7              ggbeeswarm_0.7.2        cachem_1.1.0           
## [40] zlibbioc_1.51.2         parallel_4.4.1          BiocManager_1.30.25    
## [43] XVector_0.47.0          vctrs_0.6.5             Matrix_1.7-1           
## [46] jsonlite_1.8.9          BiocSingular_1.23.0     BiocNeighbors_2.1.0    
## [49] ggrepel_0.9.6           irlba_2.3.5.1           beeswarm_0.4.0         
## [52] maketools_1.3.1         foreach_1.5.2           limma_3.63.0           
## [55] jquerylib_0.1.4         glue_1.8.0              codetools_0.2-20       
## [58] gtable_0.3.6            UCSC.utils_1.3.0        ScaledMatrix_1.15.0    
## [61] munsell_0.5.1           tibble_3.2.1            pillar_1.9.0           
## [64] htmltools_0.5.8.1       GenomeInfoDbData_1.2.13 R6_2.5.1               
## [67] doParallel_1.0.17       evaluate_1.0.1          lattice_0.22-6         
## [70] highr_0.11              bslib_0.8.0             Rcpp_1.0.13            
## [73] gridExtra_2.3           SparseArray_1.7.0       xfun_0.49              
## [76] buildtools_1.0.0        pkgconfig_2.0.3