The 2-Wasserstein distance W is a metric to describe the distance between two distributions, representing e.g. two different conditions A and B.
For continuous distributions, it is given by
$$W := W(F_A, F_B) = \bigg( \int_0^1 \big|F_A^{-1}(u) - F_B^{-1}(u) \big|^2 du \bigg)^\frac{1}{2},$$
where FA and FB are the corresponding cumulative distribution functions (CDFs) and FA−1 and FB−1 the respective quantile functions.
We specifically consider the squared 2-Wasserstein distance d := W2 which offers the following decomposition into location, size, and shape terms: $$d := d(F_A, F_B) = \int_0^1 \big|F^{-1}(u) - F^{-1}(u) \big|^2 du = \underbrace{\big(\mu_A - \mu_B\big)^2}_{\text{location}} + \underbrace{\big(\sigma_A - \sigma_B\big)^2}_{\text{size}} + \underbrace{2\sigma_A \sigma_B \big(1 - \rho^{A,B}\big)}_{\text{shape}},$$
where μA and μB are the respective means, σA and σB are the respective standard deviations, and ρA, B is the Pearson correlation of the points in the quantile-quantile plot of FA and FB.
In case the distributions FA and FB are not explicitly given and information is only availbale in the form of samples from FA and FB, respectively, we use the corresponding empirical CDFs F̂A and F̂B. Then, the 2-Wasserstein distance is computed by
$$d(\hat{F}_A, \hat{F}_B) \approx \frac{1}{K} \sum_{k=1}^K \big(Q_A^{\alpha_k} - Q_B^{\alpha_k} \big) \approx \big(\hat{\mu}_A - \hat{\mu}_B\big)^2 + \big(\hat{\sigma}_A - \hat{\sigma}_B\big)^2 + 2\hat{\sigma}_A \hat{\sigma}_B \big(1 - \hat{\rho}^{A,B}\big).$$
Here, QA and QB denote equidistant quantiles of FA and FB, respectively, at the levels $\alpha_k := \frac{k-0.5}{K}, k = 1, \dots , K$, where we use K := 1000 in our implementation. Moreover, μ̂A, μ̂B, σ̂A, σ̂B and ρ̂A, B denote the empirical versions of the corresponding quantities.
The waddR
package offers three functions to compute the
2-Wasserstein distance in two-sample settings.
We will use samples from normal distributions to illustrate them.
The first function, wasserstein_metric
, offers a faster
reimplementation in C++ of the wasserstein1d
function from
the R package transport
, which is able to compute general
p-Wasserstein distances. For
p = 2, we obtain the
2-Wasserstein distance W.
The corresponding value of the squared 2-Wasserstein distance d is then computed as:
The second function, squared_wass_approx
, computes the
squared 2-Wasserstein distance by calculating the mean squared
difference of the equidistant quantiles (first approximation in the
previous formula). This function is currently used to compute the
2-Wasserstein distances in the testing procedures.
The third function, squared_wass_decomp
, approximates
the squared 2-Wasserstein distance using the location, size, and shape
terms from the above decomposition (second apporximation in the previous
formula). It also returns the respective decomposition values.
squared_wass_decomp(x,y)
#> $distance
#> [1] 4.180458
#>
#> $location
#> [1] 4.114983
#>
#> $size
#> [1] 0.002307
#>
#> $shape
#> [1] 0.06316766
In the considered example, the decomposition results suggest that the two distributions differ with respect to location (mean), but not in terms of size and shape, thus confirming the underlying normal model.
The waddR
package
Two-sample tests to check for differences between two distributions
Detection of differential gene expression distributions in scRNAseq data
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] waddR_1.21.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.37.0 xfun_0.49
#> [3] bslib_0.8.0 Biobase_2.67.0
#> [5] lattice_0.22-6 vctrs_0.6.5
#> [7] tools_4.4.2 generics_0.1.3
#> [9] stats4_4.4.2 curl_6.0.1
#> [11] parallel_4.4.2 tibble_3.2.1
#> [13] RSQLite_2.3.9 blob_1.2.4
#> [15] pkgconfig_2.0.3 Matrix_1.7-1
#> [17] arm_1.14-4 dbplyr_2.5.0
#> [19] S4Vectors_0.45.2 lifecycle_1.0.4
#> [21] GenomeInfoDbData_1.2.13 compiler_4.4.2
#> [23] codetools_0.2-20 eva_0.2.6
#> [25] GenomeInfoDb_1.43.2 htmltools_0.5.8.1
#> [27] sys_3.4.3 buildtools_1.0.0
#> [29] sass_0.4.9 yaml_2.3.10
#> [31] nloptr_2.1.1 pillar_1.10.0
#> [33] crayon_1.5.3 jquerylib_0.1.4
#> [35] MASS_7.3-61 BiocParallel_1.41.0
#> [37] SingleCellExperiment_1.29.1 DelayedArray_0.33.3
#> [39] cachem_1.1.0 boot_1.3-31
#> [41] abind_1.4-8 nlme_3.1-166
#> [43] tidyselect_1.2.1 digest_0.6.37
#> [45] purrr_1.0.2 dplyr_1.1.4
#> [47] splines_4.4.2 maketools_1.3.1
#> [49] fastmap_1.2.0 grid_4.4.2
#> [51] SparseArray_1.7.2 cli_3.6.3
#> [53] magrittr_2.0.3 S4Arrays_1.7.1
#> [55] withr_3.0.2 filelock_1.0.3
#> [57] UCSC.utils_1.3.0 bit64_4.5.2
#> [59] XVector_0.47.0 httr_1.4.7
#> [61] matrixStats_1.4.1 lme4_1.1-35.5
#> [63] bit_4.5.0.1 coda_0.19-4.1
#> [65] memoise_2.0.1 evaluate_1.0.1
#> [67] knitr_1.49 GenomicRanges_1.59.1
#> [69] IRanges_2.41.2 BiocFileCache_2.15.0
#> [71] rlang_1.1.4 Rcpp_1.0.13-1
#> [73] glue_1.8.0 DBI_1.2.3
#> [75] BiocGenerics_0.53.3 minqa_1.2.8
#> [77] jsonlite_1.8.9 R6_2.5.1
#> [79] MatrixGenerics_1.19.0 zlibbioc_1.52.0