TDbasedUFE is a R package that intends to perform various analyses using Tensor decomposition based unsupervised feature extraction (TDbasedUFE) (Y-H. Taguchi 2020) as much as possible in an user friendly manner. Although in the future, I am planning to implement all the features possible by TDbasedUFE in this R package, at the moment, it includes only basic features.
Although [tensor] (https://bioconductor.org/packages/release/bioc/vignettes/DelayedTensor/inst/doc/DelayedTensor_1.html#12_What_is_Tensor) and [tensor decomposition] (https://bioconductor.org/packages/release/bioc/vignettes/DelayedTensor/inst/doc/DelayedTensor_1.html#13_What_is_Tensor_Decomposition) were fully described in vignettes of DelayedTensor(Tsuyuzaki 2022), we also briefly describe these.
Tensor is a natural extension of matrix that has only rows and columns, whereas tensor has more number of “rows” (or “columns”). For example, three mode tensor, xijk ∈ ℝN × M × K, has three suffix, i ∈ ℕ, j ∈ ℕ, and k ∈ ℕ, each of which ranges [1, N], [1, K] and [1, M]. This representation fitted with that of genomic datasets, e. g., xijk can represent gene expression of ith gene of kth tissue of jth subject (person). If the measurements are performed over multiple conditions, e.g., different time points, we can add additional sufix that corresponds to time. Alternatively, we can also make use of tensor to store multiomics data if we can add additional suffix that represent which of multiomics data is stored in the elements of tensor. Since the experimental design has become more and more complicated, tensor is an ideal data structure to store modern genomic observations.
Although tensor can store very complicated structure well, it does not always mean that it has ability to help us to understand these complicated structure. In order to decompose tensor, tensor decomposition was developed. There are various way to compose tensor. Usually, tensor decomposition (TD) is to decompose tensor into product of some of vectors and tensors.
The most popular TD is CP decomposition, which can decompose tensor into product some of vectors. For example, xijk ∈ ℝN × M × K can be decomposed into the product sum of three vectors, ℝN, ℝM, and ℝK.
which can be mathematically expressed as xijk = ∑ℓλℓuℓiuℓjuℓk
Alternatively, more advanced way, tensor train decomposition,
can decomposes the tensor to product of a mixture of matrix and tensor
as xijk = ∑ℓ1∑ℓ2Rℓ1iRℓ1ℓ2jRℓ2k
However, in this specific study, we employed more primitive way of
decomposition, Tucker decomposition, as xijk = ∑ℓ1∑ℓ2∑ℓ3G(ℓ1ℓ2ℓ3)uℓ1iuℓ2juℓ3k
where G ∈ ℝN × M × K
is core tensor that evaluate the weight of product $ u_{1 i}
u{2 j} u{_3 k}$ and uℓ1i ∈ ℝN × N, uℓ2j ∈ ℝM × M, uℓ3k ∈ ℝK × K
are singular value matrices and are orthogonal matrices.
Although there are no uniqueness of Tucker decomposition, we specifically employed higher order singular value decomposition (HOSVD) (Y-H. Taguchi 2020) to get Tucker decomposition. The reason why we employed Tucker decomposition was fully described in my book(Y-H. Taguchi 2020).
Here we briefly describe how we can make use of HOSVD for the feature selection. Suppose xijk represents the expression of gene i of the kth tissue of the jth subject (person) and the task is to select genes whose expression is distinct between patients and healthy controls only at one specific tissue. In order that, we need to select uℓ2j attributed to jth subject which is distinct between healthy control and patients (see (1)) and uℓ3k attributed to kth tissue is kth tissue specific (see (2)). Then we next identify which G(ℓ1ℓ2ℓ3) has the largest absolute value with fixed ℓ2 and ℓ3 that correspond to distinction between patient and healthy control and tissue specificity, respectively (see (3)). Finally, with assuming that uℓ1i obeys Gaussian distribution (null hypothesis), P-values are attributed to ith gene. P-values are corrected with considering multiple comparison corrections (specifically, we employed Benjamini Hochberg method(Y-H. Taguchi 2020)) and genes associated with adjusted P-values less than 0.01 are usually selected (see (4) and (5)).
Although TD based unsupervised FE was proposed five years ago, it was successfully applied to wide range of genomic science, together with the proto type method, principal component analysis (PCA) based unsupervised FE proposed ten years ago, we recently recognize that the optimization of the standard deviation (SD) in Gaussian distribution employed as the null hypothesis drastically improved the performance(Y.-H. Taguchi and Turki 2022) (Y.-H. Taguchi and Turki 2023) (Roy and Taguchi 2022). How we can optimize SD and why the optimization of SD can improve the performance is as follows. Previously, we computed SD of uℓ1i, σℓ1 as $$ \sigma_{\ell_1} = \sqrt{\frac{1}{N} \sum_i \left ( u_{\ell_1 i} - \langle u_{\ell_1i} \rangle\right)^2} $$ $$ \langle u_{\ell_1 i} \rangle = \frac{1}{N} \sum_i u_{\ell_1 i}. $$ However, σℓ1 has tendency to be over estimated since some of ith does not follow Gaussian distribution and has too large absolute values. In order to exclude those i that does not follow Gaussian, we tried to minimize SD of hn which is the frequency of Pi computed with assuming that uℓ1i obeys Gaussian and falling into nth bin of histogram Pi $$ \sigma_h = \sqrt{ \frac{1}{N_h} \sum_n \left ( h_n - \langle h_n \rangle \right)^2} $$ where Nh is the number of bins of the histogram of Pi and $$ \langle h_n \rangle = \frac{1}{N_h} \sum_n h_n $$ with excluding some of Pi which is supposed to be too small to assume that obeys Gaussian (Null hypothesis). This is because hn should be constant if all of Pi follows Gaussian (Null hypothesis).
The above plot is output of examples of function selectFeature in TDbasedUFE package and represents the dependence of σh upon σℓ1 as well as the resulting histogram hn of 1 − Pi when uℓ1i is fully generated from Gaussian. The smallest σh correctly corresponds to the true σℓ1( = 0.01) and hn is almost independent of n as expected.
Thus it is expected that σh will take the smallest value when we correctly exclude is whose Pi does not obey Gaussian (Null hypothesis). The following is the step to optimized σℓ1.
Prepare initial σℓ1 and define threshold adjusted P-value, P0.
Attribute Pis to is with assuming Gaussian distribution of uℓ1i using σℓ1.
Attribute adjusted Pis to is with using Benjamini-Hochberg criterion.
Exclude is associated with adjusted Pi less than P0.
Compute histogram of 1 − Pi and σh.
Modify σℓ1 until σh is minimized.
Finally, we identify excluded is as the selected features because these is do not obey Gaussian (Null hypothesis)
When we have multiomics data set composed of K omics data measured on common M samples as xikjk ∈ ℝNk × M × K where Nk is the number of features in kth omics data, we can make use of a tensor which is a bundle of linear kanel(Y. Taguchi and Turki 2022). In this case, we can generate liner kernel of kth omics data as xjj′k = ∑ikxikjkxikj′k ∈ ℝM × M × K to which HOSVD was applied and we get xjj′k = ∑ℓ1∑ℓ2∑ℓ3G(ℓ1ℓ2ℓ3)uℓ1juℓ2j′uℓ3k where ℓ1 and ℓ2 are exchangeable, since xjj′k = xj′jk. Singular values attributed to features can be retrieved as uℓik = ∑juℓjxikj Then the same procedure was applied to uℓi in order to select features.
In this vignettes, we explain how TD based unsupervised FE can work with optimized SD. In the other vignettes, we introduce how it works well in the real examples.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] MOFAdata_1.22.0 tximportData_1.34.0 tximport_1.35.0
#> [4] readr_2.1.5 rTensor_1.4.8 GenomicRanges_1.59.1
#> [7] GenomeInfoDb_1.43.2 IRanges_2.41.2 S4Vectors_0.45.2
#> [10] BiocGenerics_0.53.3 generics_0.1.3 TDbasedUFE_1.7.0
#> [13] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 hms_1.1.3 digest_0.6.37
#> [4] magrittr_2.0.3 evaluate_1.0.1 fastmap_1.2.0
#> [7] jsonlite_1.8.9 promises_1.3.2 BiocManager_1.30.25
#> [10] httr_1.4.7 UCSC.utils_1.3.0 jquerylib_0.1.4
#> [13] cli_3.6.3 shiny_1.10.0 crayon_1.5.3
#> [16] rlang_1.1.4 XVector_0.47.0 bit64_4.5.2
#> [19] cachem_1.1.0 yaml_2.3.10 parallel_4.4.2
#> [22] tools_4.4.2 tzdb_0.4.0 httpuv_1.6.15
#> [25] GenomeInfoDbData_1.2.13 buildtools_1.0.0 vctrs_0.6.5
#> [28] R6_2.5.1 mime_0.12 lifecycle_1.0.4
#> [31] zlibbioc_1.52.0 bit_4.5.0.1 vroom_1.6.5
#> [34] pkgconfig_2.0.3 pillar_1.10.0 bslib_0.8.0
#> [37] later_1.4.1 glue_1.8.0 Rcpp_1.0.13-1
#> [40] tidyselect_1.2.1 xfun_0.49 tibble_3.2.1
#> [43] sys_3.4.3 knitr_1.49 xtable_1.8-4
#> [46] htmltools_0.5.8.1 rmarkdown_2.29 maketools_1.3.1
#> [49] compiler_4.4.2