library(TDbasedUFE)

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("TDbasedUFE")

Introduction

TDbasedUFE is a R package that intends to perform various analyses using Tensor decomposition based unsupervised feature extraction (TDbasedUFE) (Y-H. Taguchi 2020) as much as possible in an user friendly manner. Although in the future, I am planning to implement all the features possible by TDbasedUFE in this R package, at the moment, it includes only basic features.

What are tensor and tensor decomposition?

Although [tensor] (https://bioconductor.org/packages/release/bioc/vignettes/DelayedTensor/inst/doc/DelayedTensor_1.html#12_What_is_Tensor) and [tensor decomposition] (https://bioconductor.org/packages/release/bioc/vignettes/DelayedTensor/inst/doc/DelayedTensor_1.html#13_What_is_Tensor_Decomposition) were fully described in vignettes of DelayedTensor(Tsuyuzaki 2022), we also briefly describe these.

Tensor is a natural extension of matrix that has only rows and columns, whereas tensor has more number of “rows” (or “columns”). For example, three mode tensor, x_ijk ∈ ℝ^{N × M × K}, has three suffix, i ∈ ℕ, j ∈ ℕ, and k ∈ ℕ, each of which ranges [1, N], [1, K] and [1, M]. This representation fitted with that of genomic datasets, e. g., x_ijk can represent gene expression of ith gene of kth tissue of jth subject (person). If the measurements are performed over multiple conditions, e.g., different time points, we can add additional sufix that corresponds to time. Alternatively, we can also make use of tensor to store multiomics data if we can add additional suffix that represent which of multiomics data is stored in the elements of tensor. Since the experimental design has become more and more complicated, tensor is an ideal data structure to store modern genomic observations.

Although tensor can store very complicated structure well, it does not always mean that it has ability to help us to understand these complicated structure. In order to decompose tensor, tensor decomposition was developed. There are various way to compose tensor. Usually, tensor decomposition (TD) is to decompose tensor into product of some of vectors and tensors.

The most popular TD is CP decomposition, which can decompose tensor into product some of vectors. For example, x_ijk ∈ ℝ^{N × M × K} can be decomposed into the product sum of three vectors, ℝ^N, ℝ^M, and ℝ^K.

Tensor Decomposition

which can be mathematically expressed as x_ijk = ∑_ℓλ_ℓu_ℓiu_ℓju_ℓk Alternatively, more advanced way, tensor train decomposition,
can decomposes the tensor to product of a mixture of matrix and tensor as x_ijk = ∑_ℓ₁∑_ℓ₂R_ℓ₁iR_{ℓ₁ℓ₂j}R_ℓ₂k However, in this specific study, we employed more primitive way of decomposition, Tucker decomposition, as x_ijk = ∑_ℓ₁∑_ℓ₂∑_ℓ₃G(ℓ₁ℓ₂ℓ₃)u_ℓ₁iu_ℓ₂ju_ℓ₃k where G ∈ ℝ^{N × M × K} is core tensor that evaluate the weight of product $ u_{1 i} u{2 j} u{_3 k}$ and u_ℓ₁i ∈ ℝ^N × N, u_ℓ₂j ∈ ℝ^M × M, u_ℓ₃k ∈ ℝ^K × K are singular value matrices and are orthogonal matrices.

Although there are no uniqueness of Tucker decomposition, we specifically employed higher order singular value decomposition (HOSVD) (Y-H. Taguchi 2020) to get Tucker decomposition. The reason why we employed Tucker decomposition was fully described in my book(Y-H. Taguchi 2020).

Tensor decomposition based unsupervised feature extraction.

Tensor decomposition based unsupervised feature extraction

Here we briefly describe how we can make use of HOSVD for the feature selection. Suppose x_ijk represents the expression of gene i of the kth tissue of the jth subject (person) and the task is to select genes whose expression is distinct between patients and healthy controls only at one specific tissue. In order that, we need to select u_ℓ₂j attributed to jth subject which is distinct between healthy control and patients (see (1)) and u_ℓ₃k attributed to kth tissue is kth tissue specific (see (2)). Then we next identify which G(ℓ₁ℓ₂ℓ₃) has the largest absolute value with fixed ℓ₂ and ℓ₃ that correspond to distinction between patient and healthy control and tissue specificity, respectively (see (3)). Finally, with assuming that u_ℓ₁i obeys Gaussian distribution (null hypothesis), P-values are attributed to ith gene. P-values are corrected with considering multiple comparison corrections (specifically, we employed Benjamini Hochberg method(Y-H. Taguchi 2020)) and genes associated with adjusted P-values less than 0.01 are usually selected (see (4) and (5)).

Optimization of standard deviation

Although TD based unsupervised FE was proposed five years ago, it was successfully applied to wide range of genomic science, together with the proto type method, principal component analysis (PCA) based unsupervised FE proposed ten years ago, we recently recognize that the optimization of the standard deviation (SD) in Gaussian distribution employed as the null hypothesis drastically improved the performance(Y.-H. Taguchi and Turki 2022) (Y.-H. Taguchi and Turki 2023) (Roy and Taguchi 2022). How we can optimize SD and why the optimization of SD can improve the performance is as follows. Previously, we computed SD of u_ℓ₁i, σ_ℓ₁ as $$ \sigma_{\ell_1} = \sqrt{\frac{1}{N} \sum_i \left ( u_{\ell_1 i} - \langle u_{\ell_1i} \rangle\right)^2} $$ $$ \langle u_{\ell_1 i} \rangle = \frac{1}{N} \sum_i u_{\ell_1 i}. $$ However, σ_ℓ₁ has tendency to be over estimated since some of ith does not follow Gaussian distribution and has too large absolute values. In order to exclude those i that does not follow Gaussian, we tried to minimize SD of h_n which is the frequency of P_i computed with assuming that u_ℓ₁i obeys Gaussian and falling into nth bin of histogram P_i $$ \sigma_h = \sqrt{ \frac{1}{N_h} \sum_n \left ( h_n - \langle h_n \rangle \right)^2} $$ where N_h is the number of bins of the histogram of P_i and $$ \langle h_n \rangle = \frac{1}{N_h} \sum_n h_n $$ with excluding some of P_i which is supposed to be too small to assume that obeys Gaussian (Null hypothesis). This is because h_n should be constant if all of P_i follows Gaussian (Null hypothesis).

$Left: \sigma_h as a function of \sigma_{\ell_1}. Right: Histogram, h_n, of 1-P_i being computed with the optimized \sigma_{\ell_1}.$

Left: σ_h as a function of σ_ℓ₁. Right: Histogram, h_n, of 1 − P_i being computed with the optimized σ_ℓ₁.

The above plot is output of examples of function selectFeature in TDbasedUFE package and represents the dependence of σ_h upon σ_ℓ₁ as well as the resulting histogram h_n of 1 − P_i when u_ℓ₁i is fully generated from Gaussian. The smallest σ_h correctly corresponds to the true σ_ℓ₁( = 0.01) and h_n is almost independent of n as expected.

Thus it is expected that σ_h will take the smallest value when we correctly exclude is whose P_i does not obey Gaussian (Null hypothesis). The following is the step to optimized σ_ℓ₁.

Prepare initial σ_ℓ₁ and define threshold adjusted P-value, P₀.
Attribute P_is to is with assuming Gaussian distribution of u_ℓ₁i using σ_ℓ₁.
Attribute adjusted P_is to is with using Benjamini-Hochberg criterion.
Exclude is associated with adjusted P_i less than P₀.
Compute histogram of 1 − P_i and σ_h.
Modify σ_ℓ₁ until σ_h is minimized.
Finally, we identify excluded is as the selected features because these is do not obey Gaussian (Null hypothesis)

Multiomics data analysis

When we have multiomics data set composed of K omics data measured on common M samples as x_{i_kjk} ∈ ℝ^{N_k × M × K} where N_k is the number of features in kth omics data, we can make use of a tensor which is a bundle of linear kanel(Y. Taguchi and Turki 2022). In this case, we can generate liner kernel of kth omics data as x_jj′k = ∑_{i_k}x_{i_kjk}x_{i_kj′k} ∈ ℝ^{M × M × K} to which HOSVD was applied and we get x_jj′k = ∑_ℓ₁∑_ℓ₂∑_ℓ₃G(ℓ₁ℓ₂ℓ₃)u_ℓ₁ju_ℓ₂j′u_ℓ₃k where ℓ₁ and ℓ₂ are exchangeable, since x_jj′k = x_j′jk. Singular values attributed to features can be retrieved as u_{ℓi_k} = ∑_ju_ℓjx_{i_kj} Then the same procedure was applied to u_ℓi in order to select features.

Multiomics analysis

Conclusions

In this vignettes, we explain how TD based unsupervised FE can work with optimized SD. In the other vignettes, we introduce how it works well in the real examples.

sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] MOFAdata_1.22.0      tximportData_1.34.0  tximport_1.35.0     
#>  [4] readr_2.1.5          rTensor_1.4.8        GenomicRanges_1.59.1
#>  [7] GenomeInfoDb_1.43.4  IRanges_2.41.3       S4Vectors_0.45.4    
#> [10] BiocGenerics_0.53.6  generics_0.1.3       TDbasedUFE_1.7.0    
#> [13] BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              hms_1.1.3               digest_0.6.37          
#>  [4] magrittr_2.0.3          evaluate_1.0.3          fastmap_1.2.0          
#>  [7] jsonlite_1.9.1          promises_1.3.2          BiocManager_1.30.25    
#> [10] httr_1.4.7              UCSC.utils_1.3.1        jquerylib_0.1.4        
#> [13] cli_3.6.4               shiny_1.10.0            crayon_1.5.3           
#> [16] rlang_1.1.5             XVector_0.47.2          bit64_4.6.0-1          
#> [19] cachem_1.1.0            yaml_2.3.10             parallel_4.4.3         
#> [22] tools_4.4.3             tzdb_0.5.0              httpuv_1.6.15          
#> [25] GenomeInfoDbData_1.2.13 buildtools_1.0.0        vctrs_0.6.5            
#> [28] R6_2.6.1                mime_0.13               lifecycle_1.0.4        
#> [31] bit_4.6.0               vroom_1.6.5             pkgconfig_2.0.3        
#> [34] pillar_1.10.1           bslib_0.9.0             later_1.4.1            
#> [37] glue_1.8.0              Rcpp_1.0.14             tidyselect_1.2.1       
#> [40] xfun_0.51               tibble_3.2.1            sys_3.4.3              
#> [43] knitr_1.50              xtable_1.8-4            htmltools_0.5.8.1      
#> [46] rmarkdown_2.29          maketools_1.3.2         compiler_4.4.3

Roy, Sanjiban Sekhar, and Y-H. Taguchi. 2022. “Tensor Decomposition and Principal Component Analysis-Based Unsupervised Feature Extraction Outperforms State-of-the-Art Methods When Applied to Histone Modification Profiles.” bioRxiv. https://doi.org/10.1101/2022.04.29.490081.

Taguchi, Y-H. 2020. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing. https://doi.org/10.1007/978-3-030-22456-1.

Taguchi, Y-H, and Turki Turki. 2022. “Adapted tensor decomposition and PCA based unsupervised feature extraction select more biologically reasonable differentially expressed genes than conventional methods.” Scientific Reports 12 (1): 17438. https://doi.org/10.1038/s41598-022-21474-z.

Taguchi, Y-h, and Turki Turki. 2022. “Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis.” BMC Medical Genomics 15 (1): 37. https://doi.org/10.1186/s12920-022-01181-4.

Taguchi, Y.-H., and Turki Turki. 2023. “Principal Component Analysis- and Tensor Decomposition-Based Unsupervised Feature Extraction to Select More Suitable Differentially Methylated Cytosines: Optimization of Standard Deviation Versus State-of-the-Art Methods.” Genomics, 110577. https://doi.org/https://doi.org/10.1016/j.ygeno.2023.110577.

Tsuyuzaki, Koki. 2022. DelayedTensor: R Package for Sparse and Out-of-Core Arithmetic and Decomposition of Tensor. https://doi.org/10.18129/B9.bioc.DelayedTensor.