TDbasedUFE

library(TDbasedUFE)

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("TDbasedUFE")

Introduction

TDbasedUFE is a R package that intends to perform various analyses using Tensor decomposition based unsupervised feature extraction (TDbasedUFE) (Y-H. Taguchi 2020) as much as possible in an user friendly manner. Although in the future, I am planning to implement all the features possible by TDbasedUFE in this R package, at the moment, it includes only basic features.

What are tensor and tensor decomposition?

Although [tensor] (https://bioconductor.org/packages/release/bioc/vignettes/DelayedTensor/inst/doc/DelayedTensor_1.html#12_What_is_Tensor) and [tensor decomposition] (https://bioconductor.org/packages/release/bioc/vignettes/DelayedTensor/inst/doc/DelayedTensor_1.html#13_What_is_Tensor_Decomposition) were fully described in vignettes of DelayedTensor(Tsuyuzaki 2022), we also briefly describe these.

Tensor is a natural extension of matrix that has only rows and columns, whereas tensor has more number of “rows” (or “columns”). For example, three mode tensor, xijk ∈ ℝN × M × K, has three suffix, i ∈ ℕ, j ∈ ℕ, and k ∈ ℕ, each of which ranges [1, N], [1, K] and [1, M]. This representation fitted with that of genomic datasets, e. g., xijk can represent gene expression of ith gene of kth tissue of jth subject (person). If the measurements are performed over multiple conditions, e.g., different time points, we can add additional sufix that corresponds to time. Alternatively, we can also make use of tensor to store multiomics data if we can add additional suffix that represent which of multiomics data is stored in the elements of tensor. Since the experimental design has become more and more complicated, tensor is an ideal data structure to store modern genomic observations.

Although tensor can store very complicated structure well, it does not always mean that it has ability to help us to understand these complicated structure. In order to decompose tensor, tensor decomposition was developed. There are various way to compose tensor. Usually, tensor decomposition (TD) is to decompose tensor into product of some of vectors and tensors.

The most popular TD is CP decomposition, which can decompose tensor into product some of vectors. For example, xijk ∈ ℝN × M × K can be decomposed into the product sum of three vectors, N, ℝM, and K.

Tensor Decomposition
Tensor Decomposition

which can be mathematically expressed as xijk = ∑λuiujuk Alternatively, more advanced way, tensor train decomposition,
can decomposes the tensor to product of a mixture of matrix and tensor as xijk = ∑12R1iR12jR2k However, in this specific study, we employed more primitive way of decomposition, Tucker decomposition, as xijk = ∑123G(ℓ123)u1iu2ju3k where G ∈ ℝN × M × K is core tensor that evaluate the weight of product $ u_{1 i} u{2 j} u{_3 k}$ and u1i ∈ ℝN × N, u2j ∈ ℝM × M, u3k ∈ ℝK × K are singular value matrices and are orthogonal matrices.

Although there are no uniqueness of Tucker decomposition, we specifically employed higher order singular value decomposition (HOSVD) (Y-H. Taguchi 2020) to get Tucker decomposition. The reason why we employed Tucker decomposition was fully described in my book(Y-H. Taguchi 2020).

Tensor decomposition based unsupervised feature extraction.

Tensor decomposition based unsupervised feature extraction
Tensor decomposition based unsupervised feature extraction

Here we briefly describe how we can make use of HOSVD for the feature selection. Suppose xijk represents the expression of gene i of the kth tissue of the jth subject (person) and the task is to select genes whose expression is distinct between patients and healthy controls only at one specific tissue. In order that, we need to select u2j attributed to jth subject which is distinct between healthy control and patients (see (1)) and u3k attributed to kth tissue is kth tissue specific (see (2)). Then we next identify which G(ℓ123) has the largest absolute value with fixed 2 and 3 that correspond to distinction between patient and healthy control and tissue specificity, respectively (see (3)). Finally, with assuming that u1i obeys Gaussian distribution (null hypothesis), P-values are attributed to ith gene. P-values are corrected with considering multiple comparison corrections (specifically, we employed Benjamini Hochberg method(Y-H. Taguchi 2020)) and genes associated with adjusted P-values less than 0.01 are usually selected (see (4) and (5)).

Optimization of standard deviation

Although TD based unsupervised FE was proposed five years ago, it was successfully applied to wide range of genomic science, together with the proto type method, principal component analysis (PCA) based unsupervised FE proposed ten years ago, we recently recognize that the optimization of the standard deviation (SD) in Gaussian distribution employed as the null hypothesis drastically improved the performance(Y.-H. Taguchi and Turki 2022) (Y.-H. Taguchi and Turki 2023) (Roy and Taguchi 2022). How we can optimize SD and why the optimization of SD can improve the performance is as follows. Previously, we computed SD of u1i, σ1 as $$ \sigma_{\ell_1} = \sqrt{\frac{1}{N} \sum_i \left ( u_{\ell_1 i} - \langle u_{\ell_1i} \rangle\right)^2} $$ $$ \langle u_{\ell_1 i} \rangle = \frac{1}{N} \sum_i u_{\ell_1 i}. $$ However, σ1 has tendency to be over estimated since some of ith does not follow Gaussian distribution and has too large absolute values. In order to exclude those i that does not follow Gaussian, we tried to minimize SD of hn which is the frequency of Pi computed with assuming that u1i obeys Gaussian and falling into nth bin of histogram Pi $$ \sigma_h = \sqrt{ \frac{1}{N_h} \sum_n \left ( h_n - \langle h_n \rangle \right)^2} $$ where Nh is the number of bins of the histogram of Pi and $$ \langle h_n \rangle = \frac{1}{N_h} \sum_n h_n $$ with excluding some of Pi which is supposed to be too small to assume that obeys Gaussian (Null hypothesis). This is because hn should be constant if all of Pi follows Gaussian (Null hypothesis).

Left: \sigma_h as a function of \sigma_{\ell_1}. Right: Histogram, h_n, of 1-P_i being computed with the optimized \sigma_{\ell_1}.
Left: σh as a function of σ1. Right: Histogram, hn, of 1 − Pi being computed with the optimized σ1.

The above plot is output of examples of function selectFeature in TDbasedUFE package and represents the dependence of σh upon σ1 as well as the resulting histogram hn of 1 − Pi when u1i is fully generated from Gaussian. The smallest σh correctly corresponds to the true σ1( = 0.01) and hn is almost independent of n as expected.

Thus it is expected that σh will take the smallest value when we correctly exclude is whose Pi does not obey Gaussian (Null hypothesis). The following is the step to optimized σ1.

  1. Prepare initial σ1 and define threshold adjusted P-value, P0.

  2. Attribute Pis to is with assuming Gaussian distribution of u1i using σ1.

  3. Attribute adjusted Pis to is with using Benjamini-Hochberg criterion.

  4. Exclude is associated with adjusted Pi less than P0.

  5. Compute histogram of 1 − Pi and σh.

  6. Modify σ1 until σh is minimized.

  7. Finally, we identify excluded is as the selected features because these is do not obey Gaussian (Null hypothesis)

Multiomics data analysis

When we have multiomics data set composed of K omics data measured on common M samples as xikjk ∈ ℝNk × M × K where Nk is the number of features in kth omics data, we can make use of a tensor which is a bundle of linear kanel(Y. Taguchi and Turki 2022). In this case, we can generate liner kernel of kth omics data as xjjk = ∑ikxikjkxikjk ∈ ℝM × M × K to which HOSVD was applied and we get xjjk = ∑123G(ℓ123)u1ju2ju3k where 1 and 2 are exchangeable, since xjjk = xjjk. Singular values attributed to features can be retrieved as uik = ∑jujxikj Then the same procedure was applied to ui in order to select features.

Multiomics analysis
Multiomics analysis

Conclusions

In this vignettes, we explain how TD based unsupervised FE can work with optimized SD. In the other vignettes, we introduce how it works well in the real examples.

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] MOFAdata_1.22.0      tximportData_1.34.0  tximport_1.35.0     
#>  [4] readr_2.1.5          rTensor_1.4.8        GenomicRanges_1.59.1
#>  [7] GenomeInfoDb_1.43.2  IRanges_2.41.2       S4Vectors_0.45.2    
#> [10] BiocGenerics_0.53.3  generics_0.1.3       TDbasedUFE_1.7.0    
#> [13] BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              hms_1.1.3               digest_0.6.37          
#>  [4] magrittr_2.0.3          evaluate_1.0.1          fastmap_1.2.0          
#>  [7] jsonlite_1.8.9          promises_1.3.2          BiocManager_1.30.25    
#> [10] httr_1.4.7              UCSC.utils_1.3.0        jquerylib_0.1.4        
#> [13] cli_3.6.3               shiny_1.10.0            crayon_1.5.3           
#> [16] rlang_1.1.4             XVector_0.47.0          bit64_4.5.2            
#> [19] cachem_1.1.0            yaml_2.3.10             parallel_4.4.2         
#> [22] tools_4.4.2             tzdb_0.4.0              httpuv_1.6.15          
#> [25] GenomeInfoDbData_1.2.13 buildtools_1.0.0        vctrs_0.6.5            
#> [28] R6_2.5.1                mime_0.12               lifecycle_1.0.4        
#> [31] zlibbioc_1.52.0         bit_4.5.0.1             vroom_1.6.5            
#> [34] pkgconfig_2.0.3         pillar_1.10.0           bslib_0.8.0            
#> [37] later_1.4.1             glue_1.8.0              Rcpp_1.0.13-1          
#> [40] tidyselect_1.2.1        xfun_0.49               tibble_3.2.1           
#> [43] sys_3.4.3               knitr_1.49              xtable_1.8-4           
#> [46] htmltools_0.5.8.1       rmarkdown_2.29          maketools_1.3.1        
#> [49] compiler_4.4.2
Roy, Sanjiban Sekhar, and Y-H. Taguchi. 2022. “Tensor Decomposition and Principal Component Analysis-Based Unsupervised Feature Extraction Outperforms State-of-the-Art Methods When Applied to Histone Modification Profiles.” bioRxiv. https://doi.org/10.1101/2022.04.29.490081.
Taguchi, Y-H. 2020. Unsupervised Feature Extraction Applied to Bioinformatics. Springer International Publishing. https://doi.org/10.1007/978-3-030-22456-1.
Taguchi, Y-H, and Turki Turki. 2022. Adapted tensor decomposition and PCA based unsupervised feature extraction select more biologically reasonable differentially expressed genes than conventional methods.” Scientific Reports 12 (1): 17438. https://doi.org/10.1038/s41598-022-21474-z.
Taguchi, Y-h, and Turki Turki. 2022. Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis.” BMC Medical Genomics 15 (1): 37. https://doi.org/10.1186/s12920-022-01181-4.
Taguchi, Y.-H., and Turki Turki. 2023. “Principal Component Analysis- and Tensor Decomposition-Based Unsupervised Feature Extraction to Select More Suitable Differentially Methylated Cytosines: Optimization of Standard Deviation Versus State-of-the-Art Methods.” Genomics, 110577. https://doi.org/https://doi.org/10.1016/j.ygeno.2023.110577.
Tsuyuzaki, Koki. 2022. DelayedTensor: R Package for Sparse and Out-of-Core Arithmetic and Decomposition of Tensor. https://doi.org/10.18129/B9.bioc.DelayedTensor.