Authors: Koki Tsuyuzaki [aut, cre]
Last modified: 2024-12-29 05:04:56.986594
Compiled: Sun Dec 29 05:07:00 2024
Biological systems have very complicated structures like this figure1.
For example, in the cell, DNA sequences are folded in cellular nucleus, RNA molecules are transcripted from the DNA, proteins are translated from the RNAs, and finally, the proteins are related to cellular functions. Outside of the cell, there are also many signals like a bacterial infection, adding chemical reagents, drugs, lifestyle, and so on. The change of these molecular types/phenomena finally causes the phenotype such as disease, BMI, and morphology. It is not possible to measure all molecular types or phenomena simultaneously, so one or two of them are chosen and exhaustively measured. This approach is called omics study and widely used. For example, genomics measure all DNA sequences, and transcriptomics measure all RNA molecules.
There is a need for a framework that can handle and analyze such heterogeneous data structures in a unified manner and provide biological interpretation. Tensors are a mathematical framework that can be very useful in such a situation.
A tensor can be considered as a generalized form of data representation2. For example, a scalar value, a vector, and a matrix are also called 0th-order tensor, 1st-order tensor, and 2nd-order tensor, respectively. If a data has three “modes” (1. height, 2. width, and 3. depth), it is called a 3rd-order tensor.
That’s why any data is basically a tensor, but in most cases, the term tensor implies 3rd-order or higher-order tensor.
Tensor decomposition is the extension of matrix decomposition. If we have a third-order tensor, gene times tissue times condition, using tensor decomposition, we can extract a small number of patterns3.
Each vector can be summarized to the multiple matrices and these are called factor matrices. The scalar values are summarized to a small tensor, and this is called core tensor.
A tensor is more than just a multi-dimensional array; as we will see
later, there are various operations that are specific to tensors, such
as reshaping, mode-wise statistics, and various tensor products. These
operations are essential in the analysis of tensor data and the
implementation of tensor decomposition algorithms. Although the standard
array
of R language can express increasing orders of
tensors, it does not provide tensor-specific operations. Therefore, many
R users manipulate tensor data by using the functions implemented in the
rTensor
package for now. Although rTensor is
very useful, it assumes the input object to be an in-memory array. On
the other hand, tensors can easily become huge as the order and the size
of each mode increase, and may no longer fit in memory.
DelayedTensor is implmented for such extreamly huge tensor data. DelayedTensor provides some functions of rTensor by reimplementing them with DelayedArray. DelayedArray is a framework that allows us to use the data on the disk as if it were a standard array in R. There are some out-of-core backend file system such as HDF5Array and TileDBArray used in DelayedArray and the incremental calculations can be performed by implementing the functions in support of “block processing”.
The functionality of DelayedTensor is fourfold.
Block-Processing Tensor Reshaping: Operations such as folding and unfolding a higher-order tensor data into a matrix can be performed while taking care of the block size.
Block-Processing Tensor Arithmetic: Calculation of sums and averages for each mode, and operations such as Hadamard product, Kronecker product, and Khatri-Rao product can be performed while taking care of block size.
Block-Processing Tensor Decomposition: Some of the tensor decomposition algorithms implemented in rTensor have been reimplemented using reshaping and arithmetic functions of DelayedTensor.
Block-Processing einsum: In addition to the rTensor functions, the einsum function, which is well-known as the Numpy (Python) function, has also been reimplemented based on DelayedArray and the block processing framework. einsum is a very powerful preprocessing method to merge multiple tensor data into a single tensor.
Although what is executed inside the functions are very different, the function name, argument name, and value name are exactly the same as those of rTensor so that rTensor users can easily introduce them.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
## [4] xfun_0.49 maketools_1.3.1 cachem_1.1.0
## [7] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29
## [10] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
## [13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.2
## [16] sys_3.4.3 tools_4.4.2 evaluate_1.0.1
## [19] bslib_0.8.0 yaml_2.3.10 BiocManager_1.30.25
## [22] jsonlite_1.8.9 rlang_1.1.4