The singular value decomposition (SVD) is an important procedure in general data analysis, primarily due to its use in the principal components analysis (PCA). The BiocSingular package provides methods to perform both exact and approximate SVD with a single wrapper function. Our aim is to provide a standard framework for use in other Bioconductor packages, where users can easily switch between different SVD algorithms. We also support parallelization throughout various parts of the SVD calculations via the BiocParallel framework.
# SVD algorithm choices
To perform any SVD, we call the runSVD()
function. The
code below performs an exact SVD to obtain the first k
singular vectors and values (in this case, 10):
dummy <- matrix(rnorm(10000), ncol=25)
e.out <- runSVD(dummy, k=10, BSPARAM=ExactParam())
str(e.out)
## List of 3
## $ d: num [1:10] 24.4 23.9 23.1 22.9 22.5 ...
## $ u: num [1:400, 1:10] -0.09372 -0.02707 0.00729 -0.00641 0.01066 ...
## $ v: num [1:25, 1:10] 0.065061 0.465998 -0.338473 -0.029838 -0.000806 ...
… but we can change the algorithm by specifying a different
BiocSingularParam
class to the BSPARAM=
argument. For example, we could use an approximate approach with irlba
instead:
… or we could use a randomized SVD via rsvd:
By default, the exact algorithm is used when
BSPARAM=NULL
. The other algorithms are approximate but are
much faster than the exact SVD for low k
. They also involve
randomization, so note the use of a random seed.
For tall or fat matrices, it may be faster to perform the SVD on the cross-product rather than on the matrix itself. The cross-product is usually much smaller than the original matrix and can be decomposed rapidly, with time savings offsettting the cost of the cross-product in the first place. The results of the cross-product decomposition is used to obtain the decomposition results for the original matrix.
The definition of “tall” or “fat” is based on the fold difference in
dimensions. All BiocSingularParam
objects have a
fold
slot that specifies the minimum fold difference
between dimensions for a cross-product to be computed. This provides an
automated mechanism for deciding whether or not a cross-product should
be used.
## class: ExactParam
## cross-product fold-threshold: 10.00
## deferred centering/scaling: off
Setting fold=1
will compute the cross-product for all
matrices, even for square matrices where the gains are negligible.
Setting fold=Inf
will never compute the cross-product for
any matrix. The default is to use fold=5
.
Centering and scaling of the input matrix can be performed with the
center=
and scale=
arguments, respectively.
These should be numeric vectors of length equal to the number of rows of
the input matrix. The SVD will then be effectively performed on
t(t(dummy)-center)/scale)
.
Many of the approximate algorithms (and computation of the cross-product) involve matrix multiplications. By default, any centering and scaling is applied before any multiplication. However, it is possible to “defer” the centering and scaling such that the multiplication can make use of features of the underlying matrix representation (e.g., sparsity). This enables faster calculations at the expense of numerical stability.
The main practical purpose of the SVD is to perform PCAs. This is
achieved with the runPCA()
wrapper function, which also
accepts a BSPARAM=
argument specifying the type of
algorithm to use:
## List of 3
## $ sdev : num [1:10] 1.22 1.2 1.15 1.15 1.12 ...
## $ rotation: num [1:25, 1:10] 0.065141 0.465864 -0.33853 -0.029862 -0.000804 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
## $ x : num [1:400, 1:10] -2.288 -0.662 0.176 -0.157 0.26 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
The output is equivalent to that of prcomp()
- a matrix
of principal component scores, the rotation vectors, and the standard
deviation explained by each component.
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -2.28836543 -0.005088504 -0.1215437 -1.14325491 2.4615798 0.3795913
## [2,] -0.66231190 -0.593552869 0.8889813 0.08355081 -1.6417589 -0.5222694
## [3,] 0.17610681 -0.581214302 -1.3314832 0.40977421 -1.4605159 0.4954922
## [4,] -0.15681334 -0.783137887 0.6542274 -0.11570220 0.8610293 0.3942107
## [5,] 0.25961665 0.174974164 1.2359999 0.74731536 -0.5603019 -0.9505845
## [6,] 0.05301977 0.539412924 0.4282860 0.79711774 -1.3153335 0.1527365
## PC7 PC8 PC9 PC10
## [1,] -1.3398241 1.2252040 -0.43533736 -1.08645020
## [2,] -0.2877867 -0.3340612 0.09030705 0.26171719
## [3,] -1.4815112 -0.1693165 3.53207643 -0.52140348
## [4,] 1.3050660 0.2166619 -0.20253499 -0.70411648
## [5,] 0.4708046 0.1893536 -1.42183396 -0.00935545
## [6,] -2.6508875 -0.9834861 0.04690857 -0.29286364
Column centering is performed by default with
center=TRUE
. Standardization of each column can be enabled
with scale=TRUE
. Alternatively, numeric vectors can be
passed to either argument for subtraction from and scaling of each
column.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.41.0 BiocSingular_1.23.0 knitr_1.49
## [4] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 jsonlite_1.8.9 compiler_4.4.2
## [4] BiocManager_1.30.25 crayon_1.5.3 rsvd_1.0.5
## [7] Rcpp_1.0.13-1 parallel_4.4.2 jquerylib_0.1.4
## [10] IRanges_2.41.2 yaml_2.3.10 fastmap_1.2.0
## [13] lattice_0.22-6 XVector_0.47.0 R6_2.5.1
## [16] S4Arrays_1.7.1 generics_0.1.3 ScaledMatrix_1.15.0
## [19] BiocGenerics_0.53.3 DelayedArray_0.33.3 MatrixGenerics_1.19.0
## [22] maketools_1.3.1 bslib_0.8.0 rlang_1.1.4
## [25] cachem_1.1.0 xfun_0.49 sass_0.4.9
## [28] sys_3.4.3 SparseArray_1.7.2 cli_3.6.3
## [31] zlibbioc_1.52.0 digest_0.6.37 grid_4.4.2
## [34] irlba_2.3.5.1 lifecycle_1.0.4 S4Vectors_0.45.2
## [37] evaluate_1.0.1 codetools_0.2-20 buildtools_1.0.0
## [40] beachmat_2.23.5 abind_1.4-8 stats4_4.4.2
## [43] rmarkdown_2.29 matrixStats_1.4.1 tools_4.4.2
## [46] htmltools_0.5.8.1