Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. One critical unmet challenge is that molecular disease subtypes characterized by relevant clinical differences, such as survival, are difficult to differentiate. With the advancement of multi-omics technologies, subtyping methods have shifted toward data integration in order to differentiate among subtypes from a holistic perspective that takes into consideration phenomena at multiple levels. However, these integrative methods are still limited by their statistical assumption and their sensitivity to noise. In addition, they are unable to predict the risk scores of patients using multi-omics data.
To address this problem, we introduce Subtyping via Consensus Factor Analysis (SCFA), a novel method for cancer subtyping and risk prediction using consensus factor analysis. SCFA follows a three-stage hierarchical process to ensure the robustness of the discovered subtypes. First, the method uses an autoencoder to filter out genes with an insignificant contribution in characterizing each patient. Second, it applies a modified factor analysis to generate a collection of factor representations of the high-dimensional multi-omics data. Finally, it utilizes a consensus ensemble to find subtypes that are shared across all factor representations.
To install SCFA
, you need to install the R pacakge from
Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("SCFA")
SCFA depends on the torch
package to build and train the
autoencoders. When SCFA package is loaded, it will check for the
availability of C++ libtorch
. torch
package
can be used to install C++ libtorch
, which is necessary for
neural network computation.
Load the example data GBM
. GBM
is the
Glioblastoma cancer dataset.
## libtorch is not installed. Use `torch::install_torch()` to download and install libtorch
library(survival)
# Load example data (GBM dataset), for other dataset, download the rds file from the Data folder at https://bioinformatics.cse.unr.edu/software/scfa/Data/ and load the rds object
data("GBM")
# List of one matrix of microRNA data, other examples would have 3 matrices of 3 data types
dataList <- GBM$data
# Survival information
survival <- GBM$survival
We can use the main funtion SCFA
to generate subtypes
from multi-omics data. The input of this function is a list of matrices
from different data types. Each matrix has rows as samples and columns
as features. The output of this function is subtype assignment for each
patient. We can perform survival analysis to determine the significance
in survival differences between discovered subtypes.
# Generating subtyping result
set.seed(1)
subtype <- SCFA(dataList, seed = 1, ncores = 4L)
# Perform survival analysis on the result
coxFit <- coxph(Surv(time = Survival, event = Death) ~ as.factor(subtype), data = survival, ties="exact")
coxP <- round(summary(coxFit)$sctest[3],digits = 20)
print(coxP)
## pvalue
## 0.01213664
We can use the function SCFA.class
to predict risk score
of patients using available survival information from training data. We
need to provide the function with training data with survival
information, and testing data. The output is the risk score of each
patient. Patient with higher risk scores have higher probablity to
experience event before the other patient. Concordance index is use to
confirm the correlation between predicted risk scores and survival
information.
# Split data to train and test
set.seed(1)
idx <- sample.int(nrow(dataList[[1]]), round(nrow(dataList[[1]])/2) )
survival$Survival <- survival$Survival - min(survival$Survival) + 1 # Survival time must be positive
trainList <- lapply(dataList, function(x) x[idx, ] )
trainSurvival <- Surv(time = survival[idx,]$Survival, event = survival[idx,]$Death)
testList <- lapply(dataList, function(x) x[-idx, ] )
testSurvival <- Surv(time = survival[-idx,]$Survival, event = survival[-idx,]$Death)
# Perform risk prediction
result <- SCFA.class(trainList, trainSurvival, testList, seed = 1, ncores = 4L)
# Validation using concordance index
c.index <- survival::concordance(coxph(testSurvival ~ result))$concordance
print(c.index)
## [1] 0.5783241
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] survival_3.7-0 SCFA_1.17.0 knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 glmnet_4.1-8 bit_4.5.0
## [4] jsonlite_1.8.9 compiler_4.4.2 BiocManager_1.30.25
## [7] psych_2.4.6.26 Rcpp_1.0.13-1 parallel_4.4.2
## [10] callr_3.7.6 cluster_2.1.6 jquerylib_0.1.4
## [13] splines_4.4.2 BiocParallel_1.41.0 RhpcBLASctl_0.23-42
## [16] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [19] R6_2.5.1 igraph_2.1.1 shape_1.4.6.1
## [22] iterators_1.0.14 snow_0.4-4 maketools_1.3.1
## [25] bslib_0.8.0 rlang_1.1.4 cachem_1.1.0
## [28] xfun_0.49 sass_0.4.9 sys_3.4.3
## [31] bit64_4.5.2 cli_3.6.3 magrittr_2.0.3
## [34] ps_1.8.1 foreach_1.5.2 digest_0.6.37
## [37] grid_4.4.2 processx_3.8.4 torch_0.13.0
## [40] nlme_3.1-166 lifecycle_1.0.4 coro_1.1.0
## [43] mnormt_2.1.1 evaluate_1.0.1 codetools_0.2-20
## [46] buildtools_1.0.0 rmarkdown_2.29 matrixStats_1.4.1
## [49] pkgconfig_2.0.3 tools_4.4.2 htmltools_0.5.8.1