Multiple co-inertia analysis (MCIA) is a member of the family of joint dimensionality reduction (jDR) methods that extend unsupervised dimension reduction techniques such as Principal Components Analysis (PCA) and Non-negative Matrix Factorization (NMF) to datasets with multiple data blocks (alternatively called views) (Cantini, 2021).
Here, we present a new implementation in R of MCIA,
nipalsMCIA
, that uses an extension of Non-linear Iterative
Partial Least Squares (NIPALS) to solve the MCIA optimization problem
(Hanafi,
2011). This implementation has several features, including speed-up
over approaches that employ the Singular Value Decomposition (SVD),
several options for pre-processing and deflation to customize algorithm
performance, methodology to perform out-of-sample global embedding, and
analysis and visualization capabilities to maximize result
interpretation.
While there exist additional implementations of MCIA (e.g. mogsa, omicade4), ours is unique in providing a pipeline that incorporates pre-processing data options including those present in the original development of MCIA (including a theoretically grounded calculation of inertia, or total variance) with an iterative solver that shows speed-up for larger datasets, and is explicitly designed for simultaneous ease of use as a tool for multi-view data decomposition as well as a foundation for theoretical and computational development of MCIA and related methodology. A manuscript detailing our implementation is forthcoming.
In this vignette, we will cover the most important functions within
the nipalsMCIA
package as well as downstream analyses that
can help interpret the MCIA decomposition using a cancer data set from
Meng et al., 2016 that
includes 21 subjects with three data blocks. The data blocks include
mRNA levels (12895 features), microRNA levels (537 features) and protein
levels (7016 features).
The nipals_multiblock
function performs MCIA using the
NIPALS algorithm. nipals_multiblock
outputs a decomposition
that includes a low-dimensional embedding of the data in the form of
global scores, and the contributions of the data blocks (block score
weights) and features (global loadings) to these same global scores.
nipalsMCIA
provides several additional functions to
visualize, analyze, and interpret these results.
The nipals_multiblock
function accepts as input a MultiAssayExperiment
(MAE) object. Such objects represent a modern classed-based approach to
organizing multi-omics data in which each assay can be stored as an
individual experiment alongside relevant metadata for samples and
experiments. If users have a list of data blocks with matching sample
names (and optional sample-level metadata), we provide a simple
conversion function (simple_mae.R
) to generate an MAE
object. For more sophisticated MAE object construction, please consult
the MAE documentation.
In the context of the NCI-60 data set and this vignette, we will show you the power of MCIA to find important relationships between mRNA, microRNA and proteins. More specifically, we will show you how to interpret the global factor scores in Part 1: Interpreting Global Factor Scores and global loadings in Part 2: Interpreting Global Loadings.
# devel version
# install.packages("devtools")
devtools::install_github("Muunraker/nipalsMCIA", ref = "devel",
force = TRUE, build_vignettes = TRUE) # devel version
The NCI-60 data set has been included with the
nipalsMCIA
package and is easily available as shown
below:
# load the dataset which uses the name data_blocks
data(NCI60)
# examine the contents
data_blocks$miRNA[1:3, 1:3]
## MI0000060_miRNA MI0000061_miRNA MI0000061.1_miRNA
## CNS.SF_268 11.91 6.71 13.11
## CNS.SF_295 11.94 7.13 12.86
## CNS.SF_539 11.50 5.79 12.10
## 5-HT3C2_1_mrna A1BG-AS1_2_mrna A2LD1_3_mrna
## CNS.SF_268 0.53 0.35 -0.05
## CNS.SF_295 -0.42 0.54 -1.04
## CNS.SF_539 0.00 0.80 0.85
## STAU1_1_prot NRAS_2_prot HRAS_3_prot
## CNS.SF_268 5.712331 7.385177 5.758845
## CNS.SF_295 0.000000 6.327175 0.000000
## CNS.SF_539 0.000000 6.597432 0.000000
To convert data_blocks into an MAE object we provide the
simple_mae()
function:
We can compute the MCIA decomposition for r global factors. For our example, we take r = 10.
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
col_preproc_method = "colprofile",
num_PCs = 10, tol = 1e-12, plots = "none")
The result is an NipalsResult object containing several outputs from the decomposition:
## [1] "global_scores" "global_loadings" "block_score_weights"
## [4] "block_scores" "block_loadings" "eigvals"
## [7] "col_preproc_method" "block_preproc_method" "block_variances"
## [10] "metadata"
We describe the first two in more detail below, and will discuss several others in the remainder of the vignette. For additional details on the decomposition, see (Hanafi, 2011, Mattesich, 2022).
The global_scores
matrix is represented by F with dimensions n × r, where n is the number of samples and r is the number of factors chosen by
using the num_PCs = r
argument. Each column j of this matrix represents the
global scores for factor j,
$$ \mathbf{F} = \begin{pmatrix} | & |& & |\\ \mathbf{f}^{(1)} &\mathbf{f}^{(2)} & \dots & \mathbf{f}^{(r)}\\ | & |& & | \end{pmatrix} \in \mathbb{R}^{n \times r} $$
This matrix encodes a low-dimensional representation of the data set, with the i-th row representing a set of r-dimensional coordinates for the i-th sample.
The global_loadings
matrix is represented by A that is p × r, where p is the number of features across
all omics and r is as before.
Each column j of this matrix
represents the global loadings for factor j, i.e.
$$ \mathbf{A} = \begin{pmatrix} | & |& & |\\ \mathbf{a}^{(1)} &\mathbf{a}^{(2)} & \dots & \mathbf{a}^{(r)}\\ | & |& & | \end{pmatrix} \in \mathbb{R}^{p \times r} $$
This matrix encodes the contribution (loading
) of each
feature to the global score.
The remainder of this vignette will be broken down into two sections, Part 1: Interpreting Global Factor Scores and Part 2: Interpreting Global Loadings where we show how to interpret F and A, respectively.
In the introduction we showed how to calculate the MCIA decomposition
using nipals_multiblock()
but used the parameter
plots = "none"
to avoid the default plotting behavior of
this function. By default, this function will generate two plots which
help establish an initial intuition for the MCIA decomposition. Here we
will re-run nipals_multiblock()
with the default
plots
parameter (all
):
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
col_preproc_method = "colprofile",
num_PCs = 10, tol = 1e-12)
The first plot visualizes each sample using factor 1 and 2 as a lower dimensional representation (factor plot).
The second plot is a scree plot of normalized singular values corresponding to the variance explained by each factor.
For clustering, it is useful to look at global factor scores without
block factor scores. The projection_plot()
function can be
used to generate such a plot using
projection = "global"
.
In addition, scores can be colored by a meaningful label such as
cancer type which is highly relevant to NCI-60. To do so, the colData
slot of the associated MAE object must be loaded with sample-level
metadata prior to invoking projection_plot()
. The sample
metadata is composed of row names corresponding to the primary sample
names of the MAE object, and columns contain different metadata
(e.g. age, disease status, etc). For instance, each of the 21 samples in
the NCI-60 dataset represents a cell line with one of three cancer
types: CNS, Leukemia, or Melanoma. We have provided this metadata as
part of the data(NCI60)
dataset and we next show how it can
be included in the resulting MAE object using the colData
parameter in simple_mae()
:
## cancerType
## CNS.SF_268 CNS
## CNS.SF_295 CNS
## CNS.SF_539 CNS
## CNS.SNB_19 CNS
## CNS.SNB_75 CNS
## CNS.U251 CNS
# loading of mae with metadata
data_blocks_mae <- simple_mae(data_blocks, row_format = "sample",
colData = metadata_NCI60)
We now rerun nipals_multiblock()
using the updated MAE
object, where the colData
is passed to the
metadata
slot of the NipalsResult
instance,
and then visualize the global factor scores using the
projection_plot()
function.
# adding metadata as part of the nipals_multiblock() function
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
col_preproc_method = "colprofile",
plots = "none",
num_PCs = 10, tol = 1e-12)
The color_col
argument of projection_plot()
can then be used to determine which column of metadata
is
used for coloring the individual data points, in this case
cancerType
. color_pal
is used to assign a
color palette and requires a vector of colors
(i.e. c("blue", "red", "green")
). To help create this
vector we also provide get_metadata_colors()
, a helper
function (used below) that can be used with a
scales::<function>
to return an appropriate vector of
colors. Note: colors are applied by lexicographically sorting the list
of unique metadata values then assigning the first color to the first
value, second with second and so on.
# meta_colors = c("blue", "grey", "yellow") can use color names
# meta_colors = c("#00204DFF", "#7C7B78FF", "#FFEA46FF") can use hex codes
meta_colors <- get_metadata_colors(mcia_results, color_col = 1,
color_pal_params = list(option = "E"))
projection_plot(mcia_results, projection = "global", orders = c(1, 2),
color_col = "cancerType", color_pal = meta_colors,
legend_loc = "bottomleft")
Using this plot one can observe that global factor scores for factor 1 and 2 can separate samples into their cancer types.
A heatmap can be used to cluster samples based on global scores
across all factors using global_scores_heatmap()
. The
samples can be colored by the associated metadata using
color_cor
+ color_pal
as shown below.
global_scores_heatmap(mcia_results = mcia_results,
color_col = "cancerType", color_pal = meta_colors)
Like the projection plot, one can observe that the clustering can differentiate between cancer type quite effectively.
In addition to the global scores matrix, MCIA also calculates a global loadings matrix A that is p × r.
The global loadings matrix is calculated using a weighted sum of the
block loadings. These weights can be interpreted as the contribution of
each omic to the global factor score (i.e. pseudoeigenvalue),
and can be visualized using the function
block_weights_heatmap()
.
For example, we can observe that factors 7-9 have the strongest contribution from the protein data.
MCIA returns feature loadings for each factor that can be used to
understand important contributions. To get a sense of which omics
features show high contribution to each factor, one can plot feature
loadings for any two omics using the vis_load_plot
function. The features that take the most extreme values will be the
strongest contributors to that factor.
# colors_omics = c("red", "blue", "green")
# colors_omics <- get_colors(mcia_results, color_pal = colors_omics)
colors_omics <- get_colors(mcia_results)
vis_load_plot(mcia_results, axes = c(1, 4), color_pal = colors_omics)
## Warning: Use of `gl_f[[colnames(gl_f)[1]]]` is discouraged.
## ℹ Use `.data[[colnames(gl_f)[1]]]` instead.
## Warning: Use of `gl_f[[colnames(gl_f)[2]]]` is discouraged.
## ℹ Use `.data[[colnames(gl_f)[2]]]` instead.
From this plot, we can deduce that individual miRNA features tend to have the most extreme values in factors 1 and 4, thereby contributing most strongly to the global scores for those factors.
To dive into each factor users can generate a scree plot of top
loading values using the following two steps/functions: 1)
ord_loadings()
which calculates a list of ranked feature
loadings, and 2) vis_load_ord()
which takes as input ranked
values from step (1) to generate a scree plot. Features can be ranked
based on ascending or descending values and/or absolute value. Users can
also plot features from all omics together or focus on a specific omic
by using a corresponding block name (i.e. mrna).
Here, we visualize two ranked lists for Factor 1: using all omics, or just mrna.
# generate the visualizations
all_pos_1_vis <- vis_load_ord(mcia_results, omic = "all", factor = 1,
absolute = FALSE, descending = TRUE)
mrna_pos_1_vis <- vis_load_ord(mcia_results, omic = "mrna", factor = 1,
absolute = FALSE, descending = TRUE)
ggpubr::ggarrange(all_pos_1_vis, mrna_pos_1_vis)
The ordering itself can be easily obtained using the
ord_loadings(...)
function.
# obtain the loadings as plotted above
all_pos_1 <- ord_loadings(mcia_results, omic = "all", factor = 1,
absolute = FALSE, descending = TRUE)
mrna_pos_1 <- ord_loadings(mcia_results, omic = "mrna", factor = 1,
absolute = FALSE, descending = TRUE)
## loading omic omic_name factor
## MI0000284_miRNA 0.09759126 miRNA MI0000284_miRNA 1
## MI0000266_miRNA 0.09732188 miRNA MI0000266_miRNA 1
## MI0000267_miRNA 0.08809128 miRNA MI0000267_miRNA 1
## loading omic omic_name factor
## GNS_3797_mrna 0.01854570 mrna GNS_3797_mrna 1
## WASL_12215_mrna 0.01850120 mrna WASL_12215_mrna 1
## LAPTM4A_4988_mrna 0.01749899 mrna LAPTM4A_4988_mrna 1
The first plot shows that that miRNA has the overall top positive feature loadings for factor 1, whereas the second plot allows us to focus on only mRNA signals, where genes like GNS and WASL come up as top contributors.
Here, we return and visualize two ranked lists for Factor 2: using all omics, or just protein, this time ranking features by magnitude (absolute value).
# define the loadings
all_abs_2 <- vis_load_ord(mcia_results, omic = "all", factor = 2,
absolute = TRUE, descending = TRUE)
prot_abs_2 <- vis_load_ord(mcia_results, omic = "prot", factor = 2,
absolute = TRUE, descending = TRUE)
ggpubr::ggarrange(all_abs_2, prot_abs_2)
# obtain the loadings as plotted above
all_abs_2_vis <- ord_loadings(mcia_results, omic = "all", factor = 2,
absolute = TRUE, descending = TRUE)
prot_abs_2_vis <- ord_loadings(mcia_results, omic = "prot", factor = 2,
absolute = TRUE, descending = TRUE)
## loading omic abs omic_name factor
## MI0001735_miRNA -0.1505451 miRNA 0.1505451 MI0001735_miRNA 2
## MI0003132_miRNA -0.1326578 miRNA 0.1326578 MI0003132_miRNA 2
## MI0000744_miRNA -0.1254875 miRNA 0.1254875 MI0000744_miRNA 2
## loading omic abs omic_name factor
## THY1_1801_prot -0.02671046 prot 0.02671046 THY1_1801_prot 2
## IGFBP7_1337_prot -0.02612204 prot 0.02612204 IGFBP7_1337_prot 2
## COL5A1_6730_prot -0.02450671 prot 0.02450671 COL5A1_6730_prot 2
For factor 2, we again see that across all omics, miRNA is playing the strongest role. One important difference (relative to factor 1) is made by ranking the values according to the absolute value. Doing so allows us to identify 9 (out of 15) features with a strong negative loading value. The right plot shows that the negatively values protein features have the largest absolute values, indicating that they may play a distinct role from the miRNA in global scores for this factor.
We return and visualize a ranked list for Factor 4 with the top 60 features.
# visualization
all_4_plot <- vis_load_ord(mcia_results, omic = "all", factor = 4,
absolute = FALSE, descending = TRUE, n_feat = 60)
all_4_plot
# visualize the tabular data
all_4 <- ord_loadings(mcia_results, omic = "all", factor = 4,
absolute = FALSE, descending = TRUE)
One can observe that while miRNA dominate the top features, proteins are also represented.
The NCI-60 data set includes gene expression data, and its
corresponding global loading matrix is a gene by factor matrix. We can
learn more about the pathways over-represented in a given factor by
running the global loadings for the factor through a gene set enrichment
analysis (GSEA). Here, we look at factors 1 and 3. We run
gsea_report()
, which reports on the p-value of the most
significant pathway as well as the total number of significant pathways
for each factor.
# extract mRNA global loadings
mrna_gfscores <- nmb_get_gl(mcia_results)
mrna_rows <- str_detect(row.names(mrna_gfscores), "_mrna")
mrna_gfscores <- mrna_gfscores[mrna_rows, ]
# rename rows to contain HUGO based gene symbols
row.names(mrna_gfscores) <- str_remove(rownames(mrna_gfscores), "_[0-9]*_.*")
# load pathway data
path.database <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.reactome.v6.2.symbols.gmt"
pathways <- fgsea::gmtPathways(gmt.file = path.database)
# generate the GSEA report
geneset_report <- gsea_report(metagenes = mrna_gfscores, path.database,
factors = seq(1, 3), pval.thr = 0.05, nproc = 2)
The report comes in the form of a list where the first element is a data frame with summary level of the GSEA analysis per factor. Ideally, each factor is capturing a select number of pathways with a high significance. From this report (below) we can see that the most significant pathway is associated with Factor 3 and that there is a large variation in the number of total (significant) pathways ranging from 7 (Factor 8) to 143 (Factor 4).
## min_pval total_pathways
## Factor1 2.38e-09 119
## Factor2 2.81e-08 95
## Factor3 8.45e-21 88
As just mentioned, Factor 3 contains the most enriched gene set so we can re-run GSEA for this factor in order to get a full list of enrichment scores across all gene sets:
# re-running GSEA
factor3_paths <- fgseaMultilevel(pathways, stats = mrna_gfscores[, 3],
nPermSimple = 10000, minSize = 15,
nproc = 2, maxSize = 500)
## pathway padj
## <char> <char>
## 1: REACTOME CELL CYCLE 1.59e-20
## 2: REACTOME CELL CYCLE MITOTIC 4.50e-18
## 3: REACTOME GENERIC TRANSCRIPTION PATHWAY 5.81e-13
## 4: REACTOME DNA REPLICATION 1.92e-09
## 5: REACTOME MITOTIC M M G1 PHASES 2.24e-09
## 6: REACTOME PROCESSING OF CAPPED INTRON CONTAINING PR... 4.43e-09
We observe that the most significant gene set is the REACTOME_CELL_CYCLE gene set. Below we can take a look at the list of genes from REACTOME_CELL_CYCLE that likely contribute to factor 3. This analysis can be repeated as necessary to make sense of other gene based factor loadings.
# extracting REACTOME_CELL_CYCLE, the most significant gene set
sig_path3 <- factor3_paths[min(factor3_paths$padj) == factor3_paths$padj, ][1, ]
sig_path3$leadingEdge[[1]][1:10]
## [1] "XRN2" "FBXW7" "FBXL5" "USP11" "FBXW4" "NOP56" "FBXL3" "FBXO4" NA
## [10] NA
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] stringr_1.5.1 nipalsMCIA_1.5.2 ggpubr_0.6.0
## [4] ggplot2_3.5.1 fgsea_1.33.0 dplyr_1.1.4
## [7] ComplexHeatmap_2.23.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] rlang_1.1.4 magrittr_2.0.3
## [3] clue_0.3-66 GetoptLong_1.0.5
## [5] matrixStats_1.4.1 compiler_4.4.2
## [7] png_0.1-8 vctrs_0.6.5
## [9] pkgconfig_2.0.3 shape_1.4.6.1
## [11] crayon_1.5.3 fastmap_1.2.0
## [13] backports_1.5.0 XVector_0.47.0
## [15] labeling_0.4.3 rmarkdown_2.29
## [17] pracma_2.4.4 UCSC.utils_1.3.0
## [19] purrr_1.0.2 xfun_0.49
## [21] MultiAssayExperiment_1.33.1 zlibbioc_1.52.0
## [23] cachem_1.1.0 GenomeInfoDb_1.43.2
## [25] jsonlite_1.8.9 DelayedArray_0.33.3
## [27] BiocParallel_1.41.0 broom_1.0.7
## [29] parallel_4.4.2 cluster_2.1.8
## [31] R6_2.5.1 bslib_0.8.0
## [33] stringi_1.8.4 RColorBrewer_1.1-3
## [35] car_3.1-3 GenomicRanges_1.59.1
## [37] jquerylib_0.1.4 Rcpp_1.0.13-1
## [39] SummarizedExperiment_1.37.0 iterators_1.0.14
## [41] knitr_1.49 IRanges_2.41.2
## [43] BiocBaseUtils_1.9.0 Matrix_1.7-1
## [45] tidyselect_1.2.1 abind_1.4-8
## [47] yaml_2.3.10 doParallel_1.0.17
## [49] codetools_0.2-20 lattice_0.22-6
## [51] tibble_3.2.1 Biobase_2.67.0
## [53] withr_3.0.2 evaluate_1.0.1
## [55] circlize_0.4.16 pillar_1.10.0
## [57] BiocManager_1.30.25 MatrixGenerics_1.19.0
## [59] carData_3.0-5 foreach_1.5.2
## [61] stats4_4.4.2 generics_0.1.3
## [63] S4Vectors_0.45.2 munsell_0.5.1
## [65] scales_1.3.0 glue_1.8.0
## [67] maketools_1.3.1 tools_4.4.2
## [69] sys_3.4.3 data.table_1.16.4
## [71] RSpectra_0.16-2 ggsignif_0.6.4
## [73] buildtools_1.0.0 fastmatch_1.1-4
## [75] cowplot_1.1.3 tidyr_1.3.1
## [77] colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [79] Formula_1.2-5 cli_3.6.3
## [81] S4Arrays_1.7.1 viridisLite_0.4.2
## [83] gtable_0.3.6 rstatix_0.7.2
## [85] sass_0.4.9 digest_0.6.37
## [87] BiocGenerics_0.53.3 SparseArray_1.7.2
## [89] farver_2.1.2 rjson_0.2.23
## [91] htmltools_0.5.8.1 lifecycle_1.0.4
## [93] httr_1.4.7 GlobalOptions_0.1.2