The consICa package: User’s manual

This document describes the usage of the functions integrated into the package and is meant to be a reference document for the end-user.

Introduction

Independent component analysis (ICA) of omics data can be used for deconvolution of biological signals and technical biases, correction of batch effects, feature engineering for classification and integration of the data [4]. The consICA package allows building robust consensus ICA-based deconvolution of the data as well as running post-hoc biological interpretation of the results. The implementation of parallel computing in the package ensures efficient analysis using modern multicore systems.

Installing and loading the package

Load the package with the following command:

if (!requireNamespace("BiocManager", quietly = TRUE)) { 
    install.packages("BiocManager")
}
BiocManager::install("consICA")
library('consICA')

Example dataset

The usage of the package functions for an independent component deconvolution is shown for an example set of 472 samples and 2454 most variable genes from the skin cutaneous melanoma (SKCM) TCGA cohort. It stored as a SummarizedExperiment object. In addition the data includes metadata such as age, gender, cancer subtype etc. and survival time-to-event data. Use data as

library(SummarizedExperiment)
data("samples_data")
samples_data
#> class: SummarizedExperiment 
#> dim: 2454 472 
#> metadata(0):
#> assays(1): X
#> rownames(2454): A2ML1 AACSP1 ... ZNF883 ZP1
#> rowData names(0):
#> colnames(472): 3N.A9WB.Metastatic 3N.A9WC.Metastatic ...
#>   Z2.AA3S.Metastatic Z2.AA3V.Metastatic
#> colData names(103): age_at_initial_pathologic_diagnosis gender ... time
#>   event

# According to our terminology expression matrix X of 2454 genes and 472 samples
X <- data.frame(assay(samples_data))
dim(X)
#> [1] 2454  472
# variables described each sample
Var <- data.frame(colData(samples_data))
dim(Var)
#> [1] 472 103
colnames(Var)[1:20] # print first 20
#>  [1] "age_at_initial_pathologic_diagnosis" "gender"                             
#>  [3] "race"                                "Weight"                             
#>  [5] "Height"                              "BMI"                                
#>  [7] "Ethinicity"                          "Cancer.Type.Detailed"               
#>  [9] "Tumor.location.site"                 "ajcc_pathologic_tumor_stage"        
#> [11] "Cancer.stage.M"                      "Cancer.stage.N"                     
#> [13] "Cancer.stage.T"                      "initial_pathologic_dx_year"         
#> [15] "birth_days_to_initial_diagnosis"     "Year.of.Birth"                      
#> [17] "age.at.last.contact"                 "Year.of.last.contact"               
#> [19] "last_contact_days_to"                "age.at.death"
# survival time and event for each sample
Sur <- data.frame(colData(samples_data))[,c("time", "event")]
Var <- Var[,which(colnames(Var) != "time" & colnames(Var) != "event")]

Consensus independent component analysis

Independent component analysis (ICA) is an unsupervised method of signal deconvolution [3]. ICA decomposes combined gene expression matrix from multiple samples into meaningful signals in the space of genes (metagenes, S) and weights in the space of samples (weight matrix, M). Figure 1. ICA decomposes combined gene expression matrix from multiple samples into meaningful signals in the space of genes (metagenes, S) and weights in the space of samples (weight matrix, M). Biological processes and signatures of cell subtypes can be found in S, while M could be linked to patient groups and patient survival (Cox regression and Eq.1).

The consICA function calculates the consensus ICA for an expression matrix: X = 𝑆 × 𝑀, where S is a matrix of statistically independent and biologically meaningful signals (metagenes) and M – their weights (metasamples). You can set the number of components, the number of consensus runs. The function will print logs every show.every run. Use ncores for parallel calculation. To filter out genes (rows) with values lower than a threshold set the filter.thr argument.

set.seed(2022)
cica <- consICA(samples_data, ncomp=20, ntry=10, ncores=1, show.every=0)

The output of consensus ICA is a list with input data X (assays of SummarizedExperiment samples_data object) and it dimensions, consensus metagene S and weight matrix M, number of components, mean R2 between rows of M (mr2), stability as mean R2 between consistent columns of S in multiple tries (stab) and number of best iteration (ibest). For compact output use reduced = TRUE.

str(cica, max.level=1)
#> List of 10
#>  $ X        : num [1:2454, 1:472] 5.13 0 3.7 1.58 5.7 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ X_num    : num [1:2454, 1:472] 5.13 0 3.7 1.58 5.7 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ S        : num [1:2454, 1:20] 0.2929 -0.0551 0.2261 -0.474 0.1712 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ M        : num [1:20, 1:472] 0.427 -0.45 -0.131 -1.13 -0.6 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ mr2      : num [1:10] 0.0302 0.0328 0.0341 0.0298 0.0295 ...
#>  $ i.best   : int 5
#>  $ stab     : num [1:10, 1:20] 0.982 0.988 0.983 0.988 0.988 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ ncomp    : num 20
#>  $ nsamples : int 472
#>  $ nfeatures: int 2454

Now we can extract the names of features (rows in X and S matrices) and their false discovery rates values for positive/negative contribution to each component. Use getFeatures to extract them.

features <- getFeatures(cica, alpha = 0.05, sort = TRUE)
#Positive and negative affecting features for first components are
head(features$ic.1$pos)
#>        features fdr
#> DDX3Y     DDX3Y   0
#> EIF1AY   EIF1AY   0
#> GYG2P1   GYG2P1   0
#> KDM5D     KDM5D   0
#> NLGN4Y   NLGN4Y   0
#> PRKY       PRKY   0
head(features$ic.1$neg)
#>        features          fdr
#> XIST       XIST 0.000000e+00
#> PAGE2     PAGE2 1.718979e-06
#> MAGEB2   MAGEB2 3.286000e-05
#> FDCSP     FDCSP 4.664017e-05
#> FRG2DP   FRG2DP 1.478113e-04
#> DSCR8     DSCR8 2.588904e-04

Two lists of top-contributing genes resulted from the getFeatures – positively and negatively involved. The plot of genes contribution to the first component (metagene) is shown below.

icomp <- 1
plot(sort(cica$S[,icomp]),col = "#0000FF", type="l", ylab=("involvement"),xlab="genes",las=2,cex.axis=0.4, main=paste0("Metagene #", icomp, "\n(involvement of features)"),cex.main=0.6)

Estimate the variance explained by the independent components with estimateVarianceExplained. Use plotICVarianceExplained to plot it.

var_ic <- estimateVarianceExplained(cica)
p <- plotICVarianceExplained(cica)

Independent components could be linked to factors from meta data with ANOVA.

## Run ANOVA for components 1 and 5
# Get metadata from SummarizedExperiment object
# Var <- data.frame(SummarizedExperiment::colData(samples_data))
var_ic1 <- anovaIC(cica, Var, icomp = 1, color_by_pv = TRUE, plot = TRUE)

var_ic5 <- anovaIC(cica, Var, icomp = 5, mode = 'box', plot = FALSE)

head(var_ic1, 3)
#>   factor       p.value p.value_disp
#> 1 gender 2.373407e-186    2.37e-186
#> 2 Height  3.090046e-29     3.09e-29
#> 3 Weight  4.130014e-07     4.13e-07
head(var_ic5, 5)
#>                      factor      p.value p.value_disp
#> 1 RNASEQ.CLUSTER_CONSENHIER 1.623478e-32     1.62e-32
#> 2         Tumor.tissue.site 3.957786e-24     3.96e-24
#> 3       Tumor.location.site 1.356281e-23     1.36e-23
#> 4           ICD.10.TopLevel 2.765156e-15     2.77e-15
#> 5               Sample.Type 1.952728e-12     1.95e-12

In this example 1st component isolated gender differences. To remove them, the reconstruction of initial matrix could be done with zero weights for first component.

Enrichment analysis

For each component we can carry out an enrichment analysis. The genes with expression above the selected threshold in at least one sample, were used as a background gene list and significantly overrepresented(adj.p-value < alpha) GO terms. You can select the DB for search: biological process (“BP”), molecular function (“MF”) and/or cellular component (“CC”).

## To save time for this example load result of getGO(cica, alpha = 0.05, db = c("BP", "MF", "CC"))
if(!file.exists("cica_GOs_20_s2022.rds")){
  
  # Old version < 1.1.1 - GO was an external object. Add it to cica and rotate if need
  # GOs <- readRDS(url("http://edu.modas.lu/data/consICA/GOs_40_s2022.rds", "r"))
  # cica$GO <- GOs
  # cica <- setOrientation(cica)
  
  # Actual version >= 1.1.1
  cica <- readRDS(url("http://edu.modas.lu/data/consICA/cica_GOs_20_s2022.rds", "r"))
  saveRDS(cica, "cica_GOs_20_s2022.rds")
} else{
  cica <- readRDS("cica_GOs_20_s2022.rds")
}

## Search GO (biological process)
# cica <- getGO(cica, alpha = 0.05, db = "BP", ncores = 4)
## Search GO (biological process, molecular function, cellular component)
# cica <- getGO(cica, alpha = 0.05, db = c("BP", "MF", "CC"), ncores = 4)

You can compare lists of enriched GO from different runs or datasets. See function overlapGO() for more.

Survival analysis

Weights of the components (rows of matrix M) can be statistically linked to patient survival using Cox partial hazard regression [4]. In survivalAnalysis function adjusted p-values of the log-rank test are used to select significant components. However, the prognostic power of each individual component might not have been high enough to be applied to the patients from the new cohort. Therefore, survivalAnalysis integrates weights of several components, calculating the risk score (RS) with improved prognostic power. For each sample, its RS is the sum of the products of significant log-hazard ratios (LHR) of the univariable Cox regression, the component stability R2 and the standardized row of weight matrix M.

RS <- survivalAnalysis(cica, surv = Sur)

cat("Hazard score for significant components")
#> Hazard score for significant components
RS$hazard.score
#>       ic.2       ic.3       ic.5       ic.9      ic.10      ic.11      ic.15 
#>  0.1806700  0.3661800  0.2620083  0.2987519 -0.3415036 -0.1648588 -0.1555157 
#>      ic.16      ic.19 
#>  0.1930739  0.1813600

Automatic report generation

The best way to get a full description of extracted components is using our automatic report. We union all analyses of independent components into a single function-generated report.

# Generate report with independent components description
if(FALSE){
  saveReport(cica, Var=Var, surv = Sur)
}

The copy of this report you can find at http://edu.modas.lu/data/consICA/report_ICA_20.pdf

# delete loaded file after vignette run
unlink("cica_GOs_20_s2022.rds")

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.35.5 Biobase_2.67.0             
#>  [3] GenomicRanges_1.57.2        GenomeInfoDb_1.41.2        
#>  [5] IRanges_2.39.2              S4Vectors_0.43.2           
#>  [7] BiocGenerics_0.53.0         MatrixGenerics_1.17.1      
#>  [9] matrixStats_1.4.1           consICA_2.5.0              
#> [11] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        farver_2.1.2            dplyr_1.1.4            
#>  [4] blob_1.2.4              Biostrings_2.75.0       fastmap_1.2.0          
#>  [7] fastICA_1.2-5.1         digest_0.6.37           lifecycle_1.0.4        
#> [10] topGO_2.57.0            survival_3.7-0          KEGGREST_1.45.1        
#> [13] RSQLite_2.3.7           magrittr_2.0.3          compiler_4.4.1         
#> [16] rlang_1.1.4             sass_0.4.9              tools_4.4.1            
#> [19] utf8_1.2.4              yaml_2.3.10             sm_2.2-6.0             
#> [22] knitr_1.48              labeling_0.4.3          S4Arrays_1.5.11        
#> [25] bit_4.5.0               DelayedArray_0.31.14    RColorBrewer_1.1-3     
#> [28] abind_1.4-8             BiocParallel_1.39.0     withr_3.0.2            
#> [31] sys_3.4.3               grid_4.4.1              fansi_1.0.6            
#> [34] colorspace_2.1-1        GO.db_3.20.0            ggplot2_3.5.1          
#> [37] scales_1.3.0            cli_3.6.3               rmarkdown_2.28         
#> [40] crayon_1.5.3            generics_0.1.3          RcppParallel_5.1.9     
#> [43] httr_1.4.7              DBI_1.2.3               cachem_1.1.0           
#> [46] splines_4.4.1           zlibbioc_1.51.2         parallel_4.4.1         
#> [49] AnnotationDbi_1.69.0    BiocManager_1.30.25     XVector_0.45.0         
#> [52] vctrs_0.6.5             Matrix_1.7-1            jsonlite_1.8.9         
#> [55] SparseM_1.84-2          bit64_4.5.2             maketools_1.3.1        
#> [58] jquerylib_0.1.4         glue_1.8.0              codetools_0.2-20       
#> [61] gtable_0.3.6            UCSC.utils_1.1.0        RcppZiggurat_0.1.6     
#> [64] munsell_0.5.1           tibble_3.2.1            pillar_1.9.0           
#> [67] htmltools_0.5.8.1       graph_1.83.0            GenomeInfoDbData_1.2.13
#> [70] R6_2.5.1                evaluate_1.0.1          lattice_0.22-6         
#> [73] highr_0.11              png_0.1-8               Rfast_2.1.0            
#> [76] pheatmap_1.0.12         memoise_2.0.1           bslib_0.8.0            
#> [79] Rcpp_1.0.13             SparseArray_1.5.45      org.Hs.eg.db_3.20.0    
#> [82] xfun_0.48               buildtools_1.0.0        pkgconfig_2.0.3

References

  1. Golebiewska, A., et al. Patient-derived organoids and orthotopic xenografts of primary and recurrent gliomas represent relevant patient avatars for precision oncology. Acta Neuropathol 2020;140(6):919-949.

  2. Scherer, M., et al. Reference-free deconvolution, visualization and interpretation of complex DNA methylation data using DecompPipeline, MeDeCom and FactorViz. Nat Protoc 2020;15(10):3240-3263.

  3. Sompairac, N.; Nazarov, P.V.; Czerwinska, U.; Cantini, L.; Biton, A,; Molkenov, A,; Zhumadilov, Z.; Barillot, E.; Radvanyi, F.; Gorban, A.; Kairov, U.; Zinovyev, A. Independent component analysis for unravelling complexity of cancer omics datasets. International Journal of Molecular Sciences 20, 18 (2019). https://doi.org/10.3390/ijms20184414

4.Nazarov, P.V., Wienecke-Baldacchino, A.K., Zinovyev, A. et al. Deconvolution of transcriptomes and miRNomes by independent component analysis provides insights into biological processes and clinical outcomes of melanoma patients. BMC Med Genomics 12, 132 (2019). https://doi.org/10.1186/s12920-019-0578-4