The rapid advancement of new technologies for omics data acquisition has resulted in the production of massive amounts of data that combine different omics types, such as gene expression, methylation and copy number variation data. These technologies enable the generation of large-scale, high-resolution and multidimensional data that can reveal the molecular mechanisms and interactions underlying biological phenomena. For instance, The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancers and matched normal samples spanning 33 cancer types. These data offer a rich source of information, but they are still underutilized because the new techniques and statistical models for data analysis have not kept up with the pace of data acquisition technologies. Indeed, most of the conventional approaches for omics data modeling depend on the comparison of only one data type across different groups of samples. Moreover, the analysis and integration of omics data pose significant challenges in terms of data interpretation. Therefore, new technologies for omics data acquisition require the development of novel methods and models for data processing and mining that can cope with the volume, variety and complexity of omics data.
In this work, we present gINTomics, our new Multi Omics integration R package, approaching Multi-Omics integration from a new perspective, with the goal to assess the impact of different omics on the final outcome of regulatory networks, that is gene expression. Therefore, for each gene/miRNA, we try to determine the association between its expression and that of its regulators, taking into account also genomic modifications such as copy number variations (CNV) and methylation.
To install this package:
gINTomics is designed to perform both two-Omics and Multi Omics
integrations. There are four different functions for the two-omics
integration. The run_cnv_integration
function can be used
to integrate Gene or miRNA Expression data with Copy Number Variation
data. The run_met_integration
is designed for the
integration between Gene Expression data with methylation data. When
both gene CNV and methylation data are available, the
run_genomic_integration
function can be used to integrate
them together with gene expression. Finally, the
run_tf_integration
function can be used for the integration
between the Expression of a target gene or miRNA and every kind of
regulator, such as Transcription Factors or miRNAs expression. The
package also gives the possibility to perform Multi Omics integration.
The run_multiomics
function takes as input a
MultiAssayExperiment, that can be generated with the help of the
create_multiassay
function, and will perform all the
possible integration available for the omics provided by the user. All
the integration functions exploit different statistical models depending
on the input data type, the available models are negative binomial and
linear regression. When the number of covariates is too high, a random
forest model will select only the most importants which will be included
in the integration models. In order to make the results more
interpretable, gINTomics provides a comprehensive and interactive
shiny app for the graphical representation of the results, the
shiny app can be called with the run_shiny
function, which
takes as input the output of the run_multiomics
function.
Moreover additional functions are available to build single plots to
visualize the results of the integration outside of the shiny app. In
the following sections, we use an example MultiAssayExperiment of
ovarian cancer to show how to use gINTomics
with the
different available integrations. The object contains all the available
input data types: Gene expression data, miRNA expression data, gene
methylation data, gene Copy Number Variations and miRNA Copy Number
Variations.
The package contains a pre built MultiAssayExperiment, anyway in this section we will divide it in signle omics matrices and see how to build a proper MultiAssayExperiment from zero.
# loading packages
library(gINTomics)
library(MultiAssayExperiment)
library(shiny)
data("mmultiassay_ov")
mmultiassay_ov
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] gene_exp: RangedSummarizedExperiment with 1000 rows and 20 columns
## [2] methylation: SummarizedExperiment with 1000 rows and 20 columns
## [3] cnv_data: SummarizedExperiment with 1000 rows and 20 columns
## [4] miRNA_exp: SummarizedExperiment with 1000 rows and 20 columns
## [5] miRNA_cnv_data: SummarizedExperiment with 1000 rows and 20 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Here we will extract the single omics from the MultiAssayExperiment
## Here we just select part of the data o speed up the process
tmp <- lapply(experiments(mmultiassay_ov), function(x) x[1:400,])
mmultiassay_ov <- MultiAssayExperiment(experiments = tmp)
gene_exp_matrix <- as.matrix(assay(mmultiassay_ov[["gene_exp"]]))
miRNA_exp_matrix <- as.matrix(assay(mmultiassay_ov[["miRNA_exp"]]))
meth_matrix <- as.matrix(assay(mmultiassay_ov[["methylation"]]))
gene_cnv_matrix <- as.matrix(assay(mmultiassay_ov[["cnv_data"]]))
miRNA_cnv_matrix <- as.matrix(assay(mmultiassay_ov[["miRNA_cnv_data"]]))
Now let’s build a proper MultiAssayExperiment starting from single matrices
new_multiassay <- create_multiassay(methylation = meth_matrix,
gene_exp = gene_exp_matrix,
cnv_data = gene_cnv_matrix,
miRNA_exp = miRNA_exp_matrix,
miRNA_cnv_data = miRNA_cnv_matrix)
new_multiassay
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] gene_exp: matrix with 400 rows and 20 columns
## [2] methylation: matrix with 400 rows and 20 columns
## [3] cnv_data: matrix with 400 rows and 20 columns
## [4] miRNA_exp: matrix with 400 rows and 20 columns
## [5] miRNA_cnv_data: matrix with 400 rows and 20 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
In this section we will see how to perform a gene_expression~CNV+met
integration. The input data should be provided as matrices or
data.frames. Rows represent samples, while each column represents the
different response variables/covariates of the models. By default all
the integration functions will assume that expression data are obtained
through sequencing, you should set the sequencing_data
argument to FALSE if they are not. Expression data could be both
normalized or not normalized, the function is able to normalize data by
setting the normalize
argument to TRUE (default). If you
want you can specify custom interaction through the
interactions
argument otherwise the function will
automatically look for the genes of expression
in
cnv_data
. NOTE: if you customize genomics interactions,
linking more covariates to a single response, the output may be not more
compatible with the shiny visualization.
gene_genomic_integration <- run_genomic_integration(expression = t(gene_exp_matrix),
cnv_data = t(gene_cnv_matrix),
methylation = t(meth_matrix))
summary(gene_genomic_integration)
## Length Class Mode
## coef_data 3 data.frame list
## pval_data 3 data.frame list
## fdr_data 3 data.frame list
## residuals 20 data.frame list
## data 3 -none- list
In this section we will see how to perform an expression~CNV
integration. The input data (for both expression and CNV) should be
provided as matrices or data.frames. Rows represent samples, while each
column represents the different response variables/covariates of the
models. By default all the integration functions will assume that
expression data are obtained through sequencing, you should set the
sequencing_data
argument to FALSE if they are not.
Expression data could be both normalized or not normalized, the function
is able to normalize data by setting the normalize
argument
to TRUE (default). If you want you can specify custom interaction
through the interactions
argument otherwise the function
will automatically look for the genes of expression
in
cnv_data
.
gene_cnv_integration <- run_cnv_integration(expression = t(gene_exp_matrix),
cnv_data = t(gene_cnv_matrix))
summary(gene_cnv_integration)
## Length Class Mode
## coef_data 2 data.frame list
## pval_data 2 data.frame list
## fdr_data 2 data.frame list
## residuals 20 data.frame list
## data 3 -none- list
In this section we will see how to perform an expression~methylation
integration. The input data should be the same of
run_cnv_integration
, but the covariates matrix will contain
methylation data instead of Copy Number Variations data. Expression data
could be both normalized or not normalized, the function is able to
normalize data by setting the normalize
argument to TRUE
(default). If you want you can specify custom interaction through the
interactions
argument otherwise the function will
automatically look for the genes of expression
in
methylation
.
gene_met_integration <- run_met_integration(expression = t(gene_exp_matrix),
methylation = t(meth_matrix))
summary(gene_met_integration)
## Length Class Mode
## coef_data 2 data.frame list
## pval_data 2 data.frame list
## fdr_data 2 data.frame list
## residuals 20 data.frame list
## data 3 -none- list
In this section we will see how to perform an expression~TF
integration. In this case you can use as input data a single gene
expression matrix if both your TF and targets are genes and are
contained in the gene expression matrix. If you want the package to
automatically download the interactions between TFs and targets you have
to set the argument type
to “tf”. Otherwise you can also
specify your custom interactions providing them with the
interactions
argument. You can handle data normalization as
in the previous functions through the normalize
argument.
This function is designed to integrate TF expression data but it can
handle every type of numerical data representing a gene expression
regulator. So you can pass the regulators matrix to the
tf_expression
argument and specify your custom
interactions
. If expression data are not obtained through
sequencing, remember to set sequencing data
to FALSE.
## Warning in import_tf_target_interactions(organism = org[species]): 'import_tf_target_interactions' is deprecated.
## Use 'tf_target' instead.
## See help("Deprecated")
## Length Class Mode
## coef_data 2 data.frame list
## pval_data 2 data.frame list
## fdr_data 2 data.frame list
## residuals 20 data.frame list
## data 3 -none- list
In this section we will see how to perform an expression~miRNA
integration. Gene expression data will be provided through the
expression
argument while miRNA expression data through the
tf_expression
argument. Input matrices should follow the
same rules of the previous functions. If you want the package to
automatically download the interactions between miRNA and targets you
have to set the argument type
to “miRNA_target”. Otherwise
you can also specify your custom interactions providing them with the
interactions
argument. You can handle data normalization as
in the previous functions through the normalize
argument.
miRNA_target_integration <- run_tf_integration(expression = t(gene_exp_matrix),
tf_expression = t(miRNA_exp_matrix),
type = "miRNA_target")
## Warning in import_mirnatarget_interactions(organism = org[species]): 'import_mirnatarget_interactions' is deprecated.
## Use 'mirna_target' instead.
## See help("Deprecated")
## Length Class Mode
## coef_data 4 data.frame list
## pval_data 4 data.frame list
## fdr_data 4 data.frame list
## residuals 20 data.frame list
## data 3 -none- list
In this section we will see how to perform an miRNA~TF integration.
miRNA expression data will be provided through the
expression
argument while gene expression data through the
tf_expression
argument. Input matrices should follow the
same rules of the previous functions. If you want the package to
automatically download the interactions between TF and miRNA you have to
set the argument type
to “tf_miRNA”. Otherwise you can also
specify your custom interactions providing them with the
interactions
argument. You can handle data normalization as
in the previous functions through the normalize
argument.
tf_miRNA_integration <- run_tf_integration(expression = t(miRNA_exp_matrix),
tf_expression = t(gene_exp_matrix),
type = "tf_miRNA")
## Warning in import_tf_mirna_interactions(organism = org[species], resources = "TransmiR"): 'import_tf_mirna_interactions' is deprecated.
## Use 'tf_mirna' instead.
## See help("Deprecated")
## Length Class Mode
## coef_data 7 data.frame list
## pval_data 7 data.frame list
## fdr_data 7 data.frame list
## residuals 20 data.frame list
## data 3 -none- list
Finally we will see how to perform a complete integration using all the available omics. In order to run this function you need to generate a MultiAssayExperiment as showed at the beginning of this vignette. The function will automatically use all the omics contained in the MultiAssayExperiment to perform all the possible integrations showed before. Moreover, if genomic data are available (CNV and/or Methylation), the first step will be the genomic integration and all the following integrations that contain gene expression data as response variable will use instead the residuals of the genomic model in order to filter out the effect of CNV and/or methylation. This framework is used also for miRNA, but in this case only CNV data are supported.
## Warning in import_tf_target_interactions(organism = org[species]): 'import_tf_target_interactions' is deprecated.
## Use 'tf_target' instead.
## See help("Deprecated")
## Warning in import_tf_mirna_interactions(organism = org[species], resources = "TransmiR"): 'import_tf_mirna_interactions' is deprecated.
## Use 'tf_mirna' instead.
## See help("Deprecated")
## Warning in import_mirnatarget_interactions(organism = org[species]): 'import_mirnatarget_interactions' is deprecated.
## Use 'mirna_target' instead.
## See help("Deprecated")
## Length Class Mode
## gene_genomic_res 5 -none- list
## mirna_cnv_res 5 -none- list
## tf_res 5 -none- list
## tf_mirna_res 5 -none- list
## mirna_target_res 5 -none- list
Now let’s see how you can visualize the results of your integration models.
gINTomics provides an interactive environment, based on Shiny, that
allows the user to easily visualize output results from integration
models and to save them for downstream tasks and reports creation. Once
multiomics integration has been performed users can provide the results
to the run_shiny
function. NOTE: Only the output of the
run_multiomics
function is compatible with
run_shiny
Network plot shows the significant interactions between
transcriptional regulators (TFs and miRNAs, if present) and their
targets genes/miRNA. Nodes and edges are selected ordering them by the
most high coefficient values (absolute value) and by default the top 200
interactions are showed. You can change the number of interactions with
the num_interactions
argument. Here you can see the code to
generate a network from a multiomics integration
## 2162 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## 'select()' returned 1:many mapping between keys and columns
The Venn Diagram is designed for the genomic integration. It can help to identify genes which are significantly regulated by both CNV and methylation. Here you can see the code to generate a Venn Diagram from a multiomics integration
The Volcano Plot shows the distribution of integration coefficients for every integration type associated with a genomic class (cnv, met, cnv-mirna). For each integration coefficient, on the y axis you have the -log10 of Pvalue/FDR and on the x axis the value of the coefficient. Here you can see the code to generate a Volcano plot for CNV and one for methylation from a multiomics integration
The ridgeline plot is designed to compare different distributions, it has been integrated in the package with the aim to compare the distribution of significant and non significant coefficients returned by our integration models. For each distribution, on the y axis you have the frequencies and on the x axis the values of the coefficients. Here you can see the code to generate a Ridgeline plot for CNV and one for methylation from a multiomics integration
This barplot highlights the distribution of significant and non significant covariates among chromosomes. This kind of visualization will identify chromosomes in which the type of regulation under analysis is particularly active. Here you can see the code to generate a chromosome distribution plot specif for the genomic integration, starting from the results of the multiomics integration.
This barplot highlights the number of significant tragets for each TFs. Here you can see the code to generate a TF distribution plot starting from the results of the multiomics integration.
The enrichment plot shows the enrichment results obtained with enrichGO and enrichKEGG (clusterProfiler). The genomic enrichment is performed providing the list of genes significantly regulated by methylation or CNV, while the transcriptional one with the list of genes significantly regulated by each transcription factor (we run an enrichment for each TF that significantly regulates at least 12 targets) Here you can see the code to run a genomic enrichment starting from the results of the multiomics integration and to visualize the top results with a DotPlot
Here is the output of sessionInfo() on the system on which this document was compiled.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] shiny_1.9.1 MultiAssayExperiment_1.31.5
## [3] SummarizedExperiment_1.35.5 Biobase_2.67.0
## [5] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
## [7] IRanges_2.39.2 S4Vectors_0.43.2
## [9] BiocGenerics_0.53.0 MatrixGenerics_1.17.1
## [11] matrixStats_1.4.1 gINTomics_1.3.0
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.4
## [2] bitops_1.0-9
## [3] enrichplot_1.25.5
## [4] fontawesome_0.5.2
## [5] lubridate_1.9.3
## [6] httr_1.4.7
## [7] RColorBrewer_1.1-3
## [8] doParallel_1.0.17
## [9] tools_4.4.1
## [10] backports_1.5.0
## [11] utf8_1.2.4
## [12] R6_2.5.1
## [13] DT_0.33
## [14] lazyeval_0.2.2
## [15] GetoptLong_1.0.5
## [16] withr_3.0.2
## [17] graphite_1.51.1
## [18] prettyunits_1.2.0
## [19] gridExtra_2.3
## [20] cli_3.6.3
## [21] sass_0.4.9
## [22] randomForest_4.7-1.2
## [23] readr_2.1.5
## [24] ggridges_0.5.6
## [25] Rsamtools_2.21.2
## [26] systemfonts_1.1.0
## [27] yulab.utils_0.1.7
## [28] gson_0.1.0
## [29] DOSE_4.1.0
## [30] svglite_2.1.3
## [31] R.utils_2.12.3
## [32] InteractiveComplexHeatmap_1.13.0
## [33] limma_3.61.12
## [34] readxl_1.4.3
## [35] rstudioapi_0.17.1
## [36] RSQLite_2.3.7
## [37] visNetwork_2.1.2
## [38] generics_0.1.3
## [39] gridGraphics_0.5-1
## [40] shape_1.4.6.1
## [41] BiocIO_1.17.0
## [42] vroom_1.6.5
## [43] gtools_3.9.5
## [44] dplyr_1.1.4
## [45] zip_2.3.1
## [46] GO.db_3.20.0
## [47] Matrix_1.7-1
## [48] fansi_1.0.6
## [49] logger_0.4.0
## [50] abind_1.4-8
## [51] R.methodsS3_1.8.2
## [52] lifecycle_1.0.4
## [53] edgeR_4.3.21
## [54] yaml_2.3.10
## [55] qvalue_2.37.0
## [56] SparseArray_1.5.45
## [57] BiocFileCache_2.15.0
## [58] grid_4.4.1
## [59] blob_1.2.4
## [60] promises_1.3.0
## [61] shinydashboard_0.7.2
## [62] crayon_1.5.3
## [63] ggtangle_0.0.3
## [64] lattice_0.22-6
## [65] cowplot_1.1.3
## [66] GenomicFeatures_1.57.1
## [67] KEGGREST_1.45.1
## [68] sys_3.4.3
## [69] maketools_1.3.1
## [70] pillar_1.9.0
## [71] knitr_1.48
## [72] ComplexHeatmap_2.23.0
## [73] fgsea_1.31.6
## [74] rjson_0.2.23
## [75] clisymbols_1.2.0
## [76] codetools_0.2-20
## [77] fastmatch_1.1-4
## [78] glue_1.8.0
## [79] ggvenn_0.1.10
## [80] ggfun_0.1.7
## [81] data.table_1.16.2
## [82] vctrs_0.6.5
## [83] png_0.1-8
## [84] treeio_1.29.2
## [85] org.Mm.eg.db_3.20.0
## [86] cellranger_1.1.0
## [87] gtable_0.3.6
## [88] cachem_1.1.0
## [89] OmnipathR_3.13.28
## [90] xfun_0.48
## [91] S4Arrays_1.5.11
## [92] mime_0.12
## [93] tidygraph_1.3.1
## [94] iterators_1.0.14
## [95] statmod_1.5.0
## [96] nlme_3.1-166
## [97] ggtree_3.13.2
## [98] bit64_4.5.2
## [99] progress_1.2.3
## [100] filelock_1.0.3
## [101] bslib_0.8.0
## [102] colorspace_2.1-1
## [103] DBI_1.2.3
## [104] tidyselect_1.2.1
## [105] processx_3.8.4
## [106] bit_4.5.0
## [107] compiler_4.4.1
## [108] curl_5.2.3
## [109] shiny.gosling_1.1.0
## [110] rvest_1.0.4
## [111] httr2_1.0.5
## [112] graph_1.83.0
## [113] xml2_1.3.6
## [114] plotly_4.10.4
## [115] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [116] DelayedArray_0.33.1
## [117] rtracklayer_1.65.0
## [118] checkmate_2.3.2
## [119] scales_1.3.0
## [120] callr_3.7.6
## [121] rappdirs_0.3.3
## [122] stringr_1.5.1
## [123] digest_0.6.37
## [124] rmarkdown_2.28
## [125] XVector_0.45.0
## [126] htmltools_0.5.8.1
## [127] pkgconfig_2.0.3
## [128] highr_0.11
## [129] dbplyr_2.5.0
## [130] fastmap_1.2.0
## [131] rlang_1.1.4
## [132] GlobalOptions_0.1.2
## [133] htmlwidgets_1.6.4
## [134] UCSC.utils_1.1.0
## [135] farver_2.1.2
## [136] jquerylib_0.1.4
## [137] jsonlite_1.8.9
## [138] BiocParallel_1.41.0
## [139] GOSemSim_2.31.2
## [140] R.oo_1.26.0
## [141] RCurl_1.98-1.16
## [142] magrittr_2.0.3
## [143] kableExtra_1.4.0
## [144] GenomeInfoDbData_1.2.13
## [145] ggplotify_0.1.2
## [146] patchwork_1.3.0
## [147] munsell_0.5.1
## [148] Rcpp_1.0.13
## [149] ape_5.8
## [150] viridis_0.6.5
## [151] stringi_1.8.4
## [152] ggraph_2.2.1
## [153] zlibbioc_1.51.2
## [154] MASS_7.3-61
## [155] org.Hs.eg.db_3.20.0
## [156] plyr_1.8.9
## [157] parallel_4.4.1
## [158] ggrepel_0.9.6
## [159] Biostrings_2.75.0
## [160] graphlayouts_1.2.0
## [161] splines_4.4.1
## [162] hms_1.1.3
## [163] circlize_0.4.16
## [164] locfit_1.5-9.10
## [165] ps_1.8.1
## [166] igraph_2.1.1
## [167] buildtools_1.0.0
## [168] reshape2_1.4.4
## [169] biomaRt_2.63.0
## [170] XML_3.99-0.17
## [171] evaluate_1.0.1
## [172] BiocManager_1.30.25
## [173] selectr_0.4-2
## [174] tzdb_0.4.0
## [175] foreach_1.5.2
## [176] tweenr_2.0.3
## [177] httpuv_1.6.15
## [178] tidyr_1.3.1
## [179] purrr_1.0.2
## [180] polyclip_1.10-7
## [181] clue_0.3-65
## [182] ggplot2_3.5.1
## [183] BiocBaseUtils_1.9.0
## [184] ReactomePA_1.49.1
## [185] ggforce_0.4.2
## [186] xtable_1.8-4
## [187] restfulr_0.0.15
## [188] reactome.db_1.89.0
## [189] tidytree_0.4.6
## [190] later_1.3.2
## [191] viridisLite_0.4.2
## [192] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
## [193] tibble_3.2.1
## [194] clusterProfiler_4.15.0
## [195] aplot_0.2.3
## [196] memoise_2.0.1
## [197] AnnotationDbi_1.69.0
## [198] GenomicAlignments_1.41.0
## [199] cluster_2.1.6
## [200] timechange_0.3.0