This document is a presentation of the R implementation of the tool CIMICE. It shows the main features of this software and how it is built as a modular pipeline, with the goal of making it easy to change and update.
CIMICE is a tool in the field of tumor phylogenetics and its goal is to build a Markov Chain (called Cancer Progression Markov Chain, CPMC) in order to model tumor subtypes evolution. The input of CIMICE is a Mutational Matrix, so a boolean matrix representing altered genes in a collection of samples. These samples are assumed to be obtained with single-cell DNA analysis techniques and the tool is specifically written to use the peculiarities of this data for the CMPC construction.
CIMICE data processing and analysis can be divided in four section:
These steps will be presented in the following sections.
This implementation of CIMICE is built as a single library on its own:
and it requires the following libraries:
# Dataframe manipulation
library(dplyr)
# Plot display
library(ggplot2)
# Improved string operations
library(glue)
# Dataframe manipulation
library(tidyr)
# Graph data management
library(igraph)
# Remove transitive edges on a graph
# library(relations)
# Interactive graph visualization
library(networkD3)
# Interactive graph visualization
library(visNetwork)
# Correlation plot visualization
library(ggcorrplot)
# Functional R programming
library(purrr)
# Graph Visualization
library(ggraph)
# sparse matrices
library(Matrix)
CIMICE requires a boolean dataframe as input, structured as follows:
It is possible to load this information from a file. The default input format for CIMICE is the “CAPRI/CAPRESE” TRONCO format:
This is a scheme of CIMICE’s input format:
s\g gene_1 gene_2 ... gene_n
sample_1 1 0 ... 0
...
sample_m 1 1 ... 1
and this an example on how to load a dataset from the file system:
# Read input dataset in CAPRI/CAPRESE format
dataset.big <- read_CAPRI(system.file("extdata", "example.CAPRI", package = "CIMICE", mustWork = TRUE))
## sfmF ulaF dapB yibT phnL ispA
## GCF_001281685.1_ASM128168v1_genomic.fna 1 1 0 0 0 0
## GCF_001281725.1_ASM128172v1_genomic.fna 0 0 0 0 0 0
## GCF_001281755.1_ASM128175v1_genomic.fna 0 0 0 0 0 0
## GCF_001281775.1_ASM128177v1_genomic.fna 0 0 0 0 0 0
## GCF_001281795.1_ASM128179v1_genomic.fna 0 1 0 0 0 0
## GCF_001281815.1_ASM128181v1_genomic.fna 0 0 0 0 0 0
## [1] "ncol: 999 - nrow: 160"
Another option is to define directly the dataframe in R. This is made
easy by the functions make_dataset
and
update_df
, used as follows:
# genes
dataset <- make_dataset(A,B,C,D) %>%
# samples
update_df("S1", 0, 0, 0, 1) %>%
update_df("S2", 1, 0, 0, 0) %>%
update_df("S3", 1, 0, 0, 0) %>%
update_df("S4", 1, 0, 0, 1) %>%
update_df("S5", 1, 1, 0, 1) %>%
update_df("S6", 1, 1, 0, 1) %>%
update_df("S7", 1, 0, 1, 1) %>%
update_df("S8", 1, 1, 0, 1)
with the following outcome:
## 8 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S3 1 0 0 0
## S4 1 0 0 1
## S5 1 1 0 1
## S6 1 1 0 1
## S7 1 0 1 1
## S8 1 1 0 1
In the case your data is composed by samples with associated frequencies it is possible to use an alternative format that we call “CAPRIpop”:
s/g gene_1 gene_2 ... gene_n freq
sample_1 1 0 ... 0 freq_s1
...
sample_m 1 1 ... 1 freq_sm
where the freq
column is mandatory and sample must not
be repeated. Frequencies in the freq
column will be
automatically normalized. This format is meant to be used with the
functions quick_run(dataset, mode="CAPRIpop")
for the full
analysis and dataset_preprocessing_population(...)
for the
preprocessing stage only. The subsequent operations remain otherwise
equal to those of the default format.
Another option is to compute a mutational matrix directly from a MAF file, which can be done as follows:
# path to MAF file
read_MAF(system.file("extdata", "paac_jhu_2014_500.maf", package = "CIMICE", mustWork = TRUE))[1:5,1:5]
## -Reading
## -Validating
## -Silent variants: 49
## -Summarizing
## --Possible FLAGS among top ten genes:
## HMCN1
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.064s elapsed (0.084s cpu)
## 5 x 5 sparse Matrix of class "dgCMatrix"
## CSMD2 C2orf61 DCHS2 DPYS GPR158
## ACINAR28 1 1 . . .
## ACINAR27 1 . . . .
## ACINAR25 1 . . . .
## ACINAR02 . . 1 . 1
## ACINAR13 . . 1 . .
This implementation of CIMICE includes simple functions to quickly analyze the distributions of mutations among genes and samples.
The following code displays an histogram showing the distribution of the number of mutations hitting a gene:
And this does the same but from the samples point of view:
In case of huge dataset, it could be necessary to focus only on a subset of the input samples or genes. The following procedures aim to provide an easy way to do so when the goal is to use the most (or least) mutated samples or genes.
Keeps the first n (=100) most mutated genes:
## eptA yohC yedK yeeO narU rsxC
## GCF_001281685.1_ASM128168v1_genomic.fna 1 1 1 1 1 1
## GCF_001281725.1_ASM128172v1_genomic.fna 1 1 1 0 0 0
## GCF_001281755.1_ASM128175v1_genomic.fna 1 1 1 1 1 1
## GCF_001281775.1_ASM128177v1_genomic.fna 1 1 1 1 1 0
## GCF_001281795.1_ASM128179v1_genomic.fna 1 1 1 1 1 1
## GCF_001281815.1_ASM128181v1_genomic.fna 1 1 1 1 1 1
## [1] "ncol: 100 - nrow: 160"
Keeps the first n (=100) least mutated samples:
## sfmF ulaF dapB yibT phnL ispA
## GCF_001281725.1_ASM128172v1_genomic.fna 0 0 0 0 0 0
## GCF_001281855.1_ASM128185v1_genomic.fna 0 0 0 0 0 0
## GCF_001281815.1_ASM128181v1_genomic.fna 0 0 0 0 0 0
## GCF_001281775.1_ASM128177v1_genomic.fna 0 0 0 0 0 0
## GCF_001607735.1_ASM160773v1_genomic.fna 0 0 0 0 0 0
## GCF_001297965.1_ASM129796v1_genomic.fna 0 0 0 0 0 0
## [1] "ncol: 999 - nrow: 100"
It is easy to combine these selections by using the pipe operator
%>%
:
## eptA yohC yedK argK yeeO narU
## GCF_001281725.1_ASM128172v1_genomic.fna 1 1 1 1 0 0
## GCF_001281855.1_ASM128185v1_genomic.fna 1 1 1 1 0 0
## GCF_001281815.1_ASM128181v1_genomic.fna 1 1 1 1 1 1
## GCF_001281775.1_ASM128177v1_genomic.fna 1 1 1 1 1 1
## GCF_001607735.1_ASM160773v1_genomic.fna 1 1 1 1 1 1
## GCF_001297965.1_ASM129796v1_genomic.fna 1 1 1 1 1 1
## [1] "ncol: 100 - nrow: 100"
It may be of interest to show correlations among gene or sample
mutations. The library corrplots
provides an easy way to do
so by preparing an heatmap based on the correlation matrix. We can show
these plots by using the following comands:
gene mutations correlation:
sample mutations correlation:
The first step of the CIMICE algorithm is based on grouping the
genotypes contained in the dataset to compute their observed
frequencies. In this implementation we used a simple approach using the
library dplyr
. However, this solution is not optimal from
an efficiency point of view and might be problematic with very large
datasets. An Rcpp implementation is planned and, moreover, it is
possible to easily modify this step by changing the algorithm as long as
its output is a dataframe containing only unique genotypes with an
additional column named “freq” for the observed frequencies count.
## $matrix
## 5 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
##
## $counts
## [1] 1 2 1 1 3
##
## $row_names
## $row_names[[1]]
## [1] "S1"
##
## $row_names[[2]]
## [1] "S2, S3"
##
## $row_names[[3]]
## [1] "S4"
##
## $row_names[[4]]
## [1] "S7"
##
## $row_names[[5]]
## [1] "S5, S6, S8"
The subsequent stage goal is to prepare the topology for the final Cancer Progression Markov Chain. We racall that this topology is assumed to be a DAG. These eraly steps are required to prepare the information necessary for this and the following pahses.
Convert dataset to matricial form:
## 5 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
Extract gene names:
## [1] "A" "B" "C" "D"
Compute observed frequency of each genotype:
## [1] 0.125 0.250 0.125 0.125 0.375
Add “Clonal” genotype to the dataset (if not present) that will be used as DAG root:
# prepare node labels listing the mutated genes for each node
labels <- prepare_labels(samples, genes)
if( is.null(compactedDataset$row_names) ){
compactedDataset$row_names <- rownames(compactedDataset$matrix)
}
matching_samples <- compactedDataset$row_names
# fix Colonal genotype absence, if needed
fix <- fix_clonal_genotype(samples, freqs, labels, matching_samples)
samples = fix[["samples"]]
freqs = fix[["freqs"]]
labels = fix[["labels"]]
matching_samples <- fix[["matching_samples"]]
## 6 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
## Clonal 0 0 0 0
Build the topology of the graph based on the “superset” relation:
and finally prepare and show with the current topology of the graph:
that can be (badly) plotted using basic igraph:
In this sections, it is shown how to call the procedures to the four steps weight computation used in CIMICE. This is in fact based in computing “UP” weights, normalized “UP” weights, “DOWN” weights and normalized “DOWN” weights.
The process is based on the graph adjacency matrix “A”:
## Warning: `as_adj()` was deprecated in igraph 2.1.0.
## ℹ Please use `as_adjacency_matrix()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## 6 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . . 1 . . .
## [2,] . . 1 . . .
## [3,] . . . 1 1 .
## [4,] . . . . . .
## [5,] . . . . . .
## [6,] 1 1 . . . .
and on the number of successors for each node:
## [1] 1 1 2 0 0 2
$$W_{up}(\langle a,b \rangle \in E) = \frac{1}{ |\Lambda_a|}(P(a) + \sum_{x\in \Pi_a}W_{up}(\langle x,a \rangle)) $$ given that Λa and Πa denote the set of children of a node a and the set of parents of a node a respectively and that P(a) is the observed frequency of node a.
## [1] 0.000 0.000 0.125 0.250 0.250 0.250
$$\overline{W}_{up}(\langle a,b \rangle \in E_1) = \begin{cases} 1 & \mbox{if $a[0]=\emptyset$} \\ \frac{ W_{up}(\langle a,b \rangle \in E)}{\sum_{x \in \Pi_b} W_{up}(\langle x,b \rangle)} & \mbox{else} \end{cases}$$
## [1] 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000 1.0000000
$$W_{down}(\langle a,b \rangle) = \overline{W}_{up}(\langle a,b \rangle)(P(b) + \sum_{x\in \Lambda_b}W_{down}(\langle b,x \rangle))$$
## [1] 0.3333333 0.6666667 0.2083333 0.4166667 0.1250000 0.3750000
$$ P(\langle a,b \rangle) = \overline{W}_{down}(\langle a,b \rangle) = \frac{W_{down}(\langle a,b \rangle)}{\sum_{x\in \Lambda_a}{W_{down}(\langle a,x \rangle)}} $$
## [1] 0.3333333 0.6666667 1.0000000 1.0000000 0.2500000 0.7500000
To better show the results of the analysis there were prepared three
ouput methods based on three different libraries: ggraph
,
networkD3
and visNetwork
. These libraries
improve the dafault igraph
output visualization. Note that
output interaction is disabled in this document, check the Quick Guide
instead.
This is the output based on ggraph
, it is ideal for
small graphs but, for legibility reasons, it is better not to use it
with long labels.
The networkD3
is a quite valid interactive approach, but
it lacks the option to draw labels on edges, limiting the representation
to thicker or thinner edges.
The visNetwork
approach is overall the best for
interactive purposes. It allows almost arbitrary long labels, as it is
compatible with HTML
elements and in particular with
textboxes and the “hovering condition” for vertex and edges.
Finally, it is also possible to export CIMICE’s output to the standard dot format for use in other visualization applications.
## digraph G{
## 1[label="S1"]
## 2[label="S2, S3"]
## 3[label="S4"]
## 4[label="S7"]
## 5[label="S5, S6, S8"]
## 6[label="Clonal"]
## 6 -> 1[label="0.333"]
## 6 -> 2[label="0.667"]
## 1 -> 3[label="1"]
## 2 -> 3[label="1"]
## 3 -> 4[label="0.25"]
## 3 -> 5[label="0.75"]
## }
This vignette was prepared using a R session with the following specifications:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Matrix_1.7-1 ggraph_2.2.1 purrr_1.0.2 ggcorrplot_0.1.4.1
## [5] visNetwork_2.1.2 networkD3_0.4 igraph_2.1.1 tidyr_1.3.1
## [9] glue_1.8.0 ggplot2_3.5.1 dplyr_1.1.4 CIMICE_1.15.0
## [13] BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
## [4] htmlwidgets_1.6.4 ggrepel_0.9.6 lattice_0.22-6
## [7] vctrs_0.6.5 tools_4.4.1 generics_0.1.3
## [10] tibble_3.2.1 fansi_1.0.6 highr_0.11
## [13] pkgconfig_2.0.3 data.table_1.16.2 RColorBrewer_1.1-3
## [16] assertthat_0.2.1 lifecycle_1.0.4 stringr_1.5.1
## [19] compiler_4.4.1 farver_2.1.2 munsell_0.5.1
## [22] ggforce_0.4.2 graphlayouts_1.2.0 htmltools_0.5.8.1
## [25] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
## [28] yaml_2.3.10 crayon_1.5.3 pillar_1.9.0
## [31] jquerylib_0.1.4 MASS_7.3-61 cachem_1.1.0
## [34] viridis_0.6.5 tidyselect_1.2.1 digest_0.6.37
## [37] stringi_1.8.4 reshape2_1.4.4 labeling_0.4.3
## [40] maketools_1.3.1 splines_4.4.1 maftools_2.21.3
## [43] polyclip_1.10-7 fastmap_1.2.0 grid_4.4.1
## [46] colorspace_2.1-1 expm_1.0-0 cli_3.6.3
## [49] magrittr_2.0.3 tidygraph_1.3.1 survival_3.7-0
## [52] utf8_1.2.4 withr_3.0.2 scales_1.3.0
## [55] rmarkdown_2.28 gridExtra_2.3 memoise_2.0.1
## [58] DNAcopy_1.79.0 evaluate_1.0.1 knitr_1.48
## [61] viridisLite_0.4.2 rlang_1.1.4 Rcpp_1.0.13
## [64] tweenr_2.0.3 BiocManager_1.30.25 jsonlite_1.8.9
## [67] plyr_1.8.9 R6_2.5.1