The netboost package, implements a three-step dimension reduction technique. First, a boosting-based filter is combined with the topological overlap measure to identify the essential edges of the network. Second, sparse hierarchical clustering is applied on the selected edges to identify modules and finally module information is aggregated by the first principal components. The primary analysis is then carried out on these summary measures instead of the original data.
The package comes with an example dataset included. We import the acute myeloid leukemia patient data from The Cancer Genome Atlas public domain database. The dataset consists of one thousand DNA methylation sites and gene expression levels on chromosome 18 for 80 patients.
## Loading required package: netboost
##
## netboost 2.15.0 loadedDefault CPU cores: 1
## [1] 80 500
The netboost()
function integrates all major analysis
steps and generates multiple plots. In this step we also set analysis
parameters:
stepno
defines the number of boosting steps taken
soft_power
(if null, automatically chosen) the exponent
in the transformation of the correlation
min_cluster_size
the minimal size of clusters,
n_pc
the number of maximally computed principal
components
scale
if data should be scaled and centered prior to
analysis
ME_diss_thres
defines the merging threshold for
identified clusters.
For details on the options please see ?netboost
and the
corresponding paper Schlosser et al. 2020.
results <- netboost(datan = tcga_aml_meth_rna_chr18, stepno = 20L,
soft_power = 3L, min_cluster_size = 10L, n_pc = 2, scale = TRUE, ME_diss_thres = 0.25)
## idx: 1 (0.2%) - Fri Nov 29 08:28:15 2024
##
## Netboost extracted 10 modules (including background) with an average size of 17.5555555555556 (excluding background) from Tree 1.
##
## Netboost detected 9 modules and 1 background modules in 1 trees resulting in 15 aggregate measures.
## Average size of the modules was 17.5555555555556.
## 342 of 500 features (68.4%) were not assigned to modules.
For each detected independent tree in the dataset (here one) the first graph shows a dendrogram of initial modules and at which level they are merged, the second graph a module dendrogram after merging and the third the dendrogram of features including the module-color-code.
results
contains the dendrograms (dendros), feature
identifier (names) matched to module assignment (colors), the aggregated
dataset (MEs), the rotation matrix to compute the aggregated dataset
(rotation) and the proportion of variance explained by the aggregate
measures (var_explained). Dependent on the minimum proportion of
variance explained set in the netboost()
call (default 0.5)
up to n_pc
principal components are exported.
## [1] "dendros" "names" "colors" "MEs"
## [5] "rotation" "var_explained" "filter"
## [1] "ME0_1_pc1" "ME0_1_pc2" "ME7_pc1" "ME1_pc1" "ME1_pc2" "ME2_pc1"
## [7] "ME2_pc2" "ME8_pc1" "ME5_pc1" "ME3_pc1" "ME3_pc2" "ME4_pc1"
## [13] "ME4_pc2" "ME9_pc1" "ME6_pc1"
As you see for most modules the first principal component already explained more than 50% of the variance in the original features of this module. ME0_X_pcY denotes the background module (unclustered features) of the independent tree X.
Explained variance is reported by a matrix for the first
n_pc
principal components. Here we list the first 5
modules:
## ME0_1 ME7 ME1 ME2 ME8
## PC1 0.06700469 0.61004480 0.4403646 0.49237958 0.59426699
## PC2 0.05502278 0.07484705 0.1174992 0.07779346 0.08341562
results$colors
use a numeric coding for the modules
which matches their module name. To list features of module ME8 we can
extract them by:
## [1] "cg00027037" "cg00034852" "cg00220661" "cg00228017" "cg00366917"
## [6] "cg00430895" "cg00474194" "cg00481457" "cg00511081" "cg00539368"
## [11] "cg00576121" "cg00615915" "cg00736530" "cg00917154" "cg00940278"
## [16] "cg00955482"
The final dendrogram including all trees can be plotted including
labels (results$names
) for individual features.
colorsrandom
controls if module-color matching should be
randomized to get a clearly differentiable pattern of the potentially
many modules. Labels are only suitable in applications with few features
or with a appropriately large pdf device.
Next the primary analysis on the aggregated dataset
(results$MEs
) can be computed. We also implemented a
convenience function to transfer a clustering to a new dataset. Here, we
transfer the clustering to the same dataset resulting in identical
aggregate measures.
ME_transfer <- nb_transfer(nb_summary = results,
new_data = tcga_aml_meth_rna_chr18, scale = TRUE)
all(round(results$MEs, 10) == round(ME_transfer, 10))
## [1] TRUE
Netboost now also has a fully non-parametric implementation. Code is
not run here to showcase the multicore option (Bioconductor vignette
builder does not allow for multicore execution). Adjust
cores
to your machine:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] netboost_2.15.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] WGCNA_1.73 blob_1.2.4 R.utils_2.12.3
## [4] Biostrings_2.75.1 fastmap_1.2.0 digest_0.6.37
## [7] rpart_4.1.23 lifecycle_1.0.4 cluster_2.1.6
## [10] survival_3.7-0 KEGGREST_1.47.0 RSQLite_2.3.8
## [13] magrittr_2.0.3 compiler_4.4.2 rlang_1.1.4
## [16] Hmisc_5.2-0 sass_0.4.9 tools_4.4.2
## [19] utf8_1.2.4 yaml_2.3.10 data.table_1.16.2
## [22] knitr_1.49 htmlwidgets_1.6.4 bit_4.5.0
## [25] foreign_0.8-87 BiocGenerics_0.53.3 sys_3.4.3
## [28] R.oo_1.27.0 nnet_7.3-19 dynamicTreeCut_1.63-1
## [31] grid_4.4.2 stats4_4.4.2 preprocessCore_1.69.0
## [34] fansi_1.0.6 colorspace_2.1-1 fastcluster_1.2.6
## [37] GO.db_3.20.0 ggplot2_3.5.1 scales_1.3.0
## [40] iterators_1.0.14 cli_3.6.3 rmarkdown_2.29
## [43] crayon_1.5.3 generics_0.1.3 RcppParallel_5.1.9
## [46] rstudioapi_0.17.1 httr_1.4.7 DBI_1.2.3
## [49] cachem_1.1.0 stringr_1.5.1 zlibbioc_1.52.0
## [52] splines_4.4.2 parallel_4.4.2 impute_1.81.0
## [55] AnnotationDbi_1.69.0 BiocManager_1.30.25 XVector_0.47.0
## [58] matrixStats_1.4.1 base64enc_0.1-3 vctrs_0.6.5
## [61] Matrix_1.7-1 jsonlite_1.8.9 IRanges_2.41.1
## [64] S4Vectors_0.45.2 bit64_4.5.2 htmlTable_2.4.3
## [67] Formula_1.2-5 maketools_1.3.1 foreach_1.5.2
## [70] jquerylib_0.1.4 glue_1.8.0 codetools_0.2-20
## [73] stringi_1.8.4 gtable_0.3.6 GenomeInfoDb_1.43.2
## [76] UCSC.utils_1.3.0 munsell_0.5.1 tibble_3.2.1
## [79] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [82] R6_2.5.1 doParallel_1.0.17 evaluate_1.0.1
## [85] lattice_0.22-6 Biobase_2.67.0 backports_1.5.0
## [88] R.methodsS3_1.8.2 png_0.1-8 memoise_2.0.1
## [91] bslib_0.8.0 Rcpp_1.0.13-1 checkmate_2.3.2
## [94] gridExtra_2.3 xfun_0.49 buildtools_1.0.0
## [97] pkgconfig_2.0.3