Example workflow

This vignette shows an example workflow for ensemble biclustering analysis with the mosbi package. Every function of the package has a help page with a detailed documentation. To access these type help(package=mosbi) in the R console.

Load packages

Import dependencies.

library(mosbi)

Helper functions

Two additional functions are defined, to calculate z-scores of the data and to visualize the biclusters as a histogram.

z_score <- function(x, margin = 2) {
    z_fun <- function(y) {
        (y - mean(y, na.rm = TRUE)) / sd(y, na.rm = TRUE)
    }

    if (margin == 2) {
        return(apply(x, margin, z_fun))
    } else if (margin == 1) {
        return(t(apply(x, margin, z_fun)))
    }
}

bicluster_histo <- function(biclusters) {
    cols <- mosbi::colhistogram(biclusters)
    rows <- mosbi::rowhistogram(biclusters)

    graphics::par(mfrow = c(1, 2))
    hist(cols, main = "Column size ditribution")
    hist(rows, main = "Row size ditribution")
}

1. Download and prepare data

Biclustering will be done on a data matrix. As an example,
lipidomics dataset from the metabolights database will be used https://www.ebi.ac.uk/metabolights/MTBLS562. The data consists of 40 samples (columns) and 245 lipids (rows).

# get data
data(mouse_data)

mouse_data <- mouse_data[c(
    grep(
        "metabolite_identification",
        colnames(mouse_data)
    ),
    grep("^X", colnames(mouse_data))
)]

# Make data matrix
data_matrix <- z_score(log2(as.matrix(mouse_data[2:ncol(mouse_data)])), 1)

rownames(data_matrix) <- mouse_data$metabolite_identification

stats::heatmap(data_matrix)

The data has a gaussian-like distribution and no missing values, so we can proceed with biclustering.

2. Compute biclusters

The mosbi package is able to work with results of different biclustering algorithms. The approach unites the results from different algorithms. The results of four example algorithms will be computed and converted to mosbi::bicluster objects. For a list of all supported biclustering algorithms/packages type ?mosbi::get_biclusters in the R console.

# Fabia
fb <- mosbi::run_fabia(data_matrix) # In case the algorithms throws an error,
#> Cycle: 0Cycle: 20Cycle: 40Cycle: 60Cycle: 80Cycle: 100Cycle: 120Cycle: 140Cycle: 160Cycle: 180Cycle: 200Cycle: 220Cycle: 240Cycle: 260Cycle: 280Cycle: 300Cycle: 320Cycle: 340Cycle: 360Cycle: 380Cycle: 400Cycle: 420Cycle: 440Cycle: 460Cycle: 480Cycle: 500
# return an empty list

# isa2
BCisa <- mosbi::run_isa(data_matrix)

# Plaid
BCplaid <- mosbi::run_plaid(data_matrix)
#> layer: 0 
#>  5882.744
#> layer: 1 
#> [1]   0 107  10
#> [1]   1 101  10
#> [1]  30 101  10
#> [1] 31 39 10
#> [1] 32 39  9
#> [1] 33 37  9
#> [1] 34 37  9
#> [1] 35 37  9
#> [1] 60 37  9
#> [1] 2
#> [1] 154.413   0.000   0.000   0.000
#> back fitting 2 times
#> layer: 2 
#> [1]  0 92 11
#> [1]  1 83 10
#> [1]  2 80 10
#> [1]  3 79 10
#> [1]  4 78 10
#> [1] 30 78 10
#> [1] 31 39 10
#> [1] 32 39  6
#> [1] 33 39  6
#> [1] 60 39  6
#> [1] 5
#> [1] 207.751   0.000   0.000   0.000
#> back fitting 2 times
#> layer: 3 
#> [1]  0 95 20
#> [1]  1 83 19
#> [1]  2 78 19
#> [1]  3 77 19
#> [1] 30 77 19
#> [1] 31  0 19
#> [1] 32
#> [1] 0 0 0 0
#>      
#> Layer Rows Cols  Df      SS    MS Convergence Rows Released Cols Released
#>     0  245   40 284 6213.04 21.88          NA            NA            NA
#>     1   37    9  45  298.40  6.63           1            64             1
#>     2   39    6  44  381.48  8.67           1            39             4

# QUBIC
BCqubic <- mosbi::run_qubic(data_matrix)

# Merge results of all algorithms
all_bics <- c(fb, BCisa, BCplaid, BCqubic)

bicluster_histo(all_bics)

The histogram visualizes the distribution of bicluster sizes (separately for the number of rows and columns of each bicluster). The total number of found biclusters are given in the title.

3. Compute network

The next step of the ensemble approach is the computation of a similarity network of biclusters. To filter for for similarities due to random overlaps of biclusters, we apply an error model (For more details refer to our publication). Different similarity metrics are available. For details type mosbi::bicluster_network in the R console.

bic_net <- mosbi::bicluster_network(all_bics, # List of biclusters
    data_matrix, # Data matrix
    n_randomizations = 5,
    # Number of randomizations for the
    # error model
    MARGIN = "both",
    # Use datapoints for metric evaluation
    metric = 4, # Fowlkes–Mallows index
    # For information about the metrics,
    # visit the "Similarity metrics
    # evaluation" vignette
    n_steps = 1000,
    # At how many steps should
    # the cut-of is evaluated
    plot_edge_dist = TRUE
    # Plot the evaluation of cut-off estimation
)
#> Esimated cut-off:  0.06206206

The two resulting plot visualize the process of cut-off estimation. The right plot show the remaining number of edges for the computed bicluster network (red) and for randomizations of biclusters (black). The vertical red line showed the threshold with the highest signal-to-noise ratio (SNR). All evaluated SNRs are again visualized in the left plot.

The next plot shows the bicluster similarity matrix. It reveals highly similar biclusters.

stats::heatmap(get_adjacency(bic_net))

Visualize network

Before the final step, extraction of bicluster communities (ensemble biclusters), the bicluster network can be layouted as a network.

plot(bic_net)

The networks are plotted using the igraph package. igraph specific plotting parameters can be added. For help type: ?igraph::plot.igraph

To see, which bicluster was generated by which algorithm, the following function can be executed:

mosbi::plot_algo_network(bic_net, all_bics, vertex.label = NA)

The downloaded data contains samples from different weeks of development. This can be visualized on the network, showing from which week the samples within a bicluster come from.

# Prepare groups for plotting
weeks <- vapply(
    strsplit(colnames(data_matrix), "\\."),
    function(x) {
        return(x[1])
    }, ""
)

names(weeks) <- colnames(data_matrix)

print(sort(unique(weeks))) # 5 colors required
#> [1] "X12W" "X24W" "X32W" "X4W"  "X52W"

week_cols <- c("yellow", "orange", "red", "green", "brown")

# Plot network colored by week
mosbi::plot_piechart_bicluster_network(bic_net, all_bics, weeks,
    week_cols,
    vertex.label = NA
)
graphics::legend("topright",
    legend = sort(unique(weeks)),
    fill = week_cols, title = "Week"
)

Such a visualization is also possible for the samples:

# Prepare groups for plotting
samples <- vapply(
    strsplit(colnames(data_matrix), "\\."),
    function(x) {
        return(x[2])
    }, ""
)

names(samples) <- colnames(data_matrix)

samples_cols <- RColorBrewer::brewer.pal(
    n = length(sort(unique(samples))),
    name = "Set3"
)


# Plot network colored by week
mosbi::plot_piechart_bicluster_network(bic_net, all_bics, samples,
    samples_cols,
    vertex.label = NA
)
graphics::legend("topright",
    legend = sort(unique(samples)),
    fill = samples_cols, title = "Sample"
)

4. Extract louvain communities

Calculate the communities

coms <- mosbi::get_louvain_communities(bic_net,
    min_size = 3,
    bics = all_bics
)
# Only communities with a minimum size of 3 biclusters are saved.

Visualization of the communities

# Plot all communities
for (i in seq(1, length(coms))) {
    tmp_bics <- mosbi::select_biclusters_from_bicluster_network(
        coms[[i]],
        all_bics
    )

    mosbi::plot_piechart_bicluster_network(coms[[i]], tmp_bics,
        weeks, week_cols,
        main = paste0("Community ", i)
    )
    graphics::legend("topright",
        legend = sort(unique(weeks)),
        fill = week_cols, title = "Week"
    )

    cat("\nCommunity ", i, " conists of results from the
             following algorithms:\n")
    cat(get_bic_net_algorithms(coms[[i]]))
    cat("\n")
}

#> 
#> Community  1  conists of results from the
#>              following algorithms:
#> fabia isa2

#> 
#> Community  2  conists of results from the
#>              following algorithms:
#> fabia isa2 biclust-plaid biclust-qubic

#> 
#> Community  3  conists of results from the
#>              following algorithms:
#> fabia

#> 
#> Community  4  conists of results from the
#>              following algorithms:
#> isa2 biclust-qubic

#> 
#> Community  5  conists of results from the
#>              following algorithms:
#> isa2

#> 
#> Community  6  conists of results from the
#>              following algorithms:
#> biclust-qubic

#> 
#> Community  7  conists of results from the
#>              following algorithms:
#> biclust-qubic

Extraction of the communities

Finally, communities of the network can be extracted as ensemble biclusters. The are saved as a list of mosbi::bicluster objects and therefore in the same format as the imported results of all the algorithms. With the parameters row_threshold & col_threshold, the minimum occurrence of a row- or column-element in the biclusters of a community can be defined.

ensemble_bicluster_list <- mosbi::ensemble_biclusters(coms, all_bics,
    data_matrix,
    row_threshold = .1,
    col_threshold = .1
)

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] mosbi_1.13.0     BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.3.1             sass_0.4.9              generics_0.1.3         
#>  [4] class_7.3-22            lattice_0.22-6          digest_0.6.37          
#>  [7] magrittr_2.0.3          evaluate_1.0.1          grid_4.4.2             
#> [10] RColorBrewer_1.1-3      fastmap_1.2.0           jsonlite_1.8.9         
#> [13] BiocManager_1.30.25     purrr_1.0.2             scales_1.3.0           
#> [16] modeltools_0.2-23       jquerylib_0.1.4         cli_3.6.3              
#> [19] isa2_0.3.6              rlang_1.1.4             Biobase_2.67.0         
#> [22] munsell_0.5.1           cachem_1.1.0            yaml_2.3.10            
#> [25] tools_4.4.2             parallel_4.4.2          biclust_2.0.3.1        
#> [28] dplyr_1.1.4             colorspace_2.1-1        ggplot2_3.5.1          
#> [31] BiocGenerics_0.53.3     buildtools_1.0.0        vctrs_0.6.5            
#> [34] R6_2.5.1                stats4_4.4.2            lifecycle_1.0.4        
#> [37] QUBIC_1.35.0            MASS_7.3-61             pkgconfig_2.0.3        
#> [40] RcppParallel_5.1.9      bslib_0.8.0             pillar_1.10.0          
#> [43] gtable_0.3.6            glue_1.8.0              Rcpp_1.0.13-1          
#> [46] tidyselect_1.2.1        xfun_0.49               tibble_3.2.1           
#> [49] sys_3.4.3               flexclust_1.4-2         knitr_1.49             
#> [52] fabia_2.53.0            igraph_2.1.2            htmltools_0.5.8.1      
#> [55] rmarkdown_2.29          BH_1.87.0-1             maketools_1.3.1        
#> [58] compiler_4.4.2          additivityTests_1.1-4.2