--- title: "Benchmarking BamScale Across Step1, GAlignments, and SeqQual Workloads" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Benchmarking BamScale Across Step1, GAlignments, and SeqQual Workloads} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: step1_ga_results_dir: "run_20260320_133141" seqqual_results_dir: "run_20260320_162359" --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = FALSE, warning = FALSE, message = FALSE ) ``` ## Overview This article combines the final benchmark runs used for the current BamScale benchmark summary: - `step1` and `galignments` from `r params$step1_ga_results_dir` - `seqqual` from `r params$seqqual_results_dir` These runs were generated with the same server-first benchmark harness, the same balanced profile family, the same deterministic case order, and the same worker/thread budget policy. `Seqqual` is reported separately because it includes both the fair compatibility track and the optimized compact track. ## Data Loading ```{r benchmark-data-loading} suppressPackageStartupMessages({ library(readr) library(dplyr) library(tidyr) library(ggplot2) library(scales) }) resolve_results_dir <- function(path) { input_dir <- tryCatch(dirname(normalizePath(knitr::current_input(), winslash = "/", mustWork = TRUE)), error = function(e) NA_character_) wd <- tryCatch(normalizePath(getwd(), winslash = "/", mustWork = TRUE), error = function(e) NA_character_) candidates <- c( path, if (!is.na(input_dir)) file.path(input_dir, path) else NA_character_, if (!is.na(input_dir)) file.path(input_dir, "..", path) else NA_character_, if (!is.na(input_dir)) file.path(input_dir, "..", "..", path) else NA_character_, if (!is.na(input_dir)) file.path(input_dir, "..", "inst", "benchmarks", path) else NA_character_, if (!is.na(wd)) file.path(wd, path) else NA_character_, if (!is.na(wd)) file.path(wd, "inst", "benchmarks", path) else NA_character_ ) candidates <- unique(candidates[!is.na(candidates) & nzchar(candidates)]) hits <- candidates[file.exists(candidates)] if (length(hits) == 0) { stop("Could not locate results_dir: ", path, call. = FALSE) } normalizePath(hits[[1]], winslash = "/", mustWork = TRUE) } read_if_exists <- function(path) { if (file.exists(path)) readr::read_csv(path, show_col_types = FALSE) else tibble::tibble() } load_run <- function(path_label) { dir_path <- resolve_results_dir(path_label) summary_path <- if (file.exists(file.path(dir_path, "summary_qmd.csv"))) { file.path(dir_path, "summary_qmd.csv") } else { file.path(dir_path, "summary.csv") } list( label = basename(dir_path), dir = dir_path, summary = read_if_exists(summary_path), files = read_if_exists(file.path(dir_path, "files.csv")), host = read_if_exists(file.path(dir_path, "host_info.csv")), reference_counts = read_if_exists(file.path(dir_path, "reference_counts.csv")), correctness = read_if_exists(file.path(dir_path, "correctness_preflight.csv")) ) } run_main <- load_run(params$step1_ga_results_dir) run_seqqual <- load_run(params$seqqual_results_dir) main_ok <- run_main$summary %>% filter(status == "ok", workload %in% c("step1", "galignments"), comparison_track == "fair") seq_ok <- run_seqqual$summary %>% filter(status == "ok", workload == "seqqual") single_main <- main_ok %>% filter(scenario == "single") multi_main <- main_ok %>% filter(scenario == "multi") seq_single <- seq_ok %>% filter(scenario == "single") seq_multi <- seq_ok %>% filter(scenario == "multi") host_main <- if (nrow(run_main$host) > 0) run_main$host[1, ] else tibble::tibble() host_seq <- if (nrow(run_seqqual$host) > 0) run_seqqual$host[1, ] else tibble::tibble() ``` ## Benchmark Provenance - Step1 and GAlignments run: `r run_main$label` - SeqQual run: `r run_seqqual$label` - Step1/GAlignments results directory: `r run_main$dir` - SeqQual results directory: `r run_seqqual$dir` ```{r benchmark-provenance-table} run_overview_tbl <- tibble::tibble( run = c(run_main$label, run_seqqual$label), workloads = c("step1, galignments", "seqqual"), profile = c( if (nrow(run_main$host) > 0) host_main$profile[[1]] else NA_character_, if (nrow(run_seqqual$host) > 0) host_seq$profile[[1]] else NA_character_ ), cpu = c( if (nrow(run_main$host) > 0) host_main$cpu_model[[1]] else NA_character_, if (nrow(run_seqqual$host) > 0) host_seq$cpu_model[[1]] else NA_character_ ), logical_cores = c( if (nrow(run_main$host) > 0) host_main$logical_cores[[1]] else NA_integer_, if (nrow(run_seqqual$host) > 0) host_seq$logical_cores[[1]] else NA_integer_ ), total_mem_gb = c( if (nrow(run_main$host) > 0) round(host_main$total_mem_gb[[1]], 1) else NA_real_, if (nrow(run_seqqual$host) > 0) round(host_seq$total_mem_gb[[1]], 1) else NA_real_ ), successful_cases = c(nrow(main_ok), nrow(seq_ok)) ) knitr::kable(run_overview_tbl, col.names = c("Run", "Workloads", "Profile", "CPU", "Logical cores", "RAM (GB)", "Successful cases")) ``` ## Methods Rationale This benchmark suite covers three distinct access patterns: - `step1`: alignment-metadata extraction without sequence/quality materialization. This is a good proxy for BAM filtering, fragment-length profiling, and fragment-distribution QC. - `galignments`: construction of alignment objects suitable for Bioconductor workflows centered on `GenomicAlignments`. - `seqqual`: full sequence and quality extraction. This is reported in two BamScale modes: - `fair` compatibility mode, which is the closest comparator to `Rsamtools::scanBam` - `optimized` compact mode, which reduces internal overhead and should be interpreted as an optimized BamScale track rather than a strict apples-to-apples replacement for the compatibility path Across all runs, the benchmark design emphasizes: - deterministic case order - balanced BamScale worker/thread plans - explicit comparator baselines - single-file and multi-file reporting - fixed 48-thread budget for multi-file plans ## Input Files The same underlying four selected BAMs were used for both runs, with repeated files allowed to populate the 12-file multi-file benchmark set. ```{r benchmark-input-files} run_main$files %>% transmute( file, source, selected_for_single, selected_for_multi, size_mb = round(size_mb, 1), has_index ) %>% knitr::kable() ``` ## Reference Counts ```{r benchmark-reference-counts} bind_rows( run_main$reference_counts %>% mutate(run = run_main$label), run_seqqual$reference_counts %>% mutate(run = run_seqqual$label) ) %>% mutate(total_mb = round(total_mb, 1)) %>% select(run, scenario, workload, n_files, n_records, total_mb) %>% knitr::kable() ``` ## Best-Observed Summary ```{r best-observed-summary} best_summary_tbl <- bind_rows(main_ok, seq_ok) %>% group_by(scenario, workload, comparison_track, method_family) %>% slice_min(order_by = median_s, n = 1, with_ties = FALSE) %>% ungroup() %>% mutate( plan = paste0(bp_workers_effective, "x", threads_effective), median_s = round(median_s, 3) ) %>% select(scenario, workload, comparison_track, method_family, method, plan, median_s) knitr::kable(best_summary_tbl, col.names = c("Scenario", "Workload", "Track", "Method family", "Method", "Plan", "Median time (s)")) ``` ## BamScale-versus-Comparator Fold Change ```{r fold-change-summary} best_fold_tbl <- function(df, scenario_name, include_optimized = FALSE) { bamscale_fair <- df %>% filter(scenario == scenario_name, method_family == "BamScale", comparison_track == "fair") %>% group_by(workload) %>% slice_min(order_by = median_s, n = 1, with_ties = FALSE) %>% ungroup() %>% transmute( scenario = scenario_name, workload, track = "fair", bamscale_method = method, bamscale_plan = paste0(bp_workers_effective, "x", threads_effective), bamscale_s = median_s ) %>% left_join( df %>% filter(scenario == scenario_name, method_family != "BamScale", comparison_track == "fair") %>% group_by(workload) %>% slice_min(order_by = median_s, n = 1, with_ties = FALSE) %>% ungroup() %>% transmute( workload, comparator_family = method_family, comparator_method = method, comparator_plan = paste0(bp_workers_effective, "x", threads_effective), comparator_s = median_s ), by = "workload" ) %>% mutate( fold_change = comparator_s / bamscale_s, pct_faster = 100 * (1 - bamscale_s / comparator_s) ) if (!include_optimized) { return(bamscale_fair) } bamscale_optimized <- df %>% filter(scenario == scenario_name, method_family == "BamScale", comparison_track == "optimized") %>% group_by(workload) %>% slice_min(order_by = median_s, n = 1, with_ties = FALSE) %>% ungroup() %>% transmute( scenario = scenario_name, workload, track = "optimized", bamscale_method = method, bamscale_plan = paste0(bp_workers_effective, "x", threads_effective), bamscale_s = median_s ) %>% left_join( df %>% filter(scenario == scenario_name, method_family != "BamScale", comparison_track == "fair") %>% group_by(workload) %>% slice_min(order_by = median_s, n = 1, with_ties = FALSE) %>% ungroup() %>% transmute( workload, comparator_family = method_family, comparator_method = method, comparator_plan = paste0(bp_workers_effective, "x", threads_effective), comparator_s = median_s ), by = "workload" ) %>% mutate( fold_change = comparator_s / bamscale_s, pct_faster = 100 * (1 - bamscale_s / comparator_s) ) bind_rows(bamscale_fair, bamscale_optimized) } fold_tbl <- bind_rows( best_fold_tbl(main_ok, "single"), best_fold_tbl(main_ok, "multi"), best_fold_tbl(seq_ok, "single", include_optimized = TRUE), best_fold_tbl(seq_ok, "multi", include_optimized = TRUE) ) %>% mutate( across(c(bamscale_s, comparator_s, fold_change, pct_faster), ~ round(.x, 3)) ) knitr::kable( fold_tbl %>% select(scenario, workload, track, comparator_family, comparator_method, comparator_plan, comparator_s, bamscale_method, bamscale_plan, bamscale_s, fold_change, pct_faster), col.names = c("Scenario", "Workload", "Track", "Comparator family", "Comparator method", "Comparator plan", "Comparator time (s)", "BamScale method", "BamScale plan", "BamScale time (s)", "Comparator/BamScale", "BamScale % faster") ) ``` ```{r fold-change-plot, fig.width=9, fig.height=4.8} ggplot(fold_tbl, aes(x = interaction(workload, scenario, track, sep = "\n"), y = fold_change, fill = workload)) + geom_col(width = 0.75) + geom_text(aes(label = sprintf("%.2fx", fold_change)), vjust = -0.3, size = 3.5) + scale_fill_brewer(palette = "Set2") + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + labs( x = NULL, y = "Speedup over comparator (x)", fill = "Workload", title = "Best-observed BamScale speedup over comparator" ) + theme_minimal(base_size = 12) + theme(axis.text.x = element_text(angle = 18, hjust = 1)) ``` ## Single-File `step1` ```{r step1-single-plot, fig.width=7.5, fig.height=4.8} single_main %>% filter(workload == "step1") %>% mutate( method_label = if_else(method_family == "BamScale", "BamScale", "Rsamtools::scanBam") ) %>% ggplot(aes(x = threads_effective, y = median_s, color = method_label)) + geom_line(linewidth = 1) + geom_point(size = 2.7) + scale_x_continuous(breaks = c(1, 4, 12, 24, 48)) + labs( x = "Threads", y = "Median time (s)", color = NULL, title = "Single-file step1 scaling" ) + theme_minimal(base_size = 12) ``` ```{r step1-single-table} single_main %>% filter(workload == "step1") %>% mutate(plan = paste0(bp_workers_effective, "x", threads_effective)) %>% select(method_family, method, plan, median_s) %>% arrange(median_s) %>% knitr::kable(col.names = c("Method family", "Method", "Plan", "Median time (s)")) ``` ## Single-File `galignments` ```{r galignments-single-plot, fig.width=7.5, fig.height=4.8} single_main %>% filter(workload == "galignments") %>% mutate( method_label = if_else(method_family == "BamScale", "BamScale", "GenomicAlignments::readGAlignments") ) %>% ggplot(aes(x = threads_effective, y = median_s, color = method_label)) + geom_line(linewidth = 1) + geom_point(size = 2.7) + scale_x_continuous(breaks = c(1, 4, 12, 24, 48)) + labs( x = "Threads", y = "Median time (s)", color = NULL, title = "Single-file galignments scaling" ) + theme_minimal(base_size = 12) ``` ```{r galignments-single-table} single_main %>% filter(workload == "galignments") %>% mutate(plan = paste0(bp_workers_effective, "x", threads_effective)) %>% select(method_family, method, plan, median_s) %>% arrange(median_s) %>% knitr::kable(col.names = c("Method family", "Method", "Plan", "Median time (s)")) ``` ## Multi-File `step1` ```{r step1-multi-plot, fig.width=7.5, fig.height=4.8} multi_main %>% filter(workload == "step1") %>% mutate( method_label = if_else(method_family == "BamScale", "BamScale", "Rsamtools::scanBam + BiocParallel") ) %>% ggplot(aes(x = bp_workers_effective, y = median_s, color = method_label, group = method_label)) + geom_line(linewidth = 1) + geom_point(size = 2.7) + scale_x_continuous(breaks = c(1, 2, 4, 8, 12)) + labs( x = "Workers", y = "Median time (s)", color = NULL, title = "Multi-file step1 scaling" ) + theme_minimal(base_size = 12) ``` ## Multi-File `galignments` ```{r galignments-multi-plot, fig.width=7.5, fig.height=4.8} multi_main %>% filter(workload == "galignments") %>% mutate( method_label = if_else(method_family == "BamScale", "BamScale", "GenomicAlignments::readGAlignments + BiocParallel") ) %>% ggplot(aes(x = bp_workers_effective, y = median_s, color = method_label, group = method_label)) + geom_line(linewidth = 1) + geom_point(size = 2.7) + scale_x_continuous(breaks = c(1, 2, 4, 8, 12)) + labs( x = "Workers", y = "Median time (s)", color = NULL, title = "Multi-file galignments scaling" ) + theme_minimal(base_size = 12) ``` ## Single-File `seqqual` ```{r seqqual-single-plot, fig.width=8, fig.height=4.8} seq_single %>% mutate( method_label = case_when( method_family == "BamScale" & comparison_track == "fair" ~ "BamScale (compatible)", method_family == "BamScale" & comparison_track == "optimized" ~ "BamScale (compact)", TRUE ~ "Rsamtools::scanBam" ) ) %>% ggplot(aes(x = threads_effective, y = median_s, color = method_label)) + geom_line(linewidth = 1) + geom_point(size = 2.7) + scale_x_continuous(breaks = c(1, 4, 12, 24, 48)) + labs( x = "Threads", y = "Median time (s)", color = NULL, title = "Single-file seqqual scaling" ) + theme_minimal(base_size = 12) ``` ```{r seqqual-single-table} seq_single %>% mutate( method_label = case_when( method_family == "BamScale" & comparison_track == "fair" ~ "BamScale (compatible)", method_family == "BamScale" & comparison_track == "optimized" ~ "BamScale (compact)", TRUE ~ "Rsamtools::scanBam" ), plan = paste0(bp_workers_effective, "x", threads_effective) ) %>% select(method_label, plan, median_s) %>% arrange(median_s) %>% knitr::kable(col.names = c("Method", "Plan", "Median time (s)")) ``` ## Multi-File `seqqual` ```{r seqqual-multi-plot, fig.width=8, fig.height=4.8} seq_multi %>% mutate( method_label = case_when( method_family == "BamScale" & comparison_track == "fair" ~ "BamScale (compatible)", method_family == "BamScale" & comparison_track == "optimized" ~ "BamScale (compact)", TRUE ~ "Rsamtools::scanBam + BiocParallel" ) ) %>% ggplot(aes(x = bp_workers_effective, y = median_s, color = method_label, group = method_label)) + geom_line(linewidth = 1) + geom_point(size = 2.7) + scale_x_continuous(breaks = c(1, 2, 4, 8, 12)) + labs( x = "Workers", y = "Median time (s)", color = NULL, title = "Multi-file seqqual scaling" ) + theme_minimal(base_size = 12) ``` ## Compact-versus-Compatible `seqqual` ```{r compact-gain-table} compact_gain_tbl <- bind_rows(seq_single, seq_multi) %>% filter(method_family == "BamScale") %>% select(scenario, workload, comparison_track, bp_workers_effective, threads_effective, median_s) %>% tidyr::pivot_wider(names_from = comparison_track, values_from = median_s) %>% mutate( plan = paste0(bp_workers_effective, "x", threads_effective), compact_gain = fair / optimized ) %>% arrange(scenario, bp_workers_effective, threads_effective) knitr::kable( compact_gain_tbl %>% transmute( scenario, workload, plan, compatible_s = round(fair, 3), compact_s = round(optimized, 3), compact_gain = round(compact_gain, 3) ), col.names = c("Scenario", "Workload", "Plan", "Compatible time (s)", "Compact time (s)", "Compatible/Compact") ) ``` ```{r compact-gain-plot, fig.width=7.5, fig.height=4.8} compact_gain_tbl %>% ggplot(aes(x = plan, y = compact_gain, fill = scenario)) + geom_col(position = position_dodge(width = 0.8), width = 0.7) + geom_text( aes(label = sprintf("%.2fx", compact_gain)), position = position_dodge(width = 0.8), vjust = -0.2, size = 3.3 ) + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + labs( x = "BamScale plan", y = "Compatible / compact", fill = "Scenario", title = "Compact seqqual reduces extraction overhead" ) + theme_minimal(base_size = 12) ``` ## Interpretation The benchmark shows a consistent pattern across the final runs: - `step1` and `galignments` are the strongest BamScale workloads. - For these workloads, the best BamScale plans are usually low-worker, high-thread configurations, indicating that within-file multithreading contributes more than aggressively increasing the number of file workers. - `Seqqual` is more difficult because sequence and quality extraction amplifies output-construction overhead. - BamScale compact mode narrows this gap substantially and produces the strongest single-file `seqqual` result, but the best multi-file `seqqual` comparator remains faster. In practical terms, these results support the following guidance: - use BamScale when metadata extraction or alignment-object construction is a bottleneck in BAM-heavy Bioconductor workflows - favor low-worker, high-thread plans when traversing one or a small number of large BAM files - interpret compact `seqqual` as an optimized throughput path rather than as a strict compatibility benchmark ## Session Information ```{r benchmark-session-info} sessionInfo() ```