CrcBiomeScreen is an R package designed to streamline microbiome-based colorectal cancer (CRC) screening workflows.
It provides standardized functions for preprocessing, taxonomic data handling, machine learning model training,
and cross-cohort validation — supporting reproducible and interpretable microbiome analysis for biomarker discovery.
This version marks the first public release of the package, submitted to Bioconductor.
SplitTaxas() now automatically detects taxonomy separators
(supports |, ., _, and ;) to handle common formats from MetaPhlAn, QIIME, etc.
Adds:
OriginalTaxa column to retain the raw taxonomy string.f__Rikenellaceae|g__unclassified → Rikenellaceae_unclassified).KeepTaxonomicLevel() filters data at a user-defined rank (e.g., Genus, Family, Species)
and automatically collapses lower-level abundances.
Handles multiple nested unclassified levels (e.g., D_2__Clostridia.D_3__uncultured.D_4__uncultured).
TrainModel() serves as a unified interface for training Random Forest (RF) and XGBoost classifiers.
withr::with_seed() for local reproducibility instead of set.seed().EvaluateRF() and EvaluateXGBoost() now return standardized performance metrics
(AUC, accuracy, recall, F1) and store model outputs in the main object structure.
ValidateModelOnData() supports model evaluation across independent datasets.
CrcBiomeScreenObject class to store:
This ensures data provenance and reproducibility across the full workflow.
PlotAUC)Added comprehensive vignette:
Implemented unit tests under tests/testthat/ for key components:
BugReports and URL fields in DESCRIPTION..Rproj, .DS_Store).set.seed() with withr::with_seed() for Bioconductor compliance.Maintainer: Chengxin Li (University of Leeds)
Date: 2025-10-21
License: MIT
Repository: https://github.com/omicsForestry/CrcBiomeScreen