--- title: "Pipeline Use of dnaEPICO" author: - name: Paul Ruiz affiliation: - Queensland University of Technology email: ruizpint@qut.edu.au - name: Divya Mehta affiliation: - Queensland University of Technology output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 date: "`r Sys.Date()`" package: dnaEPICO bibliography: dnaEPICO.bib vignette: > %\VignetteIndexEntry{3. Pipeline Use of dnaEPICO} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} startTime <- Sys.time() knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) library(dnaEPICO) can_run_preprocessing <- requireNamespace("minfiData", quietly = TRUE) && requireNamespace("IlluminaHumanMethylation450kmanifest", quietly = TRUE) && requireNamespace("IlluminaHumanMethylation450kanno.ilmn12.hg19", quietly = TRUE) can_run_sva <- can_run_preprocessing can_run_glm <- requireNamespace( "IlluminaHumanMethylation450kanno.ilmn12.hg19", quietly = TRUE ) can_run_glmm <- can_run_glm && requireNamespace("lmerTest", quietly = TRUE) ``` # Introduction This vignette describes the file-based route in `dnaEPICO`. In this mode, the same main functions still return structured result objects, but they also write files to disk when `saveOutputs = TRUE`. This route is appropriate when you want a reproducible folder structure, Makefile-driven execution, or a pipeline that can be resumed on a local workstation or an HPC system. # Citation The `r Biocpkg("dnaEPICO")` package uses methods from several external packages. Because no single manuscript describes all components, the guidance below explains how to cite `dnaEPICO` depending on the functions you use. This citation guidance is adapted from the vignettes and user guides of [minfi](https://github.com/hansenlab/minfi/blob/devel/vignettes/minfi.Rmd), [ENmix](https://github.com/xuz1/ENmix/blob/master/vignettes/ENmix.Rmd), and [wateRmelon](https://github.com/schalkwyk/wateRmelon/blob/master/vignettes/references.bib). - If you use `make f3 MODEL=model1`, please cite @minfi; @minfiEPIC; @ENmix; @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools. - If you use `make f4 MODEL=model1`, please cite @minfi; @minfiEPIC; @ENmix; @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools; @glm2. - If you use `make f3lme MODEL=model1`, please cite @minfi; @minfiEPIC; @ENmix; @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools; @lme4. - If you use `make all MODEL=model1`, please cite @minfi; @minfiEPIC; @ENmix; @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools; @glm2; @lme4. # Required knowledge `r Biocpkg("dnaEPICO")` is built on core Bioconductor infrastructure for high-dimensional genomic data, with a focus on Illumina DNA methylation arrays. To read this vignette, we assume that you are already familiar with the general DNA methylation workflow. If not, we recommend first reading [this tutorial](https://paulyrp.github.io/2025-cpgpneurogenomics-workshop/tutorial.html), which provides a practical introduction to the main concepts and analysis steps. Preprocessing and quality control are performed using established Bioconductor tools, including `r Biocpkg("minfi")`, `r Biocpkg("ENmix")`, and `r Biocpkg("wateRmelon")`. Downstream statistical modelling relies on base R and CRAN frameworks, including generalised linear models and linear mixed-effects models. Users are expected to have basic familiarity with R, command-line execution, and Illumina IDAT file structures. If you are asking yourself the question "Where do I start using Bioconductor?", you might be interested in [this blog post](https://www.bioconductor.org/install/). # Installation The `r Biocpkg("dnaEPICO")` package depends on `r Biocpkg("minfi")`, `r Biocpkg("ENmix")`, and `r Biocpkg("wateRmelon")`, among others. Install dependencies with BiocManager: ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("dnaEPICO") BiocManager::valid() ``` ```{r, eval = FALSE, message = FALSE} library("dnaEPICO") ``` # Extract the example Makefile The package includes a template Makefile that can be copied into a project directory and adapted to local paths and model definitions. ```{r makefile-example} makefilePath <- extractMake(destDir = tempdir(), overwrite = TRUE) basename(makefilePath) ``` After extraction, the file should be edited so that the project-specific paths, model labels, and cluster settings match the target analysis environment. Arguments: - `destDir = tempdir()` writes the template to a temporary directory so the example does not modify the working tree. - `overwrite = TRUE` allows the template to be replaced if it already exists in that temporary location. # Write preprocessingMinfiEwasWater outputs to disk The file-based route usually starts with `preprocessingMinfiEwasWater()`. This step writes the filtered `RGSet`, normalized metric matrices, quality-control figures, and the `phenoLC.csv` file that will be consumed by `svaEnmix()` and `preprocessingPheno()`. ### Create example input files The next chunk shows how to recreate the temporary files used in this example. It writes a small phenotype table to a temporary directory, copies a small set of IDAT files, and records the cross-reactive probe reference used during probe filtering. ```{r preprocessing-pipeline-inputs, eval = can_run_preprocessing, message = FALSE, warning = FALSE} preprocessing_inputs <- dnaEPICO:::exampleMinfiIdatInputsDnaEpico(n = 6L) names(preprocessing_inputs) preprocessing_inputs$tempDir basename(preprocessing_inputs$phenoFile) head(basename(list.files(preprocessing_inputs$idatFolder, full.names = TRUE)), 4) basename(preprocessing_inputs$crossReactivePath) ``` `preprocessing_inputs` is a small list returned by the internal example helper. `names(preprocessing_inputs)` shows its main elements: `tempDir` is the temporary working directory, `idatFolder` contains the copied example IDAT files, `phenoFile` is the phenotype table used by the function, `targets` is the same phenotype information already loaded into memory, `arrayType` and `annotationVersion` describe the array platform, and `crossReactivePath` points to the cross-reactive probe reference used during probe filtering. ```{r preprocessing-pipeline, eval = can_run_preprocessing, message = FALSE, warning = FALSE} preprocessPipelineResult <- preprocessingMinfiEwasWater( phenoFile = preprocessing_inputs$phenoFile, idatFolder = preprocessing_inputs$idatFolder, outputLogs = file.path(preprocessing_inputs$tempDir, "logs"), nSamples = 6, SampleID = "Sample_Name", arrayType = preprocessing_inputs$arrayType, annotationVersion = preprocessing_inputs$annotationVersion, scriptLabel = "preprocessingMinfiEwasWater", baseDataFolder = file.path(preprocessing_inputs$tempDir, "rData"), figureBaseDir = file.path(preprocessing_inputs$tempDir, "figures"), detPThreshold = 1, normMethods = "quantile", sexColumn = "Sex", pvalThreshold = 1, chrToRemove = "", snpsToRemove = "SBE", mafThreshold = 1, crossReactivePath = preprocessing_inputs$crossReactivePath, plotGroupVar = "Sex", lcRef = "saliva", phenoOrder = "Sample_Name;Sex;Basename;Sentrix_ID;Sentrix_Position", lcPhenoDir = file.path( preprocessing_inputs$tempDir, "data", "preprocessingMinfiEwasWater" ), display = FALSE, verbose = FALSE, logs = TRUE, saveOutputs = TRUE ) preprocess_paths <- c( phenoLC = file.path( preprocessing_inputs$tempDir, "data", "preprocessingMinfiEwasWater", "phenoLC.csv" ), rgset = file.path( preprocessing_inputs$tempDir, "rData", "preprocessingMinfiEwasWater", "objects", "RGSet.RData" ), beta = file.path( preprocessing_inputs$tempDir, "rData", "preprocessingMinfiEwasWater", "metrics", "beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData" ), qc = file.path( preprocessing_inputs$tempDir, "figures", "preprocessingMinfiEwasWater", "qc", "quality_control(MSet).tiff" ) ) class(preprocessPipelineResult) basename(preprocess_paths) file.exists(preprocess_paths) ``` The returned object still has class `dnaEPICO_preprocessingMinfiEwasWater`, but the printed vectors in this chunk focus on the written outputs. In `preprocess_paths`, `phenoLC` is the phenotype table with estimated cell proportions, `rgset` is the filtered `RGSet`, `beta` is the saved beta matrix used later by phenotype preparation, and `qc` is an example quality-control figure. `basename(preprocess_paths)` shows the file names, while `file.exists(preprocess_paths)` confirms that each one was written successfully. Arguments: - `phenoFile` and `idatFolder` define the phenotype table and the IDAT folder used to start the pipeline. - `SampleID = "Sample_Name"` is the key used to align phenotype rows with the methylation objects. - `arrayType` and `annotationVersion` define the array manifest and annotation used by `minfi`. - `baseDataFolder`, `figureBaseDir`, and `lcPhenoDir` define where serialized objects, figures, and `phenoLC.csv` are written. - `detPThreshold`, `normMethods`, `pvalThreshold`, `chrToRemove`, `snpsToRemove`, `mafThreshold`, and `crossReactivePath` define the main preprocessing and filtering choices. - `plotGroupVar`, `lcRef`, and `phenoOrder` control QC grouping, cell-type estimation, and the structure of the exported phenotype table. - `saveOutputs = TRUE` is what turns this function into the first file-producing step of the pipeline. # Write SVA outputs to disk The next example shows `svaEnmix()` in file-writing mode. The function still returns a structured result object, but the `savedFiles` element now records the main output files written to disk. ### Reuse the temporary files written in Step 1 In the file-based route, `svaEnmix()` consumes the `phenoLC.csv` and `RGSet.RData` files written by `preprocessingMinfiEwasWater()`. The next chunk shows how those temporary file paths are reconstructed locally. ```{r sva-pipeline-inputs, eval = can_run_sva, message = FALSE, warning = FALSE} sva_targets_file <- file.path( preprocessing_inputs$tempDir, "data", "preprocessingMinfiEwasWater", "phenoLC.csv" ) sva_rgset_file <- file.path( preprocessing_inputs$tempDir, "rData", "preprocessingMinfiEwasWater", "objects", "RGSet.RData" ) basename(c(sva_targets_file, sva_rgset_file)) file.exists(c(sva_targets_file, sva_rgset_file)) ``` Unlike the previous helper objects, `sva_targets_file` and `sva_rgset_file` are plain file paths reconstructed from `preprocessing_inputs$tempDir`. They point to the concrete outputs written by Step 1: `phenoLC.csv`, which is the phenotype table augmented with cell-composition estimates, and `RGSet.RData`, which is the filtered methylation object saved to disk. ```{r sva-pipeline, eval = can_run_sva, message = FALSE, warning = FALSE} svaPipelineResult <- svaEnmix( phenoFile = sva_targets_file, rgsetData = sva_rgset_file, outputLogs = file.path(preprocessing_inputs$tempDir, "logs"), SampleID = "Sample_Name", arrayType = "IlluminaHumanMethylation450k", annotationVersion = "ilmn12.hg19", SentrixIDColumn = "Sentrix_ID", SentrixPositionColumn = "Sentrix_Position", ctrlSvaPercVar = 0.90, ctrlSvaFlag = 1, scriptLabel = "svaEnmix", dataBaseDir = file.path(preprocessing_inputs$tempDir, "data"), rBaseDir = file.path(preprocessing_inputs$tempDir, "rData"), figureBaseDir = file.path(preprocessing_inputs$tempDir, "figures"), display = FALSE, verbose = FALSE, logs = TRUE, saveOutputs = TRUE ) class(svaPipelineResult$savedFiles) names(svaPipelineResult$savedFiles) basename(unlist(svaPipelineResult$savedFiles, use.names = FALSE)) ``` The returned `savedFiles` object records the main paths written by this step. `names(svaPipelineResult$savedFiles)` identifies the output groups: `svaRData` is the serialized surrogate-variable matrix, `svaCSV` is the CSV version of that matrix, `phenoWithSva` is the updated phenotype table with appended SVA columns, and `dataDir` / `rDir` are the parent directories used for the saved outputs. The `basename(...)` call shows the file names generated from those paths. Arguments: - `phenoFile` and `rgsetData` identify the phenotype table and saved `RGSet` produced by the earlier preprocessing step. - `SampleID`, `SentrixIDColumn`, and `SentrixPositionColumn` specify how samples and array positions are identified in the phenotype table. - `ctrlSvaPercVar = 0.90` keeps enough control-derived surrogate variables to explain 90% of the control-probe variance. - `ctrlSvaFlag = 1` enables the ENmix control-based SVA workflow. - `dataBaseDir` and `rBaseDir` define where file outputs are written when `saveOutputs = TRUE`. - `display = FALSE` and `verbose = FALSE` keep the console output quiet, while `logs = TRUE` writes the log files used by the report. # Write preprocessingPheno outputs to disk `preprocessingPheno()` is the step that usually prepares the largest number of pipeline-ready files because it writes timepoint-specific tables, combined longitudinal tables, and the Clock Foundation export inputs. ### Create example input files The next chunk shows how to recreate the temporary phenotype and matrix files used in this example. The helper writes a phenotype table plus aligned beta, M-value, and copy-number objects to a temporary directory. ```{r preprocessingPheno-pipeline-inputs, message = FALSE, warning = FALSE} pheno_inputs <- dnaEPICO:::examplePreprocessingPhenoStateDnaEpico() names(pheno_inputs) pheno_inputs$tempDir basename(c( pheno_inputs$phenoPath, pheno_inputs$betaPath, pheno_inputs$mPath, pheno_inputs$cnPath )) ``` `pheno_inputs` is a list that bundles the temporary files and the aligned in-memory objects used in this stage. `names(pheno_inputs)` shows that it contains the temporary directory, the phenotype table, the saved beta, M-value, and copy-number files, and precomputed helper objects such as `metricsData`, `timepointData`, `combinedData`, and `clockFoundation`. ```{r preprocessingPheno-pipeline, message = FALSE, warning = FALSE} phenoPipelineResult <- preprocessingPheno( phenoFile = pheno_inputs$phenoPath, betaPath = pheno_inputs$betaPath, mPath = pheno_inputs$mPath, cnPath = pheno_inputs$cnPath, SampleID = "Sample_Name", timeVar = "Timepoint", timepoints = "1,2", combineTimepoints = "1,2", outputPheno = file.path(pheno_inputs$tempDir, "data", "preprocessingPheno"), outputRData = file.path( pheno_inputs$tempDir, "rData", "preprocessingPheno", "metrics" ), outputRDataMerge = file.path( pheno_inputs$tempDir, "rData", "preprocessingPheno", "mergeData" ), sexColumn = "Sex", outputLogs = file.path(pheno_inputs$tempDir, "logs"), outputDir = file.path(pheno_inputs$tempDir, "clockFoundation"), verbose = FALSE, logs = TRUE, saveOutputs = TRUE ) names(phenoPipelineResult$savedFiles) names(phenoPipelineResult$savedFiles$timepoints) basename(unlist(phenoPipelineResult$savedFiles$timepoints[["1"]])) basename(unlist(phenoPipelineResult$savedFiles[c( "combinedPheno", "combinedPhenoBeta", "betaCSV", "phenoCF" )])) ``` These outputs are the files most commonly consumed by downstream modeling functions. `names(phenoPipelineResult$savedFiles)` shows the high-level output groups returned by the function. `names(phenoPipelineResult$savedFiles$timepoints)` lists the available timepoint-specific subsets. The `basename(...)` calls then show examples of the concrete files written for one timepoint and for the combined outputs. In this structure, `combinedPhenoBeta` is the merged phenotype-plus-beta object that becomes the direct input to the GLM and GLMM functions, while `betaCSV`, `betaZIP`, and `phenoCF` are the Clock Foundation exports. Arguments: - `phenoFile`, `betaPath`, `mPath`, and `cnPath` point to the phenotype table and the aligned methylation matrices from the previous workflow stages. - `SampleID = "Sample_Name"` defines the sample key used during alignment. - `timeVar = "Timepoint"`, `timepoints = "1,2"`, and `combineTimepoints = "1,2"` determine the timepoint-specific outputs and the combined longitudinal object. - `outputPheno`, `outputRData`, `outputRDataMerge`, and `outputDir` define the directories used for phenotype exports, metric objects, merged objects, and Clock Foundation inputs. - `saveOutputs = TRUE` is what makes this step useful in a file-based pipeline, because later modeling steps consume these written files. # Write GLM model outputs to disk The modeling function also separate in-memory results from file outputs. The next example runs a small GLM analysis and prints the names of the files written to disk. ### Create example input files The next chunk shows how to recreate the temporary merged phenotype-plus-beta input file used by the GLM example. ```{r glm-pipeline-inputs, eval = can_run_glm, message = FALSE, warning = FALSE} glm_inputs <- dnaEPICO:::exampleMethylationGLMStateDnaEpico() names(glm_inputs) glm_inputs$tempDir basename(glm_inputs$inputPath) file.exists(glm_inputs$inputPath) ``` `glm_inputs` is a list returned by the GLM example helper. `names(glm_inputs)` shows that it contains the temporary directory, the saved merged phenotype-plus-beta input file (`inputPath`), and the corresponding in-memory objects produced during helper construction: `preparedData`, `modelResults`, and `modelSummaries`. ```{r glm-pipeline, eval = can_run_glm, message = FALSE, warning = FALSE} glmPipelineResult <- methylationGLM_T1( inputPheno = glm_inputs$inputPath, phenotypes = "status", covariates = "sex,age", factorVars = "status,sex", cpgLimit = 2, nCores = 1, outputLogs = file.path(glm_inputs$tempDir, "logs"), outputRData = file.path(glm_inputs$tempDir, "rData", "models"), outputPlots = file.path(glm_inputs$tempDir, "figures"), significantCpGDir = file.path(glm_inputs$tempDir, "significant"), summaryTxtDir = file.path(glm_inputs$tempDir, "summaries"), summaryPval = 1, significantCpGPval = 1, annotationPackage = "IlluminaHumanMethylation450kanno.ilmn12.hg19", annotationCols = "Name,chr,pos", annotatedGLMOut = file.path(glm_inputs$tempDir, "annotated"), display = FALSE, verbose = FALSE, logs = TRUE, saveOutputs = TRUE ) class(glmPipelineResult$savedFiles) names(glmPipelineResult$savedFiles) ``` The returned `savedFiles` object records the groups of outputs written by the GLM step. `names(glmPipelineResult$savedFiles)` typically includes the model files, summary files, diagnostic plot files, significant-CpG exports, summary text files, and the annotated results workbook. This is the object to inspect when you want to confirm which modeling outputs were written and how they are grouped. Arguments: - `inputPheno` points to the merged phenotype-plus-beta object generated by `preprocessingPheno()`. - `phenotypes` lists the outcome variables that will be modeled one at a time. - `covariates` lists the adjustment variables included in every model. - `factorVars` identifies the categorical variables that should be treated as factors before model fitting. - `cpgLimit = 2` keeps the example fast by fitting only two CpG models. - `nCores = 1` keeps the example deterministic and lightweight; production HPC runs typically use higher values. - `annotationPackage` and `annotationCols` control which array annotation is merged into the final summary tables. - `outputRData`, `outputPlots`, `significantCpGDir`, `summaryTxtDir`, and `annotatedGLMOut` define where model objects, plots, summaries, and the annotated GLM workbook are written. - `saveOutputs = TRUE` enables the file-based pipeline behaviour expected by Makefile and HPC use. # Write LME model outputs to disk The longitudinal modeling step follows the same pattern as the GLM function, but the saved outputs now correspond to mixed-effects models, longitudinal summary tables, and interaction-specific result files. ### Create example input files The next chunk shows how to recreate the temporary combined longitudinal phenotype-plus-beta input file used by the LME example. ```{r glmm-pipeline-inputs, eval = can_run_glmm, message = FALSE, warning = FALSE} glmm_inputs <- dnaEPICO:::exampleMethylationGLMMStateDnaEpico() names(glmm_inputs) glmm_inputs$tempDir basename(glmm_inputs$inputPath) file.exists(glmm_inputs$inputPath) ``` `glmm_inputs` plays the same role for the longitudinal example. The helper returns a list with the temporary directory, the saved combined longitudinal input file (`inputPath`), and the in-memory objects used to build that example: `preparedData`, `modelResults`, and `modelSummaries`. ```{r glmm-pipeline, eval = can_run_glmm, message = FALSE, warning = FALSE} glmmPipelineResult <- methylationGLMM_T1T2( inputPheno = glmm_inputs$inputPath, outputLogs = file.path(glmm_inputs$tempDir, "logs"), outputRData = file.path(glmm_inputs$tempDir, "rData", "models"), outputPlots = file.path(glmm_inputs$tempDir, "figures"), personVar = "person", timeVar = "Timepoint", phenotypes = "score", covariates = "sex", factorVars = "sex,Timepoint", cpgLimit = 2, nCores = 1, summaryPval = 1, saveSignificantInteractions = TRUE, significantInteractionDir = file.path( glmm_inputs$tempDir, "results", "cpgs", "methylationGLMM_T1T2" ), significantInteractionPval = 1, saveTxtSummaries = TRUE, summaryTxtDir = file.path( glmm_inputs$tempDir, "results", "summary", "methylationGLMM_T1T2" ), annotationPackage = "IlluminaHumanMethylation450kanno.ilmn12.hg19", annotationCols = "Name,chr,pos", annotatedLMEOut = file.path(glmm_inputs$tempDir, "annotated"), display = FALSE, verbose = FALSE, logs = TRUE, saveOutputs = TRUE ) class(glmmPipelineResult$savedFiles) names(glmmPipelineResult$savedFiles) ``` The returned `savedFiles` object records the groups of outputs written by the LME step. `names(glmmPipelineResult$savedFiles)` typically includes the model files, summary files, diagnostic plot files, summary text files, significant interaction exports, and the annotated longitudinal results table. This is the object to inspect when you want to confirm which longitudinal modeling outputs were written and how they are grouped. Arguments: - `inputPheno` points to the combined longitudinal phenotype-plus-beta object. - `personVar` defines the participant identifier used for the random effect. - `timeVar` defines the repeated-measures time variable. - `phenotypes`, `covariates`, and `factorVars` define the fixed-effect portion of the mixed model. - `cpgLimit = 2` keeps the example short by fitting only two CpG models. - `nCores = 1` keeps the example lightweight; production HPC runs usually use higher parallel settings. - `saveSignificantInteractions`, `significantInteractionDir`, and `significantInteractionPval` control the export of interaction-specific results. - `summaryTxtDir` defines where plain-text model summaries are written. - `annotationPackage`, `annotationCols`, and `annotatedLMEOut` define the annotation resources and output location for the final longitudinal summary workbook. - `saveOutputs = TRUE` enables the file-based longitudinal workflow expected by Makefile and HPC use. ## Generate the report from the example outputs After the file-based examples above have been run, `dnamReport()` can assemble the phenotype table, ENmix control figures, quality-control figures, batch effect figures, metrics figures, model annotation tables, and logs into one website report. The example below uses the concrete output paths created by the previous chunks. The preprocessing and SVA examples share `preprocessing_inputs$tempDir`, while the GLM and LME examples write model outputs into their own temporary directories. Because `dnamReport()` accepts one path per tab, those outputs can be passed directly. When running this section from an R session, `dnamReport()` still needs the Quarto command line interface because the function renders the `.qmd` files into the final website. You can check whether R can find Quarto with: ```{r quarto-check-r, eval = FALSE} Sys.which("quarto") system2("quarto", "--version") ``` If `Sys.which("quarto")` returns an empty string, install the Quarto command line interface from the [Quarto get-started page](https://quarto.org/docs/get-started/) before rendering the report. The R package `quarto` provides R helper functions for an existing Quarto installation; it does not replace the Quarto command line interface used by `dnamReport()`. If Quarto is installed but not on `PATH`, set the executable path before calling `dnamReport()`: ```{r quarto-path-r, eval = FALSE} Sys.setenv(QUARTO_BIN = "/full/path/to/quarto") ``` ```{r dnamReport-pipeline-example, eval = FALSE} report_log_dir <- file.path(preprocessing_inputs$tempDir, "logs", "report") dir.create(report_log_dir, recursive = TRUE, showWarnings = FALSE) report_logs <- c( file.path( preprocessing_inputs$tempDir, "logs", "log_preprocessingMinfiEwasWater.txt" ), file.path(pheno_inputs$tempDir, "logs", "log_preprocessingPheno.txt"), file.path(preprocessing_inputs$tempDir, "logs", "log_svaEnmix.txt"), file.path(glm_inputs$tempDir, "logs", "log_methylationGLM_T1.txt"), file.path(glmm_inputs$tempDir, "logs", "log_methylationGLMM_T1T2.txt") ) existing_logs <- report_logs[file.exists(report_logs)] if (length(existing_logs)) { file.copy(existing_logs, report_log_dir, overwrite = TRUE) } reportResult <- dnamReport( outputDir = file.path(preprocessing_inputs$tempDir, "reports", "example"), phenoTab = file.path( preprocessing_inputs$tempDir, "data", "preprocessingMinfiEwasWater", "phenoLC.csv" ), enmixTab = file.path( preprocessing_inputs$tempDir, "figures", "preprocessingMinfiEwasWater", "enmix" ), qcTab = file.path( preprocessing_inputs$tempDir, "figures", "preprocessingMinfiEwasWater", "qc" ), svaTab = file.path(preprocessing_inputs$tempDir, "figures", "svaEnmix"), metricTab = file.path( preprocessing_inputs$tempDir, "figures", "preprocessingMinfiEwasWater", "metrics" ), glmTab = glmPipelineResult$savedFiles$annotatedGLM, lmerTab = glmmPipelineResult$savedFiles$annotatedLME, logTab = report_log_dir, detPPath = file.path( preprocessing_inputs$tempDir, "rData", "preprocessingMinfiEwasWater", "qc", "detP_RGSet.RData" ), detPThreshold = 0.01, verbose = FALSE, logs = TRUE ) reportResult$outputFile ``` ## GLM model structure and PRS terms For each CpG, `methylationGLM_T1()` fits a regression model of the form: $$ \text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot \text{Phenotype} + \sum_{k = 1}^{K} \beta_{k + 1} \cdot \text{Covariate}_k + \varepsilon $$ where $\text{Methylation}_{CpG_i}$ is the methylation value at CpG $i$, $\beta_0$ is the intercept, the phenotype term is the variable of interest, and the remaining fixed effects represent the listed covariates. If no polygenic risk score is included, a model can be read schematically as: $$ \begin{aligned} \text{Methylation}_{CpG_i} =\;& \beta_0 + \beta_1 \cdot \text{Phenotype} + \beta_2 \cdot \text{Age} + \beta_3 \cdot \text{Sex} + \beta_4 \cdot \text{Ethnicity} \\ &+ \varepsilon \end{aligned} $$ When `prsMap` is supplied, a phenotype-specific PRS term is appended only for the matching phenotype. For example, the mapping `"Pheno1:PRS_1,Pheno2:PRS_2"` means that: - models for `Pheno1` include `PRS_1` - models for `Pheno2` include `PRS_2` - phenotypes without a mapping are fit without a PRS term For a phenotype mapped to a PRS, the model becomes: $$ \text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot \text{Phenotype} + \sum_{k = 1}^{K} \beta_{k + 1} \cdot \text{Covariate}_k + \beta_{\text{PRS}} \cdot \text{PRS} + \varepsilon $$ If `interactionTerm` is supplied, the same framework is extended by adding the requested interaction to the fixed-effect portion of the model. ## Longitudinal mixed-effects structure The longitudinal function `methylationGLMM_T1T2()` follows the same file-based pattern, but uses a mixed-effects model with a participant-level random intercept. In schematic form: $$ \begin{aligned} \text{Methylation}_{CpG_i} =\;& \beta_0 + \beta_1 \cdot \text{Phenotype} + \beta_2 \cdot \text{Timepoint} + \beta_3 \cdot (\text{Phenotype} \times \text{Timepoint}) \\ &+ \sum_{k = 1}^{K} \beta_{k + 3} \cdot \text{Covariate}_k + \beta_{\text{PRS}} \cdot \text{PRS} + b_{\text{person}} + \varepsilon \end{aligned} $$ where $b_{\text{person}}$ is the subject-specific random intercept. As in the GLM workflow, the PRS term is included only when the current phenotype is matched in `prsMap`. Common arguments in this longitudinal stage are: - `inputPheno` for the combined longitudinal phenotype-plus-beta object - `personVar` for the participant identifier used in the random effect - `timeVar` for the repeated-measures variable - `phenotypes`, `covariates`, and `factorVars` for the fixed-effect structure - `prsMap` for phenotype-specific PRS terms - `cpgLimit` and `nCores` for computational control - `annotationPackage` and `annotationCols` for annotated summaries # Makefile use In a project directory, the Makefile route is simpler than calling each function by hand. A minimal project layout looks like this: ```text my_project/ Makefile metadata/ pheno_model1.csv data/ preprocessingMinfiEwasWater/ idats/ ``` The exported template should then be edited so that at least the model name, phenotype path, and directory variables match the project. A minimal example is: ```make MODEL = model1 PHENO_FILE = metadata/pheno_model1.csv LOGS_DIR := $(abspath logs) DATA_DIR = data RDATA_DIR = rData RESULTS_DIR = results FIGURES_DIR = figures ``` With that structure in place, a typical sequence of commands looks like this: ```bash make step1 MODEL=model1 make step2 MODEL=model1 make step3 MODEL=model1 make f3 MODEL=model1 make f4 MODEL=model1 make f3lme MODEL=model1 make all MODEL=model1 make status MODEL=model1 make clean MODEL=model1 ``` These targets correspond to the grouped outputs defined in the exported rules: - `make step1 MODEL=model1` runs preprocessing only. - `make step2 MODEL=model1` runs the SVA step using the files written by Step 1. - `make step3 MODEL=model1` runs phenotype preparation using the files written by Step 1. - `make f3 MODEL=model1` runs preprocessing, SVA, phenotype preparation, and the report target defined by the current rules. - `make f4 MODEL=model1` runs the same route, adds the GLM step, and refreshes the report target. - `make f3lme MODEL=model1` runs the same route, adds the longitudinal LME step, and refreshes the report target. - `make all MODEL=model1` runs the full pipeline, including both modeling steps and the report. - `make status MODEL=model1` checks which expected output files are already present. - `make clean MODEL=model1` removes the files generated for the selected model while keeping the shared raw input area. If several models are declared in `MODELS`, the grouped multi-model targets can be used instead: ```bash make f3_models make f4_models make f3lme_models make models ``` These grouped targets loop over every model listed in `MODELS`: - `make f3_models` runs the `f3` route for each declared model. - `make f4_models` runs the `f4` route for each declared model. - `make f3lme_models` runs the `f3lme` route for each declared model. - `make models` runs the full `all` pipeline for each declared model. Two maintenance targets are also useful during routine work: ```bash make status MODEL=model1 make clean MODEL=model1 make clean MODEL=all ``` - `make status MODEL=model1` reports which expected outputs are already present for one model. - `make clean MODEL=model1` removes the generated outputs for one model. - `make clean MODEL=all` removes the generated outputs for every model listed in `MODELS`. The exact target names and variables depend on the Makefile settings extracted for the project. In practice, `saveOutputs = TRUE` is what makes the individual steps compatible with this route because later steps consume the files written by earlier ones. ## Report generation The exported Makefile contains a report rule whose output is `$(STEP6)/$(MODEL)/docs/index.html`. With the default directory settings, this becomes `reports//docs/index.html`. Users usually do not need to call that file target directly: the grouped targets above depend on it, so Make refreshes the report automatically after the required pipeline outputs are available. For example, the `f3` route builds the report from the Data, ENmix QC, Quality Control, Batch Effect, Metrics, Report, and Logs inputs. The `f4`, `f3lme`, and `all` routes additionally make the GLM and LMER tables available when the corresponding model outputs exist. The report target reads the same types of outputs generated in the examples above, organised under the Makefile project layout: the phenotype table in `data/`, quality-control and metric figures in `figures/`, the detection P-value object in `rData/`, model annotation tables in `data/`, and workflow logs in `logs/`. The generated site is opened from the report target path: ```text $(STEP6)/$(MODEL)/docs/index.html ``` Internally, the Makefile calls `dnamReport()` with one argument per tab, using the project paths defined by the Makefile variables. The report target requires the Quarto command line interface in the same environment where `make` is run. If Quarto is not already available, install the Quarto command line interface from the [Quarto get-started page](https://quarto.org/docs/get-started/), or use the installation method provided by the local computing environment. Before launching the report target, check: ```bash quarto --version ``` If Quarto is installed outside `PATH`, export its executable path before running `make`: ```bash export QUARTO_BIN=/full/path/to/quarto ``` # Full exported Makefile The full Makefile exported by `extractMake()` is shown below. This is the template that should be adapted for a project before running the pipeline in the target analysis environment. ```{r makefile-template-full, echo = FALSE, results = "asis"} makefile_lines <- readLines(makefilePath, warn = FALSE) cat("```make\n", paste(makefile_lines, collapse = "\n"), "\n```\n", sep = "") ``` # Makefile section guide The exported Makefile is organised into sections so that project-specific choices can be edited without changing the pipeline rules themselves. ## User configuration - `MAKEFLAGS += --output-sync` keeps parallel Make output grouped by recipe, which makes logs easier to read. ## Models selection and parallel run - `MODEL` defines a single pipeline configuration to run. - `MODELS` defines a set of configurations that can be launched in parallel by the grouped Make targets. In single-model use, commands such as `make f4 MODEL=model1` run one workflow. In multi-model use, commands such as `make f4_models` evaluate the same template for each name listed in `MODELS`. ## Per-model overrides - `PHENO_FILE` is reassigned according to the selected `MODEL`. This block is useful when different cohorts, case definitions, or phenotype encodings should reuse the same analysis pipeline without duplicating the full Makefile. ## Directories - `LOGS_DIR`, `DATA_DIR`, `RDATA_DIR`, `RESULTS_DIR`, and `FIGURES_DIR` define the top-level folder structure. - `STEP1` to `STEP6` define the canonical names of the major pipeline stages. - `METRICS_DIR`, `IDAT_DIR`, `OBJ_DIR`, `MERGE_DIR`, `MODEL_DIR`, `CPG_DIR`, `SUMMARY_DIR`, `ENMIX_DIR`, `SVA_DIR`, and `QC_DIR` define the subdirectory names used by the rules file. - `R_DIR` optionally points to a custom R library path on shared systems. When set to `NULL`, the default library path is used. These variables control where files are written and how paths are composed across all stages of the pipeline. ## Global parameters These parameters are shared across multiple workflow steps. ### Step 1 to Step 5 shared parameters - `PHENO_FILE`: input phenotype table used to start the workflow. - `SEP_TYPE`: separator used when reading text files. - `SAMPLE_ID`: phenotype column used to align samples. - `N_SAMPLES`: optional sample subset for testing; `NA` uses all samples. - `ARRAY_TYPE` and `ANNOTATION_VERSION`: array platform and annotation build. - `TIFF_WIDTH`, `TIFF_HEIGHT`, and `TIFF_RES`: default plot dimensions and resolution. - `SEX_COLUMN`: phenotype column used for sex-based checks and prediction. - `SENTRIX_ID_COLUMN` and `SENTRIX_POSITION_COLUMN`: chip-position metadata used by the SVA stage. - `BASENAME_COLUMN`: column holding the array basename. - `TIME_VAR`: phenotype column used for repeated-measures or longitudinal workflows. - `PHENO_ORDER`: desired order of key phenotype columns in exported tables. ### Step 4 and Step 5 shared modeling parameters - `COVARIATES`: comma-separated adjustment variables included in each model. - `FACTOR_VARS`: variables that should be treated as categorical before model fitting. - `CPG_PREFIX`: prefix used to identify CpG columns. - `CPG_LIMIT`: optional cap on the number of CpGs processed; `NA` uses all. - `PRS_MAP`: mapping from phenotype names to PRS variables. - `SUMMARY_PVAL`: optional p-value cutoff for summary tables. - `N_CORES`: number of cores used within each R job. - `SAVE_TXT_SUMMARIES`: whether plain-text summaries should be written. - `CHUNK_SIZE`: number of CpGs processed per chunk. - `FDR_THRESHOLD`: false-discovery-rate threshold used for filtering. - `PADJ_METHOD`: multiple-testing correction method. - `ANNOTATION_PACKAGE`: array annotation package used for annotated summaries. - `ANNOTATION_COLS`: annotation fields to merge into output tables. ## Step 1 parameters - `QC_CUTOFF`: sample-quality threshold applied during preprocessing. - `DET_PTYPE`: detection p-value method. - `DET_PTHRESHOLD`: detection p-value cutoff. - `PVAL_THRESHOLD`: probe-level p-value filter applied after preprocessing. - `CHR_TO_REMOVE`: chromosomes to exclude from downstream matrices. - `SNPS_TO_REMOVE`: SNP-related probe categories to remove. - `CROSS_REACTIVE_FILE`: cross-reactive probe reference file. - `MAF_THRESHOLD`: minor-allele-frequency threshold used during SNP filtering. - `PLOT_GROUP_VAR`: phenotype variable used for grouped QC plots. - `LC_REF`: reference panel used for cell-type estimation. These values control the main preprocessing and filtering choices in `preprocessingMinfiEwasWater()`. ## Step 2 parameters - `CTRL_SVA_PERC_VAR`: target proportion of control-probe variance explained by the retained surrogate variables. - `CTRL_SVA_FLAG`: enables the control-based ENmix SVA workflow. These values determine how `svaEnmix()` estimates and retains surrogate variables. ## Step 3 parameters - `TIMEPOINTS`: comma-separated timepoints exported separately. - `COMBINE_TIMEPOINTS`: comma-separated timepoints combined into the longitudinal object. These values determine how `preprocessingPheno()` splits and recombines the phenotype-methylation data. ## Step 4 parameters - `PHENOTYPES_GLM`: comma-separated phenotype variables that will be modelled one at a time by `methylationGLM_T1()`. - `GLM_LIBS`: GLM implementation used by `methylationGLM_T1()`. - `INTERACTION_GLM`: optional interaction term added to the fixed-effect part of the GLM. - `SUMMARY_RESIDUAL_SD`: whether residual standard deviations are reported in summaries. - `SAVE_SIGNIFICANT_CPGS`: whether tables of significant CpGs are exported. - `SIGNIFICANT_CPG_PVAL`: p-value threshold used for those significant-CpG exports. ## Step 5 parameters - `PHENOTYPES_LME`: comma-separated phenotype variables that will be modelled one at a time by `methylationGLMM_T1T2()`. - `PERSON_VAR`: participant identifier used as the random effect grouping factor. - `LME_LIBS`: mixed-effects libraries used by `methylationGLMM_T1T2()`. - `INTERACTION_LME`: optional interaction term used in the longitudinal model. - `SAVE_SIGNIFICANT_INTERACTIONS`: whether significant interaction tables are exported. - `SIGNIFICANT_INTERACTION_PVAL`: p-value threshold for those exported interaction results. ## Argument normalisation - `PRS_MAP_ARG`, `R_DIR_ARG`, `INTERACTION_GLM_ARG`, and `INTERACTION_LME_ARG` convert Makefile string values into arguments that can be passed safely to R. This section is especially important for values such as `NULL`, because it prevents the literal string `"NULL"` from being interpreted as a real model value inside the main functions. ## Include pipeline from dnaEPICO - `DNAPIPE_MK` resolves the packaged `Makefile.rules.pipeline` file. - `include $(DNAPIPE_MK)` imports the actual pipeline rules after the user configuration has been defined. This separation is what allows the template Makefile to stay compact while the package keeps the rule implementation in one maintained location. # Summary The file-based route is the correct choice when you need: - a reproducible folder structure, - files that can be reused by later workflow stages, - Makefile execution, or - repeated analyses across one or more models. The interactive route and the file-based route are therefore complementary: local use is best for understanding the returned objects, while pipeline use is best for reproducible multi-step execution. # Basics Date the vignette was generated. ```{r, echo = FALSE} Sys.time() ``` Wallclock time spent generating the vignette. ```{r, echo = FALSE} totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r, echo = FALSE} sessionInfo() ``` ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve, so it is critical to learn where to ask for help. We would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help. Please remember to use the `dnaEPICO` tag and check the [older posts](https://support.bioconductor.org/tag/dnaEPICO/). If you want to receive help, please provide a small reproducible example and your session information so the source of the problem can be tracked efficiently. # References