---
title: "Pipeline Use of dnaEPICO"
author:
  - name: Paul Ruiz
    affiliation:
      - Queensland University of Technology
    email: ruizpint@qut.edu.au
  - name: Divya Mehta
    affiliation:
      - Queensland University of Technology
output:
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
date: "`r Sys.Date()`"
package: dnaEPICO
bibliography: dnaEPICO.bib
vignette: >
  %\VignetteIndexEntry{3. Pipeline Use of dnaEPICO}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
startTime <- Sys.time()
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  crop = NULL
)

library(dnaEPICO)

can_run_preprocessing <- requireNamespace("minfiData", quietly = TRUE) &&
  requireNamespace("IlluminaHumanMethylation450kmanifest", quietly = TRUE) &&
  requireNamespace("IlluminaHumanMethylation450kanno.ilmn12.hg19", quietly = TRUE)

can_run_sva <- can_run_preprocessing
can_run_glm <- requireNamespace(
  "IlluminaHumanMethylation450kanno.ilmn12.hg19",
  quietly = TRUE
)
can_run_glmm <- can_run_glm && requireNamespace("lmerTest", quietly = TRUE)
```

# Introduction

This vignette describes the file-based route in `dnaEPICO`. In this mode, the
same main functions still return structured result objects, but they also
write files to disk when `saveOutputs = TRUE`. This route is appropriate when
you want a reproducible folder structure, Makefile-driven execution, or a
pipeline that can be resumed on a local workstation or an HPC system.

# Citation

The `r Biocpkg("dnaEPICO")` package uses methods from several external
packages. Because no single manuscript describes all components, the guidance
below explains how to cite `dnaEPICO` depending on the functions you use.

This citation guidance is adapted from the vignettes and user guides of
[minfi](https://github.com/hansenlab/minfi/blob/devel/vignettes/minfi.Rmd),
[ENmix](https://github.com/xuz1/ENmix/blob/master/vignettes/ENmix.Rmd), and
[wateRmelon](https://github.com/schalkwyk/wateRmelon/blob/master/vignettes/references.bib).

- If you use `make f3 MODEL=model1`, please cite @minfi; @minfiEPIC; @ENmix;
  @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools.
- If you use `make f4 MODEL=model1`, please cite @minfi; @minfiEPIC; @ENmix;
  @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools; @glm2.
- If you use `make f3lme MODEL=model1`, please cite @minfi; @minfiEPIC;
  @ENmix; @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools; @lme4.
- If you use `make all MODEL=model1`, please cite @minfi; @minfiEPIC; @ENmix;
  @wateRmelon; @SWAN; @funnorm; @noob; @quantile; @ewastools; @glm2; @lme4.

# Required knowledge

`r Biocpkg("dnaEPICO")` is built on core Bioconductor infrastructure for
high-dimensional genomic data, with a focus on Illumina DNA methylation arrays.
To read this vignette, we assume that you are already familiar with the general
DNA methylation workflow. If not, we recommend first reading
[this tutorial](https://paulyrp.github.io/2025-cpgpneurogenomics-workshop/tutorial.html),
which provides a practical introduction to the main concepts and analysis steps.

Preprocessing and quality control are performed using established Bioconductor
tools, including `r Biocpkg("minfi")`, `r Biocpkg("ENmix")`, and
`r Biocpkg("wateRmelon")`. Downstream statistical modelling relies on base R
and CRAN frameworks, including generalised linear models and linear
mixed-effects models. Users are expected to have basic familiarity with R,
command-line execution, and Illumina IDAT file structures.

If you are asking yourself the question "Where do I start using
Bioconductor?", you might be interested in
[this blog post](https://www.bioconductor.org/install/).

# Installation

The `r Biocpkg("dnaEPICO")` package depends on `r Biocpkg("minfi")`,
`r Biocpkg("ENmix")`, and `r Biocpkg("wateRmelon")`, among others. Install
dependencies with BiocManager:

```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("dnaEPICO")

BiocManager::valid()
```

```{r, eval = FALSE, message = FALSE}
library("dnaEPICO")
```

# Extract the example Makefile

The package includes a template Makefile that can be copied into a project
directory and adapted to local paths and model definitions.

```{r makefile-example}
makefilePath <- extractMake(destDir = tempdir(), overwrite = TRUE)
basename(makefilePath)
```

After extraction, the file should be edited so that the project-specific paths,
model labels, and cluster settings match the target analysis environment.

Arguments:

- `destDir = tempdir()` writes the template to a temporary directory so the
  example does not modify the working tree.
- `overwrite = TRUE` allows the template to be replaced if it already exists in
  that temporary location.

# Write preprocessingMinfiEwasWater outputs to disk

The file-based route usually starts with `preprocessingMinfiEwasWater()`. This
step writes the filtered `RGSet`, normalized metric matrices, quality-control
figures, and the `phenoLC.csv` file that will be consumed by `svaEnmix()` and
`preprocessingPheno()`.

### Create example input files

The next chunk shows how to recreate the temporary files used in this example.
It writes a small phenotype table to a temporary directory, copies a small set
of IDAT files, and records the cross-reactive probe reference used during probe
filtering.

```{r preprocessing-pipeline-inputs, eval = can_run_preprocessing, message = FALSE, warning = FALSE}
preprocessing_inputs <- dnaEPICO:::exampleMinfiIdatInputsDnaEpico(n = 6L)

names(preprocessing_inputs)
preprocessing_inputs$tempDir
basename(preprocessing_inputs$phenoFile)
head(basename(list.files(preprocessing_inputs$idatFolder, full.names = TRUE)), 4)
basename(preprocessing_inputs$crossReactivePath)
```

`preprocessing_inputs` is a small list returned by the internal example helper.
`names(preprocessing_inputs)` shows its main elements: `tempDir` is the
temporary working directory, `idatFolder` contains the copied example IDAT
files, `phenoFile` is the phenotype table used by the function, `targets` is the
same phenotype information already loaded into memory, `arrayType` and
`annotationVersion` describe the array platform, and `crossReactivePath` points
to the cross-reactive probe reference used during probe filtering.

```{r preprocessing-pipeline, eval = can_run_preprocessing, message = FALSE, warning = FALSE}
preprocessPipelineResult <- preprocessingMinfiEwasWater(
  phenoFile = preprocessing_inputs$phenoFile,
  idatFolder = preprocessing_inputs$idatFolder,
  outputLogs = file.path(preprocessing_inputs$tempDir, "logs"),
  nSamples = 6,
  SampleID = "Sample_Name",
  arrayType = preprocessing_inputs$arrayType,
  annotationVersion = preprocessing_inputs$annotationVersion,
  scriptLabel = "preprocessingMinfiEwasWater",
  baseDataFolder = file.path(preprocessing_inputs$tempDir, "rData"),
  figureBaseDir = file.path(preprocessing_inputs$tempDir, "figures"),
  detPThreshold = 1,
  normMethods = "quantile",
  sexColumn = "Sex",
  pvalThreshold = 1,
  chrToRemove = "",
  snpsToRemove = "SBE",
  mafThreshold = 1,
  crossReactivePath = preprocessing_inputs$crossReactivePath,
  plotGroupVar = "Sex",
  lcRef = "saliva",
  phenoOrder = "Sample_Name;Sex;Basename;Sentrix_ID;Sentrix_Position",
  lcPhenoDir = file.path(
    preprocessing_inputs$tempDir,
    "data",
    "preprocessingMinfiEwasWater"
  ),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

preprocess_paths <- c(
  phenoLC = file.path(
    preprocessing_inputs$tempDir,
    "data",
    "preprocessingMinfiEwasWater",
    "phenoLC.csv"
  ),
  rgset = file.path(
    preprocessing_inputs$tempDir,
    "rData",
    "preprocessingMinfiEwasWater",
    "objects",
    "RGSet.RData"
  ),
  beta = file.path(
    preprocessing_inputs$tempDir,
    "rData",
    "preprocessingMinfiEwasWater",
    "metrics",
    "beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
  ),
  qc = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "qc",
    "quality_control(MSet).tiff"
  )
)

class(preprocessPipelineResult)
basename(preprocess_paths)
file.exists(preprocess_paths)
```

The returned object still has class `dnaEPICO_preprocessingMinfiEwasWater`, but
the printed vectors in this chunk focus on the written outputs. In
`preprocess_paths`, `phenoLC` is the phenotype table with estimated cell
proportions, `rgset` is the filtered `RGSet`, `beta` is the saved beta matrix
used later by phenotype preparation, and `qc` is an example quality-control
figure. `basename(preprocess_paths)` shows the file names, while
`file.exists(preprocess_paths)` confirms that each one was written successfully.

Arguments:

- `phenoFile` and `idatFolder` define the phenotype table and the IDAT folder
  used to start the pipeline.
- `SampleID = "Sample_Name"` is the key used to align phenotype rows with the
  methylation objects.
- `arrayType` and `annotationVersion` define the array manifest and annotation
  used by `minfi`.
- `baseDataFolder`, `figureBaseDir`, and `lcPhenoDir` define where serialized
  objects, figures, and `phenoLC.csv` are written.
- `detPThreshold`, `normMethods`, `pvalThreshold`, `chrToRemove`,
  `snpsToRemove`, `mafThreshold`, and `crossReactivePath` define the main
  preprocessing and filtering choices.
- `plotGroupVar`, `lcRef`, and `phenoOrder` control QC grouping, cell-type
  estimation, and the structure of the exported phenotype table.
- `saveOutputs = TRUE` is what turns this function into the first file-producing
  step of the pipeline.

# Write SVA outputs to disk

The next example shows `svaEnmix()` in file-writing mode. The function still
returns a structured result object, but the `savedFiles` element now records the
main output files written to disk.

### Reuse the temporary files written in Step 1

In the file-based route, `svaEnmix()` consumes the `phenoLC.csv` and
`RGSet.RData` files written by `preprocessingMinfiEwasWater()`. The next chunk
shows how those temporary file paths are reconstructed locally.

```{r sva-pipeline-inputs, eval = can_run_sva, message = FALSE, warning = FALSE}
sva_targets_file <- file.path(
  preprocessing_inputs$tempDir,
  "data",
  "preprocessingMinfiEwasWater",
  "phenoLC.csv"
)
sva_rgset_file <- file.path(
  preprocessing_inputs$tempDir,
  "rData",
  "preprocessingMinfiEwasWater",
  "objects",
  "RGSet.RData"
)

basename(c(sva_targets_file, sva_rgset_file))
file.exists(c(sva_targets_file, sva_rgset_file))
```

Unlike the previous helper objects, `sva_targets_file` and `sva_rgset_file` are
plain file paths reconstructed from `preprocessing_inputs$tempDir`. They point
to the concrete outputs written by Step 1: `phenoLC.csv`, which is the
phenotype table augmented with cell-composition estimates, and `RGSet.RData`,
which is the filtered methylation object saved to disk.

```{r sva-pipeline, eval = can_run_sva, message = FALSE, warning = FALSE}
svaPipelineResult <- svaEnmix(
  phenoFile = sva_targets_file,
  rgsetData = sva_rgset_file,
  outputLogs = file.path(preprocessing_inputs$tempDir, "logs"),
  SampleID = "Sample_Name",
  arrayType = "IlluminaHumanMethylation450k",
  annotationVersion = "ilmn12.hg19",
  SentrixIDColumn = "Sentrix_ID",
  SentrixPositionColumn = "Sentrix_Position",
  ctrlSvaPercVar = 0.90,
  ctrlSvaFlag = 1,
  scriptLabel = "svaEnmix",
  dataBaseDir = file.path(preprocessing_inputs$tempDir, "data"),
  rBaseDir = file.path(preprocessing_inputs$tempDir, "rData"),
  figureBaseDir = file.path(preprocessing_inputs$tempDir, "figures"),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

class(svaPipelineResult$savedFiles)
names(svaPipelineResult$savedFiles)
basename(unlist(svaPipelineResult$savedFiles, use.names = FALSE))
```

The returned `savedFiles` object records the main paths written by this step.
`names(svaPipelineResult$savedFiles)` identifies the output groups: `svaRData`
is the serialized surrogate-variable matrix, `svaCSV` is the CSV version of
that matrix, `phenoWithSva` is the updated phenotype table with appended SVA
columns, and `dataDir` / `rDir` are the parent directories used for the saved
outputs. The `basename(...)` call shows the file names generated from those
paths.

Arguments:

- `phenoFile` and `rgsetData` identify the phenotype table and saved `RGSet`
  produced by the earlier preprocessing step.
- `SampleID`, `SentrixIDColumn`, and `SentrixPositionColumn` specify how
  samples and array positions are identified in the phenotype table.
- `ctrlSvaPercVar = 0.90` keeps enough control-derived surrogate variables to
  explain 90% of the control-probe variance.
- `ctrlSvaFlag = 1` enables the ENmix control-based SVA workflow.
- `dataBaseDir` and `rBaseDir` define where file outputs are written when
  `saveOutputs = TRUE`.
- `display = FALSE` and `verbose = FALSE` keep the console output quiet, while
  `logs = TRUE` writes the log files used by the report.

# Write preprocessingPheno outputs to disk

`preprocessingPheno()` is the step that usually prepares the largest number of
pipeline-ready files because it writes timepoint-specific tables, combined
longitudinal tables, and the Clock Foundation export inputs.

### Create example input files

The next chunk shows how to recreate the temporary phenotype and matrix files
used in this example. The helper writes a phenotype table plus aligned beta,
M-value, and copy-number objects to a temporary directory.

```{r preprocessingPheno-pipeline-inputs, message = FALSE, warning = FALSE}
pheno_inputs <- dnaEPICO:::examplePreprocessingPhenoStateDnaEpico()

names(pheno_inputs)
pheno_inputs$tempDir
basename(c(
  pheno_inputs$phenoPath,
  pheno_inputs$betaPath,
  pheno_inputs$mPath,
  pheno_inputs$cnPath
))
```

`pheno_inputs` is a list that bundles the temporary files and the aligned
in-memory objects used in this stage. `names(pheno_inputs)` shows that it
contains the temporary directory, the phenotype table, the saved beta, M-value,
and copy-number files, and precomputed helper objects such as `metricsData`,
`timepointData`, `combinedData`, and `clockFoundation`.

```{r preprocessingPheno-pipeline, message = FALSE, warning = FALSE}
phenoPipelineResult <- preprocessingPheno(
  phenoFile = pheno_inputs$phenoPath,
  betaPath = pheno_inputs$betaPath,
  mPath = pheno_inputs$mPath,
  cnPath = pheno_inputs$cnPath,
  SampleID = "Sample_Name",
  timeVar = "Timepoint",
  timepoints = "1,2",
  combineTimepoints = "1,2",
  outputPheno = file.path(pheno_inputs$tempDir, "data", "preprocessingPheno"),
  outputRData = file.path(
    pheno_inputs$tempDir,
    "rData",
    "preprocessingPheno",
    "metrics"
  ),
  outputRDataMerge = file.path(
    pheno_inputs$tempDir,
    "rData",
    "preprocessingPheno",
    "mergeData"
  ),
  sexColumn = "Sex",
  outputLogs = file.path(pheno_inputs$tempDir, "logs"),
  outputDir = file.path(pheno_inputs$tempDir, "clockFoundation"),
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

names(phenoPipelineResult$savedFiles)
names(phenoPipelineResult$savedFiles$timepoints)
basename(unlist(phenoPipelineResult$savedFiles$timepoints[["1"]]))
basename(unlist(phenoPipelineResult$savedFiles[c(
  "combinedPheno",
  "combinedPhenoBeta",
  "betaCSV",
  "phenoCF"
)]))
```

These outputs are the files most commonly consumed by downstream modeling
functions. `names(phenoPipelineResult$savedFiles)` shows the high-level output
groups returned by the function. `names(phenoPipelineResult$savedFiles$timepoints)`
lists the available timepoint-specific subsets. The `basename(...)` calls then
show examples of the concrete files written for one timepoint and for the
combined outputs. In this structure, `combinedPhenoBeta` is the merged
phenotype-plus-beta object that becomes the direct input to the GLM and GLMM
functions, while `betaCSV`, `betaZIP`, and `phenoCF` are the Clock Foundation
exports.

Arguments:

- `phenoFile`, `betaPath`, `mPath`, and `cnPath` point to the phenotype table
  and the aligned methylation matrices from the previous workflow stages.
- `SampleID = "Sample_Name"` defines the sample key used during alignment.
- `timeVar = "Timepoint"`, `timepoints = "1,2"`, and
  `combineTimepoints = "1,2"` determine the timepoint-specific outputs and the
  combined longitudinal object.
- `outputPheno`, `outputRData`, `outputRDataMerge`, and `outputDir` define the
  directories used for phenotype exports, metric objects, merged objects, and
  Clock Foundation inputs.
- `saveOutputs = TRUE` is what makes this step useful in a file-based pipeline,
  because later modeling steps consume these written files.

# Write GLM model outputs to disk

The modeling function also separate in-memory results from file outputs. The
next example runs a small GLM analysis and prints the names of the files written
to disk.

### Create example input files

The next chunk shows how to recreate the temporary merged phenotype-plus-beta
input file used by the GLM example.

```{r glm-pipeline-inputs, eval = can_run_glm, message = FALSE, warning = FALSE}
glm_inputs <- dnaEPICO:::exampleMethylationGLMStateDnaEpico()

names(glm_inputs)
glm_inputs$tempDir
basename(glm_inputs$inputPath)
file.exists(glm_inputs$inputPath)
```

`glm_inputs` is a list returned by the GLM example helper. `names(glm_inputs)`
shows that it contains the temporary directory, the saved merged
phenotype-plus-beta input file (`inputPath`), and the corresponding in-memory
objects produced during helper construction: `preparedData`, `modelResults`, and
`modelSummaries`.

```{r glm-pipeline, eval = can_run_glm, message = FALSE, warning = FALSE}
glmPipelineResult <- methylationGLM_T1(
  inputPheno = glm_inputs$inputPath,
  phenotypes = "status",
  covariates = "sex,age",
  factorVars = "status,sex",
  cpgLimit = 2,
  nCores = 1,
  outputLogs = file.path(glm_inputs$tempDir, "logs"),
  outputRData = file.path(glm_inputs$tempDir, "rData", "models"),
  outputPlots = file.path(glm_inputs$tempDir, "figures"),
  significantCpGDir = file.path(glm_inputs$tempDir, "significant"),
  summaryTxtDir = file.path(glm_inputs$tempDir, "summaries"),
  summaryPval = 1,
  significantCpGPval = 1,
  annotationPackage = "IlluminaHumanMethylation450kanno.ilmn12.hg19",
  annotationCols = "Name,chr,pos",
  annotatedGLMOut = file.path(glm_inputs$tempDir, "annotated"),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

class(glmPipelineResult$savedFiles)
names(glmPipelineResult$savedFiles)
```

The returned `savedFiles` object records the groups of outputs written by the
GLM step. `names(glmPipelineResult$savedFiles)` typically includes the model
files, summary files, diagnostic plot files, significant-CpG exports, summary
text files, and the annotated results workbook. This is the object to inspect when
you want to confirm which modeling outputs were written and how they are
grouped.

Arguments:

- `inputPheno` points to the merged phenotype-plus-beta object generated by
  `preprocessingPheno()`.
- `phenotypes` lists the outcome variables that will be modeled one at a time.
- `covariates` lists the adjustment variables included in every model.
- `factorVars` identifies the categorical variables that should be treated as
  factors before model fitting.
- `cpgLimit = 2` keeps the example fast by fitting only two CpG models.
- `nCores = 1` keeps the example deterministic and lightweight; production HPC
  runs typically use higher values.
- `annotationPackage` and `annotationCols` control which array annotation is
  merged into the final summary tables.
- `outputRData`, `outputPlots`, `significantCpGDir`, `summaryTxtDir`, and
  `annotatedGLMOut` define where model objects, plots, summaries, and the
  annotated GLM workbook are written.
- `saveOutputs = TRUE` enables the file-based pipeline behaviour expected by
  Makefile and HPC use.

# Write LME model outputs to disk

The longitudinal modeling step follows the same pattern as the GLM function, but
the saved outputs now correspond to mixed-effects models, longitudinal summary
tables, and interaction-specific result files.

### Create example input files

The next chunk shows how to recreate the temporary combined longitudinal
phenotype-plus-beta input file used by the LME example.

```{r glmm-pipeline-inputs, eval = can_run_glmm, message = FALSE, warning = FALSE}
glmm_inputs <- dnaEPICO:::exampleMethylationGLMMStateDnaEpico()

names(glmm_inputs)
glmm_inputs$tempDir
basename(glmm_inputs$inputPath)
file.exists(glmm_inputs$inputPath)
```

`glmm_inputs` plays the same role for the longitudinal example. The helper
returns a list with the temporary directory, the saved combined longitudinal
input file (`inputPath`), and the in-memory objects used to build that example:
`preparedData`, `modelResults`, and `modelSummaries`.

```{r glmm-pipeline, eval = can_run_glmm, message = FALSE, warning = FALSE}
glmmPipelineResult <- methylationGLMM_T1T2(
  inputPheno = glmm_inputs$inputPath,
  outputLogs = file.path(glmm_inputs$tempDir, "logs"),
  outputRData = file.path(glmm_inputs$tempDir, "rData", "models"),
  outputPlots = file.path(glmm_inputs$tempDir, "figures"),
  personVar = "person",
  timeVar = "Timepoint",
  phenotypes = "score",
  covariates = "sex",
  factorVars = "sex,Timepoint",
  cpgLimit = 2,
  nCores = 1,
  summaryPval = 1,
  saveSignificantInteractions = TRUE,
  significantInteractionDir = file.path(
    glmm_inputs$tempDir,
    "results",
    "cpgs",
    "methylationGLMM_T1T2"
  ),
  significantInteractionPval = 1,
  saveTxtSummaries = TRUE,
  summaryTxtDir = file.path(
    glmm_inputs$tempDir,
    "results",
    "summary",
    "methylationGLMM_T1T2"
  ),
  annotationPackage = "IlluminaHumanMethylation450kanno.ilmn12.hg19",
  annotationCols = "Name,chr,pos",
  annotatedLMEOut = file.path(glmm_inputs$tempDir, "annotated"),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

class(glmmPipelineResult$savedFiles)
names(glmmPipelineResult$savedFiles)
```

The returned `savedFiles` object records the groups of outputs written by the
LME step. `names(glmmPipelineResult$savedFiles)` typically includes the model
files, summary files, diagnostic plot files, summary text files, significant
interaction exports, and the annotated longitudinal results table. This is the
object to inspect when you want to confirm which longitudinal modeling outputs
were written and how they are grouped.

Arguments:

- `inputPheno` points to the combined longitudinal phenotype-plus-beta object.
- `personVar` defines the participant identifier used for the random effect.
- `timeVar` defines the repeated-measures time variable.
- `phenotypes`, `covariates`, and `factorVars` define the fixed-effect portion
  of the mixed model.
- `cpgLimit = 2` keeps the example short by fitting only two CpG models.
- `nCores = 1` keeps the example lightweight; production HPC runs usually use
  higher parallel settings.
- `saveSignificantInteractions`, `significantInteractionDir`, and
  `significantInteractionPval` control the export of interaction-specific
  results.
- `summaryTxtDir` defines where plain-text model summaries are written.
- `annotationPackage`, `annotationCols`, and `annotatedLMEOut` define the
  annotation resources and output location for the final longitudinal summary
  workbook.
- `saveOutputs = TRUE` enables the file-based longitudinal workflow expected by
  Makefile and HPC use.

## Generate the report from the example outputs

After the file-based examples above have been run, `dnamReport()` can assemble
the phenotype table, ENmix control figures, quality-control figures, batch
effect figures, metrics figures, model annotation tables, and logs into one
website report. The example below uses the concrete output paths created by the
previous chunks.

The preprocessing and SVA examples share `preprocessing_inputs$tempDir`, while
the GLM and LME examples write model outputs into their own temporary
directories. Because `dnamReport()` accepts one path per tab, those outputs can
be passed directly.

When running this section from an R session, `dnamReport()` still needs the
Quarto command line interface because the function renders the `.qmd` files into
the final website. You can check whether R can find Quarto with:

```{r quarto-check-r, eval = FALSE}
Sys.which("quarto")
system2("quarto", "--version")
```

If `Sys.which("quarto")` returns an empty string, install the Quarto command
line interface from the
[Quarto get-started page](https://quarto.org/docs/get-started/) before
rendering the report. The R package `quarto` provides R helper functions for an
existing Quarto installation; it does not replace the Quarto command line
interface used by `dnamReport()`.

If Quarto is installed but not on `PATH`, set the executable path before calling
`dnamReport()`:

```{r quarto-path-r, eval = FALSE}
Sys.setenv(QUARTO_BIN = "/full/path/to/quarto")
```

```{r dnamReport-pipeline-example, eval = FALSE}
report_log_dir <- file.path(preprocessing_inputs$tempDir, "logs", "report")
dir.create(report_log_dir, recursive = TRUE, showWarnings = FALSE)

report_logs <- c(
  file.path(
    preprocessing_inputs$tempDir,
    "logs",
    "log_preprocessingMinfiEwasWater.txt"
  ),
  file.path(pheno_inputs$tempDir, "logs", "log_preprocessingPheno.txt"),
  file.path(preprocessing_inputs$tempDir, "logs", "log_svaEnmix.txt"),
  file.path(glm_inputs$tempDir, "logs", "log_methylationGLM_T1.txt"),
  file.path(glmm_inputs$tempDir, "logs", "log_methylationGLMM_T1T2.txt")
)

existing_logs <- report_logs[file.exists(report_logs)]
if (length(existing_logs)) {
  file.copy(existing_logs, report_log_dir, overwrite = TRUE)
}

reportResult <- dnamReport(
  outputDir = file.path(preprocessing_inputs$tempDir, "reports", "example"),
  phenoTab = file.path(
    preprocessing_inputs$tempDir,
    "data",
    "preprocessingMinfiEwasWater",
    "phenoLC.csv"
  ),
  enmixTab = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "enmix"
  ),
  qcTab = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "qc"
  ),
  svaTab = file.path(preprocessing_inputs$tempDir, "figures", "svaEnmix"),
  metricTab = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "metrics"
  ),
  glmTab = glmPipelineResult$savedFiles$annotatedGLM,
  lmerTab = glmmPipelineResult$savedFiles$annotatedLME,
  logTab = report_log_dir,
  detPPath = file.path(
    preprocessing_inputs$tempDir,
    "rData",
    "preprocessingMinfiEwasWater",
    "qc",
    "detP_RGSet.RData"
  ),
  detPThreshold = 0.01,
  verbose = FALSE,
  logs = TRUE
)

reportResult$outputFile
```

## GLM model structure and PRS terms

For each CpG, `methylationGLM_T1()` fits a regression model of the form:

$$
\text{Methylation}_{CpG_i} =
\beta_0 +
\beta_1 \cdot \text{Phenotype} +
\sum_{k = 1}^{K} \beta_{k + 1} \cdot \text{Covariate}_k +
\varepsilon
$$

where $\text{Methylation}_{CpG_i}$ is the methylation value at CpG $i$,
$\beta_0$ is the intercept, the phenotype term is the variable of interest, and
the remaining fixed effects represent the listed covariates.

If no polygenic risk score is included, a model can be read schematically as:

$$
\begin{aligned}
\text{Methylation}_{CpG_i} =\;&
\beta_0 +
\beta_1 \cdot \text{Phenotype} +
\beta_2 \cdot \text{Age} +
\beta_3 \cdot \text{Sex} +
\beta_4 \cdot \text{Ethnicity} \\
&+
\varepsilon
\end{aligned}
$$

When `prsMap` is supplied, a phenotype-specific PRS term is appended only for
the matching phenotype. For example, the mapping
`"Pheno1:PRS_1,Pheno2:PRS_2"` means that:

- models for `Pheno1` include `PRS_1`
- models for `Pheno2` include `PRS_2`
- phenotypes without a mapping are fit without a PRS term

For a phenotype mapped to a PRS, the model becomes:

$$
\text{Methylation}_{CpG_i} =
\beta_0 +
\beta_1 \cdot \text{Phenotype} +
\sum_{k = 1}^{K} \beta_{k + 1} \cdot \text{Covariate}_k +
\beta_{\text{PRS}} \cdot \text{PRS} +
\varepsilon
$$

If `interactionTerm` is supplied, the same framework is extended by adding the
requested interaction to the fixed-effect portion of the model.

## Longitudinal mixed-effects structure

The longitudinal function `methylationGLMM_T1T2()` follows the same file-based
pattern, but uses a mixed-effects model with a participant-level random
intercept. In schematic form:

$$
\begin{aligned}
\text{Methylation}_{CpG_i} =\;&
\beta_0 +
\beta_1 \cdot \text{Phenotype} +
\beta_2 \cdot \text{Timepoint} +
\beta_3 \cdot (\text{Phenotype} \times \text{Timepoint}) \\
&+
\sum_{k = 1}^{K} \beta_{k + 3} \cdot \text{Covariate}_k +
\beta_{\text{PRS}} \cdot \text{PRS} +
b_{\text{person}} +
\varepsilon
\end{aligned}
$$

where $b_{\text{person}}$ is the subject-specific random intercept. As in the
GLM workflow, the PRS term is included only when the current phenotype is
matched in `prsMap`.

Common arguments in this longitudinal stage are:

- `inputPheno` for the combined longitudinal phenotype-plus-beta object
- `personVar` for the participant identifier used in the random effect
- `timeVar` for the repeated-measures variable
- `phenotypes`, `covariates`, and `factorVars` for the fixed-effect structure
- `prsMap` for phenotype-specific PRS terms
- `cpgLimit` and `nCores` for computational control
- `annotationPackage` and `annotationCols` for annotated summaries

# Makefile use

In a project directory, the Makefile route is simpler than calling each
function by hand. A minimal project layout looks like this:

```text
my_project/
  Makefile
  metadata/
    pheno_model1.csv
  data/
    preprocessingMinfiEwasWater/
      idats/
        <IDAT files>
```

The exported template should then be edited so that at least the model name,
phenotype path, and directory variables match the project. A minimal example is:

```make
MODEL = model1
PHENO_FILE = metadata/pheno_model1.csv
LOGS_DIR := $(abspath logs)
DATA_DIR = data
RDATA_DIR = rData
RESULTS_DIR = results
FIGURES_DIR = figures
```

With that structure in place, a typical sequence of commands looks like this:

```bash
make step1 MODEL=model1
make step2 MODEL=model1
make step3 MODEL=model1
make f3 MODEL=model1
make f4 MODEL=model1
make f3lme MODEL=model1
make all MODEL=model1
make status MODEL=model1
make clean MODEL=model1
```

These targets correspond to the grouped outputs defined in the exported rules:

- `make step1 MODEL=model1` runs preprocessing only.
- `make step2 MODEL=model1` runs the SVA step using the files written by Step 1.
- `make step3 MODEL=model1` runs phenotype preparation using the files written by Step 1.
- `make f3 MODEL=model1` runs preprocessing, SVA, phenotype preparation, and the report target defined by the current rules.
- `make f4 MODEL=model1` runs the same route, adds the GLM step, and refreshes the report target.
- `make f3lme MODEL=model1` runs the same route, adds the longitudinal LME step, and refreshes the report target.
- `make all MODEL=model1` runs the full pipeline, including both modeling steps and the report.
- `make status MODEL=model1` checks which expected output files are already present.
- `make clean MODEL=model1` removes the files generated for the selected model while keeping the shared raw input area.

If several models are declared in `MODELS`, the grouped multi-model targets can
be used instead:

```bash
make f3_models
make f4_models
make f3lme_models
make models
```

These grouped targets loop over every model listed in `MODELS`:

- `make f3_models` runs the `f3` route for each declared model.
- `make f4_models` runs the `f4` route for each declared model.
- `make f3lme_models` runs the `f3lme` route for each declared model.
- `make models` runs the full `all` pipeline for each declared model.

Two maintenance targets are also useful during routine work:

```bash
make status MODEL=model1
make clean MODEL=model1
make clean MODEL=all
```

- `make status MODEL=model1` reports which expected outputs are already present for one model.
- `make clean MODEL=model1` removes the generated outputs for one model.
- `make clean MODEL=all` removes the generated outputs for every model listed in `MODELS`.

The exact target names and variables depend on the Makefile settings extracted
for the project. In practice, `saveOutputs = TRUE` is what makes the individual
steps compatible with this route because later steps consume the files written by
earlier ones.

## Report generation

The exported Makefile contains a report rule whose output is
`$(STEP6)/$(MODEL)/docs/index.html`. With the default directory settings, this
becomes `reports/<model>/docs/index.html`. Users usually do not need to call
that file target directly: the grouped targets above depend on it, so Make
refreshes the report automatically after the required pipeline outputs are
available.

For example, the `f3` route builds the report from the Data, ENmix QC, Quality
Control, Batch Effect, Metrics, Report, and Logs inputs. The `f4`, `f3lme`, and
`all` routes additionally make the GLM and LMER tables available when the
corresponding model outputs exist.

The report target reads the same types of outputs generated in the examples
above, organised under the Makefile project layout: the phenotype table in
`data/`, quality-control and metric figures in `figures/`, the detection
P-value object in `rData/`, model annotation tables in `data/`, and workflow
logs in `logs/`.

The generated site is opened from the report target path:

```text
$(STEP6)/$(MODEL)/docs/index.html
```

Internally, the Makefile calls `dnamReport()` with one argument per tab, using
the project paths defined by the Makefile variables. The report target requires
the Quarto command line interface in the same environment where `make` is run.
If Quarto is not already available, install the Quarto command line interface
from the [Quarto get-started page](https://quarto.org/docs/get-started/), or
use the installation method provided by the local computing environment. Before
launching the report target, check:

```bash
quarto --version
```

If Quarto is installed outside `PATH`, export its executable path before running
`make`:

```bash
export QUARTO_BIN=/full/path/to/quarto
```

# Full exported Makefile

The full Makefile exported by `extractMake()` is shown below. This is the
template that should be adapted for a project before running the pipeline in the
target analysis environment.

```{r makefile-template-full, echo = FALSE, results = "asis"}
makefile_lines <- readLines(makefilePath, warn = FALSE)
cat("```make\n", paste(makefile_lines, collapse = "\n"), "\n```\n", sep = "")
```

# Makefile section guide

The exported Makefile is organised into sections so that project-specific
choices can be edited without changing the pipeline rules themselves.

## User configuration

- `MAKEFLAGS += --output-sync` keeps parallel Make output grouped by recipe,
  which makes logs easier to read.

## Models selection and parallel run

- `MODEL` defines a single pipeline configuration to run.
- `MODELS` defines a set of configurations that can be launched in parallel by
  the grouped Make targets.

In single-model use, commands such as `make f4 MODEL=model1` run one workflow.
In multi-model use, commands such as `make f4_models` evaluate the same template
for each name listed in `MODELS`.

## Per-model overrides

- `PHENO_FILE` is reassigned according to the selected `MODEL`.

This block is useful when different cohorts, case definitions, or phenotype
encodings should reuse the same analysis pipeline without duplicating the full
Makefile.

## Directories

- `LOGS_DIR`, `DATA_DIR`, `RDATA_DIR`, `RESULTS_DIR`, and `FIGURES_DIR` define
  the top-level folder structure.
- `STEP1` to `STEP6` define the canonical names of the major pipeline stages.
- `METRICS_DIR`, `IDAT_DIR`, `OBJ_DIR`, `MERGE_DIR`, `MODEL_DIR`, `CPG_DIR`,
  `SUMMARY_DIR`, `ENMIX_DIR`, `SVA_DIR`, and `QC_DIR` define the subdirectory
  names used by the rules file.
- `R_DIR` optionally points to a custom R library path on shared systems. When
  set to `NULL`, the default library path is used.

These variables control where files are written and how paths are composed
across all stages of the pipeline.

## Global parameters

These parameters are shared across multiple workflow steps.

### Step 1 to Step 5 shared parameters

- `PHENO_FILE`: input phenotype table used to start the workflow.
- `SEP_TYPE`: separator used when reading text files.
- `SAMPLE_ID`: phenotype column used to align samples.
- `N_SAMPLES`: optional sample subset for testing; `NA` uses all samples.
- `ARRAY_TYPE` and `ANNOTATION_VERSION`: array platform and annotation build.
- `TIFF_WIDTH`, `TIFF_HEIGHT`, and `TIFF_RES`: default plot dimensions and
  resolution.
- `SEX_COLUMN`: phenotype column used for sex-based checks and prediction.
- `SENTRIX_ID_COLUMN` and `SENTRIX_POSITION_COLUMN`: chip-position metadata
  used by the SVA stage.
- `BASENAME_COLUMN`: column holding the array basename.
- `TIME_VAR`: phenotype column used for repeated-measures or longitudinal
  workflows.
- `PHENO_ORDER`: desired order of key phenotype columns in exported tables.

### Step 4 and Step 5 shared modeling parameters

- `COVARIATES`: comma-separated adjustment variables included in each model.
- `FACTOR_VARS`: variables that should be treated as categorical before model
  fitting.
- `CPG_PREFIX`: prefix used to identify CpG columns.
- `CPG_LIMIT`: optional cap on the number of CpGs processed; `NA` uses all.
- `PRS_MAP`: mapping from phenotype names to PRS variables.
- `SUMMARY_PVAL`: optional p-value cutoff for summary tables.
- `N_CORES`: number of cores used within each R job.
- `SAVE_TXT_SUMMARIES`: whether plain-text summaries should be written.
- `CHUNK_SIZE`: number of CpGs processed per chunk.
- `FDR_THRESHOLD`: false-discovery-rate threshold used for filtering.
- `PADJ_METHOD`: multiple-testing correction method.
- `ANNOTATION_PACKAGE`: array annotation package used for annotated summaries.
- `ANNOTATION_COLS`: annotation fields to merge into output tables.

## Step 1 parameters

- `QC_CUTOFF`: sample-quality threshold applied during preprocessing.
- `DET_PTYPE`: detection p-value method.
- `DET_PTHRESHOLD`: detection p-value cutoff.
- `PVAL_THRESHOLD`: probe-level p-value filter applied after preprocessing.
- `CHR_TO_REMOVE`: chromosomes to exclude from downstream matrices.
- `SNPS_TO_REMOVE`: SNP-related probe categories to remove.
- `CROSS_REACTIVE_FILE`: cross-reactive probe reference file.
- `MAF_THRESHOLD`: minor-allele-frequency threshold used during SNP filtering.
- `PLOT_GROUP_VAR`: phenotype variable used for grouped QC plots.
- `LC_REF`: reference panel used for cell-type estimation.

These values control the main preprocessing and filtering choices in
`preprocessingMinfiEwasWater()`.

## Step 2 parameters

- `CTRL_SVA_PERC_VAR`: target proportion of control-probe variance explained by
  the retained surrogate variables.
- `CTRL_SVA_FLAG`: enables the control-based ENmix SVA workflow.

These values determine how `svaEnmix()` estimates and retains surrogate
variables.

## Step 3 parameters

- `TIMEPOINTS`: comma-separated timepoints exported separately.
- `COMBINE_TIMEPOINTS`: comma-separated timepoints combined into the
  longitudinal object.

These values determine how `preprocessingPheno()` splits and recombines the
phenotype-methylation data.

## Step 4 parameters

- `PHENOTYPES_GLM`: comma-separated phenotype variables that will be modelled
  one at a time by `methylationGLM_T1()`.
- `GLM_LIBS`: GLM implementation used by `methylationGLM_T1()`.
- `INTERACTION_GLM`: optional interaction term added to the fixed-effect part
  of the GLM.
- `SUMMARY_RESIDUAL_SD`: whether residual standard deviations are reported in
  summaries.
- `SAVE_SIGNIFICANT_CPGS`: whether tables of significant CpGs are exported.
- `SIGNIFICANT_CPG_PVAL`: p-value threshold used for those significant-CpG
  exports.

## Step 5 parameters

- `PHENOTYPES_LME`: comma-separated phenotype variables that will be modelled
  one at a time by `methylationGLMM_T1T2()`.
- `PERSON_VAR`: participant identifier used as the random effect grouping
  factor.
- `LME_LIBS`: mixed-effects libraries used by `methylationGLMM_T1T2()`.
- `INTERACTION_LME`: optional interaction term used in the longitudinal model.
- `SAVE_SIGNIFICANT_INTERACTIONS`: whether significant interaction tables are
  exported.
- `SIGNIFICANT_INTERACTION_PVAL`: p-value threshold for those exported
  interaction results.

## Argument normalisation

- `PRS_MAP_ARG`, `R_DIR_ARG`, `INTERACTION_GLM_ARG`, and `INTERACTION_LME_ARG`
  convert Makefile string values into arguments that can be passed safely to R.

This section is especially important for values such as `NULL`, because it
prevents the literal string `"NULL"` from being interpreted as a real model
value inside the main functions.

## Include pipeline from dnaEPICO

- `DNAPIPE_MK` resolves the packaged `Makefile.rules.pipeline` file.
- `include $(DNAPIPE_MK)` imports the actual pipeline rules after the user
  configuration has been defined.

This separation is what allows the template Makefile to stay compact while the
package keeps the rule implementation in one maintained location.

# Summary

The file-based route is the correct choice when you need:

- a reproducible folder structure,
- files that can be reused by later workflow stages,
- Makefile execution, or
- repeated analyses across one or more models.

The interactive route and the file-based route are therefore complementary:
local use is best for understanding the returned objects, while pipeline use is
best for reproducible multi-step execution.

# Basics

Date the vignette was generated.

```{r, echo = FALSE}
Sys.time()
```

Wallclock time spent generating the vignette.

```{r, echo = FALSE}
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)
```

`R` session information.

```{r, echo = FALSE}
sessionInfo()
```

## Asking for help

As package developers, we try to explain clearly how to use our packages and in
which order to use the functions. But `R` and `Bioconductor` have a steep
learning curve, so it is critical to learn where to ask for help. We would like
to highlight the [Bioconductor support site](https://support.bioconductor.org/)
as the main resource for getting help. Please remember to use the `dnaEPICO`
tag and check the
[older posts](https://support.bioconductor.org/tag/dnaEPICO/). If you want to
receive help, please provide a small reproducible example and your session
information so the source of the problem can be tracked efficiently.

# References