This notebook describes how to run
power analyses for DNAm arrays on user-defined datasets with the
pwrEWAS
method. The original pwrEWAS
method is
available as a Bioconductor package.
There was need to make the original method extensible to new
user-provided datasets, and this vignette describes how to do this with
a lightly modified power analysis function,
pwrEWAS_itable()
.
pwrEWAS_itable()
Source the revised function from GitHub with
source_url()
. This runs the script
pwrEWAS_revised.R
, producing a series of callable functions
in the active R session.
pwrEWAS
requires a set of DNAm summary statistics,
specifically a table with DNAm means (e.g. a column named “mu”) and
variances (e.g. a column named “var”) to inform power analysis
simulations. For this example, get DNAm summary statistics from
minfiData
’s example array data, stored as a
MethylSet
, then use rowMeans()
and
rowVars()
to compute summaries from Beta-values.
library(minfiData)
data("MsetEx") # load MethylSet
ms <- MsetEx
bval <- getBeta(ms) # get DNAm fractions
# get the summary data frame
dfpwr <- data.frame(mu = rowMeans(bval, na.rm = T),
var = rowVars(bval, na.rm = T))
## mu var
## cg19413110 0.04736183 4.913379e-05
## cg15130433 0.04950389 1.548990e-04
## cg05313588 0.77365106 1.749416e-02
## ch.17.795614F 0.13965785 4.563469e-03
## cg03074421 0.57182014 3.423462e-02
## cg22792014 0.90923771 5.471781e-04
The particular samples used to generate the CpG probe summary statistics above can be important. Samples from a specific tissue and/or demographic may yield more relevant information from power analysis for a given experiment design task.
pwrEWAS_itable()
There are numerous parameters for fine-tuning power analyses. For the demonstration runs below, set the following parameter values.
ttype <- dfpwr # tissueType
mintss <- 10 # minTotSampleSize
maxtss <- 1000 # maxTotSampleSize
sstep <- 100 # SampleSizeSteps
tdeltav <- c(0.05, 0.1, 0.2) # targetDelta
dmethod <- "limma" # DMmethod
fdr <- 0.05 # FDRcritVal
nsim <- 20 # sims
j <- 1000 # J
ndmp <- 50 # targetDmCpGs
detlim <- 0.01 # detectionLimit
maxctau <- 100 # maxCnt.tau
ncntper <- 0.5 # NcntPer
These effectively define the power analysis as a series of tests varying samples from a minimum of 10, to a maximum of 230, at intervals of 20 samples, for a total of 12 total max sample sizes tested with evenly distributed experiment groups. For instance, at 10 total samples each experiment group has 5 samples, etc.
Further, simulations use the limma()
function for
differential methylation test, with 50 significant probes expected at an
FDR significance of 0.05 from among 5000 total simulated probes. Mean
DNAm differences between experiment groups are varied across 3 possible
values, either 0.05, 0.1, or 0.2. Finally, each set of test parameters
will be simulated 20 times.
Setting the method parameters as above, run pwrEWAS
on
multiple cores by setting the argument core
to some value
>1. But first set the seed to ensure run reproducibility.
set.seed(0)
lpwr.c2 <- pwrEWAS_itable(core = 2,
tissueType = ttype, minTotSampleSize = mintss,
maxTotSampleSize = maxtss, SampleSizeSteps = sstep,
NcntPer = ncntper, targetDelta = tdeltav, J = j,
targetDmCpGs = ndmp, detectionLimit = detlim,
DMmethod = dmethod, FDRcritVal = fdr,
sims = nsim, maxCnt.tau = maxctau)
# [2022-02-17 13:44:51] Finding tau...done [2022-02-17 13:45:06]
# [1] "The following taus were chosen: 0.013671875, 0.02734375, 0.0546875"
# [2022-02-17 13:45:06] Running simulation
# [2022-02-17 13:45:06] Running simulation ... done [2022-02-17 13:48:23]
The commented status messages show the example run time was about 3:30.
Power analysis results are returned in a list of four objects called
"meanPower"
(a matrix), "powerArray"
(an
array), "deltaArray"
(a list), and "metric"
,
(also a list).
The first object, meanPower
, shows the mean power (cell
values) by total sample size (y-axis, rows, from 10 to 230) and delta
DNAm difference (x-axis, columns) across simulations. The dimensions and
first rows of this table are shown below.
lpwr <- lpwr.c1 # get results from an above example
mp <- lpwr[["meanPower"]] # get the mean power table
The second object is powerArray
, an array of matrices
containing 720 data points. This data is used to calculate the
meanPower
summaries, which can be seen by comparing mean of
the first 20 powerArray
values (e.g. the 20 simulations
where total samples is 10 and delta is 0.05) to the
meanPower
cell [1,1] corresponding to delta = 0.1, total
samples = 10.
## [1] 600
## [1] TRUE
The final objects show various observed values for the delta DNAm (in
the deltaArray
object), and the marginal type I error,
classical power, FDR, FDC, and true positive probability (in the
metric
object).
ggplot2
This section shows how to use the ggplot2
package to
generate publication-ready plot summaries of pwrEWAS
power
analysis results.
To plot the full simulation results with smooths and standard errors,
reformat the array of matrices in the powerArray
object.
Extract the power values according to the dimensions of our simulations
(e.g. 10 sample sizes times 10 simulations times 2 deltas = 200 total
simulations). Finally, coerce and harmonize power values across deltas
to form a tall data frame for plotting.
# extract power values from the array of matrices
parr <- pa
m1 <- data.frame(power = parr[1:200]) # first delta power values
m2 <- data.frame(power = parr[201:400]) # second delta power values
m3 <- data.frame(power = parr[401:600]) # third delta power values
# assign total samples to power values
m1$total.samples <- m2$total.samples <-
m3$total.samples <- rep(seq(10, 910, 100), each = 20)
# add delta labels
m1$`Delta\nDNAm` <- rep("0.05", 200)
m2$`Delta\nDNAm` <- rep("0.1", 200)
m3$`Delta\nDNAm` <- rep("0.2", 200)
# make the tall data frame for plotting
dfp <- rbind(m1, rbind(m2, m3))
Make the final plot using goem_smooth()
, which uses
method=loess
here by default. You can specify other methods
with the method
argument (see ?geom_smooth
for
details). Again, the horizontal line at 80% power is included for
reference.
ggplot(dfp, aes(x = total.samples, y = power, color = `Delta\nDNAm`)) +
geom_smooth(se = T, method = "loess") + theme_bw() + xlab("Total samples") +
ylab("Power") + geom_hline(yintercept = 0.8, color = "black", linetype = "dotted")
Including the standard errors lends some confidence to our findings. First, we can tell there is a great deal of separation between each of the delta models throughout all simulations except at the very lowest total sample sizes. Further, at the highest total sample size the power exceeds 80% with high confidence at delta = 0.2 (e.g. lowest standard error well above reference line), with less confidence where delta = 0.1 (e.g. lowest standard error touches reference line), and not at all where delta = 0.05 (e.g. highest standard error is well below reference line).
This vignette showed how to used a lightly modified implementation of
pwrEWAS
on a custom user-provided DNAm dataset. It showed
how to generate DNAm summary statistics, use these in the
pwrEWAS_itable()
function, identify simulation and summary
outcomes in returned results data, and plot simulation results with
errors using ggplot2
.
Note the values of arguments like J
, nsim
,
and maxtss
can be increased in practice to yield a more
robust power model. The values of arguments including
FDRcritVal
and tdeltav
can further be set
according to the empirical results of perliminary analyses or literature
review to yield more informative results.
More details about the pwrEWAS
method, including more
fuction parameter details, can be found in the function docstrings
(e.g. check ?pwrEWAS
), descriptions in the Bioconductor
package documentation
and in the main study, Graw et al. (2019).
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] limma_3.63.2
## [2] gridExtra_2.3
## [3] knitr_1.49
## [4] recountmethylation_1.17.0
## [5] HDF5Array_1.35.2
## [6] rhdf5_2.51.1
## [7] DelayedArray_0.33.3
## [8] SparseArray_1.7.2
## [9] S4Arrays_1.7.1
## [10] abind_1.4-8
## [11] Matrix_1.7-1
## [12] ggplot2_3.5.1
## [13] minfiDataEPIC_1.32.0
## [14] IlluminaHumanMethylationEPICanno.ilm10b2.hg19_0.6.0
## [15] IlluminaHumanMethylationEPICmanifest_0.3.0
## [16] minfiData_0.52.0
## [17] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [18] IlluminaHumanMethylation450kmanifest_0.4.0
## [19] minfi_1.53.1
## [20] bumphunter_1.49.0
## [21] locfit_1.5-9.10
## [22] iterators_1.0.14
## [23] foreach_1.5.2
## [24] Biostrings_2.75.3
## [25] XVector_0.47.1
## [26] SummarizedExperiment_1.37.0
## [27] Biobase_2.67.0
## [28] MatrixGenerics_1.19.0
## [29] matrixStats_1.4.1
## [30] GenomicRanges_1.59.1
## [31] GenomeInfoDb_1.43.2
## [32] IRanges_2.41.2
## [33] S4Vectors_0.45.2
## [34] BiocGenerics_0.53.3
## [35] generics_0.1.3
## [36] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] GenomicFeatures_1.59.1 farver_2.1.2
## [7] rmarkdown_2.29 BiocIO_1.17.1
## [9] zlibbioc_1.52.0 vctrs_0.6.5
## [11] multtest_2.63.0 memoise_2.0.1
## [13] Rsamtools_2.23.1 DelayedMatrixStats_1.29.0
## [15] RCurl_1.98-1.16 askpass_1.2.1
## [17] htmltools_0.5.8.1 curl_6.0.1
## [19] Rhdf5lib_1.29.0 sass_0.4.9
## [21] nor1mix_1.3-3 bslib_0.8.0
## [23] plyr_1.8.9 cachem_1.1.0
## [25] buildtools_1.0.0 GenomicAlignments_1.43.0
## [27] lifecycle_1.0.4 pkgconfig_2.0.3
## [29] R6_2.5.1 fastmap_1.2.0
## [31] GenomeInfoDbData_1.2.13 digest_0.6.37
## [33] colorspace_2.1-1 siggenes_1.81.0
## [35] reshape_0.8.9 AnnotationDbi_1.69.0
## [37] RSQLite_2.3.9 base64_2.0.2
## [39] labeling_0.4.3 mgcv_1.9-1
## [41] httr_1.4.7 compiler_4.4.2
## [43] beanplot_1.3.1 rngtools_1.5.2
## [45] withr_3.0.2 bit64_4.5.2
## [47] BiocParallel_1.41.0 DBI_1.2.3
## [49] MASS_7.3-61 openssl_2.3.0
## [51] rjson_0.2.23 tools_4.4.2
## [53] rentrez_1.2.3 glue_1.8.0
## [55] quadprog_1.5-8 restfulr_0.0.15
## [57] nlme_3.1-166 rhdf5filters_1.19.0
## [59] grid_4.4.2 gtable_0.3.6
## [61] tzdb_0.4.0 preprocessCore_1.69.0
## [63] tidyr_1.3.1 data.table_1.16.4
## [65] hms_1.1.3 xml2_1.3.6
## [67] pillar_1.10.0 genefilter_1.89.0
## [69] splines_4.4.2 dplyr_1.1.4
## [71] lattice_0.22-6 survival_3.8-3
## [73] rtracklayer_1.67.0 bit_4.5.0.1
## [75] GEOquery_2.75.0 annotate_1.85.0
## [77] tidyselect_1.2.1 maketools_1.3.1
## [79] xfun_0.49 scrime_1.3.5
## [81] statmod_1.5.0 UCSC.utils_1.3.0
## [83] yaml_2.3.10 evaluate_1.0.1
## [85] codetools_0.2-20 tibble_3.2.1
## [87] BiocManager_1.30.25 cli_3.6.3
## [89] xtable_1.8-4 munsell_0.5.1
## [91] jquerylib_0.1.4 Rcpp_1.0.13-1
## [93] png_0.1-8 XML_3.99-0.17
## [95] readr_2.1.5 blob_1.2.4
## [97] mclust_6.1.1 doRNG_1.8.6
## [99] sparseMatrixStats_1.19.0 bitops_1.0-9
## [101] scales_1.3.0 illuminaio_0.49.0
## [103] purrr_1.0.2 crayon_1.5.3
## [105] rlang_1.1.4 KEGGREST_1.47.0