RaMWAS provides a complete toolset for methylome-wide association studies (MWAS). It is primarily designed for data from enrichment-based methylation assays, but can be applied to other methylomic data (e.g. Illumina methylation array) as well as other data types like gene expression and genotype data (see vignette).
RaMWAS includes the following major components (steps):
Most steps of RaMWAS are internally parallelized. This is made possible, in part, by the use of the filematrix package for storing the data and accessing it in parallel jobs.
To install the most recent version of RaMWAS please follow
instructions at GitHub.com (R 3.3 or
newer required).
To install the Bioconductor version of RaMWAS (R 3.4 or newer requred)
use the following commands:
The package vignettes and reference manual can be viewed online and with the following commands.
To illustrate the main steps of RaMWAS we first create an artificial data set.
library(ramwas)
dr = paste0(tempdir(), "/simulated_project")
ramwas0createArtificialData(
dir = dr,
nsamples = 20,
nreads = 100e3,
ncpgs = 25e3,
threads = 2)
cat("Project directory:", dr)
## Project directory: /tmp/RtmpE42VVP/simulated_project
Note. The project directory dr
can be
set to a more convenient location when running the code.
The function ramwas0createArtificialData
created the
following files and subdirectories in the project directory.
bams
– directory with 20 BAM filesbam_list.txt
– file with names of all the BAM
filescovariates.txt
– file with age and sex variables for
the samples.Simulated_chromosome.rds
- file with genomic locations
for all CpGs.Each RaMWAS step accepts parameters in the form of a list. Here is the parameter set we will use for all steps below.
param = ramwasParameters(
dirproject = dr,
dirbam = "bams",
filebamlist = "bam_list.txt",
filecpgset = "Simulated_chromosome.rds",
cputhreads = 2,
scoretag = "MAPQ",
minscore = 4,
minfragmentsize = 50,
maxfragmentsize = 250,
minavgcpgcoverage = 0.3,
minnonzerosamples = 0.3,
filecovariates = "covariates.txt",
modelcovariates = NULL,
modeloutcome = "age",
modelPCs = 0,
toppvthreshold = 1e-5,
bihost = "grch37.ensembl.org",
bimart = "ENSEMBL_MART_ENSEMBL",
bidataset = "hsapiens_gene_ensembl",
biattributes = c("hgnc_symbol","entrezgene","strand"),
bifilters = list(with_hgnc_trans_name = TRUE),
biflank = 0,
cvnfolds = 5,
mmalpha = 0,
mmncpgs = c(5, 10, 50, 100, 500, 1000, 2000)
)
We describe the role of each parameter as they are used below. The complete description of all RaMWAS parameters is available in the parameter vignette.
This step scans all BAM files listed in the filebamlist
file, it records read start locations, and calculates a number of
quality control metrics. The BAMs must be located in dirbam
directory.
Reads are filtered by the scoretag
parameter, which is
usually either the “MAPQ” field or “AS” tag in the BAM file (see BAM file
format). Reads with scores below minscore
are
excluded.
The minfragmentsize
and maxfragmentsize
parameters define the minimum and maximum size of DNA fragments that
were sequenced. Please note, these parameters are not equal to the read
length but instead reflect the length of the DNA fragments that were
extracted and sequenced.
The set of CpGs is loaded from the filecpgset
file. For
more information see the CpG set
vignette.
RaMWAS uses cputhreads
CPU cores (parallel jobs) to scan
BAMs. Hard disk speed can be a bottleneck for this step. If BAMs are
stored on a single rotational hard drive, using more than 4 parallel
jobs may not provide further speed improvements.
This creates the following subdirectories in the project directory:
qc
– directory with a number of quality control (QC)
plots, one plot per QC metrix per BAM.edit_distance
– plots showing distribution of edit
distance, i.e. number of mismatches between the aligned read and the
reference genome.isolated_distance
– plots showing distribution of
distances from read starts to isolated CpGs.maxfragmentsize
basepairs away from any other CpG.matched_length
– plots showing distribution of the
length of the reads aligned to the reference genome.score
– plots showing distribution of the score
(scoretag
parameter).coverage_by_density
– plots showing average CpG score
(coverage) as a function of CpG density.rds_rbam
and rds_qc
– directories with
RaMWAS BAM info files (BAM read start locations) and BAM quality control
metrics, one RDS file per BAM.Here is a coverage_by_density
plot for the simulated
data. It shows higher average CpG score (fragment coverage) for regions
with higher CpG densities, up to the saturation point.
pfull = parameterPreprocess(param)
qc = readRDS(paste0(pfull$dirrqc, "/BAM007.qc.rds"))
plot(qc$qc$avg.coverage.by.density)
Note. If a BAM file has previously been scanned, it
will not be scanned again in subsequent calls of
ramwas1scanBams
. This way Step 1 can be rerun multiple
times to efficiently include additional BAMs (samples) when they become
available.
Note. The BAM files are no longer needed after this
step and the RaMWAS BAM info files are 50 to 100 times smaller than the
original BAMs.
This step aggregates quality control metrics across all scanned BAM files, produces a number of summary plots and tables, and estimates fragment size distribution.
In practice, multiple BAM files may correspond to the same sample.
For simplicity, in the example here each BAM corresponds to one sample
with the same name. RaMWAS can be instructed about BAM to sample
correspondence via filebam2sample
or
bam2sample
parameters (See parameter vignette).
The following files and directories are created in the project directory:
Fragment_size_distribution.txt
– text file with
estimated fragment size distribution.qc/Fragment_size_distribution_estimate.pdf
– plot
showing both estimated fragment size distribution and the distribution
of distances from read starts to isolated CpGs.summary_bams
– by BAMs.summary_bams_in_bam2sample
– by BAMs, but only those
listed in filebam2sample
parameter.summary_by_sample
– by sample.summary_total
– total across all BAMs.filebam2sample
.Summary_QC.txt
– table with a number of numerical QC
measures, in an Excel friendly format.Summary_QC_R.txt
– table with a number of numerical QC
measures, in an R friendly format.qclist.rds
– an R file with list of QC objects.Fig_hist_*.pdf
– histograms of several QC measures
across samples.Fig_*.pdf
– PDF files with various QC plots, one page
per BAM or sample, depending on the directory.After exclusion of BAMs and samples not passing QC, step 2 should be rerun. This ensures that fragment size distribution is estimated using selected data only.
The fragment size distribution estimation plot is presented below. The points indicate the number of reads (y-axis) observed at varying distances from isolated CpGs (x-axis). The red line is the parametic fit for these points. For more information on estimation of fragment size distribution see (Oord et al. 2013).
This step creates a CpG score matrix (a.k.a. fragment coverage
matrix) for all samples and all CpGs in the CpG set. The samples are
defined by either filebam2sample
or bam2sample
parameter. The CpGs are defined by the filecpgset
parameter, see CpG set vignette for more
information.
RaMWAS can filter out CpGs with low scores. These CpGs are unmethylated and are unlikely to produce any findings. A CpG is preserved if
minavgcpgcoverage
(default is 0.3).minnonzerosamples
proportion of
samples with nonzero coverageFor each sample, the CpG scores are affected by the number of sequenced DNA fragments, which varies from sample to sample. To remove this variation, the CpG score matrix is normalized to have the same average score for each sample.
This step a new creates directory named coverage_norm_X
in the project directory (X is the number of samples, see Directory names) with the following
files:
Coverage.*
– filematrix with the CpG scores for all
samples and all CpGs that passed the filtering.CpG_locations.*
– filematrix with the location of the
CpGs that passed the threshold.chr
and position
.CpG_chromosome_names.txt
– file with chromosome names
(factor levels) for the integer column chr
in
CpG_locations.*
filematrix.raw_sample_sums.*
– filematrix with total coverage for
each sample before normalization.Note. This step created temporary files in the
dirtemp
directory.
This step performs principal component analysis (PCA) on the CpG score matrix.
PCA is capturing the main directions of unmeasured sources of variation in the data. The main goal of PCA is estimation of laboratory technical variations which can be used as covariates in the association analyses. PCA can be performed after regressing out known covariates as they represent measured sources of variation, which we need not estimate.
Additionally, large sample loadings of the top PCs can indicate multidimensional outlying samples. Such sample may be excluded from the analysis.
If measured covariates are regressed out prior to conducting the PCA,
these covariates are loaded from the file filecovariates
.
The file can be comma or tab-separated, with sample names in the first
column. The artificial data set includes a covariate file
covariates.txt.
The parameter modelcovariates
names the covariates
regressed out before PCA (NULL
for none).
This step creates sub-directory PCA_X_cvrts_Y
in the
directory with score matrix, where X is the number of covariates
regressed out and Y is a code distinguishing different sets with the
same number of covariates (see Directory
names). The sub-directory includes:
covmat.*
, eigenvalues.*
, and
eigenvectors.*
– filematrices with the sample covariance
matrix and its eigenvalue decomposition.PC_values.txt
– principal components scores.PC_loadings.txt
– sample loadings for the top 20
PCs.PC_plot_covariates_removed.pdf
– plots of principal
components scores (i.e. % variance explained on page 1) and sample
loadings for the top 40 PCs (pages 2+).PC_vs_covariates_corr.txt
– correlations of principal
components with phenotypes/covariates (from filecovariates
file).PC_vs_covariates_pvalue.txt
– p-values for these
correlations.The PC plot for artificial data shows one strong component and no outliers in the sample loadings, with first PC clearly capturing sample sex.
eigenvalues = fm.load(paste0(pfull$dirpca, "/eigenvalues"));
eigenvectors = fm.open(
filenamebase = paste0(pfull$dirpca, "/eigenvectors"),
readonly = TRUE);
plotPCvalues(eigenvalues)
The file with correlations shows strong correlation of the top PCs with age and sex.
tblcr = read.table(
file = paste0(pfull$dirpca, "/PC_vs_covs_corr.txt"),
header = TRUE,
sep = "\t")
pander(head(tblcr, 3))
name | age | sex |
---|---|---|
PC1 | -0.28 | 0.997 |
PC2 | 0.723 | 0.0335 |
PC3 | -0.184 | 0.0281 |
The file with p-values indicate statistical significance of these correlations.
Table:
Correlations in PC_vs_covariates_corr.txt
file.
tblpv = read.table(
file = paste0(pfull$dirpca, "/PC_vs_covs_pvalue.txt"),
header = TRUE,
sep = "\t")
pander(head(tblpv, 3))
name | age | sex |
---|---|---|
PC1 | 0.231 | 1.62e-21 |
PC2 | 0.000319 | 0.889 |
PC3 | 0.437 | 0.906 |
This step performs tests for association between normalized CpG
scores and the outcome variable named by modeloutcome
parameter. The analysis corrects for covariates listed in
modelcovariates
parameter and a number of top principal
components (modelPCs
). Tests are performed using the linear
regression model if the outcome variable is numeric, ANOVA if
categorical.
All results are saved in a filematrix. Findings passing the
toppvthreshold
p-value threshold are recorded in a text
file.
This step creates a sub-directory in the PCA directory named
Testing_X_Y_PCs
, where X is the name of the outcome
variable and Y is the number of top PCs included as covariates (see Directory names). The sub-directory
includes:
QQ_plot.pdf
– QQ-plot with a confidence band and
inflation factor lambda (median of the chi-squared test statistics
divided by the expected median of the chi-squared distribution under
null).Top_tests.txt
– text file with the top findings.Stats_and_pvalues.*
– filematrix with MWAS results. The
columns include test statistic, p-value, and q-value (calculated using
Benjamini-Hochberg Procedure (Benjamini and
Hochberg 1995)). The rows match the CpGs of the coverage
matrix.DegreesOfFreedom.txt
– file with the numbers of degrees
of freedom for the t- or F test used for testing.For the simulated data the QQ-plot for age shows moderate deviation from the diagonal for many CpGs, suggesting many markers with small effects. This is consistent with how the data was generated – there is weak signal in 1% of CpGs.
RaMWAS also creates a Manhattan plot with matching Y axis.
RaMWAS can annotate top findings using data from biomaRt
.
The parameters include:
bihost
– biomart host site.bimart
– BioMart database name.bidataset
– data set.biattributes
– are attributes of interest.bifilters
– list of filters (if any).biflank
– indicates the maximum allowed distance from
the CpG to the annotation element.For detailed description of these parameters and R script for listing allowed values see parameter vignette
Here we annotate top findings with human gene symbols, gene ids, and strand information.
The function updates the Top_tests.txt
file in the MWAS
directory.
chr | position | tstat | pvalue | qvalue | hgnc_symbol | entrezgene | strand |
---|---|---|---|---|---|---|---|
chr1 | 15,975,530 | 8.771 | 6.446x10-8 | 0.022 | DDI2/DDI2 | 84301/6248 | 1/1 |
chr1 | 15,975,533 | 8.568 | 9.097x10-8 | 0.022 | DDI2/DDI2 | 84301/6248 | 1/1 |
chr1 | 15,418,248 | -7.654 | 4.571x10-7 | 0.071 | KAZN | 23254 | 1 |
While it is important to find individual CpGs associated with a phenotype, we can often achieve better predictive power by combining information from multiple CpGs.
We take an approach similar to the one proposed by (Horvath 2013) for predicting biological age from methylation data. In particular, we combine signal across multiple CpGs using the elastic net model (Tibshirani et al. 2012). To avoid overfitting and correctly estimate the predictive power of the model we use k-fold cross-validation. Within the cross-validation procedure, for each training set of samples we perform MWAS, select top MWAS sites, train the elastic net, and make predictions for the test samples. The set of predictions is recorded as the MRS.
RaMWAS predicts the outcome variable (modeloutcomes
parameter) using top mmncpgs
CpGs from the MWAS on the
training set of samples. If mmncpgs
is a vector of several
values, each number of CpGs is tested separately.
The elastic net parameter alpha can be set via mmalpha
parameter. The number of folds cvnfolds
in the K-fold cross
validation is 10 by default.
This step creates CV_X_folds
sub-directory in MWAS
directory, where X is the number of folds in k-fold cross validation
(see Directory names). It contains:
fold_*
– directories with MWAS on 90% of the data, each
with different 10% of the samples excluded from the analysis.NA
) outcome variable are not included
in these MWAS.MMCVN_prediction_folds=10_CpGs=*.txt
– table with true
outcome and cross-validation prediction.MMCVN_prediction_folds=10_CpGs=*.pdf
– scatter-plot of
true outcome vs. cross-validation prediction.Prediction_alpha=0.000000.pdf
– scatter plot of
outcome-prediction correlations as a function of the number of CpGs used
for prediction.For the simulated data we get moderately good prediction using just 50 top CpGs.
cv = readRDS(paste0(pfull$dircv, "/rds/CpGs=000050_alpha=0.000000.rds"))
plotPrediction(
param = pfull,
outcome = cv$outcome,
forecast = cv$forecast,
cpgs2use = 50,
main = "Prediction success (EN on coverage)")
The correlation of prediction with actual age is maximal when we use 100 top CpGs in elastic net.
Point mutated CpG sites, called CpG-SNPs (Shoemaker et al. 2010), are interesting sites for human diseases because in addition to the sequence variation they may show individual differences in DNA methylation. RaMWAS can perform CpG-SNP analyses if SNP data from the same subjects/samples is also available. These tests are performed using a regression model that assesses whether the case-control differences are proportional to the number of CpG sites (Oord et al. 2015).
This type of analysis is explained in the CpG-SNP vignette.
The names of CpG score/PCA/MWAS/MRS directories can be recovered by
first calling parameterPreprocess
function.
## CpG score directory:
## /tmp/RtmpE42VVP/simulated_project/coverage_norm_20
## PCA directory:
## /tmp/RtmpE42VVP/simulated_project/coverage_norm_20/PCA_00_cvrts
## MWAS directory:
## /tmp/RtmpE42VVP/simulated_project/coverage_norm_20/PCA_00_cvrts/Testing_age_0_PCs
## MRS directory:
## /tmp/RtmpE42VVP/simulated_project/coverage_norm_20/PCA_00_cvrts/Testing_age_0_PCs/CV_05_folds
## CpG-SNP directory:
## /tmp/RtmpE42VVP/simulated_project/coverage_norm_20/Testing_wSNPs_age_0cvrts_0PCs
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ramwas_1.31.0 filematrix_1.3 pander_0.6.5 knitr_1.48
## [5] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] blob_1.2.4 filelock_1.0.3
## [5] Biostrings_2.75.0 bitops_1.0-9
## [7] fastmap_1.2.0 BiocFileCache_2.15.0
## [9] GenomicAlignments_1.43.0 digest_0.6.37
## [11] lifecycle_1.0.4 survival_3.7-0
## [13] KEGGREST_1.47.0 RSQLite_2.3.7
## [15] magrittr_2.0.3 compiler_4.4.1
## [17] rlang_1.1.4 sass_0.4.9
## [19] progress_1.2.3 tools_4.4.1
## [21] utf8_1.2.4 yaml_2.3.10
## [23] prettyunits_1.2.0 S4Arrays_1.6.0
## [25] curl_5.2.3 bit_4.5.0
## [27] DelayedArray_0.33.1 xml2_1.3.6
## [29] abind_1.4-8 BiocParallel_1.41.0
## [31] KernSmooth_2.23-24 BiocGenerics_0.53.0
## [33] sys_3.4.3 grid_4.4.1
## [35] stats4_4.4.1 fansi_1.0.6
## [37] iterators_1.0.14 biomaRt_2.63.0
## [39] SummarizedExperiment_1.36.0 cli_3.6.3
## [41] rmarkdown_2.28 crayon_1.5.3
## [43] generics_0.1.3 httr_1.4.7
## [45] DBI_1.2.3 cachem_1.1.0
## [47] stringr_1.5.1 zlibbioc_1.52.0
## [49] splines_4.4.1 parallel_4.4.1
## [51] AnnotationDbi_1.69.0 BiocManager_1.30.25
## [53] XVector_0.46.0 matrixStats_1.4.1
## [55] vctrs_0.6.5 glmnet_4.1-8
## [57] Matrix_1.7-1 jsonlite_1.8.9
## [59] IRanges_2.41.0 hms_1.1.3
## [61] S4Vectors_0.44.0 bit64_4.5.2
## [63] maketools_1.3.1 foreach_1.5.2
## [65] jquerylib_0.1.4 glue_1.8.0
## [67] codetools_0.2-20 stringi_1.8.4
## [69] shape_1.4.6.1 GenomeInfoDb_1.43.0
## [71] GenomicRanges_1.59.0 UCSC.utils_1.2.0
## [73] tibble_3.2.1 pillar_1.9.0
## [75] rappdirs_0.3.3 htmltools_0.5.8.1
## [77] GenomeInfoDbData_1.2.13 R6_2.5.1
## [79] dbplyr_2.5.0 httr2_1.0.5
## [81] evaluate_1.0.1 lattice_0.22-6
## [83] Biobase_2.67.0 highr_0.11
## [85] png_0.1-8 Rsamtools_2.22.0
## [87] memoise_2.0.1 bslib_0.8.0
## [89] Rcpp_1.0.13 SparseArray_1.6.0
## [91] xfun_0.48 MatrixGenerics_1.19.0
## [93] buildtools_1.0.0 pkgconfig_2.0.3