There are several ways to initialize the parameters for calling RaMWAS pipeline functions. The parameters can be stored in an R list like this:
param = ramwasParameters(
dirproject = ".",
dirbam = "bams",
filebamlist = "bam_list.txt",
filecpgset = "Simulated_chromosome.rds",
cputhreads = 2,
scoretag = "MAPQ",
minscore = 4,
minfragmentsize = 50,
maxfragmentsize = 250,
filecovariates = "covariates.txt",
modelcovariates = NULL,
modeloutcome = "age",
modelPCs = 0,
toppvthreshold = 1e-5,
cvnfolds = 10,
mmalpha = 0,
mmncpgs = c(5, 10, 50, 100, 500, 1000, 5000, 10000)
)
Alternatively, the parameters can be set in a separate R code file,
which is processed into a list as above by
parametersFromFile
function. The R code file can contain
lines like this:
### R parameter file
dirbam = "/ramwas_project/bams/"
dirproject = "/ramwas_project/"
filebamlist = "/ramwas_project/000_list_of_files.txt"
scoretag = "AS"
minscore = 100
### platform dependent part
if(.Platform$OS.type == "windows"){
filecpgset="C:/RaMWAS/CpG_set/cpgset_hg19_SNPS_at_MAF_0.05.rds"
} else {
filecpgset="/computing_cluster/ramwas/cpgset_hg19_SNPS_at_MAF_0.05.rds"
}
The project directory parameter is dirproject
. Files
specified by file*
parameters are looked for here, unless
they have the full path specified. By default dirproject
is
set to the current directory.
The dirbam
directory is the location where RaMWAS
expects to find BAM files. If it is not an absolute path, it is
considered to be relative to the dirproject
.
The dirfilter
directory is, by default, the same as
dirproject
. All files created by RaMWAS are created within
this directory. If the user wants to test different read filtering
rules, they can set dirfilter
to TRUE
. This
will set it to something like “Filter_MAPQ_4”, there “MAPQ” is the BAM
field used for filtering and “4” is the threshold.
The dirrbam
parameter is the location where RaMWAS saves
RaMWAS raw data files (read start locations) after scanning BAMs. It is
“rds_rbam” by default and is located in dirfilter
.
The dirrqc
parameter is the location where RaMWAS saves
QC files in R format after scanning BAMs. It is “rds_qc” by default and
is located in dirfilter
.
The dirqc
parameter is the location where RaMWAS saves
QC plots and text files (BAM QC info) after scanning BAMs. It is “qc” by
default and is located in dirfilter
.
The dircoveragenorm
parameter is the sub-directory where
RaMWAS saves coverage matrix at Step 3 of the pipeline. It is
“coverage_norm_123” by default (123 is the number of samples) and is
located in dirfilter
.
The dirtemp
parameter is the directory where RaMWAS
stores temporary files during construction of coverage matrix at Step 3
of the pipeline. It is “temp” by default and is located in
dircoveragenorm
. For better performance it can be set to a
location on a different hard drive than
dircoveragenorm
.
The dirpca
parameter is the sub-directory where RaMWAS
saves results of PCA analysis at Step 4 of the pipeline. It is
“PCA_12_cvrts_0b0a0c” by default (12 is the number of covariates
regressed out and 0b0a0c is a unique code to differentiate different
sets of 12 covariates) and is located in
dircoveragenorm
.
The dirmwas
parameter is the sub-directory where RaMWAS
saves results of MWAS analysis at Step 5. It is “Testing_age_7_PCs” by
default (age is the phenotype being tested and 7 is number of top PCs
included in the model) and is located in dirpca
.
The dircv
parameter is the sub-directory where RaMWAS
saves results of Methylation Risk Score analysis at Step 7. It is
“CV_10_folds” by default (10 is number of folds in N-fold cross
validation) and is located in dirmwas
.
Parameter filebamlist
, if defined, must point to a text
file with one BAM file name per line. BAM file names can include path,
relative to dirbam
or absolute.
Such file may looks like this.
batch1/b1sample1.bam
batch1/b1sample2.bam
batch2/b2sample1.bam
batch2/b2sample2.bam
batch2/b2sample3.bam
batch4/sample4.bam
This file is then loaded into bamnames
parameter, with
“.bam” extension stripped.
Note: BAM file names must be different. For example, the list of BAMS below is NOT allowed, as it contains “sample1.bam” twice:
batch1/sample1.bam
batch1/sample2.bam
batch2/sample1.bam
The filebam2sample
parameter lets RaMWAS know the BAM to
sample correspondence. It provides information on how BAMs from the same
sample are to be combined. Each line in filebam2sample
must
have information for one sample. If sample1 contains reads from bam1,
bam2 and bam3, the line should be
sample1=bam1,bam2,bam3
If the sample name matches the bam name, the line can simply contain that name
sample2
The filebam2sample
file is scanned into
bam2sample
list. The elements of the list are bam names,
and their names are sample names. For example:
RaMWAS calculates CpG scores and performs further analyses at a set
of CpGs (or locations in general) defined by the user via
filecpgset
parameter. The filecpgset
parameter
must point to an .rds file (a file saved using saveRDS
function), with the set of locations stored as a list
with
one sorted vector of CpG locations per chromosome.
cpgset = list(
chr1 = c(12L, 57L, 123L),
chr2 = c(45L, 95L, 99L, 111L),
chr3 = c(22L, 40L, 199L, 211L) )
In practice, the set should depend on the reference genome and can include CpGs created by common SNPs.
Optionally, parameter filenoncpgset
, can point to a file
storing vetted locations away from any CpGs.
For more on CpG sets see the CpG set vignette
The parameter filecovariates
, if defined, must point to
a file containing phenotype information and covariates for the available
samples. If the file has extension “.csv”, it is assumed to be comma
separated, otherwise - tab separated. It must have a header and the
first column must have sample names as defined by
bam2sample
parameter (see
above).
The data in filecovariates
is read into the
covariates
parameter.
Many parts of RaMWAS are parallelized. The cputhreads
parameter determines the maximum number of CPU intensive tasks running
in parallel. By default cputhreads
is set to the number of
CPU cores.
Some tasks are disk intensive. The maximum number of such tasks
running in parallel is set by the diskthreads
parameter. By
default diskthreads
value is 2. Higher values can be
beneficial on machine with lots of RAM.
On some systems the performance is better if different jobs are
prevented from simultaneous access to files. To enforce this for
filematrices set usefilelock=TRUE
.
The reads are filtered by scoretag
parameter, which is
usually the “MAPQ” field or “AS” tag in the BAM file (BAM file
format). The minscore
parameter defines the minimum
admissible score, reads with scores below that are excluded.
If there the are more than maxrepeats
read with the same
start position, this excess is assumed to be the result of template
preparation or amplification artifacts and count is reset to
maxrepeaets
(it is set to 3 by default).
The CpGs in CpG set defined by filecpgset
are filtered
based on their coverage.
minavgcpgcoverage
(default is 0.3).minnonzerosamples
proportion
of samples with nonzero coverageThe file operations in this step can be performed faster if done in
large blocks. To set the block size use buffersize
parameter. Be default it is set to 1 GB
(buffersize = 1e9
).
Numerical values take 8 bytes is stored with full precision. The
coverage matrix does not need such precision and can safely be stored
with 4 bytes per value (single precision). The value size is set by
doublesize
parameter, which is 4 by default.
Both PCA and MWAS correct for variation explained by selected
covariates set by modelcovariates
. The
modelcovariates
parameter must name variables in
filecovariates
/covariates
.
By default, the tested linear model includes a constant. To exclude
it, set modelhasconstant
parameter to
FALSE
.
MWAS tests for association of normalized CpG coverage with
modeloutcome
, accounting for variation of top
modelPCs
principal components.
MWAS produces a QQ-plot in dirmwas
. The title of the
QQ-plot can be changed by the qqplottitle
parameter. To
exclude the title set qqplottitle=""
.
Top MWAS results are saved in a text file Top_tests.txt
.
Parameter toppvthreshold
defines p-value threshold for
selection of top results. Alternatively, it can define the number of top
results, if it is set to a value larger than 1.
The annotation is done using biomaRt
.
package.
The parameters include:
bihost
– BioMart host site.grch37.ensembl.org
.bimart
– BioMart database name, see listMarts().ENSEMBL_MART_ENSEMBL
.bidataset
– BioMart data set, see listDatasets().biattributes
– are attributes of interest, see
listAttributes(). Default is
c("hgnc_symbol","entrezgene","strand")
.bifilters
– lists filters (if any), see
listFilters().biflank
– indicates the maximum allowed distance from
the CpG to the annotation element.Here is an example on how to select a custom biomart annotation track:
library(biomaRt)
library(ramwas)
# First pick a host.
bihost = "grch37.ensembl.org"
# First we list databases
listOfMarts = listMarts(host = bihost)
pander(head(listOfMarts, 10))
# Pick a database
bimart = "ENSEMBL_MART_ENSEMBL"
# Connect to the database
mart = useMart(biomart = bimart, host = bihost)
# List the data sets in the database
listOfDatasets = listDatasets(mart = mart)
pander(head(listOfDatasets, 10))
# Pisk a data set
bidataset = "hsapiens_gene_ensembl"
# Connect to the data set
mart = useMart(biomart = bimart, dataset = bidataset, host = bihost)
# List the attributes
listOfAttributes = listAttributes(mart)
pander(head(listOfAttributes, 10))
# Pick attributes
biattributes = c("hgnc_symbol", "entrezgene", "strand")
listOfFilters = listFilters(mart)
pander(head(listOfFilters, 20))
# Pick a filter
bifilters = list(with_hgnc_trans_name=TRUE)
# Test a location
chr = "chr1"
pos = 15975530
param = ramwasParameters(
bihost = bihost,
bimart = bimart,
bidataset = bidataset,
biattributes = biattributes,
bifilters = bifilters,
biflank = 0)
anno = ramwasAnnotateLocations(param, chr, pos)
pander(anno)
RaMWAS predicts the outcome variable (modeloutcomes
parameter) using top mmncpgs
CpGs from the MWAS. This
prediction is done for each fold in k-fold cross validation and the
prediction performance is measured via correlations and (for binary
outcomes) ROC curves.
To run the procedure for multiple number of top CpGs, The parameter
mmncpgs
can be set to a vector of multiple values.
The elastic net mixing parameter alpha can be set via
mmalpha
parameter. The number of folds
cvnfolds
in the K-fold cross validation is 10 by
default.
The split into folds is random. The random seed can be set with the
randseed
parameter, which is set to 18090212
by default for consistency across runs.
cvnfolds
in the cross
validationWhen selecting the number of folds, K, in K-fold cross validation a researcher faces a trade off. On one hand, larger K allows the training set [of size approximately $\frac{K-1}{K}N$ to better match the size of the complete data set. On the other hand, the computational complexity of cross validation grows linearly with K. As a balance, K = 5 or 10 is often chosen. The most extreme case of K = N is known as the leave-one-out cross validation procedure.
The joint analysis of methylation and genotype data is described in the corresponding vignette.
The SNP data must be stored in a filematrix with dimensions matching
the CpG score matrix. Its name must be defined by fileSNPs
parameter, with absolute path or relative to
dircoveragenorm
.
The results of the joint analysis are stored in dirSNPs
directory. By default, the directory is created within
dircoveragenorm
directory.