Title: | Fast Methylome-Wide Association Study Pipeline for Enrichment Platforms |
---|---|
Description: | A complete toolset for methylome-wide association studies (MWAS). It is specifically designed for data from enrichment based methylation assays, but can be applied to other data as well. The analysis pipeline includes seven steps: (1) scanning aligned reads from BAM files, (2) calculation of quality control measures, (3) creation of methylation score (coverage) matrix, (4) principal component analysis for capturing batch effects and detection of outliers, (5) association analysis with respect to phenotypes of interest while correcting for top PCs and known covariates, (6) annotation of significant findings, and (7) multi-marker analysis (methylation risk score) using elastic net. Additionally, RaMWAS include tools for joint analysis of methlyation and genotype data. This work is published in Bioinformatics, Shabalin et al. (2018) <doi:10.1093/bioinformatics/bty069>. |
Authors: | Andrey A Shabalin [aut, cre] , Shaunna L Clark [aut], Mohammad W Hattab [aut], Karolina A Aberg [aut], Edwin J C G van den Oord [aut] |
Maintainer: | Andrey A Shabalin <[email protected]> |
License: | LGPL-3 |
Version: | 1.31.0 |
Built: | 2024-12-30 03:38:58 UTC |
Source: | https://github.com/bioc/ramwas |
RaMWAS provides a complete toolset for methylome-wide association studies (MWAS). It is specifically designed for data from enrichment based methylation assays, but can be applied to other methylomic data as well. The analysis pipeline includes seven steps: (1) scanning aligned reads from BAM files, (2) calculation of quality control measures, (3) creation of methylation score (coverage) matrix, (4) principal component analysis for capturing batch effects and detection of outliers, (5) association analysis with respect to phenotypes of interest while correctingfor top PCs and known covariates, (6) annotation of significant findings, and (7) multi-marker analysis (methylation risk score) using elastic net.
Package: | ramwas |
Type: | Package |
License: | LGPL-3 |
LazyLoad: | yes |
Depends: | methods |
Andrey A Shabalin [email protected]
Maintainer: Andrey A Shabalin <[email protected]>
See vignettes: browseVignettes("ramwas")
.
Loads an .rds file rdsfilename
using readRDS
and returns the loaded object.
The object is also saved in a cache so that repeated calls of the
function with the same filename return the same object instanteneously.
cachedRDSload(rdsfilename)
cachedRDSload(rdsfilename)
rdsfilename |
Name of the RDS file. |
The cached object is stored in a private package environment.
Returns the object loaded with readRDS
from
rdsfilename
at this or a previous call of the function.
Andrey A Shabalin [email protected]
### Change filename to hg19 CpGset filename = system.file("extdata", "qc_sample.rds", package = "ramwas") time1 = system.time( {obj1 = cachedRDSload(filename)} ) time2 = system.time( {obj1 = cachedRDSload(filename)} ) cat("First loading time:",time1[3],"seconds","\n") cat("Second loading time:",time2[3],"seconds","\n")
### Change filename to hg19 CpGset filename = system.file("extdata", "qc_sample.rds", package = "ramwas") time1 = system.time( {obj1 = cachedRDSload(filename)} ) time2 = system.time( {obj1 = cachedRDSload(filename)} ) cat("First loading time:",time1[3],"seconds","\n") cat("Second loading time:",time2[3],"seconds","\n")
Finding top, say, 100 p-values out of millions can be slow.
This function does it much faster than the usual application of
order(pv)[1:N]
.
findBestNpvs(pv, n)
findBestNpvs(pv, n)
pv |
Vector of p-values. |
n |
Number of best p-values to select. |
The function is a faster analog of sort(order(pv)[1:N])
Return a vector of positions of the smallest N
p-values in pv
.
Andrey A Shabalin [email protected]
See order
.
pv = runif(1000)^10 n = 100 # Faster version topSites1 = findBestNpvs(pv, n) # Slow alternative topSites2 = sort(order(pv)[1:n]) # The results must match stopifnot(all( topSites1 == topSites2 ))
pv = runif(1000)^10 n = 100 # Faster version topSites1 = findBestNpvs(pv, n) # Slow alternative topSites2 = sort(order(pv)[1:n]) # The results must match stopifnot(all( topSites1 == topSites2 ))
Functions for access to data, MWAS results, and location information.
Function getLocations
obtains the location information for
all variables (CpGs).
Function getMWASandLocations
obtains both MWAS results and
location information in a single data frame.
Functions getDataByLocation
and getMWASrange
return the data (coverage) and MWAS results for the selected set
of variables (CpGs).
getLocations(x) getMWAS(x) getMWASandLocations(x) getMWASrange(x, chr, start, end) getDataByLocation(x, chr, start, end)
getLocations(x) getMWAS(x) getMWASandLocations(x) getMWASrange(x, chr, start, end) getDataByLocation(x, chr, start, end)
x |
Name of directory or
list of RaMWAS parameters as described in the "RW6_param.Rmd" vignette.
If a directory name is provided, it must point to
|
chr |
Chromosome name or number. |
start |
Start position of the genomic region of interest. |
end |
End position of the genomic region of interest. |
The functions return the MWAS results and/or locations.
Function getLocations
returns a data frame with
chr |
Chromosome |
start |
Start position |
end |
End position |
Function getMWAS
returns a data frame with
cor |
coverage - phenotype correlation |
t.test |
t-statistic |
p.value |
p-value |
q.value |
q-value (FDR) |
If the outcome variable was categorical,
columns cor
and t.test
are replaced with
R.squared
and F-test
.
Functions getMWASandLocations
and getMWASrange
return a data frame with elements of output
of both getLocations
and getMWAS
Function getDataByLocation
returns a list with
locations |
Chromosomal location information for located variables |
matrix |
Data (coverage) matrix for the selected locations |
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
## Not run: # Extract locations using parameter vector getLocations(param) # Extract locations using directory name getLocations("/data/myMWAS") # Extract MWAS using parameter vector getMWAS(param) # Extract MWAS using directory name getMWAS("/data/myMWAS") # Extract MWAS using parameter vector getMWASandLocations(param) # Extract MWAS using directory name getMWASandLocations("/data/myMWAS") # Extract MWAS for a region getMWASrange(param, 1, 123321, 223321) # Chromosome can be character getMWASrange(param, "chr1", 123321, 223321) # Extract data for a region getDataByLocation(param, 1, 123321, 223321) # Chromosome can be character getDataByLocation(param, "chr1", 123321, 223321) ## End(Not run)
## Not run: # Extract locations using parameter vector getLocations(param) # Extract locations using directory name getLocations("/data/myMWAS") # Extract MWAS using parameter vector getMWAS(param) # Extract MWAS using directory name getMWAS("/data/myMWAS") # Extract MWAS using parameter vector getMWASandLocations(param) # Extract MWAS using directory name getMWASandLocations("/data/myMWAS") # Extract MWAS for a region getMWASrange(param, 1, 123321, 223321) # Chromosome can be character getMWASrange(param, "chr1", 123321, 223321) # Extract data for a region getDataByLocation(param, 1, 123321, 223321) # Chromosome can be character getDataByLocation(param, "chr1", 123321, 223321) ## End(Not run)
Finds all CpGs in a reference genome.
getCpGsetCG(genome) getCpGsetALL(genome)
getCpGsetCG(genome) getCpGsetALL(genome)
genome |
A |
The getCpGsetCG
function searches for all
CG
pairs in the genome.
The getCpGsetALL
function also works for genomes with injected SNPs.
Returns a list with CpG coordinates for each genome sequence.
Andrey A Shabalin [email protected]
### Using a BSGenome input library(BSgenome.Ecoli.NCBI.20080805) cpgset = getCpGsetCG(BSgenome.Ecoli.NCBI.20080805) print("First 10 CpGs in NC_008253:") print(cpgset$NC_008253[1:10]) ### Using a character vector input genome = list( chr1 = "AGCGTTTTCATTCTGACTGCAACGGGCYR", chr2 = "AAAAAACGCCTTAGTAAGTGATTTTCGYR") cpgset1 = getCpGsetCG(genome) cpgset2 = getCpGsetALL(genome) print("Pure CG coordinates in the toy genome:") print(cpgset1) print("CG coordinates in the toy genome possible with SNPs:") print(cpgset2)
### Using a BSGenome input library(BSgenome.Ecoli.NCBI.20080805) cpgset = getCpGsetCG(BSgenome.Ecoli.NCBI.20080805) print("First 10 CpGs in NC_008253:") print(cpgset$NC_008253[1:10]) ### Using a character vector input genome = list( chr1 = "AGCGTTTTCATTCTGACTGCAACGGGCYR", chr2 = "AAAAAACGCCTTAGTAAGTGATTTTCGYR") cpgset1 = getCpGsetCG(genome) cpgset2 = getCpGsetALL(genome) print("Pure CG coordinates in the toy genome:") print(cpgset1) print("CG coordinates in the toy genome possible with SNPs:") print(cpgset2)
Injects SNPs from a VCF count file into a DNA sequence.
injectSNPsMAF(gensequence, frqcount, MAF = 0.01)
injectSNPsMAF(gensequence, frqcount, MAF = 0.01)
gensequence |
A string or |
frqcount |
File name of the allele count file produced by
|
MAF |
SNPs with minor allele frequency at or above |
Returns a string with the genome sequence with SNPs injected.
Andrey A Shabalin [email protected]
See injectSNPs
for the standard analog
function without MAF filtering.
gensequence1 = "AAAACAAAA" frqcount = c( "CHROM\tPOS\tN_ALLELES\tN_CHR\t{ALLELE:COUNT}", "1\t6\t2\t1000\tA:400\tG:600", "1\t7\t2\t1000\tA:800\tC:200", "1\t9\t2\t1000\tA:900\tG:100") MAF = 0.01 gensequence2 = injectSNPsMAF(gensequence1, frqcount, MAF) ### No CpGs without SNPs show(gensequence1) getCpGsetCG(gensequence1) ### SNPs create 1 CpG show(gensequence2) getCpGsetALL(gensequence2)
gensequence1 = "AAAACAAAA" frqcount = c( "CHROM\tPOS\tN_ALLELES\tN_CHR\t{ALLELE:COUNT}", "1\t6\t2\t1000\tA:400\tG:600", "1\t7\t2\t1000\tA:800\tC:200", "1\t9\t2\t1000\tA:900\tG:100") MAF = 0.01 gensequence2 = injectSNPsMAF(gensequence1, frqcount, MAF) ### No CpGs without SNPs show(gensequence1) getCpGsetCG(gensequence1) ### SNPs create 1 CpG show(gensequence2) getCpGsetALL(gensequence2)
Creates a FASTQ file with all fragments of fraglength
bp long.
insilicoFASTQ(con, gensequence, fraglength)
insilicoFASTQ(con, gensequence, fraglength)
con |
A |
gensequence |
A string or |
fraglength |
Fragment length. |
The function a FASTQ file with all fragments of fraglength
bp
long from the forward strand of the DNA sequence.
Returns a list with CpG coordinates for each genome sequence.
Andrey A Shabalin [email protected]
## There are four 4 bp fragments in a 7 basepair sequence: insilicoFASTQ(con="", gensequence = "ABCDEFG", fraglength=4)
## There are four 4 bp fragments in a 7 basepair sequence: insilicoFASTQ(con="", gensequence = "ABCDEFG", fraglength=4)
Check whether a path is relative or absolute.
isAbsolutePath(path)
isAbsolutePath(path)
path |
Path to be tested. |
The function is designed to word with both Windows and Unix paths.
TRUE
if the path is absolute, FALSE
otherwise.
This function improves upon the analog function
in R.utils
package.
For instance, "~hi" is not an absolute path.
Andrey A Shabalin [email protected]
See also makefullpath
.
isAbsolutePath( "C:/123" ) # TRUE isAbsolutePath( "~123" ) # FALSE isAbsolutePath( "~/123" ) # TRUE isAbsolutePath( "/123" ) # TRUE isAbsolutePath( "\\123" ) # TRUE isAbsolutePath( "asd\\123" ) # FALSE isAbsolutePath( "a\\123" ) # FALSE
isAbsolutePath( "C:/123" ) # TRUE isAbsolutePath( "~123" ) # FALSE isAbsolutePath( "~/123" ) # TRUE isAbsolutePath( "/123" ) # TRUE isAbsolutePath( "\\123" ) # TRUE isAbsolutePath( "asd\\123" ) # FALSE isAbsolutePath( "a\\123" ) # FALSE
Functions for exporting MWAS results in BED format files.
Function madeBED
saves MWAS findings in BED format for
all variables (CpGs), while madeBEDrange
selects only variables on
a given chromosome between given locations.
Functions madeBEDgraph
and madeBEDgraphRange
do the same,
but create a file in BedGraph format.
madeBED(x, filename) madeBEDrange(x, filename, chr, start, end) madeBEDgraph(x, filename) madeBEDgraphRange(x, filename, chr, start, end)
madeBED(x, filename) madeBEDrange(x, filename, chr, start, end) madeBEDgraph(x, filename) madeBEDgraphRange(x, filename, chr, start, end)
x |
Name of MWAS directory (parameter |
filename |
Name of the BED file to create. If file exists, it's overwritten. |
chr |
Chromosome name or number. |
start |
Start position of the genomic region of interest. |
end |
End position of the genomic region of interest. |
The function returns the MWAS results with locations.
Returns a data.frame with BED file content:
chrom |
Chromosome |
chromStart |
Start position |
chromEnd |
End position |
name |
Empty name column. BED format only |
score |
p-value |
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
## Not run: # Extract BED file using parameter vector madeBED(param, "file.bed") madeBEDrange(param, "file.bed", 1, 123321, 223321) # Extract BED file using directory name madeBED("/data/myMWAS", "file.bed") madeBEDrange("/data/myMWAS", "file.bed", 1, 123321, 223321) ## End(Not run)
## Not run: # Extract BED file using parameter vector madeBED(param, "file.bed") madeBEDrange(param, "file.bed", 1, 123321, 223321) # Extract BED file using directory name madeBED("/data/myMWAS", "file.bed") madeBEDrange("/data/myMWAS", "file.bed", 1, 123321, 223321) ## End(Not run)
Combine a path with a filename into filename with path.
makefullpath(path, filename)
makefullpath(path, filename)
path |
Path, relative to which the filename is expected to be.
Can be absolute, relative, or |
filename |
Can be just filename, include relative path, or include absolute path. |
Function returns filename
if it includes absolute path or
if path
is NULL
.
Filename with the path included.
Andrey A Shabalin [email protected]
See also isAbsolutePath
.
makefullpath("dir1/dir2", "file.txt") # "dir1/dir2/file.txt" makefullpath("dir1/dir2", "dir3/file.txt") # "dir1/dir2/dir3/file.txt" # Path is ignored if the filename already includes absolute path makefullpath("dir1/dir2", "/file.txt") # "/file.txt" makefullpath("dir1/dir2", "C:/file.txt") # "C:/file.txt"
makefullpath("dir1/dir2", "file.txt") # "dir1/dir2/file.txt" makefullpath("dir1/dir2", "dir3/file.txt") # "dir1/dir2/dir3/file.txt" # Path is ignored if the filename already includes absolute path makefullpath("dir1/dir2", "/file.txt") # "/file.txt" makefullpath("dir1/dir2", "C:/file.txt") # "C:/file.txt"
The function manPlotFast
creates a Manhattan plot.
The function manPlotPrepare
extracts necessary information
from a vector of p-values sufficient for creating a Manhattan plot.
It optimized to work quickly even for tens of millions of p-values.
manPlotPrepare( pvalues, chr, pos, ismlog10 = FALSE, chrmargins = 5e6) manPlotFast( man, ylim = NULL, colorSet = c('steelblue4',"#2C82D1","#4CB2D1"), yaxmax = NULL, lwd = 3, axistep = 2, cex = 1)
manPlotPrepare( pvalues, chr, pos, ismlog10 = FALSE, chrmargins = 5e6) manPlotFast( man, ylim = NULL, colorSet = c('steelblue4',"#2C82D1","#4CB2D1"), yaxmax = NULL, lwd = 3, axistep = 2, cex = 1)
pvalues |
Vector of p-values.
As is (if |
chr , pos
|
Vectors indicating the chromosomes and genomic positions (in basepairs)
for each p-value in |
ismlog10 |
Specifies whether the provides p-values ( |
chrmargins |
The plot margins at the ends of chromosomes (in basepairs). |
man |
Object returned by |
ylim |
Numeric vectors of length 2, giving the y coordinate range. Exactly as in Plotting Parameters. |
colorSet |
Colors of points, rotating over chromosomes. |
yaxmax |
Maximum reach of the y axis. |
lwd |
The line width. |
axistep |
Distance between axis label ticks for y axis. |
cex |
The size of Manhattan plot points. As in Graphics Parameters. |
The function manPlotFast
creates Manhattan plot.
It requires the use of the function manPlotPrepare
which extracts the necessary information
from a vector of p-values sufficient for creating Manhattan plot.
The resulting object is many times smaller than the vector of p-values.
This function manPlotPrepare
returns an object with
information for creating Manhattan plot.
The plot has no title. To add a title use title
.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
# Simulate data (9 chromosomes, million tests each) chr = rep(paste0('chr',1:9), each = 1e6) pos = rep(1:1e6, 9) pv = runif(9e6)^1.1 # Extract the Manhattan plot info man = manPlotPrepare(pv, chr, pos, chrmargins = 1000) # Create Manhattan plot manPlotFast(man) title("Manhattan plot") # Size of p-values before extraction of Manhattan plot info object.size(list(pv, chr, pos)) # Size of the Manhattan plot info object object.size(man)
# Simulate data (9 chromosomes, million tests each) chr = rep(paste0('chr',1:9), each = 1e6) pos = rep(1:1e6, 9) pv = runif(9e6)^1.1 # Extract the Manhattan plot info man = manPlotPrepare(pv, chr, pos, chrmargins = 1000) # Create Manhattan plot manPlotFast(man) title("Manhattan plot") # Size of p-values before extraction of Manhattan plot info object.size(list(pv, chr, pos)) # Size of the Manhattan plot info object object.size(man)
Internal function for splitting a matrix into column vectors.
mat2cols(x)
mat2cols(x)
x |
A matrix. |
List of matrix columns.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
# Sample data data = matrix(1:12, nrow = 3) # Split it mat2cols(data)
# Sample data data = matrix(1:12, nrow = 3) # Split it mat2cols(data)
Takes a matrix of data frame with covariates, adds a constant covariate (optional), and orthonormalizes the set.
orthonormalizeCovariates(cvrt, modelhasconstant)
orthonormalizeCovariates(cvrt, modelhasconstant)
cvrt |
A matrix or data frame with covariates (one column per covariate). |
modelhasconstant |
Set to |
Factor variables are split into dummy variables before orthonormalization.
The operation is performed via QR decomposition (qr).
Returns a matrix with orthogonal columns with unit length,
whose columns spans the same space as the covariates plus a constant
(if modelhasconstant
is TRUE
).
This function is used in several parts of the pipeline.
Andrey A Shabalin [email protected]
# Sample matrix of covariates covariates = data.frame(a = 1:12, b = 12:1) # Orthonormalizing Covariates cvrtqr = orthonormalizeCovariates(covariates, modelhasconstant = TRUE) # Checking the results (round to ignore rounding errors) print( round(crossprod(cvrtqr),15) ) # Stop if not orthonormal stopifnot(all.equal( crossprod(cvrtqr), diag(ncol(cvrtqr)) )) # Example with a factor variable groups = data.frame(gr = c("a","a","a","b","b","b","c","c","c")) orthonormalizeCovariates(groups)
# Sample matrix of covariates covariates = data.frame(a = 1:12, b = 12:1) # Orthonormalizing Covariates cvrtqr = orthonormalizeCovariates(covariates, modelhasconstant = TRUE) # Checking the results (round to ignore rounding errors) print( round(crossprod(cvrtqr),15) ) # Stop if not orthonormal stopifnot(all.equal( crossprod(cvrtqr), diag(ncol(cvrtqr)) )) # Example with a factor variable groups = data.frame(gr = c("a","a","a","b","b","b","c","c","c")) orthonormalizeCovariates(groups)
Saves parameters in a text file,
prioritizing those listed in toplines
.
parameterDump(dir, param, toplines = NULL)
parameterDump(dir, param, toplines = NULL)
dir |
Directory to save the parameters to.
The file is named |
param |
A list with RaMWAS parameters. Or any list in general. |
toplines |
Names of the elements in |
This function is used internally by multiple RaMWAS functions to record parameters used to run the analysis.
The function creates a file and returns nothing.
This function is not intended to be run by the user.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
param = ramwasParameters( number = 123123, integer = 312L, textline = "Hi there", characterVector = c("Hi","Hi again","Bye"), dataframe = data.frame(a = 1:12, b = 12:1) ) thedir = tempdir() parameterDump(thedir, param, c("integer","characterVector")) cat( readLines( paste0(thedir,"/UsedSettings.txt") ), sep = "\n") file.remove( paste0(thedir,"/UsedSettings.txt") )
param = ramwasParameters( number = 123123, integer = 312L, textline = "Hi there", characterVector = c("Hi","Hi again","Bye"), dataframe = data.frame(a = 1:12, b = 12:1) ) thedir = tempdir() parameterDump(thedir, param, c("integer","characterVector")) cat( readLines( paste0(thedir,"/UsedSettings.txt") ), sep = "\n") file.remove( paste0(thedir,"/UsedSettings.txt") )
Fill in missing parameters with default values, read supporting data files, make relative directory path parameters absolute.
parameterPreprocess(param)
parameterPreprocess(param)
param |
List with RaMWAS parameters. |
A number of common preprocessing steps necessary for parameters of multiple pipeline parts are combined in this function. The actions include
Fill in default values for all missing parameters.
Set bamnames parameter to the content filebamlist file (if bamnames was not set).
Set bam2sample parameter to processed content of filebam2sample file (if bam2sample was not set).
Set covariates parameter to the data frame from filecovariates file (if covariates was not set).
Check parameters for consistency, i.e. that modelcovariates include only names of columns in covariates.
Check that files filecpgset and filenoncpgset exist if the parameters are set.
Returns preprocessed list of parameters.
This function is not intended to be run by the user.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
param = ramwasParameters( dirproject = "." ) param2 = parameterPreprocess(param) print(param2)
param = ramwasParameters( dirproject = "." ) param2 = parameterPreprocess(param) print(param2)
The pipeline parameters can be stored in a simple file,
formatted as R code.
The parametersFromFile
function transforms them into
a parameter list used by RaMWAS steps.
parametersFromFile(.parameterfile)
parametersFromFile(.parameterfile)
.parameterfile |
Name of the file with the parameters set as R variables. See the example below. |
Variables with names starting with period (.) are ignored.
Returns the list with all the variables set in the file.
The file .parameterfile
is executed as R code,
so use only trusted parameter files.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
filename = tempfile() # Create a file with lines # dirproject = "." # modelcovariates = c("Age","Sex") writeLines( con = filename, text = c( "dirproject = \".\"", "modelcovariates = c(\"Age\",\"Sex\")") ) # Scan the file into a list param = parametersFromFile(filename) # Show the list print(param) file.remove(filename)
filename = tempfile() # Create a file with lines # dirproject = "." # modelcovariates = c("Age","Sex") writeLines( con = filename, text = c( "dirproject = \".\"", "modelcovariates = c(\"Age\",\"Sex\")") ) # Scan the file into a list param = parametersFromFile(filename) # Show the list print(param) file.remove(filename)
These functions provide a simple way to run all steps of RaMWAS pipeline.
ramwas1scanBams(param) pipelineProcessBam(bamname, param) ramwas2collectqc(param) ramwas3normalizedCoverage(param) ramwas4PCA(param) ramwas5MWAS(param) ramwas6annotateTopFindings(param) ramwas7ArunMWASes(param) ramwas7BrunElasticNet(param) ramwas7CplotByNCpGs(param) ramwas7riskScoreCV(param) ramwasSNPs(param)
ramwas1scanBams(param) pipelineProcessBam(bamname, param) ramwas2collectqc(param) ramwas3normalizedCoverage(param) ramwas4PCA(param) ramwas5MWAS(param) ramwas6annotateTopFindings(param) ramwas7ArunMWASes(param) ramwas7BrunElasticNet(param) ramwas7CplotByNCpGs(param) ramwas7riskScoreCV(param) ramwasSNPs(param)
param |
List with RaMWAS parameters. |
bamname |
Name of the BAM file to process.
Can be absolute or relative to |
See vignettes for details: browseVignettes("ramwas")
.
Function pipelineProcessBam
returns "OK. <bamname>"
if no error occurred. Otherwise, returns text with error.
Other functions return nothing.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
param = ramwasParameters( dirbam = "/project/bams", dirproject = "/project", filebamlist = "000_list_of_files.txt", scoretag = "AS", minscore = 100, cputhreads = 4, filecpgset = "/RaMWAS/hg19_1kG_MAF_0.01_chr1-22_bowtie2_75bp.rds", filenoncpgset = "/RaMWAS/hg19_1kG_MAF_0.01_chr1-22_bowtie2_75bp_nonCpG.rds", maxrepeats = 3, maxfragmentsize = 250, minfragmentsize = 75, filebam2sample = "000_list_of_files.txt", filecovariates = "Covariates.txt", modelcovariates = c("Age","Sex"), modeloutcome = "CellType", modelPCs = 1, cvnfolds = 10, mmncpgs = 1000, mmalpha = 0 ) ## Not run: ramwas1scanBams(param) ramwas2collectqc(param) ramwas3normalizedCoverage(param) ramwas4PCA(param) ramwas5MWAS(param) ramwas6annotateTopFindings(param) ramwas7riskScoreCV(param) ## End(Not run)
param = ramwasParameters( dirbam = "/project/bams", dirproject = "/project", filebamlist = "000_list_of_files.txt", scoretag = "AS", minscore = 100, cputhreads = 4, filecpgset = "/RaMWAS/hg19_1kG_MAF_0.01_chr1-22_bowtie2_75bp.rds", filenoncpgset = "/RaMWAS/hg19_1kG_MAF_0.01_chr1-22_bowtie2_75bp_nonCpG.rds", maxrepeats = 3, maxfragmentsize = 250, minfragmentsize = 75, filebam2sample = "000_list_of_files.txt", filecovariates = "Covariates.txt", modelcovariates = c("Age","Sex"), modeloutcome = "CellType", modelPCs = 1, cvnfolds = 10, mmncpgs = 1000, mmalpha = 0 ) ## Not run: ramwas1scanBams(param) ramwas2collectqc(param) ramwas3normalizedCoverage(param) ramwas4PCA(param) ramwas5MWAS(param) ramwas6annotateTopFindings(param) ramwas7riskScoreCV(param) ## End(Not run)
The function plotPrediction
plots
cross validation predictions of a phenotype
against true values of the phenotype
with multiple summary stats in the title.
The function plotCVcors
plots
the predictive power (correlations) across
predictions using various numbers of markers.
The function plotROC
plots an ROC (Receiver operating characteristic)
curve for predictions of a binary outcome.
plotPrediction( param, outcome, forecast, cpgs2use, main, dfFull = NULL) plotCVcors(cl, param) plotROC(outcome, forecast)
plotPrediction( param, outcome, forecast, cpgs2use, main, dfFull = NULL) plotCVcors(cl, param) plotROC(outcome, forecast)
param |
List of parameters as described in the "RW6_param.Rmd" vignette. |
outcome |
Values of a phenotype. Must be binary for |
forecast |
Predictions for the phenotype. |
cpgs2use |
Number of variables used for prediction (for the legend). |
main |
Part of the title (summary stats are added beneath). |
dfFull |
Number of degrees of freedom for the significance testing. |
cl |
List with three elements:
|
The plotROC
and plot has no title.
To add a title use title
.
The plotROC
returns the area under the curve (AUC) for the ROC.
The plotPrediction
function returns the list of
calculated statistics printed in the title.
The plotCVcors
returns nothing (NULL
).
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
# Sample data n = 300 param = list(modeloutcome = "Age", mmalpha = 0, cvnfolds = 5) outcome = rnorm(n, mean = 50, sd = 20) forecast = outcome + rnorm(n, mean = 0, sd = 20) cpgs2use = 1000 main = "Prediction success (simulated data)" # Plot phenotype-prediction plot plotPrediction( param, outcome, forecast, cpgs2use, main) # Artificial data for plotCVcors() cl = list( x = c(50, 100, 200, 500, 1000), corp = c(0.1, 0.6, 0.7, 0.85, 0.8), cors = c(0.1, 0.6, 0.7, 0.85, 0.8) + rnorm(5)*0.1) # Plot prediction performance by the number of markers plotCVcors(cl, param) # Make the outcome binary for ROC plot outcome = (outcome > 50) # Plot ROC curve and calculate the AUC plotROC(outcome, forecast)
# Sample data n = 300 param = list(modeloutcome = "Age", mmalpha = 0, cvnfolds = 5) outcome = rnorm(n, mean = 50, sd = 20) forecast = outcome + rnorm(n, mean = 0, sd = 20) cpgs2use = 1000 main = "Prediction success (simulated data)" # Plot phenotype-prediction plot plotPrediction( param, outcome, forecast, cpgs2use, main) # Artificial data for plotCVcors() cl = list( x = c(50, 100, 200, 500, 1000), corp = c(0.1, 0.6, 0.7, 0.85, 0.8), cors = c(0.1, 0.6, 0.7, 0.85, 0.8) + rnorm(5)*0.1) # Plot prediction performance by the number of markers plotCVcors(cl, param) # Make the outcome binary for ROC plot outcome = (outcome > 50) # Plot ROC curve and calculate the AUC plotROC(outcome, forecast)
RaMWAS functions for estimation and plotting of the fragment size distribution.
estimateFragmentSizeDistribution(frdata, seqLength) plotFragmentSizeDistributionEstimate( frdata, estimate, col1 = "blue", col2 = "red")
estimateFragmentSizeDistribution(frdata, seqLength) plotFragmentSizeDistributionEstimate( frdata, estimate, col1 = "blue", col2 = "red")
frdata |
Distribution of distances from the starts of isolated reads to the respective CpGs. |
seqLength |
The length of sequenced part of the fragments. |
estimate |
Fragment size distribution estimate. |
col1 |
Color of |
col2 |
Color of |
The function estimateFragmentSizeDistribution
returns the estimate of the fragment size distribution.
If the length of frdata
is equal to seqLength
,
the fragments are assumed to all be of length seqLength
.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
# Simulate data x = 0:250 truemean = 1 - pnorm(x, mean = 150, sd = 50) frdata = rpois(n = length(x), lambda = truemean*300) # Estimate fragment size distribution estimate = estimateFragmentSizeDistribution(frdata, seqLength = 50) # Plot fragment size distribution estimate plotFragmentSizeDistributionEstimate(frdata, estimate)
# Simulate data x = 0:250 truemean = 1 - pnorm(x, mean = 150, sd = 50) frdata = rpois(n = length(x), lambda = truemean*300) # Estimate fragment size distribution estimate = estimateFragmentSizeDistribution(frdata, seqLength = 50) # Plot fragment size distribution estimate plotFragmentSizeDistributionEstimate(frdata, estimate)
The function plotPCvalues
plots PC values (variation explained).
The function plotPCvectors
plots PC vectors (loadings).
plotPCvalues(values, n = 40, ylim = NULL, col = "blue") plotPCvectors(eigenvector, i, col = "blue1")
plotPCvalues(values, n = 40, ylim = NULL, col = "blue") plotPCvectors(eigenvector, i, col = "blue1")
values |
Vector of PC values. |
n |
Number of top PCs to plot. |
ylim |
Numeric vectors of length 2, giving the y coordinate range. Exactly as in Plotting Parameters. |
col |
Color of the plotted points. |
eigenvector |
The i-th eigenvector. See |
i |
Indicates loadings of which PC to plot. |
This function creates a PC plot and returns nothing (NULL
).
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
# Sample data # for 1000 observations and 10 samples m = 1000 n = 10 data = matrix(rnorm(n*m), nrow = m) # Covariance and eigenvalue decomposition covmat = crossprod(data) e = eigen(covmat) # Plot PC values plotPCvalues(e$values) # Plot PC vectors plotPCvectors(e$vectors[,1], 1) plotPCvectors(e$vectors[,2], 2)
# Sample data # for 1000 observations and 10 samples m = 1000 n = 10 data = matrix(rnorm(n*m), nrow = m) # Covariance and eigenvalue decomposition covmat = crossprod(data) e = eigen(covmat) # Plot PC values plotPCvalues(e$values) # Plot PC vectors plotPCvectors(e$vectors[,1], 1) plotPCvectors(e$vectors[,2], 2)
The pipeline parameters can be provided via command line.
For example: R pipeline.r dirproject="/project" maxrepeats=0 modeloutcome="Age"
Each command line argument is treated as an R statement.
All variables defined this way are collected in a list which is returned.
processCommandLine(.arg = NULL)
processCommandLine(.arg = NULL)
.arg |
Vector of command line parameters. Obtained from
|
If a command line argument defines variable "fileparam"
,
it is assumed to be a filename, and the file with this name
is scanned for extra pipeline parameters,
as by parametersFromFile
.
Returns the list with all the variables set by the statement in the command line.
Variables with names starting with period (.) are ignored.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
filename = tempfile() # Assume command line with two components: # dirproject="." # modelcovariates=c("Age","Sex") arg = c( "dirproject = \".\"", "modelcovariates = c(\"Age\",\"Sex\")") # Process the command line param = processCommandLine(arg) # Show the list print(param)
filename = tempfile() # Assume command line with two components: # dirproject="." # modelcovariates=c("Age","Sex") arg = c( "dirproject = \".\"", "modelcovariates = c(\"Age\",\"Sex\")") # Process the command line param = processCommandLine(arg) # Show the list print(param)
Calculate Benjamini-Hochberg q-values for a vector of p-values.
pvalue2qvalue(pv, n = length(pv))
pvalue2qvalue(pv, n = length(pv))
pv |
Vector of p-values. |
n |
If |
The q-values can be slightly conservative compared to other popular q-value calculation methods.
Return a vector of q-values matching p-values in pv
.
The function runs faster if the vector pv
is sorted.
Andrey A Shabalin [email protected]
pv = runif(20)^2 qv = pvalue2qvalue(pv)
pv = runif(20)^2 qv = pvalue2qvalue(pv)
RaMWAS calculates a number of QC measures for each BAM and saves them in R .rds files.
For full description of the QC measures and the ploting options
runvignette("RW3_BAM_QCs")
qcmean(x) ## S3 method for class 'NULL' qcmean(x) ## S3 method for class 'qcChrX' qcmean(x) ## S3 method for class 'qcChrY' qcmean(x) ## S3 method for class 'qcCoverageByDensity' qcmean(x) ## S3 method for class 'qcEditDist' qcmean(x) ## S3 method for class 'qcEditDistBF' qcmean(x) ## S3 method for class 'qcFrwrev' qcmean(x) ## S3 method for class 'qcHistScore' qcmean(x) ## S3 method for class 'qcHistScoreBF' qcmean(x) ## S3 method for class 'qcIsoDist' qcmean(x) ## S3 method for class 'qcLengthMatched' qcmean(x) ## S3 method for class 'qcLengthMatchedBF' qcmean(x) ## S3 method for class 'qcNonCpGreads' qcmean(x) ## S3 method for class 'qcHistScore' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcHistScoreBF' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcEditDist' plot(x, samplename="", xstep = 5, ...) ## S3 method for class 'qcEditDistBF' plot(x, samplename="", xstep = 5, ...) ## S3 method for class 'qcLengthMatched' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcLengthMatchedBF' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcIsoDist' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcCoverageByDensity' plot(x, samplename="", ...)
qcmean(x) ## S3 method for class 'NULL' qcmean(x) ## S3 method for class 'qcChrX' qcmean(x) ## S3 method for class 'qcChrY' qcmean(x) ## S3 method for class 'qcCoverageByDensity' qcmean(x) ## S3 method for class 'qcEditDist' qcmean(x) ## S3 method for class 'qcEditDistBF' qcmean(x) ## S3 method for class 'qcFrwrev' qcmean(x) ## S3 method for class 'qcHistScore' qcmean(x) ## S3 method for class 'qcHistScoreBF' qcmean(x) ## S3 method for class 'qcIsoDist' qcmean(x) ## S3 method for class 'qcLengthMatched' qcmean(x) ## S3 method for class 'qcLengthMatchedBF' qcmean(x) ## S3 method for class 'qcNonCpGreads' qcmean(x) ## S3 method for class 'qcHistScore' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcHistScoreBF' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcEditDist' plot(x, samplename="", xstep = 5, ...) ## S3 method for class 'qcEditDistBF' plot(x, samplename="", xstep = 5, ...) ## S3 method for class 'qcLengthMatched' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcLengthMatchedBF' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcIsoDist' plot(x, samplename="", xstep = 25, ...) ## S3 method for class 'qcCoverageByDensity' plot(x, samplename="", ...)
x |
The QC object. See the examples below. |
samplename |
Name of the sample for plot title. |
xstep |
The distance between x axis ticks. |
... |
Parameters passed to the underlying
|
Function qcmean
returns one value summary of most QC measures.
Run vignette("RW3_BAM_QCs")
for description of values returned by it.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
# Load QC data from a sample project filename = system.file("extdata", "bigQC.rds", package = "ramwas") qc = readRDS(filename)$qc ## The number of BAM files cat("N BAMs:", qc$nbams) ## Total number of reads in the BAM file(s) cat("Reads total:", qc$reads.total) ## Number of reads aligned to the reference genome cat("Reads aligned:", qc$reads.aligned, "\n") cat("This is ", qc$reads.aligned / qc$reads.total * 100, "% of all reads", sep="") ## Number of reads that passed minimum score filter and are recorded cat("Reads recorded:",qc$reads.recorded,"\n") cat("This is ", qc$reads.recorded / qc$reads.aligned * 100, "% of aligned reads", sep="") ## Number of recorded reads aligned to ## the forward and reverse strands respectively cat("Reads on forward strand:", qc$frwrev[1],"\n") cat("Reads on reverse strand:", qc$frwrev[2],"\n") cat("Fraction of reads on forward strand:", qcmean(qc$frwrev), "\n") ## Distribution of the read scores cat("Average alignment score:", qcmean(qc$hist.score1), "\n") cat("Average alignment score, no filter:", qcmean(qc$bf.hist.score1), "\n") par(mfrow=c(1,2)) plot(qc$hist.score1) plot(qc$bf.hist.score1) ## Distribution of the length of the aligned part of the reads cat("Average aligned length:", qcmean(qc$hist.length.matched), "\n") cat("Average aligned length, no filter:", qcmean(qc$bf.hist.length.matched), "\n") par(mfrow = c(1,2)) plot(qc$hist.length.matched) plot(qc$bf.hist.length.matched) ## Distribution of edit distance between ## the aligned part of the read and the reference genome cat("Average edit distance:", qcmean(qc$hist.edit.dist1), "\n") cat("Average edit distance, no filter:", qcmean(qc$bf.hist.edit.dist1), "\n") par(mfrow = c(1,2)) plot(qc$hist.edit.dist1) plot(qc$bf.hist.edit.dist1) ## Number of reads after removal of duplicate reads cat("Reads without duplicates:", qc$reads.recorded.no.repeats, "\n") cat("This is ", qc$reads.recorded.no.repeats / qc$reads.recorded * 100, "% of aligned reads", "\n", sep="") cat("Fraction of reads on forward strand (with duplicates):", qcmean(qc$frwrev), "\n") cat("Fraction of reads on forward strand (without duplicates):", qcmean(qc$frwrev.no.repeats), "\n") ## Number of reads away from CpGs cat("Non-CpG reads:", qc$cnt.nonCpG.reads[1], "\n") cat("This is ", qcmean(qc$cnt.nonCpG.reads)*100, "% of recorded reads", sep="") ## Average coverage of CpGs and non-CpGs cat("Summed across", qc$nbams, "bams", "\n") cat("Average CpG coverage:", qc$avg.cpg.coverage, "\n") cat("Average non-CpG coverage:", qc$avg.noncpg.coverage,"\n") cat("Enrichment ratio:", qc$avg.cpg.coverage / qc$avg.noncpg.coverage) ## Coverage around isolated CpGs plot(qc$hist.isolated.dist1) ## Fraction of reads from chrX and chrY cat("ChrX reads: ", qc$chrX.count[1], ", which is ", qcmean(qc$chrX.count)*100, "% of total", sep="", "\n") cat("ChrX reads: ", qc$chrY.count[1], ", which is ", qcmean(qc$chrY.count)*100, "% of total", sep="", "\n") ## Coverage vs. CpG density cat("Highest coverage is observed at CpG density of", qcmean(qc$avg.coverage.by.density)^2) plot(qc$avg.coverage.by.density)
# Load QC data from a sample project filename = system.file("extdata", "bigQC.rds", package = "ramwas") qc = readRDS(filename)$qc ## The number of BAM files cat("N BAMs:", qc$nbams) ## Total number of reads in the BAM file(s) cat("Reads total:", qc$reads.total) ## Number of reads aligned to the reference genome cat("Reads aligned:", qc$reads.aligned, "\n") cat("This is ", qc$reads.aligned / qc$reads.total * 100, "% of all reads", sep="") ## Number of reads that passed minimum score filter and are recorded cat("Reads recorded:",qc$reads.recorded,"\n") cat("This is ", qc$reads.recorded / qc$reads.aligned * 100, "% of aligned reads", sep="") ## Number of recorded reads aligned to ## the forward and reverse strands respectively cat("Reads on forward strand:", qc$frwrev[1],"\n") cat("Reads on reverse strand:", qc$frwrev[2],"\n") cat("Fraction of reads on forward strand:", qcmean(qc$frwrev), "\n") ## Distribution of the read scores cat("Average alignment score:", qcmean(qc$hist.score1), "\n") cat("Average alignment score, no filter:", qcmean(qc$bf.hist.score1), "\n") par(mfrow=c(1,2)) plot(qc$hist.score1) plot(qc$bf.hist.score1) ## Distribution of the length of the aligned part of the reads cat("Average aligned length:", qcmean(qc$hist.length.matched), "\n") cat("Average aligned length, no filter:", qcmean(qc$bf.hist.length.matched), "\n") par(mfrow = c(1,2)) plot(qc$hist.length.matched) plot(qc$bf.hist.length.matched) ## Distribution of edit distance between ## the aligned part of the read and the reference genome cat("Average edit distance:", qcmean(qc$hist.edit.dist1), "\n") cat("Average edit distance, no filter:", qcmean(qc$bf.hist.edit.dist1), "\n") par(mfrow = c(1,2)) plot(qc$hist.edit.dist1) plot(qc$bf.hist.edit.dist1) ## Number of reads after removal of duplicate reads cat("Reads without duplicates:", qc$reads.recorded.no.repeats, "\n") cat("This is ", qc$reads.recorded.no.repeats / qc$reads.recorded * 100, "% of aligned reads", "\n", sep="") cat("Fraction of reads on forward strand (with duplicates):", qcmean(qc$frwrev), "\n") cat("Fraction of reads on forward strand (without duplicates):", qcmean(qc$frwrev.no.repeats), "\n") ## Number of reads away from CpGs cat("Non-CpG reads:", qc$cnt.nonCpG.reads[1], "\n") cat("This is ", qcmean(qc$cnt.nonCpG.reads)*100, "% of recorded reads", sep="") ## Average coverage of CpGs and non-CpGs cat("Summed across", qc$nbams, "bams", "\n") cat("Average CpG coverage:", qc$avg.cpg.coverage, "\n") cat("Average non-CpG coverage:", qc$avg.noncpg.coverage,"\n") cat("Enrichment ratio:", qc$avg.cpg.coverage / qc$avg.noncpg.coverage) ## Coverage around isolated CpGs plot(qc$hist.isolated.dist1) ## Fraction of reads from chrX and chrY cat("ChrX reads: ", qc$chrX.count[1], ", which is ", qcmean(qc$chrX.count)*100, "% of total", sep="", "\n") cat("ChrX reads: ", qc$chrY.count[1], ", which is ", qcmean(qc$chrY.count)*100, "% of total", sep="", "\n") ## Coverage vs. CpG density cat("Highest coverage is observed at CpG density of", qcmean(qc$avg.coverage.by.density)^2) plot(qc$avg.coverage.by.density)
Function qqPlotFast
creates a QQ-plot
with a confidence band and
an estimate of inflation factor lambda.
It optimized to work quickly even for tens of millions of p-values.
qqPlotPrepare( pvalues, ntests = NULL, ismlog10 = FALSE) qqPlotFast( x, ntests = NULL, ismlog10 = FALSE, ci.level = 0.05, ylim = NULL, newplot = TRUE, col = "#D94D4C", cex = 0.5, yaxmax = NULL, lwd = 3, axistep = 2, col.band = "#ECA538", makelegend = TRUE, xlab = expression( paste("\u2013", " log"[10]*"(", italic("P"), "), null")), ylab = expression( paste("\u2013", " log"[10]*"(", italic("P"), "), observed")))
qqPlotPrepare( pvalues, ntests = NULL, ismlog10 = FALSE) qqPlotFast( x, ntests = NULL, ismlog10 = FALSE, ci.level = 0.05, ylim = NULL, newplot = TRUE, col = "#D94D4C", cex = 0.5, yaxmax = NULL, lwd = 3, axistep = 2, col.band = "#ECA538", makelegend = TRUE, xlab = expression( paste("\u2013", " log"[10]*"(", italic("P"), "), null")), ylab = expression( paste("\u2013", " log"[10]*"(", italic("P"), "), observed")))
pvalues |
Vector of p-values.
As is (if |
ntests |
If only significant p-values are provided,
the total number of tests performed. |
ismlog10 |
Specifies whether the provides p-values ( |
x |
Either a vector of p-values, as in |
ci.level |
Significance level of the confidence band.
Set to |
ylim |
Numeric vectors of length 2, giving the y coordinate range. Exactly as in Plotting Parameters. |
newplot |
If |
col |
The QQ-plot curve color. |
col.band |
Confidence band curve color. |
cex |
The size of QQ-plot points. As in Graphics Parameters. |
lwd |
The line width. |
axistep |
Distance between axis label ticks for both axis. |
yaxmax |
Maximum reach of the y axis. |
makelegend |
If true, add legend to the plot. |
xlab , ylab
|
Axis labels. As in plot function. |
The function qqPlotFast
creates a QQ-plot.
The function qqPlotPrepare
extracts the necessary information
from a vector of p-values sufficient for creating QQ-plot.
The resulting object is many times smaller than the vector of p-values.
The function qqPlotPrepare
returns an object with
the necessary information from a vector of p-values
sufficient for creating QQ-plot.
The plot has no title. To add a title use title
.
The function works faster if the p-values are sorted.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
# Million p-values n = 1e6 # Null p-values pv = runif(n) # QQ-plot should be nearly diagonal qqPlotFast(pv) title("QQ-plot") # Size of p-values before extraction of QQ-plot info object.size(pv) # Extract the QQ-plot info qq = qqPlotPrepare(pv) # Size of the QQ-plot info object object.size(qq) # Create QQ-plot, it is the same qqPlotFast(qq) # Create QQ-plot with plotting parameters qqPlotFast(qq, ylim = c(0,10), yaxmax = 9, axistep = 3, lwd = 3, cex = 1)
# Million p-values n = 1e6 # Null p-values pv = runif(n) # QQ-plot should be nearly diagonal qqPlotFast(pv) title("QQ-plot") # Size of p-values before extraction of QQ-plot info object.size(pv) # Extract the QQ-plot info qq = qqPlotPrepare(pv) # Size of the QQ-plot info object object.size(qq) # Create QQ-plot, it is the same qqPlotFast(qq) # Create QQ-plot with plotting parameters qqPlotFast(qq, ylim = c(0,10), yaxmax = 9, axistep = 3, lwd = 3, cex = 1)
Creates a set of artificial BAM files and supplementary files which can be used to test run the pipeline. The BAMs contain reads aligned only to one human chromosome, with methylation effects embedded for simulated age and case-control status.
ramwas0createArtificialData(dir, nsamples = 20, nreads = 1e6, ncpgs = 500e3, randseed = 18090212, threads = 1)
ramwas0createArtificialData(dir, nsamples = 20, nreads = 1e6, ncpgs = 500e3, randseed = 18090212, threads = 1)
dir |
Directory for generated RaMWAS project files and BAMs. |
nsamples |
Number of samples/BAMs to create. |
nreads |
Number of reads in each BAM file. |
ncpgs |
Number of CpGs in the generated genome (with a single chromosome). |
randseed |
Random number generator seed for consistency of the output. |
threads |
Number of CPU cores to use for data generation. |
The function generates a number of files within dir
directory.
bam_list.txt
- list of created BAM files.
To be used in filebamlist
and
filebam2sample
parameters in the pipeline.
covariates.txt
- table with age and sex status covariates.
For use in filecovariates
parameter in the pipeline.
Single_chromosome.rds
- CpG location file
with the selected chromosome only.
bams
- directory with all the BAM files.
The generated BAMs have 600 CpGs affected by sex
,
namely fully methylated or not methylated at all, depending on sex
.
The methylation level of 1% of all CpGs is affected by age.
The methylation of those CpGs is
equal to age/100
or 1-age/100
.
The age is generated randomly in the range from 20 to 80.
The function creates multiple files but returns no value.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
### Location for the artificial project dr = paste0(tempdir(), "/simulated_project") ramwas0createArtificialData( dr, nsamples = 4, nreads = 10e3, ncpgs = 1e3) # Artificial project files created in: dr # The generated files are:" as.matrix(list.files(dr, recursive=TRUE)) ### Clean up unlink(paste0(dr,"/*"), recursive=TRUE)
### Location for the artificial project dr = paste0(tempdir(), "/simulated_project") ramwas0createArtificialData( dr, nsamples = 4, nreads = 10e3, ncpgs = 1e3) # Artificial project files created in: dr # The generated files are:" as.matrix(list.files(dr, recursive=TRUE)) ### Clean up unlink(paste0(dr,"/*"), recursive=TRUE)
Calls biomart annotation database for a vector of locations and assignes the tracks to the locations.
ramwasAnnotateLocations(param, chr, pos)
ramwasAnnotateLocations(param, chr, pos)
param |
List of parameters as described in the "RW6_param.Rmd" vignette. |
chr |
A vector of chromosome names or numbers. |
pos |
A vector of genomic locations on the chromosomes. |
This function is used internally by RaMWAS annotation step.
An annotation table, on line per supplied location.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
bihost = "grch37.ensembl.org" bimart = "ENSEMBL_MART_ENSEMBL" bidataset = "hsapiens_gene_ensembl" biattributes = c("hgnc_symbol", "entrezgene", "strand") bifilters = list(with_hgnc_trans_name = TRUE) biflank = 0 param = ramwasParameters( bihost = bihost, bimart = bimart, bidataset = bidataset, biattributes = biattributes, bifilters = bifilters, biflank = biflank) # Test a location chr = "chr1" pos = 15975530 ## Not run: ramwasAnnotateLocations(param, chr, pos) ## End(Not run)
bihost = "grch37.ensembl.org" bimart = "ENSEMBL_MART_ENSEMBL" bidataset = "hsapiens_gene_ensembl" biattributes = c("hgnc_symbol", "entrezgene", "strand") bifilters = list(with_hgnc_trans_name = TRUE) biflank = 0 param = ramwasParameters( bihost = bihost, bimart = bimart, bidataset = bidataset, biattributes = biattributes, bifilters = bifilters, biflank = biflank) # Test a location chr = "chr1" pos = 15975530 ## Not run: ramwasAnnotateLocations(param, chr, pos) ## End(Not run)
RaMWAS parameter vector which is used by major functions of the pipeline is a regular R list and setting it does not require a special function. However, using this function makes it much simpler in RStudio as the names and role of every parameter is showed in the RStudio IDE.
ramwasParameters( dirproject, dirfilter, dirrbam, dirrqc, dirqc, dircoveragenorm, dirtemp, dirpca, dirmwas, dircv, dirbam, filebamlist, bamnames, filebam2sample, bam2sample, filecpgset, filenoncpgset, filecovariates, covariates, cputhreads, diskthreads, usefilelock, scoretag, minscore, maxrepeats, minavgcpgcoverage, minnonzerosamples, buffersize, doublesize, modelcovariates, modeloutcome, modelPCs, modelhasconstant, qqplottitle, toppvthreshold, mmncpgs, mmalpha, cvnfolds, bihost, bimart, bidataset, biattributes, bifilters, biflank, fileSNPs, dirSNPs, ...)
ramwasParameters( dirproject, dirfilter, dirrbam, dirrqc, dirqc, dircoveragenorm, dirtemp, dirpca, dirmwas, dircv, dirbam, filebamlist, bamnames, filebam2sample, bam2sample, filecpgset, filenoncpgset, filecovariates, covariates, cputhreads, diskthreads, usefilelock, scoretag, minscore, maxrepeats, minavgcpgcoverage, minnonzerosamples, buffersize, doublesize, modelcovariates, modeloutcome, modelPCs, modelhasconstant, qqplottitle, toppvthreshold, mmncpgs, mmalpha, cvnfolds, bihost, bimart, bidataset, biattributes, bifilters, biflank, fileSNPs, dirSNPs, ...)
dirproject |
The project directory. Default is currect directory. |
dirfilter |
By default, the same as "dirproject". |
dirrbam |
Directory where RaMWAS saves RaMWAS raw data files
(read start locations) after scanning BAMs. |
dirrqc |
Directory where RaMWAS saves QC files in R format after scanning BAMs.
|
dirqc |
Directory where RaMWAS saves QC plots and text files (BAM QC info)
after scanning BAMs. |
dircoveragenorm |
Directory where RaMWAS saves coverage matrix at Step 3 of the pipeline.
|
dirtemp |
Directory where RaMWAS stores temporary files
during construction of coverage matrix at Step 3 of the pipeline. |
dirpca |
Directory where RaMWAS saves results of PCA analysis at Step 4
of the pipeline. |
dirmwas |
Directory where RaMWAS saves results of MWAS analysis at Step 5
of the pipeline. |
dircv |
Directory where RaMWAS saves results of Methylation Risk Score analysis
at Step 7 of the pipeline. |
dirbam |
Location of BAM files. |
filebamlist |
If defined, must point to a text file with one BAM file name per line.
|
bamnames |
A character vector with BAM file names. |
filebam2sample |
Allowes multiple BAMs contain information about common sample. |
bam2sample |
Allowes multiple BAMs contain information about common sample. |
filecpgset |
Name of the file storing a set of CpGs. |
filenoncpgset |
If defined, must point to a file storing vetted locations away from any CpGs. |
filecovariates |
Name of the file containing phenotype and covariates
for the available samples. |
covariates |
Data frame with phenotype and covariates
for the available samples. |
cputhreads |
Maximum number of CPU intensive tasks running in parallel. |
diskthreads |
Maximum number of disk intensive tasks running in parallel. |
usefilelock |
If TRUE, parallel jobs are prevented from simultaneous access
to file matrices. |
scoretag |
Reads from BAM files are filtered by this tag. |
minscore |
Reads from BAM files with score "scoretag" below this are excluded. |
maxrepeats |
Duplicate reads (reads with the same start position and direction) in excess of this limit are removed. |
minavgcpgcoverage |
CpGs with average coverage below this threshold are removed. |
minnonzerosamples |
CpGs with fraction of samples with non-zero coverage below this threshold are removed. |
buffersize |
Coverage matrix transposition is performed using buffers of this size.
|
doublesize |
The coverage matrix is stored with this number of bytes per value. |
modelcovariates |
Names of covariates included in PCA and MWAS. |
modeloutcome |
Name of the outcome variable for MWAS. |
modelPCs |
Number of principal components accounted for in MWAS. |
modelhasconstant |
By default, the tested linear model includes a constant. |
qqplottitle |
The title of the QQ-plot produced by MWAS (step 4 of the pipeline). |
toppvthreshold |
Determines the number of top MWAS results saved in text file. |
mmncpgs |
Parameter for multi-marker elastic net cross validation (MRS). |
mmalpha |
Parameter for multi-marker elastic net cross validation (MRS). |
cvnfolds |
Parameter for multi-marker elastic net cross validation (MRS). |
bihost |
Parameter for BiomaRt annotation (Step 6 of the pipeline). |
bimart |
Parameter for BiomaRt annotation (Step 6 of the pipeline). |
bidataset |
Parameter for BiomaRt annotation (Step 6 of the pipeline). |
biattributes |
Parameter for BiomaRt annotation (Step 6 of the pipeline). |
bifilters |
Parameter for BiomaRt annotation (Step 6 of the pipeline). |
biflank |
Parameter for BiomaRt annotation (Step 6 of the pipeline). |
fileSNPs |
Name of the filematrix with genotype (SNP) data. |
dirSNPs |
Directory where RaMWAS saves the results of joint methylation-genotype analysis. |
... |
Any other named parameters can be added here. |
The function simply collects all the parameters in a list.
The main benefit of the function is that the user does not
need to memorize the names of RaMWAS parameters.
Here is how it helps in RStudio:
List with provided parameters.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
ramwasParameters(dirproject = ".", cputhreads = 4)
ramwasParameters(dirproject = ".", cputhreads = 4)
Form row and column sums of squares for numeric matrices.
The functions are introduced as faster analogs of
rowSums(x^2)
and colSums(x^2)
calls.
rowSumsSq(x) colSumsSq(x)
rowSumsSq(x) colSumsSq(x)
x |
Numeric matrix. |
The function is implemented in C for better performance.
Return a vector of sums of values in each row/column for matrix
x
(rowSumsSq
/colSumsSq
).
Andrey A Shabalin [email protected]
See rowSums
and colSums
for simple (not squared) row/column sums.
x = matrix( 1:99, 9, 11) # Calculate sums of squared elements in each row rsum2 = rowSumsSq(x) # Compare with alternative calculation stopifnot( all.equal( rsum2, rowSums(x^2) )) # Calculate sums of squared elements in each column csum2 = colSumsSq(x) # Compare with alternative calculation stopifnot( all.equal( csum2, colSums(x^2) ))
x = matrix( 1:99, 9, 11) # Calculate sums of squared elements in each row rsum2 = rowSumsSq(x) # Compare with alternative calculation stopifnot( all.equal( rsum2, rowSums(x^2) )) # Calculate sums of squared elements in each column csum2 = colSumsSq(x) # Compare with alternative calculation stopifnot( all.equal( csum2, colSums(x^2) ))
This class is a wrapper for accessing the data (coverage) matrix. It automatically subsets the samples to those listed in the covariates. Data access function imputes missing values and can residualize the variables.
rwDataClass
is a reference classes
(see envRefClass
).
fmdata
:Filematrix object for the data matrix.
Not intended to be accessed directly.
samplenames
:Vector of sample names.
nsamples
:Number of samples
ncpgs
:Number of variables (CpG sites) in the data matrix.
ndatarows
:Number of variables in the data matrix (may be bigger than the number of samples).
rowsubset
:Indices of samples in the data matrix.
cvrtqr
:Matrix of orthonormalized covariates.
initialize(param = NULL, getPCs = TRUE, lockfile = NULL)
:Create the data access class.
'param' should contain the RaMWAS parameter vector.
'getPCs' indicates if the covariate set should include
Principal components.
'lockfile' is the 'lockfile' parameter used in accessing the
data filematrix.
open(param = NULL, getPCs = TRUE, lockfile = NULL)
:The same as 'initialize' method, but for already created object.
close()
:Clears the object. Closes the filematrix.
getDataRez(colset, resid = TRUE)
:Extracts data for variables indexed by 'colset'.
The data is residualized unless resid = FALSE
.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
# Create an empty rwDataClass data = new("rwDataClass") ## Not run: # Connect to the data data$open(param) # Create a rwDataClass and connect to the data data = new("rwDataClass", param = param) # close the object data$close() ## End(Not run)
# Create an empty rwDataClass data = new("rwDataClass") ## Not run: # Connect to the data data$open(param) # Create a rwDataClass and connect to the data data = new("rwDataClass", param = param) # close the object data$close() ## End(Not run)
Subset a data (coverage) matrix and corresponding matrix of locations to a specified set of locations.
subsetCoverageDirByLocation(x, chr, start, targetdir)
subsetCoverageDirByLocation(x, chr, start, targetdir)
x |
Name of data (coverage) directory or
list of RaMWAS parameters as described in the "RW6_param.Rmd" vignette.
|
chr |
Vector of chromosome names or numbers. |
start |
Start positions of the CpGs of interest. |
targetdir |
Directory name for the new (subset) data matrix and locations. |
The function returns nothing.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
x = "/data/myCoverageMatrix" chr = c("chr1", "chr2", "chr3") start = c(12345, 123, 12) targetdir = "/data/subsetCoverageMatrix" ## Not run: subsetCoverageDirByLocation(x, chr, start, targetdir) ## End(Not run)
x = "/data/myCoverageMatrix" chr = c("chr1", "chr2", "chr3") start = c(12345, 123, 12) targetdir = "/data/subsetCoverageMatrix" ## Not run: subsetCoverageDirByLocation(x, chr, start, targetdir) ## End(Not run)
An internal, function for fast association testing. It tests the phenotype of interest for association with methylation coverage (columns of the data parameter).
testPhenotype(phenotype, data1, cvrtqr)
testPhenotype(phenotype, data1, cvrtqr)
phenotype |
Vector with phenotype. Can be numerical, character, or factor vector. |
data1 |
Matrix with data (normalized coverage), one variable (CpG) per column. |
cvrtqr |
Orthonormalized covariates, one covariate per column.
See |
The testing is performed using matrix operations and C/C++ code, emplying an approach similar to that in MatrixEQTL.
If the phenotype is numerical, the output is a list with
correlation |
Correlations between residualized phenotype and data columns. |
tstat |
Corresponding T-statistics |
pvalue |
Corresponding P-values |
nVarTested |
Always 1 |
dfFull |
Number of degrees of freedom of the T-test |
If the phenotype is a factor (or character)
Rsquared |
R-squared for the residualized ANOVA F-test. |
Fstat |
Corresponding F-test |
pvalue |
Corresponding P-values |
nVarTested |
First number of degrees of freedom for the F-test. Equal to the number of factor levels reduced by 1 |
dfFull |
Second number of degrees of freedom for the F-test. |
This function is used in several parts of the pipeline.
Andrey A Shabalin [email protected]
See vignettes: browseVignettes("ramwas")
.
Also check orthonormalizeCovariates
.
### Generate data inputs # Random data matrix with signal in the first column data = matrix(runif(30*5), nrow = 30, ncol = 5) data[,1] = data[,1] + rep(0:2, each = 10) # Two random covariates cvrt = matrix(runif(2*30), nrow = 30, ncol = 2) cvrtqr = orthonormalizeCovariates(cvrt) ### First, illustrate with numerical phenotype # Numerical, 3 value phenotype phenotype = rep(1:3, each = 10) # Test for association output = testPhenotype(phenotype, data, cvrtqr) # Show the results print(output) # Comparing with standard R code for the first variable summary(lm( data[,1] ~ phenotype + cvrt )) ### First, illustrate with numerical phenotype # Categorical, 3 group phenotype phenotype = rep(c("Normal", "Sick", "Dead"), each = 10) # Test for association output = testPhenotype(phenotype, data, cvrtqr) # Show the results print(output) # Comparing with standard R code for the first variable anova(lm( data[,1] ~ cvrt + phenotype ))
### Generate data inputs # Random data matrix with signal in the first column data = matrix(runif(30*5), nrow = 30, ncol = 5) data[,1] = data[,1] + rep(0:2, each = 10) # Two random covariates cvrt = matrix(runif(2*30), nrow = 30, ncol = 2) cvrtqr = orthonormalizeCovariates(cvrt) ### First, illustrate with numerical phenotype # Numerical, 3 value phenotype phenotype = rep(1:3, each = 10) # Test for association output = testPhenotype(phenotype, data, cvrtqr) # Show the results print(output) # Comparing with standard R code for the first variable summary(lm( data[,1] ~ phenotype + cvrt )) ### First, illustrate with numerical phenotype # Categorical, 3 group phenotype phenotype = rep(c("Normal", "Sick", "Dead"), each = 10) # Test for association output = testPhenotype(phenotype, data, cvrtqr) # Show the results print(output) # Comparing with standard R code for the first variable anova(lm( data[,1] ~ cvrt + phenotype ))