Title: | Accurate identification of co-methylated and differentially methylated regions in epigenome-wide association studies |
---|---|
Description: | coMethDMR identifies genomic regions associated with continuous phenotypes by optimally leverages covariations among CpGs within predefined genomic regions. Instead of testing all CpGs within a genomic region, coMethDMR carries out an additional step that selects co-methylated sub-regions first without using any outcome information. Next, coMethDMR tests association between methylation within the sub-region and continuous phenotype using a random coefficient mixed effects model, which models both variations between CpG sites within the region and differential methylation simultaneously. |
Authors: | Fernanda Veitzman [cre], Lissette Gomez [aut], Tiago Silva [aut], Ning Lijiao [ctb], Boissel Mathilde [ctb], Lily Wang [aut], Gabriel Odom [aut] |
Maintainer: | Fernanda Veitzman <[email protected]> |
License: | GPL-3 |
Version: | 1.11.0 |
Built: | 2024-10-30 05:38:04 UTC |
Source: | https://github.com/bioc/coMethDMR |
coMethDMR
Pipeline ResultsGiven a data frame with regions in the genome, add gene symbols, UCSC reference gene accession, UCSC reference gene group and relation to CpG island.
AnnotateResults(lmmRes_df, arrayType = c("450k", "EPIC"), nCores_int = 1L, ...)
AnnotateResults(lmmRes_df, arrayType = c("450k", "EPIC"), nCores_int = 1L, ...)
lmmRes_df |
A data frame returned by
|
arrayType |
Type of array: 450k or EPIC |
nCores_int |
Number of computing cores to be used when executing code in parallel. Defaults to 1 (serial computing). |
... |
Dots for additional arguments passed to the cluster constructor.
See |
The region types include "NSHORE"
, "NSHELF"
,
"SSHORE"
, "SSHELF"
, "TSS1500"
, "TSS200"
,
"UTR5"
, "EXON1"
, "GENEBODY"
, "UTR3"
, and
"ISLAND"
.
A data frame with
the location of the genomic region's chromosome (chrom
),
start (start
), and end (end
);
UCSC annotation information (UCSC_RefGene_Group
,
UCSC_RefGene_Accession
, and UCSC_RefGene_Name
); and
a list of all of the probes in that region (probes
).
lmmResults_df <- data.frame( chrom = c("chr22", "chr22", "chr22", "chr22", "chr22"), start = c("39377790", "50987294", "19746156", "42470063", "43817258"), end = c("39377930", "50987527", "19746368", "42470223", "43817384"), regionType = c("TSS1500", "EXON1", "ISLAND", "TSS200", "ISLAND"), stringsAsFactors = FALSE ) AnnotateResults( lmmRes_df = lmmResults_df, arrayType = "450k" )
lmmResults_df <- data.frame( chrom = c("chr22", "chr22", "chr22", "chr22", "chr22"), start = c("39377790", "50987294", "19746156", "42470063", "43817258"), end = c("39377930", "50987527", "19746368", "42470223", "43817384"), regionType = c("TSS1500", "EXON1", "ISLAND", "TSS200", "ISLAND"), stringsAsFactors = FALSE ) AnnotateResults( lmmRes_df = lmmResults_df, arrayType = "450k" )
Subset of an Alzheimer's Disease methylation data set, with beta values for measured CpGs methylation levels.
data("betaMatrix_ex1")
data("betaMatrix_ex1")
A data frame containing beta values for 4 CpGs in one CpG islands for 110 subjects. Each column is a CpG, each row is a sample.
GEO accession: GSE59685
Subset of an Alzheimer's Disease methylation data set, with beta values for measured CpGs methylation levels.
data("betaMatrix_ex2")
data("betaMatrix_ex2")
A data frame containing beta values for 4 CpGs in one CpG islands for 110 subjects. Each column is a CpG, each row is a sample.
GEO accession: GSE59685
Subset of an Alzheimer's Disease methylation data set, with beta values for measured CpGs methylation levels.
data("betaMatrix_ex3")
data("betaMatrix_ex3")
A data frame containing beta values for 6 CpGs in one CpG islands for 110 subjects. Each column is a CpG, each row is a sample.
GEO accession: GSE59685
Subset of an Alzheimer's Disease methylation data set, with beta values for measured CpGs methylation levels.
data("betaMatrix_ex4")
data("betaMatrix_ex4")
A data frame containing beta values for 52 CpGs in one CpG islands for 110 subjects. Each column is a CpG, each row is a sample.
GEO accession: GSE59685
Subset of an Alzheimer's methylation dataset, with beta values for CpGs.
data("betasChr22_df")
data("betasChr22_df")
A data frame containing beta values for 8552 CpGs in Chr22 for a subset of 20 subjects.
GEO accession: GSE59685
Extract clusters of CpGs located closely in a genomic region.
CloseBySingleRegion( CpGs_char, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, maxGap = 200, minCpGs = 3 )
CloseBySingleRegion( CpGs_char, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, maxGap = 200, minCpGs = 3 )
CpGs_char |
a list of CpG IDs |
genome |
Human genome of reference hg19 or hg38 |
arrayType |
Type of array, 450k or EPIC |
manifest_gr |
A GRanges object with the genome manifest (as returned by
|
maxGap |
an integer, genomic locations within maxGap from each other are placed into the same cluster |
minCpGs |
an integer, minimum number of CpGs for the resulting CpG cluster |
Note that this function depends only on CpG locations, and not on
any methylation data. The algorithm is based on the
clusterMaker
function in the bumphunter
R package. Each cluster is essentially a group of CpG locations such that
two consecutive locations in the cluster are separated by less than
maxGap
.
a list, each item in the list is a character vector of CpG IDs located closely (i.e. in the same cluster)
CpGs_char <- c( "cg02505293", "cg03618257", "cg04421269", "cg17885402", "cg19890033", "cg20566587", "cg27505880" ) cluster_ls <- CloseBySingleRegion( CpGs_char, genome = "hg19", arrayType = "450k", maxGap = 100, minCpGs = 3 )
CpGs_char <- c( "cg02505293", "cg03618257", "cg04421269", "cg17885402", "cg19890033", "cg20566587", "cg27505880" ) cluster_ls <- CloseBySingleRegion( CpGs_char, genome = "hg19", arrayType = "450k", maxGap = 100, minCpGs = 3 )
Extract contiguous co-methylated genomic regions from a list of pre-defined genomic regions
CoMethAllRegions( dnam, betaToM = FALSE, method = c("pearson", "spearman"), rDropThresh_num = 0.4, minCpGs = 3, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), CpGs_ls, file = NULL, returnAllCpGs = FALSE, output = c("CpGs", "dataframe"), nCores_int = 1L, ... )
CoMethAllRegions( dnam, betaToM = FALSE, method = c("pearson", "spearman"), rDropThresh_num = 0.4, minCpGs = 3, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), CpGs_ls, file = NULL, returnAllCpGs = FALSE, output = c("CpGs", "dataframe"), nCores_int = 1L, ... )
dnam |
matrix (or data frame) of beta values, with row names = CpG IDs, column names = sample IDs. This is typically genome-wide methylation beta values. |
betaToM |
indicates if converting methylation beta values to mvalues |
method |
method for computing correlation, can be "spearman" or "pearson" |
rDropThresh_num |
threshold for min correlation between a cpg with sum of the rest of the CpGs |
minCpGs |
minimum number of CpGs to be considered a "region".
Only regions with more than |
genome |
Human genome of reference hg19 or hg38 |
arrayType |
Type of array, can be "450k" or "EPIC" |
CpGs_ls |
list where each item is a character vector of CpGs IDs. This should be CpG probes located closely on the array. |
file |
an RDS file with clusters of CpG locations (i.e. CpGs
located closely to each other on the genome). This file can be generated
by the |
returnAllCpGs |
When there is not a contiguous comethylated region in
the inputting pre-defined region, |
output |
a character vector of CpGs or a dataframe of CpGs along with rDrop info |
nCores_int |
Number of computing cores to be used when executing code in parallel. Defaults to 1 (serial computing). |
... |
Dots for additional arguments passed to the cluster constructor.
See |
There are two ways to input genomic regions for this function: (1)
use CpGs_ls
argument, or (2) use file
argument. Examples of
these files are in /inst/extdata/ folder of the package.
When output = "dataframe"
is selected, returns a list of data
frames, each with CpG
(CpG name), Chr
(chromosome number),
MAPINFO
(genomic position), r_drop
(correlation between the
CpG with rest of the CpGs), keep
(indicator for co-methylated CpG),
keep_contiguous
(index for contiguous comethylated subregions).
When output = "CpGs"
is selected, returns a list, each item is a
list of CpGs in the contiguous co-methylated subregion.
data(betasChr22_df) CpGisland_ls <- readRDS( system.file( "extdata", "CpGislandsChr22_ex.rds", package = 'coMethDMR', mustWork = TRUE ) ) coMeth_ls <- CoMethAllRegions ( dnam = betasChr22_df, betaToM = TRUE, method = "pearson", CpGs_ls = CpGisland_ls, arrayType = "450k", returnAllCpGs = FALSE, output = "CpGs" )
data(betasChr22_df) CpGisland_ls <- readRDS( system.file( "extdata", "CpGislandsChr22_ex.rds", package = 'coMethDMR', mustWork = TRUE ) ) coMeth_ls <- CoMethAllRegions ( dnam = betasChr22_df, betaToM = TRUE, method = "pearson", CpGs_ls = CpGisland_ls, arrayType = "450k", returnAllCpGs = FALSE, output = "CpGs" )
Wrapper function to find contiguous and comethyalted sub-regions within a pre-defined genomic region
CoMethSingleRegion( CpGs_char, dnam, betaToM = TRUE, rDropThresh_num = 0.4, method = c("pearson", "spearman"), minCpGs = 3, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, returnAllCpGs = FALSE )
CoMethSingleRegion( CpGs_char, dnam, betaToM = TRUE, rDropThresh_num = 0.4, method = c("pearson", "spearman"), minCpGs = 3, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, returnAllCpGs = FALSE )
CpGs_char |
vector of CpGs in the inputting pre-defined genomic region. |
dnam |
matrix (or data frame) of beta values, with row names = CpG ids,
column names = sample ids. This should include the CpGs in |
betaToM |
indicates if converting methylation beta values mvalues |
rDropThresh_num |
threshold for min correlation between a cpg with sum of the rest of the CpGs |
method |
method for computing correlation, can be "pearson" or "spearman" |
minCpGs |
minimum number of CpGs to be considered a "region".
Only regions with more than |
genome |
Human genome of reference hg19 or hg38 |
arrayType |
Type of array, can be "450k" or "EPIC" |
manifest_gr |
A GRanges object with the genome manifest (as returned by
|
returnAllCpGs |
When there is not a contiguous comethylated region in
the inputing pre-defined region, |
A list with two components:
Contiguous_Regions
: a data frame with CpG
(CpG ID),
Chr
(chromosome number), MAPINFO
(genomic position),
r_drop
(correlation between the CpG with rest of the CpGs),
keep
(indicator for co-methylated CpG), keep_contiguous
(index for contiguous comethylated subregion)
CpGs_subregions
: lists of CpGs in each contiguous
co-methylated subregion
data(betasChr22_df) CpGsChr22_char <- c( "cg02953382", "cg12419862", "cg24565820", "cg04234412", "cg04824771", "cg09033563", "cg10150615", "cg18538332", "cg20007245", "cg23131131", "cg25703541" ) CoMethSingleRegion( CpGs_char = CpGsChr22_char, dnam = betasChr22_df ) data(betaMatrix_ex3) CpGsEx3_char <- c( "cg14221598", "cg02433884", "cg07372974", "cg13419809", "cg26856676", "cg25246745" ) CoMethSingleRegion( CpGs_char = CpGsEx3_char, dnam = t(betaMatrix_ex3), returnAllCpGs = TRUE )
data(betasChr22_df) CpGsChr22_char <- c( "cg02953382", "cg12419862", "cg24565820", "cg04234412", "cg04824771", "cg09033563", "cg10150615", "cg18538332", "cg20007245", "cg23131131", "cg25703541" ) CoMethSingleRegion( CpGs_char = CpGsChr22_char, dnam = betasChr22_df ) data(betaMatrix_ex3) CpGsEx3_char <- c( "cg14221598", "cg02433884", "cg07372974", "cg13419809", "cg26856676", "cg25246745" ) CoMethSingleRegion( CpGs_char = CpGsEx3_char, dnam = t(betaMatrix_ex3), returnAllCpGs = TRUE )
Test associations of individual CpGs in multiple genomic regions with a continuous phenotype
CpGsInfoAllRegions( AllRegionNames_char, allRegions_gr = NULL, betas_df, pheno_df, contPheno_char, covariates_char, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC") )
CpGsInfoAllRegions( AllRegionNames_char, allRegions_gr = NULL, betas_df, pheno_df, contPheno_char, covariates_char, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC") )
AllRegionNames_char |
vector of character strings with location info for
all the genomic regions. Each region should be specified in this format:
|
allRegions_gr |
An object of class |
betas_df |
data frame of beta values for all genomic regions, with row names = CpG IDs amd column names = sample IDs |
pheno_df |
a data frame with phenotype and covariate variables, with variable "Sample" for sample IDs. |
contPheno_char |
character string of the continuous phenotype to be tested against methylation values |
covariates_char |
character vector of covariate variables names |
genome |
human genome of reference hg19 (default) or hg38 |
arrayType |
Type of array, can be "450k" or "EPIC" |
a data frame with locations of the genomic region (Region), CpG ID (cpg), chromosome (chr), position (pos), results for testing association of methylation in individual CpGs with the continuous phenotype (slopeEstimate, slopePval), UCSC_RefGene_Name, UCSC_RefGene_Accession, and UCSC_RefGene_Group
data(betasChr22_df) data(pheno_df) AllRegionNames_char <- c( "chr22:18267969-18268249", "chr22:18531243-18531447" ) CpGsInfoAllRegions( AllRegionNames_char, betas_df = betasChr22_df, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = c("age.brain", "sex") )
data(betasChr22_df) data(pheno_df) AllRegionNames_char <- c( "chr22:18267969-18268249", "chr22:18531243-18531447" ) CpGsInfoAllRegions( AllRegionNames_char, betas_df = betasChr22_df, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = c("age.brain", "sex") )
Test associations of individual CpGs in a genomic region with a continuous phenotype
CpGsInfoOneRegion( regionName_char, region_gr = NULL, betas_df, pheno_df, contPheno_char, covariates_char = NULL, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL )
CpGsInfoOneRegion( regionName_char, region_gr = NULL, betas_df, pheno_df, contPheno_char, covariates_char = NULL, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL )
regionName_char |
character string of location information for a genomic
region, specified in the format of |
region_gr |
An object of class |
betas_df |
data frame of beta values with row names = CpG IDs, column names = sample IDs |
pheno_df |
a data frame with phenotype and covariate variables, with variable "Sample" for sample IDs. |
contPheno_char |
character string of the continuous phenotype to be tested against methylation values |
covariates_char |
character vector of covariate variables names |
genome |
human genome of reference hg19 (default) or hg38 |
arrayType |
Type of array, can be "450k" or "EPIC" |
manifest_gr |
A GRanges object with the genome manifest (as returned by
|
This function implements linear models that test association between methylation values in a genomic region with a continuous phenotype. Note that methylation M values are used as regression outcomes in these models. The model for each CpG is:
methylation M value ~ contPheno_char + covariates_char
a data frame with location of the genomic region (Region), CpG ID (cpg), chromosome (chr), position (pos), results for testing association of methylation in individual CpGs with continuous phenotype (slopeEstimate, slopePval) and annotations for the region.
data(betasChr22_df) data(pheno_df) myRegion_gr <- RegionsToRanges("chr22:18267969-18268249") CpGsInfoOneRegion( region_gr = myRegion_gr, betas_df = betasChr22_df, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = c("age.brain", "sex"), arrayType = "450k" )
data(betasChr22_df) data(pheno_df) myRegion_gr <- RegionsToRanges("chr22:18267969-18268249") CpGsInfoOneRegion( region_gr = myRegion_gr, betas_df = betasChr22_df, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = c("age.brain", "sex"), arrayType = "450k" )
This function is an operating-system agnostic wrapper for the
SnowParam
and
MulticoreParam
constructor functions.
CreateParallelWorkers(nCores, ...)
CreateParallelWorkers(nCores, ...)
nCores |
The number of computing cores to make available for coMethDMR computation |
... |
Additional arguments passed to the cluster constructors. |
This function checks the operating system and then creates a
cluster of workers using the SnowParam
function for Windows machines
and the MulticoreParam
function for non-Windows machines.
A parameter class for use in parallel evaluation
workers_cl <- CreateParallelWorkers(nCores = 4)
workers_cl <- CreateParallelWorkers(nCores = 4)
Computes leave-one-out correlations (rDrop) for each CpG
CreateRdrop(data, method = c("pearson", "spearman"), use = "complete.obs")
CreateRdrop(data, method = c("pearson", "spearman"), use = "complete.obs")
data |
a dataframe with rownames = sample IDs, column names = CpG IDs. |
method |
method for computing correlation, can be "pearson" or
"spearman", and is passed to the |
use |
method for handling missing values when calculating the
correlation. Defaults to |
An outlier CpG in a genomic region will typically have low
correlation with the rest of the CpGs in a genomic region. On the other
hand, in a cluster of co-methylated CpGs, we expect each CpG to have high
correlation with the rest of the CpGs. The r.drop
statistic is used
to identify these co-methylated CpGs here.
A data frame with the following columns:
CpG
: CpG ID
r_drop
: The correlation between each CpG with the sum of
the rest of the CpGs
data(betaMatrix_ex1) CreateRdrop(data = betaMatrix_ex1, method = "pearson")
data(betaMatrix_ex1) CreateRdrop(data = betaMatrix_ex1, method = "pearson")
Find contiguous comethylated regions based on output file from
function MarkComethylatedCpGs
FindComethylatedRegions(CpGs_df, minCpGs_int = 3)
FindComethylatedRegions(CpGs_df, minCpGs_int = 3)
CpGs_df |
an output dataframe from function |
minCpGs_int |
an integer indicating the minimum number of CpGs for output genomic regions |
A data frame with variables ProbeID
and Subregion
(index for each output contiguous comethylated region)
data(betaMatrix_ex4) CpGs_df <- MarkComethylatedCpGs(betaCluster_mat = betaMatrix_ex4) FindComethylatedRegions(CpGs_df)
data(betaMatrix_ex4) CpGs_df <- MarkComethylatedCpGs(betaCluster_mat = betaMatrix_ex4) FindComethylatedRegions(CpGs_df)
Extract probe IDs for CpGs located in a genomic region
GetCpGsInRegion( regionName_char, region_gr = NULL, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, ignoreStrand = TRUE )
GetCpGsInRegion( regionName_char, region_gr = NULL, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, ignoreStrand = TRUE )
regionName_char |
character string with location information for one
region in the format |
region_gr |
An object of class |
genome |
human genome of reference hg19 (default) or hg38 |
arrayType |
Type of array, 450k or EPIC |
manifest_gr |
A GRanges object with the genome manifest (as returned by
|
ignoreStrand |
Whether strand can be ignored, default is TRUE |
vector of CpG probe IDs mapped to the genomic region
myRegion_gr <- RegionsToRanges("chr22:18267969-18268249") GetCpGsInRegion( region_gr = myRegion_gr, genome = "hg19", arrayType = "450k", ignoreStrand = TRUE )
myRegion_gr <- RegionsToRanges("chr22:18267969-18268249") GetCpGsInRegion( region_gr = myRegion_gr, genome = "hg19", arrayType = "450k", ignoreStrand = TRUE )
Remove covariate effects from methylayion values by fitting probe-specific linear models
GetResiduals( dnam, betaToM = TRUE, epsilon = 1e-08, pheno_df, covariates_char, nCores_int = 1L, ... )
GetResiduals( dnam, betaToM = TRUE, epsilon = 1e-08, pheno_df, covariates_char, nCores_int = 1L, ... )
dnam |
data frame or matrix of methylation values with row names = CpG IDs and column names = sample IDs. This is often the genome-wide array data. |
betaToM |
indicates if methylation beta values (ranging from [0, 1])
should be converted to M values (ranging from (-Inf, Inf)). Note that if
beta values are the input to |
epsilon |
When transforming beta values to M values, what should be done
to values exactly equal to 0 or 1? The M value transformation would yield
|
pheno_df |
a data frame with phenotype and covariates, with variable
|
covariates_char |
character vector for names of the covariate variables |
nCores_int |
Number of computing cores to be used when executing code in parallel. Defaults to 1 (serial computing). |
... |
Dots for additional arguments passed to the cluster constructor.
See |
This function fits an ordinary linear model predicting methylation values for each probe from the specified covariates. This process will be useful in scenarios where methylation values in a region or at an individual probe are known a priori to have differential methylation independent of the disease or condition of interest.
output a matrix of residual values in the same dimension as
dnam
data(betasChr22_df) data(pheno_df) GetResiduals( dnam = betasChr22_df[1:10, 1:10], betaToM = TRUE, pheno_df = pheno_df, covariates_char = c("age.brain", "sex", "slide") )
data(betasChr22_df) data(pheno_df) GetResiduals( dnam = betasChr22_df[1:10, 1:10], betaToM = TRUE, pheno_df = pheno_df, covariates_char = c("age.brain", "sex", "slide") )
Load either the HM540 and EPIC manifests into working memory
ImportSesameData(manifest_char)
ImportSesameData(manifest_char)
manifest_char |
Which manifest should be loaded? Currently, this package has been tested to work with 450k and EPIC arrays for HG19 and HG38. |
This function assumes that the .onLoad() function has executed properly and (therefore) that the necessary data sets are in the cache.
hm450k_gr <- ImportSesameData("HM450.hg19.manifest") head(hm450k_gr)
hm450k_gr <- ImportSesameData("HM450.hg19.manifest") head(hm450k_gr)
Fit mixed model to methylation values in one genomic region
lmmTest( betaOne_df, pheno_df, contPheno_char, covariates_char, modelType = c("randCoef", "simple"), genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, ignoreStrand = TRUE, outLogFile = NULL )
lmmTest( betaOne_df, pheno_df, contPheno_char, covariates_char, modelType = c("randCoef", "simple"), genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, ignoreStrand = TRUE, outLogFile = NULL )
betaOne_df |
matrix of beta values for one genomic region, with row names = CpG IDs and column names = sample IDs |
pheno_df |
a data frame with phenotype and covariates, with variable
|
contPheno_char |
character string of the main effect (a continuous phenotype) to be tested for association with methylation values in the region |
covariates_char |
character vector for names of the covariate variables |
modelType |
type of mixed model: can be |
genome |
Human genome of reference: hg19 or hg38 |
arrayType |
Type of array: "450k" or "EPIC" |
manifest_gr |
A GRanges object with the genome manifest (as returned by
|
ignoreStrand |
Whether strand can be ignored, default is TRUE |
outLogFile |
Name of log file for messages of mixed model analysis |
This function implements a mixed model to test association between methylation M values in a genomic region with a continuous phenotype. In our simulation studies, we found both models shown below are conservative, so p-values are estimated from normal distributions instead of Student's t distributions.
When modelType = "randCoef"
, the model is:
M ~ contPheno_char + covariates_char + (1|Sample) + (contPheno_char|CpG)
.
The last term specifies random intercept and slope for each CpG. When
modelType = "simple"
, the model is
M ~ contPheno_char + covariates_char + (1|Sample)
.
A dataframe with one row for association result of one region and
the following columns: Estimate
, StdErr
, and pvalue
showing the association of methylation values in the genomic region tested
with the continuous phenotype supplied in contPheno_char
data(betasChr22_df) CpGsChr22_char <- c( "cg02953382", "cg12419862", "cg24565820", "cg04234412", "cg04824771", "cg09033563", "cg10150615", "cg18538332", "cg20007245", "cg23131131", "cg25703541" ) coMethCpGs <- CoMethSingleRegion(CpGsChr22_char, betasChr22_df) # test only the first co-methylated region coMethBeta_df <- betasChr22_df[coMethCpGs$CpGsSubregions[[1]], ] data(pheno_df) res <- lmmTest( betaOne_df = coMethBeta_df, pheno_df, contPheno_char = "stage", covariates_char = c("age.brain", "sex"), modelType = "randCoef", arrayType = "450k", ignoreStrand = TRUE )
data(betasChr22_df) CpGsChr22_char <- c( "cg02953382", "cg12419862", "cg24565820", "cg04234412", "cg04824771", "cg09033563", "cg10150615", "cg18538332", "cg20007245", "cg23131131", "cg25703541" ) coMethCpGs <- CoMethSingleRegion(CpGsChr22_char, betasChr22_df) # test only the first co-methylated region coMethBeta_df <- betasChr22_df[coMethCpGs$CpGsSubregions[[1]], ] data(pheno_df) res <- lmmTest( betaOne_df = coMethBeta_df, pheno_df, contPheno_char = "stage", covariates_char = c("age.brain", "sex"), modelType = "randCoef", arrayType = "450k", ignoreStrand = TRUE )
Fit mixed model to test association between a continuous phenotype and methylation values in a list of genomic regions
lmmTestAllRegions( betas, region_ls, pheno_df, contPheno_char, covariates_char, modelType = c("randCoef", "simple"), genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), ignoreStrand = TRUE, outFile = NULL, outLogFile = NULL, nCores_int = 1L, ... )
lmmTestAllRegions( betas, region_ls, pheno_df, contPheno_char, covariates_char, modelType = c("randCoef", "simple"), genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), ignoreStrand = TRUE, outFile = NULL, outLogFile = NULL, nCores_int = 1L, ... )
betas |
data frame or matrix of beta values for all genomic regions, with row names = CpG IDs and column names = sample IDs. This is often the genome-wide array data. |
region_ls |
a list of genomic regions; each item is a vector of CpG IDs
within a genomic region. The co-methylated regions can be obtained by
function |
pheno_df |
a data frame with phenotype and covariates, with variable
|
contPheno_char |
character string of the main effect (a continuous phenotype) to be tested for association with methylation values in each region |
covariates_char |
character vector for names of the covariate variables |
modelType |
type of mixed model; can be |
genome |
Human genome of reference: hg19 or hg38 |
arrayType |
Type of array: "450k" or "EPIC" |
ignoreStrand |
Whether strand can be ignored, default is TRUE |
outFile |
output .csv file with the results for the mixed model analysis |
outLogFile |
log file for mixed models analysis messages |
nCores_int |
Number of computing cores to be used when executing code in parallel. Defaults to 1 (serial computing). |
... |
Dots for additional arguments passed to the cluster constructor.
See |
This function implements a mixed model to test association between methylation M values in a genomic region with a continuous phenotype. In our simulation studies, we found both models shown below are conservative, so p-values are estimated from normal distributions instead of Student's t distributions.
When modelType = "randCoef"
, the model is:
M ~ contPheno_char + covariates_char + (1|Sample) + (contPheno_char|CpG)
.
The last term specifies random intercept and slope for each CpG. When
modelType = "simple"
, the model is
M ~ contPheno_char + covariates_char + (1|Sample)
.
For the results of mixed models, note that if the mixed model failed to converge, p-value for mixed model is set to 1. Also, if the mixed model is singular, at least one of the estimated variance components for intercepts or slopes random effects is 0, because there isn't enough variability in the data to estimate the random effects. In this case, the mixed model reduces to a fixed effects model. The p-values for these regions are still valid.
If outFile
is NULL
, this function returns a data frame
as described below. If outFile
is specified, this function writes a
data frame in .csv format with the following information to the disk:
location of the genomic region (chrom, start, end
), number of CpGs
(nCpGs
), Estimate
, Standard error (StdErr
) of the test
statistic, p-value and False Discovery Rate (FDR) for association between
methylation values in each genomic region with phenotype (pValue
).
If outLogFile
is specified, this function also writes a .txt file
that includes messages for mixed model fitting to the disk.
data(betasChr22_df) data(pheno_df) CpGisland_ls <- readRDS( system.file( "extdata", "CpGislandsChr22_ex.rds", package = 'coMethDMR', mustWork = TRUE ) ) coMeth_ls <- CoMethAllRegions( dnam = betasChr22_df, betaToM = TRUE, CpGs_ls = CpGisland_ls, arrayType = "450k", rDropThresh_num = 0.4, returnAllCpGs = FALSE ) results_df <- lmmTestAllRegions( betas = betasChr22_df, region_ls = coMeth_ls, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = "age.brain", modelType = "randCoef", arrayType = "450k", ignoreStrand = TRUE, # generates a log file in the current directory # outLogFile = paste0("lmmLog_", Sys.Date(), ".txt") )
data(betasChr22_df) data(pheno_df) CpGisland_ls <- readRDS( system.file( "extdata", "CpGislandsChr22_ex.rds", package = 'coMethDMR', mustWork = TRUE ) ) coMeth_ls <- CoMethAllRegions( dnam = betasChr22_df, betaToM = TRUE, CpGs_ls = CpGisland_ls, arrayType = "450k", rDropThresh_num = 0.4, returnAllCpGs = FALSE ) results_df <- lmmTestAllRegions( betas = betasChr22_df, region_ls = coMeth_ls, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = "age.brain", modelType = "randCoef", arrayType = "450k", ignoreStrand = TRUE, # generates a log file in the current directory # outLogFile = paste0("lmmLog_", Sys.Date(), ".txt") )
Mark CpGs in contiguous and co-methylated region
MarkComethylatedCpGs( betaCluster_mat, betaToM = TRUE, epsilon = 1e-08, rDropThresh_num = 0.4, method = c("pearson", "spearman"), use = "complete.obs" )
MarkComethylatedCpGs( betaCluster_mat, betaToM = TRUE, epsilon = 1e-08, rDropThresh_num = 0.4, method = c("pearson", "spearman"), use = "complete.obs" )
betaCluster_mat |
matrix of beta values, with rownames = sample ids and
column names = CpG ids. Note that the CpGs need to be ordered by their
genomic positions, this can be accomplished by the
|
betaToM |
indicates if beta values should be converted to M values before computing correlations. Defaults to TRUE. |
epsilon |
When transforming beta values to M values, what should be done
to values exactly equal to 0 or 1? The M value transformation would yield
|
rDropThresh_num |
threshold for minimum correlation between a cpg with the rest of the CpGs. Defaults to 0.4. |
method |
correlation method; can be "pearson" or "spearman" |
use |
method for handling missing values when calculating the
correlation. Defaults to |
An outlier CpG in a genomic region will typically have low
correlation with the rest of the CpGs in a genomic region. On the other
hand, in a cluster of co-methylated CpGs, we expect each CpG to have high
correlation with the rest of the CpGs. The r.drop
statistic is used
to identify these co-methylated CpGs here.
A data frame with the following columns:
CpG
: CpG ID
keep
: The CpGs with keep = 1
belong to the
contiguous and co-methylated region
ind
: Index for the CpGs
r_drop
: The correlation between each CpG with the sum of
the rest of the CpGs
data(betaMatrix_ex1) MarkComethylatedCpGs( betaCluster_mat = betaMatrix_ex1, betaToM = FALSE, method = "pearson" )
data(betaMatrix_ex1) MarkComethylatedCpGs( betaCluster_mat = betaMatrix_ex1, betaToM = FALSE, method = "pearson" )
Return Column and Row Names of Samples and Probes under the Missingness Theshold
MarkMissing(dnaM_df, sampMissing_p = 0.5, probeMissing_p = 0.25)
MarkMissing(dnaM_df, sampMissing_p = 0.5, probeMissing_p = 0.25)
dnaM_df |
A data frame of DNA methylation values. Samples are columns. Row names are probe IDs. |
sampMissing_p |
The maximum proportion of missingness allowed in a sample. Defaults to 50%. |
probeMissing_p |
The maximum proportion of missingness allowed in a probe. Defaults to 25%. |
Before calculating the missing proportion of samples, probes with missingness greater than the threshold are dropped first.
A list of four entries:
dropSamples
: the column names of samples with more than
sampMissing_p
percent missing values
keepSamples
: the column names of samples with less than or
equal to sampMissing_p
percent missing values
dropProbes
: the row names of probes with more than
probeMissing_p
percent missing values
keepProbes
: the row names of probes with less than or equal
to probeMissing_p
percent missing values
### Setup ### values_num <- c( 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, NA, 0.1, 0.1, 0.1, 0.1, NA, 0.1, 0.1, 0.1, NA, NA, 0.1, 0.1, 0.1, NA, NA, 0.1, 0.1, NA, NA, NA, 0.1, 0.1, NA, NA, NA, 0.1, NA, NA, NA, NA, NA, NA, NA, NA, NA ) values_mat <- matrix(values_num, nrow = 9, ncol = 5, byrow = TRUE) rownames(values_mat) <- paste0("probe_0", 1:9) colnames(values_mat) <- paste0("sample_0", 1:5) values_df <- as.data.frame(values_mat) ### Simple Calculations ### MarkMissing(values_df) MarkMissing(values_df, probeMissing_p = 0.5) MarkMissing(values_df, sampMissing_p = 0.25) ### Using the Output ### mark_ls <- MarkMissing(values_df, probeMissing_p = 0.5) valuesPurged_df <- values_df[ mark_ls$keepProbes, mark_ls$keepSamples ] valuesPurged_df
### Setup ### values_num <- c( 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, NA, 0.1, 0.1, 0.1, 0.1, NA, 0.1, 0.1, 0.1, NA, NA, 0.1, 0.1, 0.1, NA, NA, 0.1, 0.1, NA, NA, NA, 0.1, 0.1, NA, NA, NA, 0.1, NA, NA, NA, NA, NA, NA, NA, NA, NA ) values_mat <- matrix(values_num, nrow = 9, ncol = 5, byrow = TRUE) rownames(values_mat) <- paste0("probe_0", 1:9) colnames(values_mat) <- paste0("sample_0", 1:5) values_df <- as.data.frame(values_mat) ### Simple Calculations ### MarkMissing(values_df) MarkMissing(values_df, probeMissing_p = 0.5) MarkMissing(values_df, sampMissing_p = 0.25) ### Using the Output ### mark_ls <- MarkMissing(values_df, probeMissing_p = 0.5) valuesPurged_df <- values_df[ mark_ls$keepProbes, mark_ls$keepSamples ] valuesPurged_df
Name a region with several CpGs based on its genomic location
NameRegion(CpGsOrdered_df)
NameRegion(CpGsOrdered_df)
CpGsOrdered_df |
dataframe with columns for Probe IDs as character (cpg), chromosome number as character (chr), and genomic location as integer (pos) |
genome location of the CpGs formatted as "chrxx:xxxxxx-xxxxxx"
# Consider four probe IDs: CpGs_char <- c("cg04677227", "cg07146435", "cg11632906", "cg20214853") # After querying these four probes against an EPIC array via the # OrderCpGsByLocation() function, we get the following data frame: CpGsOrdered_df <- data.frame( chr = c("chr10", "chr10", "chr10", "chr10"), pos = c(100028236L, 100028320L, 100028468L, 100028499L), cpg = c("cg20214853", "cg04677227", "cg11632906", "cg07146435"), stringsAsFactors = FALSE ) # Now, we can name the region that contains these four probes: NameRegion(CpGsOrdered_df)
# Consider four probe IDs: CpGs_char <- c("cg04677227", "cg07146435", "cg11632906", "cg20214853") # After querying these four probes against an EPIC array via the # OrderCpGsByLocation() function, we get the following data frame: CpGsOrdered_df <- data.frame( chr = c("chr10", "chr10", "chr10", "chr10"), pos = c(100028236L, 100028320L, 100028468L, 100028499L), cpg = c("cg20214853", "cg04677227", "cg11632906", "cg07146435"), stringsAsFactors = FALSE ) # Now, we can name the region that contains these four probes: NameRegion(CpGsOrdered_df)
Order CpGs by genomic location
OrderCpGsByLocation( CpGs_char, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, ignoreStrand = TRUE, output = c("vector", "dataframe") )
OrderCpGsByLocation( CpGs_char, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), manifest_gr = NULL, ignoreStrand = TRUE, output = c("vector", "dataframe") )
CpGs_char |
vector of CpGs |
genome |
Human genome of reference: hg19 or hg38 |
arrayType |
Type of array: 450k or EPIC |
manifest_gr |
A GRanges object with the genome manifest (as returned by
|
ignoreStrand |
Whether strand can be ignored, default is TRUE |
output |
vector of CpGs or dataframe with CpGs, CHR, MAPINFO |
vector of CpGs ordered by location or dataframe with CpGs ordered by location (cpg), chromosome (chr), position (pos)
CpGs_char <- c("cg04677227", "cg07146435", "cg11632906", "cg20214853") OrderCpGsByLocation( CpGs_char, genome = "hg19", arrayType = "450k", ignoreStrand = TRUE, output = "dataframe" )
CpGs_char <- c("cg04677227", "cg07146435", "cg11632906", "cg20214853") OrderCpGsByLocation( CpGs_char, genome = "hg19", arrayType = "450k", ignoreStrand = TRUE, output = "dataframe" )
Subset of phenotype information for Alzheimer's methylation dataset.
data("pheno_df")
data("pheno_df")
A data frame containing variables for Braak stage (stage), subject.id, Batch (slide), Sex, Sample, age of brain sample (age.brain)
GEO accession: GSE59685
Convert genomic regions in a data frame to GRanges format
RegionsToRanges(regionName_char)
RegionsToRanges(regionName_char)
regionName_char |
a character vector of regions in the format
|
genomic regions in GRanges format
regions <- c("chr22:19709548-19709755", "chr2:241721922-241722113") RegionsToRanges(regions)
regions <- c("chr22:19709548-19709755", "chr2:241721922-241722113") RegionsToRanges(regions)
Extract clusters of CpG probes located closely
WriteCloseByAllRegions( fileName, regions, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), ignoreStrand = TRUE, maxGap = 200, minCpGs = 3, ... )
WriteCloseByAllRegions( fileName, regions, genome = c("hg19", "hg38"), arrayType = c("450k", "EPIC"), ignoreStrand = TRUE, maxGap = 200, minCpGs = 3, ... )
fileName |
Name of the RDS file where the output genomic regions will be saved. |
regions |
GRanges of input genomic regions |
genome |
Human genome of reference: hg19 or hg38 |
arrayType |
Type of array: "450k" or "EPIC" |
ignoreStrand |
Whether strand can be ignored, default is TRUE |
maxGap |
an integer, genomic locations within maxGap from each other are placed into the same cluster |
minCpGs |
an integer, minimum number of CpGs for each resulting region |
... |
Dots for internal arguments. Currently unused. |
For maxGap
= 200 and minCpGs
= 3, we have already
calculated the clusters of CpGs. They are saved in folder
/inst/extdata/
.
Nothing. Instead, file with the genomic regions containing CpGs located closely within each inputting pre-defined genomic region will be written to the disk
regions <- GenomicRanges::GRanges( seqnames = c("chr4", "chr6", "chr16", "chr16", "chr22", "chr19"), ranges = c( "174202697-174203520", "28226203-28227482", "89572934-89574634", "67232460-67234167", "38244199-38245362", "39402823-39403373" ) ) # Uncomment out the example code below: # WriteCloseByAllRegions( # regions = regions, # arrayType = "EPIC", # maxGap = 50, # minCpGs = 3, # fileName = "closeByRegions.rds" # )
regions <- GenomicRanges::GRanges( seqnames = c("chr4", "chr6", "chr16", "chr16", "chr22", "chr19"), ranges = c( "174202697-174203520", "28226203-28227482", "89572934-89574634", "67232460-67234167", "38244199-38245362", "39402823-39403373" ) ) # Uncomment out the example code below: # WriteCloseByAllRegions( # regions = regions, # arrayType = "EPIC", # maxGap = 50, # minCpGs = 3, # fileName = "closeByRegions.rds" # )