Title: | Analysing Illumina HumanMethylation BeadChip Data |
---|---|
Description: | Normalisation, testing for differential variability and differential methylation and gene set testing for data from Illumina's Infinium HumanMethylation arrays. The normalisation procedure is subset-quantile within-array normalisation (SWAN), which allows Infinium I and II type probes on a single array to be normalised together. The test for differential variability is based on an empirical Bayes version of Levene's test. Differential methylation testing is performed using RUV, which can adjust for systematic errors of unknown origin in high-dimensional data by using negative control probes. Gene ontology analysis is performed by taking into account the number of probes per gene on the array, as well as taking into account multi-gene associated probes. |
Authors: | Belinda Phipson and Jovana Maksimovic |
Maintainer: | Belinda Phipson <[email protected]>, Jovana Maksimovic <[email protected]>, Andrew Lonsdale <[email protected]>, Calandra Grima <[email protected]> |
License: | GPL-2 |
Version: | 1.41.0 |
Built: | 2024-10-30 09:20:03 UTC |
Source: | https://github.com/bioc/missMethyl |
missMethyl is a library for the analysis of Illumina's 450K human methylation BeadChip. Specifically, functions for SWAN normalisation and differential variability analysis are provided. SWAN normalisation uses probe specific information, and the differential variability procedure uses linear models which can handle any designed experiment.
Package: | missMethyl |
Type: | Package |
Version: | 0.99.0 |
Date: | 2014-06-30 |
License: | GPL-2 |
Normalisation of the 450K arrays can be performed using the function
SWAN
.
Differential variability analysis can be performed by calling varFit
followed by topVar
for a list of the top ranked differentially
variable CpGs between conditions.
More detailed help documentation is provided in each function's help page.
Belinda Phipson and Jovana Maksimovic
Maintainer: Belinda Phipson <[email protected]>, Jovana Maksimovic <[email protected]>
Maksimovic, J., Gordon, L., Oshlack, A. (2012). SWAN: Subset-quantile within array normalization for illumina infinium HumanMethylation450 BeadChips. Genome Biology, 13:R44.
Phipson, B., and Oshlack, A. (2014). DiffVar: A new method for detecting differential variability with application to methylation in cancer and aging. Genome Biology, 15:465.
Compute estimated coefficients, standard errors and LogVarRatios for a given set of contrasts.
contrasts.varFit(fit, contrasts = NULL)
contrasts.varFit(fit, contrasts = NULL)
fit |
List containing a linear model fit produced by |
contrasts |
Numeric matrix with rows corresponding to coefficients in
|
This function calls the contrasts.fit
function in limma
to
compute coefficients and standard errors for the specified contrasts
corresponding to a linear model fit obtained from the varFit
function. LogVarRatios are also computed in terms of the contrasts. A
contrasts matrix can be computed using the makeContrasts
function.
A list object of the same class as fit
.
Belinda Phipson
varFit
, contrasts.fit
, makeContrasts
# Randomly generate data for a 3 group problem with 100 CpG sites and 4 # arrays in each group. library(limma) y<-matrix(rnorm(1200),ncol=12) group<-factor(rep(c(1,2,3),each=4)) design<-model.matrix(~0+group) colnames(design)<-c("grp1","grp2","grp3") design # Fit linear model for differential variability # Please always specify the coef parameter in the call to varFit vfit<-varFit(y,design,coef=c(1,2,3)) # Specify contrasts contr<-makeContrasts(grp2-grp1,grp3-grp1,grp3-grp2,levels=colnames(design)) # Compute contrasts from fit object vfit.contr<-contrasts.varFit(vfit,contrasts=contr) summary(decideTests(vfit.contr)) # Look at top table of results for first contrast topVar(vfit.contr,coef=1)
# Randomly generate data for a 3 group problem with 100 CpG sites and 4 # arrays in each group. library(limma) y<-matrix(rnorm(1200),ncol=12) group<-factor(rep(c(1,2,3),each=4)) design<-model.matrix(~0+group) colnames(design)<-c("grp1","grp2","grp3") design # Fit linear model for differential variability # Please always specify the coef parameter in the call to varFit vfit<-varFit(y,design,coef=c(1,2,3)) # Specify contrasts contr<-makeContrasts(grp2-grp1,grp3-grp1,grp3-grp2,levels=colnames(design)) # Compute contrasts from fit object vfit.contr<-contrasts.varFit(vfit,contrasts=contr) summary(decideTests(vfit.contr)) # Look at top table of results for first contrast topVar(vfit.contr,coef=1)
Plot the overall density distribution of beta values and the density distributions of the Infinium I and II probe types.
densityByProbeType( data, legendPos = "top", colors = c("black", "red", "blue"), main = "", lwd = 3, cex.legend = 1 )
densityByProbeType( data, legendPos = "top", colors = c("black", "red", "blue"), main = "", lwd = 3, cex.legend = 1 )
data |
A |
legendPos |
The x and y co-ordinates to be used to position the legend.
They can be specified by keyword or in any way which is accepted by
|
colors |
Colors to be used for the different beta value density distributions. Must be a vector of length 3. |
main |
Plot title. |
lwd |
The line width to be used for the different beta value density distributions. |
cex.legend |
The character expansion factor for the legend text. |
The density distribution of the beta values for a single sample is plotted.
The density distributions of the Infinium I and II probes are then plotted
individually, showing how they contribute to the overall distribution. This
is useful for visualising how using SWAN
affects the data.
No return value. Plot is produced as a side-effect.
Jovana Maksimovic
densityPlot
, densityBeanPlot
,
par
, legend
if (require(minfi) & require(minfiData)) { dat <- preprocessRaw(RGsetEx) datSwan <- SWAN(dat) par(mfrow=c(1,2)) densityByProbeType(dat[,1], main="Raw") densityByProbeType(datSwan[,1], main="SWAN") }
if (require(minfi) & require(minfiData)) { dat <- preprocessRaw(RGsetEx) datSwan <- SWAN(dat) par(mfrow=c(1,2)) densityByProbeType(dat[,1], main="Raw") densityByProbeType(datSwan[,1], main="SWAN") }
Extract values adjusted for unwanted variation by RUVm.
getAdj(Y, fit)
getAdj(Y, fit)
Y |
A matrix of M-values. |
fit |
The list |
This function extracts values adjusted for unwanted variation by RUVm. These values are ONLY intendeded to be used for visualisation purposes. It is NOT recommended that they are used for any further analysis.
An matrix of M-values.
Jovana Maksimovic
if(require(minfi) & require(minfiData) & require(limma)) { # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group <- factor(pData(MsetEx)$Sample_Group, labels=c(0,1)) design <- model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when not known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # get adjusted values Madj <- getAdj(Y=Mval,fit=fit) }
if(require(minfi) & require(minfiData) & require(limma)) { # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group <- factor(pData(MsetEx)$Sample_Group, labels=c(0,1)) design <- model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when not known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # get adjusted values Madj <- getAdj(Y=Mval,fit=fit) }
Extracts the intensity data for the Illumina negative controls found on 450k or EPIC arrays and returns a matrix of M-values (log2 ratio of the green to red intensities).
getINCs(rgSet)
getINCs(rgSet)
rgSet |
An object of class |
The getINCs function extracts the intensity data for the INCs from an RGChannelSet object. The function retrieves both the green and red channel intensity values and returns the data as the log2 ratio of the green to red intensities. Essentially, the INCs are being treated like Type II probes for which the M-values are also given as the log2 ratio of the green to red intensities.
An matrix of M-values.
Jovana Maksimovic
if (require(minfi) & require(minfiData)) { INCs <- getINCs(RGsetEx) head(INCs) dim(INCs) }
if (require(minfi) & require(minfiData)) { INCs <- getINCs(RGsetEx) head(INCs) dim(INCs) }
Obtain absolute or squared Levene residuals for each CpG given a series of methylation arrays
getLeveneResiduals(data, design = NULL, type = NULL)
getLeveneResiduals(data, design = NULL, type = NULL)
data |
Object of class |
design |
The design matrix of the experiment, with rows corresponding to arrays/samples and columns to coefficients to be estimated. Defaults to the unit vector. |
type |
Character string, |
This function will return absolute or squared Levene residuals given a matrix of M values and a design matrix. This can be used for graphing purposes or for downstream analysis such a gene set testing based on differential variability rather than differential methylation. If no design matrix is given, the residuals are determined by treating all samples as coming from one group.
Returns a list with three components. data
contains a matrix
of absolute or squared residuals, AvgVar
is a vector of sample
variances and LogVarRatio
corresponds to the columns of the design
matrix and is usually the ratios of the log of the group variances.
Belinda Phipson
Phipson, B., and Oshlack, A. (2014). A method for detecting differential variability in methylation data shows CpG islands are highly variably methylated in cancers. Genome Biology, 15:465.
# Randomly generate data for a 2 group problem with 100 CpG sites and 5 # arrays in each group y <- matrix(rnorm(1000),ncol=10) group <- factor(rep(c(1,2),each=5)) design <- model.matrix(~group) # Get absolute Levene Residuals resid <- getLeveneResiduals(y,design) # Plot the first CpG barplot(resid$data[1,],col=rep(c(2,4),each=5), ylab="Absolute Levene Residuals",names=group)
# Randomly generate data for a 2 group problem with 100 CpG sites and 5 # arrays in each group y <- matrix(rnorm(1000),ncol=10) group <- factor(rep(c(1,2),each=5)) design <- model.matrix(~group) # Get absolute Levene Residuals resid <- getLeveneResiduals(y,design) # Plot the first CpG barplot(resid$data[1,],col=rep(c(2,4),each=5), ylab="Absolute Levene Residuals",names=group)
Given a set of CpG probe names and optionally all the CpG sites tested, this function outputs a list containing the mapped Entrez Gene IDs as well as the numbers of probes per gene, and a vector indicating significance.
getMappedEntrezIDs( sig.cpg, all.cpg = NULL, array.type = c("450K", "EPIC", "EPIC_V2"), anno = NULL, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd") )
getMappedEntrezIDs( sig.cpg, all.cpg = NULL, array.type = c("450K", "EPIC", "EPIC_V2"), anno = NULL, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd") )
sig.cpg |
Character vector of significant CpG sites used for testing gene set enrichment. |
all.cpg |
Character vector of all CpG sites tested. Defaults to all CpG sites on the array. |
array.type |
The Illumina methylation array used. Options are "450K", "EPIC" or "EPIC_V2". |
anno |
Optional. A |
genomic.features |
Character vector or scalar indicating whether the gene set enrichment analysis should be restricted to CpGs from specific genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon", "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults to "ALL". |
This function is used by the gene set testing functions gometh
and
gsameth
. It maps the significant CpG probe names to Entrez Gene IDs,
as well as all the CpG sites tested. It also calculates the numbers of
probes for gene. Input CpGs are able to be restricted by genomic features
using the genomic.features
argument.
Genes associated with each CpG site are obtained from the annotation package
IlluminaHumanMethylation450kanno.ilmn12.hg19
if the array type is
"450K". For the EPIC array, the annotation package
IlluminaHumanMethylationEPICanno.ilm10b4.hg19
is used. For the EPIC v2
array, the annotation package
IlluminaHumanMethylationEPICv2anno.20a1.hg38
is used. To use a
different annotation package, please supply it using the anno
argument.
A list with the following elements
sig.eg |
mapped Entrez Gene IDs for the significant probes |
universe |
mapped Entrez Gene IDs for all probes on the array, or for all the CpG probes tested. |
freq |
table output with numbers of probes associated with each gene |
equiv |
table output with equivalent numbers of probes associated with each gene taking into account multi-gene bias |
de |
a vector of ones and zeroes of the same length of universe indicating which genes in the universe are significantly differentially methylated. |
fract.counts |
a dataframe with 2 columns corresponding to the Entrez Gene IDS for the significant probes and the associated weight to account for multi-gene probes. |
Belinda Phipson
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(org.Hs.eg.db) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) mappedEz <- getMappedEntrezIDs(sigcpgs,allcpgs,array.type="450K") names(mappedEz) # Entrez IDs of the significant genes mappedEz$sig.eg[1:10] # Entrez IDs for the universe mappedEz$universe[1:10] # Number of CpGs per gene mappedEz$freq[1:10] # Equivalent numbers of CpGs measured per gene mappedEz$equiv[1:10] A vector of 0s and 1s indicating which genes in the universe are significant mappedEz$de[1:10] ## End(Not run)
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(org.Hs.eg.db) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) mappedEz <- getMappedEntrezIDs(sigcpgs,allcpgs,array.type="450K") names(mappedEz) # Entrez IDs of the significant genes mappedEz$sig.eg[1:10] # Entrez IDs for the universe mappedEz$universe[1:10] # Number of CpGs per gene mappedEz$freq[1:10] # Equivalent numbers of CpGs measured per gene mappedEz$equiv[1:10] A vector of 0s and 1s indicating which genes in the universe are significant mappedEz$de[1:10] ## End(Not run)
Tests gene ontology enrichment for significant CpGs from Illumina's Infinium HumanMethylation450 or MethylationEPIC array, taking into account two different sources of bias: 1) the differing number of probes per gene present on the array, and 2) CpGs that are annotated to multiple genes.
gometh( sig.cpg, all.cpg = NULL, collection = c("GO", "KEGG"), array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
gometh( sig.cpg, all.cpg = NULL, collection = c("GO", "KEGG"), array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
sig.cpg |
Character vector of significant CpG sites to test for GO term enrichment. |
all.cpg |
Character vector of all CpG sites tested. Defaults to all CpG sites on the array. |
collection |
The collection of pathways to test. Options are "GO" and "KEGG". Defaults to "GO". |
array.type |
The Illumina methylation array used. Options are "450K", "EPIC" or "EPIC_V2". Defaults to "450K". |
plot.bias |
Logical, if true a plot showing the bias due to the differing numbers of probes per gene will be displayed. |
prior.prob |
Logical, if true will take into account the probability of significant differential methylation due to numbers of probes per gene. If false, a hypergeometric test is performed ignoring any bias in the data. |
anno |
Optional. A |
equiv.cpg |
Logical, if true then equivalent numbers of cpgs are used
for odds calculation rather than total number cpgs. Only used if
|
fract.counts |
Logical, if true then fractional counting of Cpgs is used
to account for CpGs that are annotated to multiple genes. Only used if
|
genomic.features |
Character vector or scalar indicating whether the gene set enrichment analysis should be restricted to CpGs from specific genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon", "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults to "ALL". |
sig.genes |
Logical, if true then the significant differentially methylated genes that overlap with the gene set of interest is outputted as the final column in the results table. Default is FALSE. |
This function takes a character vector of significant CpG sites, maps the
CpG sites to Entrez Gene IDs, and tests for GO term or KEGG pathway
enrichment using a Wallenius' non central hypergeometric test, taking into
account the number of CpG sites per gene on the 450K/EPIC array and
multi-gene annotated CpGs. Geeleher et al. (2013) showed
that a severe bias exists when performing gene set analysis for genome-wide
methylation data that occurs due to the differing numbers of CpG sites
profiled for each gene. gometh
is based on the goseq
method
(Young et al., 2010), and is a modification of the goana
function in
the limma
package. If prior.prob
is set to
FALSE, then prior probabilities are not used and it is assumed that each
gene is equally likely to have a significant CpG site associated with it.
The testing now also takes into account that some CpGs are annotated to
multiple genes. For a small number of gene families, this previously caused
their associated GO categories/gene sets to be erroneously overrepresented
and thus highly significant. If fract.counts=FALSE
then CpGs are
allowed to map to multiple genes (this is NOT recommended).
A new feature of gometh
and gsameth
is the ability to restrict
the input CpGs by genomic feature with the argument genomic.features
.
The possible options include "ALL", "TSS200", "TSS1500", "Body", "1stExon",
"3'UTR", "5'UTR" and "ExonBnd", and the user may specify any combination.
Please not that "ExonBnd" is not an annotated feature on 450K arrays. For
example if you are interested in the promoter region only, you could specify
genomic.features = c("TSS1500","TSS200","1stExon")
. The default
behaviour is to test all input CpGs sig.cpg
even if the user specifies
"ALL" and one or more other features.
Genes associated with each CpG site are obtained from the annotation package
IlluminaHumanMethylation450kanno.ilmn12.hg19
if the array type is
"450K". For the EPIC array, the annotation package
IlluminaHumanMethylationEPICanno.ilm10b4.hg19
is used. For the EPIC v2
array, the annotation package
IlluminaHumanMethylationEPICv2anno.20a1.hg38
is used. To use a
different annotation package, please supply it using the anno
argument.
If you are interested in which genes overlap with the genes in the gene set,
setting sig.genes
to TRUE will output an additional column in the
results data frame that contains all the significant differentially
methylated gene symbols, comma separated. The default is FALSE.
In order to get a list which contains the mapped Entrez gene IDs,
please use the getMappedEntrezIDs
function. gometh
tests all
GO or KEGG terms, and false discovery rates are calculated using the method
of Benjamini and Hochberg (1995). The topGSA
function can be used to
display the top 20 most enriched pathways.
For more generalised gene set testing where the user can specify the gene
set/s of interest to be tested, please use the gsameth
function.
If you are interested in performing gene set testing following a region
analysis, then the functions goregion
and gsaregion
can be
used.
A data frame with a row for each GO or KEGG term and the following columns:
Term |
GO term if testing GO pathways |
Ont |
ontology that the GO term belongs to if testing GO pathways. "BP" - biological process, "CC" - cellular component, "MF" - molecular function. |
Pathway |
the KEGG pathway being tested if testing KEGG terms. |
N |
number of genes in the GO or KEGG term |
DE |
number of genes that are differentially methylated |
P.DE |
p-value for over-representation of the GO or KEGG term term |
FDR |
False discovery rate |
SigGenesInSet |
Significant differentially methylated genes overlapping with the gene set of interest. |
Belinda Phipson
Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl: an R package for analysing methylation data from Illuminas HumanMethylation450 platform. Bioinformatics, 15;32(2), 286–8.
Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R., and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics, 29(15), 1851–1857.
Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, gkv007.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
gsameth,goregion,gsaregion,
getMappedEntrezIDs
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) # GO testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", plot.bias = TRUE, prior.prob = TRUE, anno = ann) # Total number of GO categories significant at 5% FDR table(gst$FDR<0.05) # Table of top GO results topGSA(gst) # GO testing ignoring bias gst.bias <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", prior.prob=FALSE, anno = ann) # Total number of GO categories significant at 5% FDR ignoring bias table(gst.bias$FDR<0.05) # Table of top GO results ignoring bias topGSA(gst.bias) # GO testing ignoring multi-mapping CpGs gst.multi <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", plot.bias = TRUE, prior.prob = TRUE, fract.counts = FALSE, anno = ann) topGSA(gst.multi, n=10) # Restrict to CpGs in promoter regions gst.promoter <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", anno = ann, genomic.features=c("TSS200","TSS1500","1stExon")) topGSA(gst.promoter) # KEGG testing kegg <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "KEGG", prior.prob=TRUE, anno = ann) # Table of top KEGG results topGSA(kegg) # Add significant genes to KEGG output kegg.siggenes <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "KEGG", anno = ann, sig.genes = TRUE) # Output top 5 KEGG pathways topGSA(kegg.siggenes, n=5) ## End(Not run)
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) # GO testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", plot.bias = TRUE, prior.prob = TRUE, anno = ann) # Total number of GO categories significant at 5% FDR table(gst$FDR<0.05) # Table of top GO results topGSA(gst) # GO testing ignoring bias gst.bias <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", prior.prob=FALSE, anno = ann) # Total number of GO categories significant at 5% FDR ignoring bias table(gst.bias$FDR<0.05) # Table of top GO results ignoring bias topGSA(gst.bias) # GO testing ignoring multi-mapping CpGs gst.multi <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", plot.bias = TRUE, prior.prob = TRUE, fract.counts = FALSE, anno = ann) topGSA(gst.multi, n=10) # Restrict to CpGs in promoter regions gst.promoter <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", anno = ann, genomic.features=c("TSS200","TSS1500","1stExon")) topGSA(gst.promoter) # KEGG testing kegg <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "KEGG", prior.prob=TRUE, anno = ann) # Table of top KEGG results topGSA(kegg) # Add significant genes to KEGG output kegg.siggenes <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "KEGG", anno = ann, sig.genes = TRUE) # Output top 5 KEGG pathways topGSA(kegg.siggenes, n=5) ## End(Not run)
Tests gene ontology or KEGG pathway enrichment for differentially methylated regions (DMRs) identified from Illumina's Infinium HumanMethylation450 or MethylationEPIC array, taking into account the differing number of probes per gene present on the array.
goregion( regions, all.cpg = NULL, collection = c("GO", "KEGG"), array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
goregion( regions, all.cpg = NULL, collection = c("GO", "KEGG"), array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
regions |
|
all.cpg |
Character vector of all CpG sites tested. Defaults to all CpG sites on the array. |
collection |
The collection of pathways to test. Options are "GO" and "KEGG". Defaults to "GO". |
array.type |
The Illumina methylation array used. Options are "450K", "EPIC" or "EPIC_V2". Defaults to "450K". |
plot.bias |
Logical, if true a plot showing the bias due to the differing numbers of probes per gene will be displayed. |
prior.prob |
Logical, if true will take into account the probability of significant differentially methylation due to numbers of probes per gene. If false, a hypergeometric test is performed ignoring any bias in the data. |
anno |
Optional. A |
equiv.cpg |
Logical, if true then equivalent numbers of cpgs are used
for odds calculation rather than total number cpgs. Only used if
|
fract.counts |
Logical, if true then fractional counting of cpgs is used
to account for cgps that map to multiple genes. Only used if
|
genomic.features |
Character vector or scalar indicating whether the gene set enrichment analysis should be restricted to CpGs from specific genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon", "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults to "ALL". |
sig.genes |
Logical, if true then the significant differentially methylated genes that overlap with the gene set of interest is outputted as the final column in the results table. Default is FALSE. |
This function takes a GRanges
object of DMR coordinates, maps them to
CpG sites on the array and then to Entrez Gene IDs, and tests for GO term or
KEGG pathway enrichment using Wallenius' noncentral hypergeometric test,
taking into account the number of CpG sites per gene on the 450K/EPIC array.
If prior.prob
is set to FALSE, then prior probabilities are not used
and it is assumed that each gene is equally likely to have a significant CpG
site associated with it. Please not that we have tested goregion
and
gsaregion
extensively using the DMRCate
package to identify
differentially methylated regions (Peters, et al., 2015).
The testing now also takes into account that some CpGs map to multiple genes.
For a small number of gene families, this previously caused their associated
GO categories/gene sets to be erroneously overrepresented and thus highly
significant. If fract.counts=FALSE
then CpGs are allowed to map to
multiple genes (this is NOT recommended).
Genes associated with each CpG site are obtained from the annotation
package IlluminaHumanMethylation450kanno.ilmn12.hg19
if the array
type is "450K". For the EPIC array, the annotation package
IlluminaHumanMethylationEPICanno.ilm10b4.hg19
is used. For the EPIC v2
array, the annotation package
IlluminaHumanMethylationEPICv2anno.20a1.hg38
is used. To use a
different annotation package, please supply it using the anno
argument.
In order to get a list which contains the mapped Entrez gene IDS,
please use the getMappedEntrezIDs
function. goregion
tests all
GO or KEGG terms, and false discovery rates are calculated using the method
of Benjamini and Hochberg (1995). The topGSA
function can be used to
display the top 20 most enriched pathways.
If you are interested in which genes overlap with the genes in the gene set,
setting sig.genes
to TRUE will output an additional column in the
results data frame that contains all the significant differentially
methylated gene symbols, comma separated. The default is FALSE.
For more generalised gene set testing where the user can specify the gene
set/s of interest to be tested, please use the gsaregion
function.
A data frame with a row for each GO or KEGG term and the following columns:
Term |
GO term if testing GO pathways |
Ont |
ontology that the GO term belongs to if testing GO pathways. "BP" - biological process, "CC" - cellular component, "MF" - molecular function. |
Pathway |
the KEGG pathway being tested if testing KEGG terms. |
N |
number of genes in the GO or KEGG term |
DE |
number of genes that are differentially methylated |
P.DE |
p-value for over-representation of the GO or KEGG term term |
FDR |
False discovery rate |
SigGenesInSet |
Significant differentially methylated genes overlapping with the gene set of interest. |
Jovana Maksimovic
Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl: an R package for analysing methylation data from Illuminas HumanMethylation450 platform. Bioinformatics, 15;32(2), 286–8.
Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R., and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics, 29(15), 1851–1857.
Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, gkv007.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
Peters, T.J., Buckley, M.J., Statham, A.L., Pidsley, R., Samaras, K., Lord, R.V., Clark, S.J.,Molloy, P.L. (2015). De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin, 8, 6.
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) library(limma) library(DMRcate) library(ExperimentHub) # Follow the example for the dmrcate function to get some EPIC data from # ExperimentHub eh <- ExperimentHub() FlowSorted.Blood.EPIC <- eh[["EH1136"]] tcell <- FlowSorted.Blood.EPIC[,colData(FlowSorted.Blood.EPIC)$CD4T==100 | colData(FlowSorted.Blood.EPIC)$CD8T==100] detP <- detectionP(tcell) remove <- apply(detP, 1, function (x) any(x > 0.01)) tcell <- tcell[!remove,] tcell <- preprocessFunnorm(tcell) #Subset to chr2 only tcell <- tcell[seqnames(tcell) == "chr2",] tcellms <- getM(tcell) tcellms.noSNPs <- rmSNPandCH(tcellms, dist=2, mafcut=0.05) tcell$Replicate[tcell$Replicate==""] <- tcell$Sample_Name[tcell$Replicate==""] tcellms.noSNPs <- avearrays(tcellms.noSNPs, tcell$Replicate) tcell <- tcell[,!duplicated(tcell$Replicate)] tcell <- tcell[rownames(tcellms.noSNPs),] colnames(tcellms.noSNPs) <- colnames(tcell) assays(tcell)[["M"]] <- tcellms.noSNPs assays(tcell)[["Beta"]] <- ilogit2(tcellms.noSNPs) # Perform region analysis type <- factor(tcell$CellType) design <- model.matrix(~type) myannotation <- cpg.annotate("array", tcell, arraytype = "EPIC", analysis.type="differential", design=design, coef=2) # Run DMRCate with beta value cut-off filter of 0.1 dmrcoutput <- dmrcate(myannotation, lambda=1000, C=2, betacutoff = 0.1) regions <- extractRanges(dmrcoutput) length(regions) ann <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) # All CpG sites tested (limited to chr 2) allcpgs <- rownames(tcell) # GO testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- goregion(regions = regions, all.cpg = allcpgs, collection = "GO", array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann) # Table of top GO results topGSA(gst, n=10) # KEGG testing kegg <- goregion(regions = regions, all.cpg = allcpgs, collection = "KEGG", array.type = "EPIC", prior.prob=TRUE, anno = ann) # Table of top KEGG results topGSA(kegg, n=10) # Restrict to promoter regions gst.prom <- goregion(regions = regions, all.cpg = allcpgs, collection = "GO", array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann, genomic.features = c("TSS200","TSS1500")) topGSA(gst.prom, n=10) # Add significant genes in gene set to KEGG output kegg <- goregion(regions = regions, all.cpg = allcpgs, collection = "KEGG", array.type = "EPIC", prior.prob=TRUE, anno = ann, sig.genes = TRUE) # Table of top KEGG results topGSA(kegg, n=5) ## End(Not run)
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) library(limma) library(DMRcate) library(ExperimentHub) # Follow the example for the dmrcate function to get some EPIC data from # ExperimentHub eh <- ExperimentHub() FlowSorted.Blood.EPIC <- eh[["EH1136"]] tcell <- FlowSorted.Blood.EPIC[,colData(FlowSorted.Blood.EPIC)$CD4T==100 | colData(FlowSorted.Blood.EPIC)$CD8T==100] detP <- detectionP(tcell) remove <- apply(detP, 1, function (x) any(x > 0.01)) tcell <- tcell[!remove,] tcell <- preprocessFunnorm(tcell) #Subset to chr2 only tcell <- tcell[seqnames(tcell) == "chr2",] tcellms <- getM(tcell) tcellms.noSNPs <- rmSNPandCH(tcellms, dist=2, mafcut=0.05) tcell$Replicate[tcell$Replicate==""] <- tcell$Sample_Name[tcell$Replicate==""] tcellms.noSNPs <- avearrays(tcellms.noSNPs, tcell$Replicate) tcell <- tcell[,!duplicated(tcell$Replicate)] tcell <- tcell[rownames(tcellms.noSNPs),] colnames(tcellms.noSNPs) <- colnames(tcell) assays(tcell)[["M"]] <- tcellms.noSNPs assays(tcell)[["Beta"]] <- ilogit2(tcellms.noSNPs) # Perform region analysis type <- factor(tcell$CellType) design <- model.matrix(~type) myannotation <- cpg.annotate("array", tcell, arraytype = "EPIC", analysis.type="differential", design=design, coef=2) # Run DMRCate with beta value cut-off filter of 0.1 dmrcoutput <- dmrcate(myannotation, lambda=1000, C=2, betacutoff = 0.1) regions <- extractRanges(dmrcoutput) length(regions) ann <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) # All CpG sites tested (limited to chr 2) allcpgs <- rownames(tcell) # GO testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- goregion(regions = regions, all.cpg = allcpgs, collection = "GO", array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann) # Table of top GO results topGSA(gst, n=10) # KEGG testing kegg <- goregion(regions = regions, all.cpg = allcpgs, collection = "KEGG", array.type = "EPIC", prior.prob=TRUE, anno = ann) # Table of top KEGG results topGSA(kegg, n=10) # Restrict to promoter regions gst.prom <- goregion(regions = regions, all.cpg = allcpgs, collection = "GO", array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann, genomic.features = c("TSS200","TSS1500")) topGSA(gst.prom, n=10) # Add significant genes in gene set to KEGG output kegg <- goregion(regions = regions, all.cpg = allcpgs, collection = "KEGG", array.type = "EPIC", prior.prob=TRUE, anno = ann, sig.genes = TRUE) # Table of top KEGG results topGSA(kegg, n=5) ## End(Not run)
Given a user specified list of gene sets to test, gsameth
tests
whether significantly differentially methylated CpG sites are enriched in
these gene sets.
gsameth( sig.cpg, all.cpg = NULL, collection, array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
gsameth( sig.cpg, all.cpg = NULL, collection, array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
sig.cpg |
Character vector of significant CpG sites to test for gene set enrichment. |
all.cpg |
Character vector of all CpG sites tested. Defaults to all CpG sites on the array. |
collection |
A list of user specified gene sets to test. Can also be a single character vector gene set. Gene identifiers must be Entrez Gene IDs. |
array.type |
The Illumina methylation array used. Options are "450K", "EPIC" or "EPIC_V2". Defaults to "450K". |
plot.bias |
Logical, if true a plot showing the bias due to the differing numbers of probes per gene will be displayed |
prior.prob |
Logical, if true will take into account the probability of significant differentially methylation due to numbers of probes per gene. If false, a hypergeometric test is performed ignoring any bias in the data. |
anno |
Optional. A |
equiv.cpg |
Logical, if true then equivalent numbers of cpgs are used
for odds calculation rather than total number cpgs. Only used if
|
fract.counts |
Logical, if true then fractional counting of cpgs is used
to account for cgps that map to multiple genes. Only used if
|
genomic.features |
Character vector or scalar indicating whether the gene set enrichment analysis should be restricted to CpGs from specific genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon", "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults to "ALL". |
sig.genes |
Logical, if true then the significant differentially methylated genes that overlap with the gene set of interest is outputted as the final column in the results table. Default is FALSE. |
This function extends gometh
, which only tests GO and KEGG pathways.
gsameth
can take a list of user specified gene sets and test whether
the significant CpG sites are enriched in these pathways. gsameth
maps the CpG sites to Entrez Gene IDs and tests for pathway enrichment using
Wallenius' concentral hypergeometric test, taking into account the number of
CpG sites per gene on the 450K/EPIC arrays. Please note the gene ids for the
collection of gene,sets must be Entrez Gene IDs. If prior.prob
is set
to FALSE, then prior probabilities are not used and it is assumed that each
gene is equally likely to have a significant CpG site associated with it.
The testing now also takes into account that some CpGs map to multiple genes.
For a small number of gene families, this previously caused their associated
GO categories/gene sets to be erroneously overrepresented and thus highly
significant. If fract.counts=FALSE
then CpGs are allowed to map to
multiple genes (this is NOT recommended).
A new feature of gometh
and gsameth
is the ability to restrict
the input CpGs by genomic feature with the argument genomic.features
.
The possible options include "ALL", "TSS200", "TSS1500", "Body", "1stExon",
"3'UTR", "5'UTR" and "ExonBnd", and the user may specify any combination.
Please not that "ExonBnd" is not an annotatedfeature on 450K arrays. For
example if you are interested in the promoter region only, you could specify
genomic.features = c("TSS1500","TSS200","1stExon")
. The default
behaviour is to test all input CpGs sig.cpg
even if the user specifies
"ALL" and one or more other features.
Genes associated with each CpG site are obtained from the annotation
package IlluminaHumanMethylation450kanno.ilmn12.hg19
if the array
type is "450K". For the EPIC array, the annotation package
IlluminaHumanMethylationEPICanno.ilm10b4.hg19
is used. For the EPIC v2
array, the annotation package
IlluminaHumanMethylationEPICv2anno.20a1.hg38
is used. To use a
different annotation package, please supply it using the anno
argument.
In order to get a list which contains the mapped Entrez gene IDS,
please use the getMappedEntrezIDs
function.
If you are interested in which genes overlap with the genes in the gene set,
setting sig.genes
to TRUE will output an additional column in the
results data frame that contains all the significant differentially
methylated gene symbols, comma separated. The default is FALSE.
A data frame with a row for each gene set and the following columns:
N |
number of genes in the gene set |
DE |
number of genes that are differentially methylated |
P.DE |
p-value for over-representation of the gene set |
FDR |
False discovery rate, calculated using the method of Benjamini and Hochberg (1995). |
SigGenesInSet |
Significant differentially methylated genes overlapping with the gene set of interest. |
Belinda Phipson
Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl: an R package for analysing methylation data from Illuminas HumanMethylation450 platform. Bioinformatics, 15;32(2), 286–8.
Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R., and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics, 29(15), 1851–1857.
Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, gkv007.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(org.Hs.eg.db) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) # Use org.Hs.eg.db to extract a GO term GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID","GO"), keytype="ENTREZID")) setname1 <- GOtoID$GO[1] setname1 keep.set1 <- GOtoID$GO %in% setname1 set1 <- GOtoID$ENTREZID[keep.set1] setname2 <- GOtoID$GO[2] setname2 keep.set2 <- GOtoID$GO %in% setname2 set2 <- GOtoID$ENTREZID[keep.set2] # Make the gene sets into a list sets <- list(set1, set2) names(sets) <- c(setname1,setname2) # Testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE) topGSA(gst) # Add significant gene symbols in each set to output gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE, sig.genes = TRUE) topGSA(gst) # Testing ignoring bias gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, prior.prob = FALSE) topGSA(gst.bias) # Restrict to CpGs in gene bodies gst.body <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, genomic.features = "Body") topGSA(gst.body) ## End(Not run)
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(org.Hs.eg.db) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) # Use org.Hs.eg.db to extract a GO term GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID","GO"), keytype="ENTREZID")) setname1 <- GOtoID$GO[1] setname1 keep.set1 <- GOtoID$GO %in% setname1 set1 <- GOtoID$ENTREZID[keep.set1] setname2 <- GOtoID$GO[2] setname2 keep.set2 <- GOtoID$GO %in% setname2 set2 <- GOtoID$ENTREZID[keep.set2] # Make the gene sets into a list sets <- list(set1, set2) names(sets) <- c(setname1,setname2) # Testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE) topGSA(gst) # Add significant gene symbols in each set to output gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE, sig.genes = TRUE) topGSA(gst) # Testing ignoring bias gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, prior.prob = FALSE) topGSA(gst.bias) # Restrict to CpGs in gene bodies gst.body <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, genomic.features = "Body") topGSA(gst.body) ## End(Not run)
Given a user specified list of gene sets to test, gsaregion
tests
whether differentially methylated regions (DMRs) identified from Illumina's
Infinium HumanMethylation450 or MethylationEPIC array are enriched, taking
into account the differing number of probes per gene present on the array.
gsaregion( regions, all.cpg = NULL, collection, array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
gsaregion( regions, all.cpg = NULL, collection, array.type = c("450K", "EPIC", "EPIC_V2"), plot.bias = FALSE, prior.prob = TRUE, anno = NULL, equiv.cpg = TRUE, fract.counts = TRUE, genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR", "ExonBnd"), sig.genes = FALSE )
regions |
|
all.cpg |
Character vector of all CpG sites tested. Defaults to all CpG sites on the array. |
collection |
A list of user specified gene sets to test. Can also be a single character vector gene set. Gene identifiers must be Entrez Gene IDs. |
array.type |
The Illumina methylation array used. Options are "450K", "EPIC" or "EPIC_V2". Defaults to "450K". |
plot.bias |
Logical, if true a plot showing the bias due to the differing numbers of probes per gene will be displayed. |
prior.prob |
Logical, if true will take into account the probability of significant differentially methylation due to numbers of probes per gene. If false, a hypergeometric test is performed ignoring any bias in the data. |
anno |
Optional. A |
equiv.cpg |
Logical, if true then equivalent numbers of cpgs are used
for odds calculation rather than total number cpgs. Only used if
|
fract.counts |
Logical, if true then fractional counting of cpgs is used
to account for cgps that map to multiple genes. Only used if
|
genomic.features |
Character vector or scalar indicating whether the gene set enrichment analysis should be restricted to CpGs from specific genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon", "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults to "ALL". |
sig.genes |
Logical, if true then the significant differentially methylated genes that overlap with the gene set of interest is outputted as the final column in the results table. Default is FALSE. |
This function extends goregion
, which only tests GO and KEGG pathways.
gsaregion
can take a list of user specified gene sets and test whether
the significant DMRs are enriched in these pathways. This function takes a
GRanges
object of DMR coordinates, maps them to CpG sites on the
array and then to Entrez Gene IDs, and tests for enrichment using Wallenius'
noncentral hypergeometric test, taking into account the number of CpG sites
per gene on the 450K/EPIC array. If prior.prob
is set to FALSE, then
prior probabilities are not used and it is assumed that each gene is equally
likely to have a significant CpG site associated with it.
The testing now also takes into account that some CpGs map to multiple genes.
For a small number of gene families, this previously caused their associated
GO categories/gene sets to be erroneously overrepresented and thus highly
significant. If fract.counts=FALSE
then CpGs are allowed to map to
multiple genes (this is NOT recommended).
Genes associated with each CpG site are obtained from the annotation package
IlluminaHumanMethylation450kanno.ilmn12.hg19
if the array type is
"450K". For the EPIC array, the annotation package
IlluminaHumanMethylationEPICanno.ilm10b4.hg19
is used. For the EPIC v2
array, the annotation package
IlluminaHumanMethylationEPICv2anno.20a1.hg38
is used. To use a
different annotation package, please supply it using the anno
argument.
In order to get a list which contains the mapped Entrez gene IDS,
please use the getMappedEntrezIDs
function. The topGSA
function
can be used to display the top 20 most enriched pathways.
If you are interested in which genes overlap with the genes in the gene set,
setting sig.genes
to TRUE will output an additional column in the
results data frame that contains all the significant differentially
methylated gene symbols, comma separated. The default is FALSE.
A data frame with a row for each gene set and the following columns:
N |
number of genes in the gene set |
DE |
number of genes that are differentially methylated |
P.DE |
p-value for over-representation of the gene set |
FDR |
False discovery rate, calculated using the method of Benjamini and Hochberg (1995). |
SigGenesInSet |
Significant differentially methylated genes overlapping with the gene set of interest. |
Jovana Maksimovic
Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl: an R package for analysing methylation data from Illuminas HumanMethylation450 platform. Bioinformatics, 15;32(2), 286–8.
Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R., and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics, 29(15), 1851–1857.
Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, gkv007.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
gometh,goregion,gsameth,
getMappedEntrezIDs
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) library(limma) library(DMRcate) library(ExperimentHub) library(org.Hs.eg.db) # Follow the example for the dmrcate function to get some EPIC data from # ExperimentHub eh <- ExperimentHub() FlowSorted.Blood.EPIC <- eh[["EH1136"]] tcell <- FlowSorted.Blood.EPIC[,colData(FlowSorted.Blood.EPIC)$CD4T==100 | colData(FlowSorted.Blood.EPIC)$CD8T==100] detP <- detectionP(tcell) remove <- apply(detP, 1, function (x) any(x > 0.01)) tcell <- tcell[!remove,] tcell <- preprocessFunnorm(tcell) #Subset to chr2 only tcell <- tcell[seqnames(tcell) == "chr2",] tcellms <- getM(tcell) tcellms.noSNPs <- rmSNPandCH(tcellms, dist=2, mafcut=0.05) tcell$Replicate[tcell$Replicate==""] <- tcell$Sample_Name[tcell$Replicate==""] tcellms.noSNPs <- avearrays(tcellms.noSNPs, tcell$Replicate) tcell <- tcell[,!duplicated(tcell$Replicate)] tcell <- tcell[rownames(tcellms.noSNPs),] colnames(tcellms.noSNPs) <- colnames(tcell) assays(tcell)[["M"]] <- tcellms.noSNPs assays(tcell)[["Beta"]] <- ilogit2(tcellms.noSNPs) # Perform region analysis type <- factor(tcell$CellType) design <- model.matrix(~type) myannotation <- cpg.annotate("array", tcell, arraytype = "EPIC", analysis.type="differential", design=design, coef=2) # Run DMRCate with beta value cut-off filter of 0.1 dmrcoutput <- dmrcate(myannotation, lambda=1000, C=2, betacutoff = 0.1) regions <- extractRanges(dmrcoutput) length(regions) ann <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) # All CpG sites tested (limited to chr 2) allcpgs <- rownames(tcell) # Use org.Hs.eg.db to extract a GO term GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID","GO"), keytype="ENTREZID")) keep.set1 <- GOtoID$GO %in% "GO:0010951" set1 <- GOtoID$ENTREZID[keep.set1] keep.set2 <- GOtoID$GO %in% "GO:0042742" set2 <- GOtoID$ENTREZID[keep.set2] keep.set3 <- GOtoID$GO %in% "GO:0031295" set3 <- GOtoID$ENTREZID[keep.set3] # Make the gene sets into a list sets <- list(set1, set2, set3) names(sets) <- c("GO:0010951","GO:0042742","GO:0031295") # Testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gsaregion(regions = regions, all.cpg = allcpgs, collection = sets, array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann) topGSA(gst) # Add significant genes in gene set to output gst <- gsaregion(regions = regions, all.cpg = allcpgs, collection = sets, array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann, sig.genes = TRUE) topGSA(gst) ## End(Not run)
## Not run: # to avoid timeout on Bioconductor build library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) library(limma) library(DMRcate) library(ExperimentHub) library(org.Hs.eg.db) # Follow the example for the dmrcate function to get some EPIC data from # ExperimentHub eh <- ExperimentHub() FlowSorted.Blood.EPIC <- eh[["EH1136"]] tcell <- FlowSorted.Blood.EPIC[,colData(FlowSorted.Blood.EPIC)$CD4T==100 | colData(FlowSorted.Blood.EPIC)$CD8T==100] detP <- detectionP(tcell) remove <- apply(detP, 1, function (x) any(x > 0.01)) tcell <- tcell[!remove,] tcell <- preprocessFunnorm(tcell) #Subset to chr2 only tcell <- tcell[seqnames(tcell) == "chr2",] tcellms <- getM(tcell) tcellms.noSNPs <- rmSNPandCH(tcellms, dist=2, mafcut=0.05) tcell$Replicate[tcell$Replicate==""] <- tcell$Sample_Name[tcell$Replicate==""] tcellms.noSNPs <- avearrays(tcellms.noSNPs, tcell$Replicate) tcell <- tcell[,!duplicated(tcell$Replicate)] tcell <- tcell[rownames(tcellms.noSNPs),] colnames(tcellms.noSNPs) <- colnames(tcell) assays(tcell)[["M"]] <- tcellms.noSNPs assays(tcell)[["Beta"]] <- ilogit2(tcellms.noSNPs) # Perform region analysis type <- factor(tcell$CellType) design <- model.matrix(~type) myannotation <- cpg.annotate("array", tcell, arraytype = "EPIC", analysis.type="differential", design=design, coef=2) # Run DMRCate with beta value cut-off filter of 0.1 dmrcoutput <- dmrcate(myannotation, lambda=1000, C=2, betacutoff = 0.1) regions <- extractRanges(dmrcoutput) length(regions) ann <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) # All CpG sites tested (limited to chr 2) allcpgs <- rownames(tcell) # Use org.Hs.eg.db to extract a GO term GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID","GO"), keytype="ENTREZID")) keep.set1 <- GOtoID$GO %in% "GO:0010951" set1 <- GOtoID$ENTREZID[keep.set1] keep.set2 <- GOtoID$GO %in% "GO:0042742" set2 <- GOtoID$ENTREZID[keep.set2] keep.set3 <- GOtoID$GO %in% "GO:0031295" set3 <- GOtoID$ENTREZID[keep.set3] # Make the gene sets into a list sets <- list(set1, set2, set3) names(sets) <- c("GO:0010951","GO:0042742","GO:0031295") # Testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gsaregion(regions = regions, all.cpg = allcpgs, collection = sets, array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann) topGSA(gst) # Add significant genes in gene set to output gst <- gsaregion(regions = regions, all.cpg = allcpgs, collection = sets, array.type = "EPIC", plot.bias = TRUE, prior.prob = TRUE, anno = ann, sig.genes = TRUE) topGSA(gst) ## End(Not run)
Given a user defined list of gene sets, gsaseq
will test whether
significantly differentially expressed genes are enriched in these gene sets.
gsaseq( sig.de, universe, collection, plot.bias = FALSE, gene.length = NULL, sort = TRUE )
gsaseq( sig.de, universe, collection, plot.bias = FALSE, gene.length = NULL, sort = TRUE )
sig.de |
Character vector of significant differentially expressed genes to test for gene set enrichment. Must be Entrez Gene ID format. |
universe |
Character vector of all genes analysed in the experiment. Must be Entrez Gene ID format. |
collection |
A list of user specified gene sets to test. Can also be a single character vector gene set. Gene identifiers must be Entrez Gene IDs. |
plot.bias |
Logical, if true a plot showing gene length bias related to differential expression will be displayed. |
gene.length |
A vector containing the gene lengths for each gene in the
same order as |
sort |
Logical, if TRUE then the output dataframe is sorted by p-value. |
This function is a generalised version of goana
and kegga
from
the limma
package in that it can take a user-defined list of
differentially expressed genes and perform gene set enrichment analysis, and
is not limited to only testing GO and KEGG categories. It is not as flexible
as goana
and kegga
. Please note the vector of differentially
expressed genes and list of gene sets must be Entrez Gene IDs.
The gsaseq
function will test for enrichment using a hypergeometric
test if the gene.length
parameter is NULL. If the
gene.length
parameter is supplied then the p-values are derived from
Walllenius' noncentral hypergeometric distribution from the BiasedUrn
package. Please note that the gene.length
parameter must be in the
same order and of the same length as the universe
parameter.
A data frame with a row for each gene set and the following columns:
N |
number of genes in the gene set |
DE |
number of genes that are differentially expressed |
P.DE |
p-value for over-representation of the gene set |
FDR |
False discovery rate, calculated using the method of Benjamini and Hochberg (1995). |
Belinda Phipson
## Not run: # to avoid timeout on Bioconductor build library(org.Hs.eg.db) # Use org.Hs.eg.db to extract GO terms GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID","GO"), keytype="ENTREZID")) head(GOtoID) # Define the universe as random sample of 20000 genes in the annotation universe <- sample(unique(GOtoID$ENTREZID),20000) # Randomly sample 500 genes as DE de.genes <- sample(universe, 500) # Generate random gene lengths for genes in universe # This is based on the true distribution of log(gene length) of genes in the # hg19 genome logGL <- rnorm(length(universe),mean=7.9, sd=1.154) genelength <- exp(logGL) # Define a list of gene sets containing two GO terms setname1 <- GOtoID$GO[1] setname1 keep.set1 <- GOtoID$GO %in% setname1 set1 <- GOtoID$ENTREZID[keep.set1] setname2 <- GOtoID$GO[2] setname2 keep.set2 <- GOtoID$GO %in% setname2 set2 <- GOtoID$ENTREZID[keep.set2] # Make the gene sets into a list sets <- list(set1, set2) names(sets) <- c(setname1,setname2) # Test for enrichment of gene sets with no gene length bias # The genes are randomly selected so we don't expect significant results gsaseq(sig.de = de.genes, universe = universe, collection = sets) # Test for enrichment of gene sets taking into account gene length bias # Since the gene lengths are randomly generated this shouldn't make much # difference to the results # Using log(gene length) or gene length doesn't make a difference to the # p-values because the probability weighting function is transformation # invariant gsaseq(sig.de = de.genes, univers = universe, collection = sets, gene.length = genelength) ## End(Not run)
## Not run: # to avoid timeout on Bioconductor build library(org.Hs.eg.db) # Use org.Hs.eg.db to extract GO terms GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID","GO"), keytype="ENTREZID")) head(GOtoID) # Define the universe as random sample of 20000 genes in the annotation universe <- sample(unique(GOtoID$ENTREZID),20000) # Randomly sample 500 genes as DE de.genes <- sample(universe, 500) # Generate random gene lengths for genes in universe # This is based on the true distribution of log(gene length) of genes in the # hg19 genome logGL <- rnorm(length(universe),mean=7.9, sd=1.154) genelength <- exp(logGL) # Define a list of gene sets containing two GO terms setname1 <- GOtoID$GO[1] setname1 keep.set1 <- GOtoID$GO %in% setname1 set1 <- GOtoID$ENTREZID[keep.set1] setname2 <- GOtoID$GO[2] setname2 keep.set2 <- GOtoID$GO %in% setname2 set2 <- GOtoID$ENTREZID[keep.set2] # Make the gene sets into a list sets <- list(set1, set2) names(sets) <- c(setname1,setname2) # Test for enrichment of gene sets with no gene length bias # The genes are randomly selected so we don't expect significant results gsaseq(sig.de = de.genes, universe = universe, collection = sets) # Test for enrichment of gene sets taking into account gene length bias # Since the gene lengths are randomly generated this shouldn't make much # difference to the results # Using log(gene length) or gene length doesn't make a difference to the # p-values because the probability weighting function is transformation # invariant gsaseq(sig.de = de.genes, univers = universe, collection = sets, gene.length = genelength) ## End(Not run)
Post-process and summarize the results of call to RUVfit
.
RUVadj( Y, fit, var.type = c("ebayes", "standard", "pooled"), p.type = c("standard", "rsvar", "evar"), cpginfo = NULL, ... )
RUVadj( Y, fit, var.type = c("ebayes", "standard", "pooled"), p.type = c("standard", "rsvar", "evar"), cpginfo = NULL, ... )
Y |
The original data matrix used in the call to |
fit |
A RUV model fit (a |
var.type |
Which type of estimate for sigma2 should be used from the
call to |
p.type |
Which type of p-values should be used from the call to
|
cpginfo |
A matrix or dataframe containing information about the CpGs. This information is included in the summary that is returned. |
... |
Other parameters that can be passed to |
This function post-processes the results of a call to RUVfit
and then summarizes the output. The post-processing step primarily consists
of a call to ruv_summary
and variance_adjust
,
which computes various adjustments to variances, t-statistics, and and
p-values. See variance_adjust
for details. The
var.type
and p.type
options determine which of these
adjustments are used.
After post-processing, the results are summarized into a list containing 4
objects: 1) the data matrix Y; 2) a dataframe R containing information about
the rows (samples); 3) a dataframe C containing information about the
columns (features, e.g. genes), and 4) a list misc of other information
returned by RUVfit
.
An list
containing:
Y |
The original data matrix.. |
R |
A dataframe of sample-wise information, including X, Z, and any
other data passed in with |
C |
A dataframe of cpg-wise
information, including p-values, estimated regression coefficients,
estimated variances, column means, an index of the negative controls, and
any other data passed in with |
misc |
A list of
additional information returned by |
Jovana Maksimovic [email protected]
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
Gagnon-Bartsch JA, Speed TP. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics. 13(3), 539-52. Available at: http://biostatistics.oxfordjournals.org/content/13/3/539.full.
Gagnon-Bartsch, Jacob, and Speed. 2013. Removing Unwanted Variation from High Dimensional Data with Negative Controls. Available at: http://statistics.berkeley.edu/tech-reports/820.
Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Volume 3, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf.
MArrayLM
, RUV2
,
RUV4
, RUVinv
, RUVrinv
,
p.adjust
, get_empirical_variances
,
sigmashrink
if(require(minfi) & require(minfiData) & require(limma)) { # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group<-factor(pData(MsetEx)$Sample_Group) design<-model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when not known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # Look at table of top results top <- topRUV(fit2) }
if(require(minfi) & require(minfiData) & require(limma)) { # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group<-factor(pData(MsetEx)$Sample_Group) design<-model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when not known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # Look at table of top results top <- topRUV(fit2) }
Provides an interface similar to lmFit
from
limma
to the RUV2
, RUV4
,
RUVinv
and RUVrinv
functions from the
ruv
package, which facilitates the removal of unwanted
variation in a differential methylation analysis. A set of negative control
variables, as described in the references, must be specified.
RUVfit( Y, X, ctl, Z = 1, k = NULL, method = c("inv", "rinv", "ruv4", "ruv2"), ... )
RUVfit( Y, X, ctl, Z = 1, k = NULL, method = c("inv", "rinv", "ruv4", "ruv2"), ... )
Y |
numeric |
X |
The factor(s) of interest. A m by p matrix, where m is the number
of samples and p is the number of factors of interest. Very often p = 1.
Factors and dataframes are also permissible, and converted to a matrix by
|
ctl |
logical vector, |
Z |
Any additional covariates to include in the model, typically a m by
q matrix. Factors and dataframes are also permissible, and converted to a
matrix by |
k |
integer, required if |
method |
character string, indicates which |
... |
additional arguments that can be passed to |
This function depends on the ruv
package and is used to
estimate and adjust for unwanted variation in a differential methylation
analysis. Briefly, the unwanted factors W
are estimated using
negative control variables. Y
is then regressed on the variables
X
, Z
, and W
. For methylation data, the analysis is
performed on the M-values, defined as the log base 2 ratio of the methylated
signal to the unmethylated signal.
A list
containing:
betahat |
The estimated coefficients of the factor(s) of interest. A p by n matrix. |
sigma2 |
Estimates of the features' variances. A vector of length n. |
t |
t statistics for the factor(s) of interest. A p by n matrix. |
p |
P-values for the factor(s) of interest. A p by n matrix. |
Fstats |
F statistics for testing all of the factors in X simultaneously.. |
Fpvals |
P-values for testing all of the factors in X simultaneously. |
multiplier |
The
constant by which |
df |
The number of residual degrees of freedom. |
W |
The estimated unwanted factors. |
alpha |
The estimated coefficients of W. |
byx |
The coefficients in a regression of Y on X (after both Y and X have been "adjusted" for Z). Useful for projection plots. |
bwx |
The coefficients in a regression of W on X (after X has been "adjusted" for Z). Useful for projection plots. |
X |
|
k |
|
ctl |
|
Z |
|
fullW0 |
Can be used to
speed up future calls of |
include.intercept |
|
method |
Character variable with value indicating which RUV method was used. Included for reference. |
Jovana Maksimovic
Gagnon-Bartsch JA, Speed TP. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics. 13(3), 539-52. Available at: http://biostatistics.oxfordjournals.org/content/13/3/539.full.
Gagnon-Bartsch, Jacob, and Speed. 2013. Removing Unwanted Variation from High Dimensional Data with Negative Controls. Available at: http://statistics.berkeley.edu/tech-reports/820.
RUV2
, RUV4
, RUVinv
,
RUVrinv
, topRUV
if(require(minfi) & require(minfiData) & require(limma)) { # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group <- factor(pData(MsetEx)$Sample_Group) design <- model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when not known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # Look at table of top results top <- topRUV(fit2) }
if(require(minfi) & require(minfiData) & require(limma)) { # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group <- factor(pData(MsetEx)$Sample_Group) design <- model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when not known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # Look at table of top results top <- topRUV(fit2) }
Subset-quantile Within Array Normalisation (SWAN) is a within array normalisation method for the Illumina Infinium HumanMethylation450 platform. It allows Infinium I and II type probes on a single array to be normalized together.
SWAN(data, verbose = FALSE) ## S3 method for class 'MethyLumiSet' SWAN(data, verbose = FALSE) ## S3 method for class 'RGChannelSet' SWAN(data, verbose = FALSE) ## Default S3 method: SWAN(data, verbose = FALSE)
SWAN(data, verbose = FALSE) ## S3 method for class 'MethyLumiSet' SWAN(data, verbose = FALSE) ## S3 method for class 'RGChannelSet' SWAN(data, verbose = FALSE) ## Default S3 method: SWAN(data, verbose = FALSE)
data |
An object of class either |
verbose |
Should the function be verbose? |
The SWAN method has two parts. First, an average quantile distribution is created using a subset of probes defined to be biologically similar based on the number of CpGs underlying the probe body. This is achieved by randomly selecting N Infinium I and II probes that have 1, 2 and 3 underlying CpGs, where N is the minimum number of probes in the 6 sets of Infinium I and II probes with 1, 2 or 3 probe body CpGs. If no probes have previously been filtered out e.g. sex chromosome probes, etc. N=11,303. This results in a pool of 3N Infinium I and 3N Infinium II probes. The subset for each probe type is then sorted by increasing intensity. The value of each of the 3N pairs of observations is subsequently assigned to be the mean intensity of the two probe types for that row or 'quantile'. This is the standard quantile procedure. The intensities of the remaining probes are then separately adjusted for each probe type using linear interpolation between the subset probes.
An object of class MethylSet
.
NULL
NULL
NULL
SWAN uses a random subset of probes to perform the within-array
normalization. In order to achive reproducible results, the seed needs to be
set using set.seed
.
Jovana Maksimovic
J Maksimovic, L Gordon and A Oshlack (2012). SWAN: Subset quantile Within-Array Normalization for Illumina Infinium HumanMethylation450 BeadChips. Genome Biology 13, R44.
RGChannelSet
and
MethylSet
as well as MethyLumiSet
and IlluminaMethylationManifest
.
if (require(minfi) & require(minfiData)) { set.seed(100) datSwan1 <- SWAN(RGsetEx) dat <- preprocessRaw(RGsetEx) set.seed(100) datSwan2 <- SWAN(dat) head(getMeth(datSwan2)) == head(getMeth(datSwan1)) }
if (require(minfi) & require(minfiData)) { set.seed(100) datSwan1 <- SWAN(RGsetEx) dat <- preprocessRaw(RGsetEx) set.seed(100) datSwan2 <- SWAN(dat) head(getMeth(datSwan2)) == head(getMeth(datSwan1)) }
After using gsameth
, calling topGSA will output the top 20 most
significantly enriched pathways.
topGSA(gsa, number = 20, sort = TRUE)
topGSA(gsa, number = 20, sort = TRUE)
gsa |
Matrix, from output of |
number |
Scalar, number of pathway results to output. Default is 20. |
sort |
Logical, should the table be ordered by p-value. Default is TRUE. |
This function will output the top 20 most significant pathways from a
pathway analysis using the gsameth
function. The output is ordered by
p-value.
A matrix ordered by P.DE, with a row for each gene set and the following columns:
N |
number of genes in the gene set |
DE |
number of genes that are differentially methylated |
P.DE |
p-value for over-representation of the gene set |
FDR |
False discovery rate, calculated using the method of Benjamini and Hochberg (1995) |
.
SigGenesInSet |
Significant differentially methylated genes overlapping with the gene set of interest. |
Belinda Phipson
library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(org.Hs.eg.db) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) # Use org.Hs.eg.db to extract a GO term GOtoID <- toTable(org.Hs.egGO2EG) setname1 <- GOtoID$go_id[1] setname1 keep.set1 <- GOtoID$go_id %in% setname1 set1 <- GOtoID$gene_id[keep.set1] setname2 <- GOtoID$go_id[2] setname2 keep.set2 <- GOtoID$go_id %in% setname2 set2 <- GOtoID$gene_id[keep.set2] # Make the gene sets into a list sets <- list(set1, set2) names(sets) <- c(setname1,setname2) # Testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE) topGSA(gst) # Testing ignoring bias gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, prior.prob = FALSE) topGSA(gst.bias)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(org.Hs.eg.db) library(limma) ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19) # Randomly select 1000 CpGs to be significantly differentially methylated sigcpgs <- sample(rownames(ann),1000,replace=FALSE) # All CpG sites tested allcpgs <- rownames(ann) # Use org.Hs.eg.db to extract a GO term GOtoID <- toTable(org.Hs.egGO2EG) setname1 <- GOtoID$go_id[1] setname1 keep.set1 <- GOtoID$go_id %in% setname1 set1 <- GOtoID$gene_id[keep.set1] setname2 <- GOtoID$go_id[2] setname2 keep.set2 <- GOtoID$go_id %in% setname2 set2 <- GOtoID$gene_id[keep.set2] # Make the gene sets into a list sets <- list(set1, set2) names(sets) <- c(setname1,setname2) # Testing with prior probabilities taken into account # Plot of bias due to differing numbers of CpG sites per gene gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE) topGSA(gst) # Testing ignoring bias gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, prior.prob = FALSE) topGSA(gst.bias)
Extract a table of the top-ranked CpGs from a linear model fit after
performing a differential methylation analysis using RUVfit
and
RUVadj
.
topRUV(fitsum, number = 10, sort.by = c("p", "F.p"), p.BH = 1)
topRUV(fitsum, number = 10, sort.by = c("p", "F.p"), p.BH = 1)
fitsum |
An object containing the summary fit object produced by
|
number |
integer, maximum number of genes to list. Default is 10. |
sort.by |
character string, what the results should be sorted by. Default is unadjusted p-value. |
p.BH |
numeric, cutoff value for Benjamini-Hochberg adjusted p-values. Only features with lower p-values are listed. Must be between 0 and 1. Default is 1. |
This function summarises the results of a differential methylation analysis
performed using RUVfit
, followed by RUVadj
. The top ranked
CpGs are sorted by p-value.
Produces a dataframe with rows corresponding to the top
number
CpGs and the following columns: F.p F.p.BH p_X1 p.BH_X1 b_X1
sigma2 var.b_X1 fit.ctl mean
F.p |
P-values for testing all of the factors of interest simultaneously. |
F.p.BH |
Benjamini-Hochberg adjusted p-values for testing all of the factors of interest simultaneously. |
p_X1 |
p-values for the factor of interest. |
p.BH_X1 |
Benjamini-Hochberg adjusted p-values for the factor of interest. |
b_X1 |
The estimated coefficients of the factor of interest. |
sigma2 |
Estimate of the methylation variance. |
var.b_X1 |
Variance estimate of |
fit.ctl |
logical, indicating whether CpG was designated as a negative control. |
mean |
The mean methylation (M-value). |
Jovana Maksimovic [email protected]
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Volume 3, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf.
if(require(minfi) & require(minfiData) & require(limma)){ # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group <- factor(pData(MsetEx)$Sample_Group) design <- model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when *not* known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # Look at table of top results top <- topRUV(fit2) }
if(require(minfi) & require(minfiData) & require(limma)){ # Get methylation data for a 2 group comparison meth <- getMeth(MsetEx) unmeth <- getUnmeth(MsetEx) Mval <- log2((meth + 100)/(unmeth + 100)) group <- factor(pData(MsetEx)$Sample_Group) design <- model.matrix(~group) # Perform initial analysis to empirically identify negative control features # when *not* known a priori lFit <- lmFit(Mval,design) lFit2 <- eBayes(lFit) lTop <- topTable(lFit2,coef=2,num=Inf) # The negative control features should *not* be associated with factor of # interest but *should* be affected by unwanted variation ctl <- rownames(Mval) %in% rownames(lTop[lTop$adj.P.Val > 0.5,]) # Perform RUV adjustment and fit fit <- RUVfit(Y=Mval, X=group, ctl=ctl) fit2 <- RUVadj(Y=Mval, fit=fit) # Look at table of top results top <- topRUV(fit2) }
Extract a table of the top-ranked CpGs from a linear model fit after a differential variability analysis.
topVar(fit, coef = NULL, number = 10, sort = TRUE)
topVar(fit, coef = NULL, number = 10, sort = TRUE)
fit |
List containing a linear model fit produced by |
coef |
Column number or column name specifying which coefficient of the linear model fit is of interest. It should be the same coefficient that the differential variability testing was performed on. Default is last column of fit object. |
number |
Maximum number of genes to list. Default is 10. |
sort |
Logical, default is TRUE. Sorts output according the P-value. FALSE will return results in same order as fit object. |
This function summarises the results of a differential variability analysis
performed with varFit
. The p-values from the comparison of interest
are adjusted using Benjamini and Hochberg's false discovery rate with the
function p.adjust
. The top ranked CpGs are selected by first ranking
the adjusted p-values, then ranking the raw p-values. At this time no other
sorting option is catered for.
Produces a dataframe with rows corresponding to the top CpGs and the following columns:
genelist |
one or more columns of annotation for
each CpG, if the gene information is available in |
AvgVar |
average of the absolute or squared Levene residuals across all samples |
DiffVar |
estimate of the difference in the Levene residuals corresponding to the comparison of interest |
t |
moderated t-statistic |
P.Value |
raw p-value |
Adj.P.Value |
adjusted p-value |
Belinda Phipson
Phipson, B., and Oshlack, A. (2014). A method for detecting differential variability in methylation data shows CpG islands are highly variably methylated in cancers. Genome Biology, 15:465.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
varFit
, p.adjust
# Randomly generate data for a 2 group problem with 100 CpG sites and 5 # arrays in each group. y<-matrix(rnorm(1000),ncol=10) group<-factor(rep(c(1,2),each=5)) design<-model.matrix(~group) # Fit linear model for differential variability vfit<-varFit(y,design) # Look at top table of results topVar(vfit,coef=2)
# Randomly generate data for a 2 group problem with 100 CpG sites and 5 # arrays in each group. y<-matrix(rnorm(1000),ncol=10) group<-factor(rep(c(1,2),each=5)) design<-model.matrix(~group) # Fit linear model for differential variability vfit<-varFit(y,design) # Look at top table of results topVar(vfit,coef=2)
Fit linear model on mean absolute or squared deviations for each CpG given a series of methylation arrays
varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL ) ## S3 method for class 'MethylSet' varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL ) ## S3 method for class 'DGEList' varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL ) ## Default S3 method: varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL )
varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL ) ## S3 method for class 'MethylSet' varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL ) ## S3 method for class 'DGEList' varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL ) ## Default S3 method: varFit( data, design = NULL, coef = NULL, type = NULL, trend = TRUE, robust = TRUE, weights = NULL )
data |
Object of class |
design |
The design matrix of the experiment, with rows corresponding to arrays/samples and columns to coefficients to be estimated. Defaults to the unit vector. |
coef |
The columns of the design matrix containing the comparisons to test for differential variability. Defaults to all columns of design matrix. |
type |
Character string, |
trend |
Logical, if true fits a mean variance trend on the absolute or squared deviations. |
robust |
Logical, if true performs robust empirical Bayes shrinkage of the variances for the moderated t statistics. |
weights |
Non-negative observation weights. Can be a numeric matrix of individual weights, of same size as the object matrix, or a numeric vector of array weights, or a numeric vector of gene/feature weights. |
This function depends on the limma
package and is used to rank
features such as CpG sites or genes in order of evidence of differential
variability between different comparisons corresponding to the columns of
the design matrix. A measure of variability is calculated for each CpG in
each sample by subtracting out the group mean and taking the absolute or
squared deviation. A linear model is then fitted to the absolute or squared
deviations. The residuals of the linear model fit are subjected to empirical
Bayes shrinkage and moderated t statistics (Smyth, 2004) calculated. False
discovery rates are calculated using the method of Benjamini and Hochberg
(1995).
Please always specify the coef
parameter in the call to varFit
,
which indicates which groups are to be tested for differential variability.
If coef
is not specified, then group means are estimated based on all
the columns of the design matrix and subtracted out before testing for
differential variability. If the design matrix contains nuisance parameters,
then subsetting the design matrix columns by coef
should remove these
columns from the design matrix. If the design matrix includes an intercept
term, this should be included in coef
. The nuisance parameters are
included in the linear model fit to the absolute or squared deviations, but
should not be considered when subtracting group means to obtain the
deviations. Note that design matrices without an intercept term are
permitted, and specific contrasts tested using the function
contrasts.varFit
.
For methylation data, the analysis is performed on the M-values, defined as
the log base 2 ratio of the methylated signal to the unmethylated signal. If
a MethylSet
object is supplied, M-values are extracted with an offset
of 100 added to the numerator and denominator.
For testing differential variability on RNA-Seq data, a DGEList
object can be supplied directly to the function. A voom
transformation is applied before testing for differential variability. The
weights calculated in voom
are used in the linear model fit.
Since the output is of class MArrayLM
, any functions that can be
applied to fit objects from lmFit
and eBayes
can be applied,
for example, topTable
and decideTests
.
Produces an object of class MArrayLM
(see
MArrayLM-class
) containing everything found in a fitted model
object produced by lmFit
and eBayes
as well as a vector
containing the sample CpG-wise variances and a matrix of LogVarRatios
corresponding to the differential variability analysis.
NULL
NULL
NULL
Belinda Phipson
Phipson, B., and Oshlack, A. (2014). A method for detecting differential variability in methylation data shows CpG islands are highly variably methylated in cancers. Genome Biology, 15:465.
Smyth, G.K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Volume 3, Article 3.
Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, 2005.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
contrasts.varFit
, topVar
,
getLeveneResiduals
, lmFit
, eBayes
,
topTable
, decideTests
, voom
# Randomly generate data for a 2 group problem with 100 CpG sites and 5 # arrays in each # group. y<-matrix(rnorm(1000),ncol=10) group<-factor(rep(c(1,2),each=5)) design<-model.matrix(~group) # Fit linear model for differential variability vfit<-varFit(y,design,coef=c(1,2)) # Look at top table of results topVar(vfit,coef=2)
# Randomly generate data for a 2 group problem with 100 CpG sites and 5 # arrays in each # group. y<-matrix(rnorm(1000),ncol=10) group<-factor(rep(c(1,2),each=5)) design<-model.matrix(~group) # Fit linear model for differential variability vfit<-varFit(y,design,coef=c(1,2)) # Look at top table of results topVar(vfit,coef=2)