The missMethyl package contains functions to analyse methylation data from Illumina’s HumanMethylation450 and MethylationEPIC beadchip. These arrays are a cost-effective alternative to whole genome bisulphite sequencing, and as such are widely used to profile DNA methylation. Specifically, missMethyl contains functions to perform SWAN normalisation (Maksimovic, Gordon, and Oshlack 2012),perform differential methylation analysis using RUVm (Maksimovic et al. 2015), differential variability analysis (Phipson and Oshlack 2014) and gene set enrichment analysis (Phipson, Maksimovic, and Oshlack 2016). As our lab’s research into specialised analyses of these arrays continues we anticipate that the package will be updated with new functions.
Raw data files are in IDAT format, which can be read into R using the
minfi
package (Aryee et al. 2014). Statistical
analyses are usually performed on M-values, and β values are used for visualisation,
both of which can be extracted from MethylSet
data objects,
which is a class of object created by minfi. For
detecting differentially variable CpGs we recommend that the analysis is
performed on M-values. All analyses described here are performed at the
CpG site level.
The missMethyl package has been updated to support Illumina’s MethylationEPIC v2.0 beadchip. This array contains replicate probes which map to the same CpG dinucleotides. We recommend filtering these replicates prior to analysis with the missMethyl package using the DMRcate package. An example of this is shown in this vignette.
We will use the data in the minfiData
package to demonstrate the functions in missMethyl.
The example dataset has 6 samples across two slides. There are 3 cancer
samples and 3 normal sample. The sample information is in the targets
file. An essential column in the targets file is the
Basename
column which tells us where the idat files to be
read in are located. The R commands to read in the data are taken from
the minfi
User’s Guide. For additional details on how to read the IDAT files into
R, as well as information regarding quality control please refer to the
minfi
User’s Guide.
## [1] "/github/workspace/pkglib/minfiData/extdata/SampleSheet.csv"
## Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID person age sex
## 1 GroupA_3 H5 <NA> GroupA <NA> id3 83 M
## 2 GroupA_2 D5 <NA> GroupA <NA> id2 58 F
## 3 GroupB_3 C6 <NA> GroupB <NA> id3 83 M
## 4 GroupB_1 F7 <NA> GroupB <NA> id1 75 F
## 5 GroupA_1 G7 <NA> GroupA <NA> id1 75 F
## 6 GroupB_2 H7 <NA> GroupB <NA> id2 58 F
## status
## 1 normal
## 2 normal
## 3 cancer
## 4 cancer
## 5 normal
## 6 cancer
## Array Slide
## 1 R02C02 5723646052
## 2 R04C01 5723646052
## 3 R05C02 5723646052
## 4 R04C02 5723646053
## 5 R05C02 5723646053
## 6 R06C02 5723646053
## Basename
## 1 /github/workspace/pkglib/minfiData/extdata/5723646052/5723646052_R02C02
## 2 /github/workspace/pkglib/minfiData/extdata/5723646052/5723646052_R04C01
## 3 /github/workspace/pkglib/minfiData/extdata/5723646052/5723646052_R05C02
## 4 /github/workspace/pkglib/minfiData/extdata/5723646053/5723646053_R04C02
## 5 /github/workspace/pkglib/minfiData/extdata/5723646053/5723646053_R05C02
## 6 /github/workspace/pkglib/minfiData/extdata/5723646053/5723646053_R06C02
The data is now an RGChannelSet
object and needs to be
normalised and converted to a MethylSet
object.
When using MethylationEPIC v2.0 datasets, the array and annotation
details in the RGChannelSet
object must be set
manually.
SWAN (subset-quantile within array normalization) is a within-array normalization method for Illumina 450k & EPIC BeadChips. Technical differencs have been demonstrated to exist between the Infinium I and Infinium II assays on a single Illumina HumanMethylation array Dedeurwaerder, Defrance, and Calonne (2011). Using the SWAN method substantially reduces the technical variability between the assay designs whilst maintaining important biological differences. The SWAN method makes the assumption that the number of CpGs within the 50bp probe sequence reflects the underlying biology of the region being interrogated. Hence, the overall distribution of intensities of probes with the same number of CpGs in the probe body should be the same regardless of assay type. The method then uses a subset quantile normalization approach to adjust the intensities of each array (Maksimovic, Gordon, and Oshlack 2012).
SWAN
can take a MethylSet
,
RGChannelSet
or MethyLumiSet
as input. It
should be noted that, in order to create the normalization subset,
SWAN
randomly selects Infinium I and II probes that have
one, two and three underlying CpGs; as such, we recommend using
set.seed
before to ensure that the normalized intensities
will be identical, if the normalization is repeated.
The technical differences between Infinium I and II assay designs can result in aberrant β value distributions (Figure @ref(fig:betasByType), panel “Raw”). Using SWAN corrects for the technical differences between the Infinium I and II assay designs and produces a smoother overall β value distribution (Figure @ref(fig:betasByType), panel “SWAN”).
## [SWAN] Preparing normalization subset
## 450k
## [SWAN] Normalizing methylated channel
## [SWAN] Normalizing array 1 of 6
## [SWAN] Normalizing array 2 of 6
## [SWAN] Normalizing array 3 of 6
## [SWAN] Normalizing array 4 of 6
## [SWAN] Normalizing array 5 of 6
## [SWAN] Normalizing array 6 of 6
## [SWAN] Normalizing unmethylated channel
## [SWAN] Normalizing array 1 of 6
## [SWAN] Normalizing array 2 of 6
## [SWAN] Normalizing array 3 of 6
## [SWAN] Normalizing array 4 of 6
## [SWAN] Normalizing array 5 of 6
## [SWAN] Normalizing array 6 of 6
par(mfrow=c(1,2), cex=1.25)
densityByProbeType(mSet[,1], main = "Raw")
densityByProbeType(mSetSw[,1], main = "SWAN")
Poor quality probes can be filtered out based on the detection p-value. For this example, to retain a CpG for further analysis, we require that the detection p-value is less than 0.01 in all samples.
Now that the data has been SWAN
normalised we can
extract β and M-values from
the object. We prefer to add an offset to the methylated and
unmethylated intensities when calculating M-values, hence we extract the
methylated and unmethylated channels separately and perform our own
calculation. For all subsequent analyses we use a random selection of
20000 CpGs to reduce computation time.
set.seed(10)
mset_reduced <- mSetSw[sample(1:nrow(mSetSw), 20000),]
meth <- getMeth(mset_reduced)
unmeth <- getUnmeth(mset_reduced)
Mval <- log2((meth + 100)/(unmeth + 100))
beta <- getBeta(mset_reduced)
dim(Mval)
## [1] 20000 6
par(mfrow=c(1,1))
plotMDS(Mval, labels=targets$Sample_Name, col=as.integer(factor(targets$status)))
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1.2,col=1:2)
An MDS plot (Figure @ref(fig:mdsplot)) is a good sanity check to make sure samples cluster together according to the main factor of interest, in this case, cancer and normal.
The MethylationEPIC v2.0 array contains replicate probes which map to the same CpG dinucleotides. We recommend filtering these replicate probes using the DMRcate package as shown below. Please see the [DMRcate](https://bioconductor.org/packages/3.20/DMRcate) EPICv2 vignette for more details.
### This code is not run within this vignette
# For EPIC_V2, please run these lines to remove replicate and non-cg probes:
# remove non-cg probes
Mval <- Mval[grepl("^cg", rownames(Mval)),]
# select replicate probes with best sensitivity
Mval <- DMRcate::rmPosReps(Mval, filter.strategy="sensitivity")
### End of not run
To test for differential methylation we use the limma
package (Smyth 2005), which employs an
empirical Bayes framework based on Guassian model theory. First we need
to set up the design matrix. There are a number of ways to do this, the
most straightforward is directly from the targets file. There are a
number of variables, with the status
column indicating
cancer/normal samples. From the person
column of the targets file, we see that the
cancer/normal samples are matched, with 3 individuals
each contributing both a cancer and
normal sample. Since the limma model
framework can handle any experimental design which can be summarised by
a design matrix, we can take into account the paired nature of the data
in the analysis. For more complicated experimental designs, please refer
to the limma
User’s Guide.
group <- factor(targets$status,levels=c("normal","cancer"))
id <- factor(targets$person)
design <- model.matrix(~id + group)
design
## (Intercept) idid2 idid3 groupcancer
## 1 1 0 1 0
## 2 1 1 0 0
## 3 1 0 1 1
## 4 1 0 0 1
## 5 1 0 0 0
## 6 1 1 0 1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$id
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
Now we can test for differential methylation using the
lmFit
and eBayes
functions from limma. As
input data we use the matrix of M-values.
The numbers of hyper-methylated (1) and hypo-methylated (-1) can be
displayed using the decideTests
function in limma and
the top 10 differentially methylated CpGs for cancer versus
normal extracted using topTable
.
## (Intercept) idid2 idid3 groupcancer
## Down 6966 0 118 562
## NotSig 3359 20000 19877 18969
## Up 9675 0 5 469
## logFC AveExpr t P.Value adj.P.Val B
## cg16969623 4.732328 -1.3236286 18.19048 5.203068e-06 0.02960379 4.430065
## cg24115221 4.268874 -0.4492736 16.37216 9.053483e-06 0.02960379 4.065794
## cg13692446 4.345686 0.3051789 15.70099 1.127710e-05 0.02960379 3.912218
## cg11334818 4.032687 0.9351987 15.30336 1.289968e-05 0.02960379 3.815707
## cg10341556 4.987135 -1.5818239 15.38701 1.360335e-05 0.02960379 3.764979
## cg17815252 3.698873 -2.3661852 14.69926 1.592661e-05 0.02960379 3.660599
## cg24331301 3.853080 -1.4333732 14.51856 1.699062e-05 0.02960379 3.612100
## cg26976732 3.851935 -0.9336429 14.37287 1.791004e-05 0.02960379 3.572265
## cg26328335 3.713341 0.3506268 14.10089 1.978915e-05 0.02960379 3.496085
## cg05621343 4.265564 -0.5639405 14.10660 2.029158e-05 0.02960379 3.473979
Note that since we performed our analysis on M-values, the
logFC
and AveExpr
columns are computed on the
M-value scale. For interpretability and visualisation we can look at the
β values. The β values for the top 4
differentially methylated CpGs shown in Figure @ref(fig:top4).
cpgs <- rownames(top)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(beta[rownames(beta)==cpgs[i],]~design[,4],method="jitter",
group.names=c("Normal","Cancer"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(cpgs[i],cex.main=1.5)
}
Like other platforms, 450k array studies are subject to unwanted technical variation such as batch effects and other, often unknown, sources of variation. The adverse effects of unwanted variation have been extensively documented in gene expression array studies and have been shown to be able to both reduce power to detect true differences and to increase the number of false discoveries. As such, when it is apparent that data is significantly affected by unwanted variation, it is advisable to perform an adjustment to mitigate its effects.
missMethyl provides a limma inspired interface to functions from the CRAN package ruv, which enable the removal of unwanted variation when performing a differential methylation analysis (Maksimovic et al. 2015).
RUVfit
uses the RUV-inverse method by default,
as this does not require the user to specify a k parameter. The ridged version of
RUV-inverse is also available by setting
method = rinv
. The RUV-2 and RUV-4
functions can also be used by setting method = ruv2
or
method = ruv4
, respectively, and specifying an appropriate
value for k (number of components of unwanted variation to
remove) where 0 ≤ k < no.samples.
All of the methods rely on negative control features to accurately estimate the components of unwanted variation. Negative control features are probes/genes/etc. that are known a priori to not truly be associated with the biological factor of interest, but are affected by unwanted variation. For example, in a microarray gene expression study, these could be house-keeping genes or a set of spike-in controls. Negative control features are extensively discussed in Gagnon-Bartsch and Speed (2012) and Gagnon-Bartsch et al. (2013). Once the unwanted factors are accurately estimated from the data, they are adjusted for in the linear model that describes the differential analysis.
If the negative control features are not known a priori, they can be identified empirically. This can be achieved via a 2-stage approach, RUVm. Stage 1 involves performing a differential methylation analysis using RUV-inverse (by default) and the 613 Illumina negative controls (INCs) as negative control features. This will produce a list of CpGs ranked by p-value according to their level of association with the factor of interest. This list can then be used to identify a set of empirical control probes (ECPs), which will capture more of the unwanted variation than using the INCs alone. ECPs are selected by designating a proportion of the CpGs least associated with the factor of interest as negative control features; this can be done based on either an FDR cut-off or by taking a fixed percentage of probes from the bottom of the ranked list. Stage 2 involves performing a second differential methylation analysis on the original data using RUV-inverse (by default) and the ECPs. For simplicity, we are ignoring the paired nature of the cancer and normal samples in this example.
# get M-values for ALL probes
meth <- getMeth(mSet)
unmeth <- getUnmeth(mSet)
M <- log2((meth + 100)/(unmeth + 100))
### This code is not run within this vignette
# For EPIC_V2, please run these lines to remove replicate and non-cg probes:
# remove non-cg probes
M <- M[grepl("^cg", rownames(M)),]
# select replicate probes with best sensitivity
M <- DMRcate::rmPosReps(M, filter.strategy="sensitivity")
### End of not run
# setup the factor of interest
grp <- factor(targets$status, labels=c(0,1))
# extract Illumina negative control data
INCs <- getINCs(rgSet)
head(INCs)
## 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## 13792480 -0.3299654 -1.0955482 -0.5266103
## 69649505 -1.0354488 -1.4943396 -1.0067050
## 34772371 -1.1286422 -0.2995603 -0.8192636
## 28715352 -0.5553373 -0.7599489 -0.7186973
## 74737439 -1.1169178 -0.8656399 -0.6429681
## 33730459 -0.7714684 -0.5622424 -0.7724825
## 5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
## 13792480 -0.6374299 -1.116598 -0.4332793
## 69649505 -0.8854881 -1.586679 -0.9217329
## 34772371 -0.6895514 -1.161155 -0.6186795
## 28715352 -1.7903619 -1.348105 -1.0067259
## 74737439 -0.8872082 -1.064986 -0.9841833
## 33730459 -1.5623138 -2.079184 -1.0445246
# add negative control data to M-values
Mc <- rbind(M,INCs)
# create vector marking negative controls in data matrix
ctl1 <- rownames(Mc) %in% rownames(INCs)
table(ctl1)
## ctl1
## FALSE TRUE
## 485512 613
rfit1 <- RUVfit(Y = Mc, X = grp, ctl = ctl1) # Stage 1 analysis
rfit2 <- RUVadj(Y = Mc, fit = rfit1)
Now that we have performed an initial differential methylation analysis to rank the CpGs with respect to their association with the factor of interest, we can designate the CpGs that are least associated with the factor of interest based on FDR-adjusted p-value as ECPs.
## F.p F.p.BH p_X1.1 p.BH_X1.1 b_X1.1 sigma2
## cg04743961 3.516091e-07 0.01017357 3.516091e-07 0.01017357 -4.838190 0.10749571
## cg07155336 3.583107e-07 0.01017357 3.583107e-07 0.01017357 -5.887409 0.16002329
## cg20925841 3.730375e-07 0.01017357 3.730375e-07 0.01017357 -4.790211 0.10714494
## cg03607359 4.721205e-07 0.01017357 4.721205e-07 0.01017357 -4.394397 0.09636129
## cg10566121 5.238865e-07 0.01017357 5.238865e-07 0.01017357 -4.787914 0.11780108
## cg07655636 6.080091e-07 0.01017357 6.080091e-07 0.01017357 -4.571758 0.11201883
## var.b_X1.1 fit.ctl mean
## cg04743961 0.07729156 FALSE -2.31731496
## cg07155336 0.11505994 FALSE -1.27676413
## cg20925841 0.07703935 FALSE -0.87168892
## cg03607359 0.06928569 FALSE -2.19187147
## cg10566121 0.08470133 FALSE 0.03138961
## cg07655636 0.08054378 FALSE -1.29294851
## ctl2
## FALSE TRUE
## 172540 312972
We can then use the ECPs to perform a second differential methylation with RUV-inverse, which is adjusted for the unwanted variation estimated from the data.
# Perform RUV adjustment and fit
rfit3 <- RUVfit(Y = M, X = grp, ctl = ctl2) # Stage 2 analysis
rfit4 <- RUVadj(Y = M, fit = rfit3)
# Look at table of top results
topRUV(rfit4)
## F.p F.p.BH p_X1.1 p.BH_X1.1 b_X1.1 sigma2 var.b_X1.1 fit.ctl
## cg07155336 1e-24 1e-24 1e-24 1e-24 -5.769286 0.1668414 0.1349771 FALSE
## cg06463958 1e-24 1e-24 1e-24 1e-24 -5.733093 0.1668414 0.1349771 FALSE
## cg00024472 1e-24 1e-24 1e-24 1e-24 -5.662959 0.1668414 0.1349771 FALSE
## cg02040433 1e-24 1e-24 1e-24 1e-24 -5.651399 0.1668414 0.1349771 FALSE
## cg13355248 1e-24 1e-24 1e-24 1e-24 -5.595396 0.1668414 0.1349771 FALSE
## cg02467990 1e-24 1e-24 1e-24 1e-24 -5.592707 0.1668414 0.1349771 FALSE
## cg00817367 1e-24 1e-24 1e-24 1e-24 -5.527501 0.1668414 0.1349771 FALSE
## cg11396157 1e-24 1e-24 1e-24 1e-24 -5.487992 0.1668414 0.1349771 FALSE
## cg16306898 1e-24 1e-24 1e-24 1e-24 -5.466780 0.1668414 0.1349771 FALSE
## cg03735888 1e-24 1e-24 1e-24 1e-24 -5.396242 0.1668414 0.1349771 FALSE
## mean
## cg07155336 -1.2767641
## cg06463958 0.2776252
## cg00024472 -2.4445762
## cg02040433 -1.2918259
## cg13355248 -0.8483387
## cg02467990 -0.4154370
## cg00817367 -0.2911294
## cg11396157 -1.3800170
## cg16306898 -2.1469768
## cg03735888 -1.1527557
If the number of samples in your experiment is greater than the number of Illumina negative controls on the array platform used - 613 for 450k, 411 for EPIC - stage 1 of RUVm will not work. In such cases, we recommend performing a standard limma analysis in stage 1.
## (Intercept) grp1
## 1 1 1
## 2 1 1
## 3 1 0
## 4 1 0
## 5 1 1
## 6 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$grp
## [1] "contr.treatment"
# limma differential methylation analysis
lfit1 <- lmFit(M, design=des)
lfit2 <- eBayes(lfit1) # Stage 1 analysis
# Look at table of top results
topTable(lfit2)
## Removing intercept from test coefficients
## logFC AveExpr t P.Value adj.P.Val B
## cg07155336 -6.037439 -1.276764 -19.22210 1.175108e-07 0.005755968 7.635736
## cg04743961 -4.887986 -2.317315 -19.21709 1.177367e-07 0.005755968 7.634494
## cg03607359 -4.393946 -2.191871 -18.07007 1.852304e-07 0.005755968 7.334032
## cg13272280 -4.559707 -2.099665 -17.25531 2.599766e-07 0.005755968 7.099628
## cg22263007 -4.438420 -1.010994 -17.12384 2.749857e-07 0.005755968 7.060036
## cg03556069 -5.456754 -1.811718 -17.00720 2.891269e-07 0.005755968 7.024476
## cg08443814 -4.597347 -2.062275 -16.80835 3.151706e-07 0.005755968 6.962907
## cg18672939 -5.159383 -0.705992 -16.65643 3.368597e-07 0.005755968 6.915046
## cg24385334 -4.157473 -1.943370 -16.59313 3.463909e-07 0.005755968 6.894890
## cg18044663 -4.426118 -1.197724 -16.57851 3.486357e-07 0.005755968 6.890216
The results of this can then be used to define ECPs for stage 2, as in the previous example.
## Removing intercept from test coefficients
## logFC AveExpr t P.Value adj.P.Val B
## cg07155336 -6.037439 -1.276764 -19.22210 1.175108e-07 0.005755968 7.635736
## cg04743961 -4.887986 -2.317315 -19.21709 1.177367e-07 0.005755968 7.634494
## cg03607359 -4.393946 -2.191871 -18.07007 1.852304e-07 0.005755968 7.334032
## cg13272280 -4.559707 -2.099665 -17.25531 2.599766e-07 0.005755968 7.099628
## cg22263007 -4.438420 -1.010994 -17.12384 2.749857e-07 0.005755968 7.060036
## cg03556069 -5.456754 -1.811718 -17.00720 2.891269e-07 0.005755968 7.024476
## ctl3
## FALSE TRUE
## 199150 286362
We can then use the ECPs to perform a second differential methylation
with RUV-inverse
as before.
# Perform RUV adjustment and fit
rfit5 <- RUVfit(Y = M, X = grp, ctl = ctl3) # Stage 2 analysis
rfit6 <- RUVadj(Y = M, fit = rfit5)
# Look at table of top results
topRUV(rfit6)
## F.p F.p.BH p_X1.1 p.BH_X1.1 b_X1.1 sigma2 var.b_X1.1 fit.ctl
## cg06463958 1e-24 1e-24 1e-24 1e-24 -5.910598 0.1667397 0.1170589 FALSE
## cg07155336 1e-24 1e-24 1e-24 1e-24 -5.909549 0.1667397 0.1170589 FALSE
## cg02467990 1e-24 1e-24 1e-24 1e-24 -5.841079 0.1667397 0.1170589 FALSE
## cg00024472 1e-24 1e-24 1e-24 1e-24 -5.823529 0.1667397 0.1170589 FALSE
## cg01893212 1e-24 1e-24 1e-24 1e-24 -5.699627 0.1667397 0.1170589 FALSE
## cg11396157 1e-24 1e-24 1e-24 1e-24 -5.699331 0.1667397 0.1170589 FALSE
## cg13355248 1e-24 1e-24 1e-24 1e-24 -5.658606 0.1667397 0.1170589 FALSE
## cg00817367 1e-24 1e-24 1e-24 1e-24 -5.649284 0.1667397 0.1170589 FALSE
## cg16306898 1e-24 1e-24 1e-24 1e-24 -5.610118 0.1667397 0.1170589 FALSE
## cg16556906 1e-24 1e-24 1e-24 1e-24 -5.567659 0.1667397 0.1170589 FALSE
## mean
## cg06463958 0.27762518
## cg07155336 -1.27676413
## cg02467990 -0.41543703
## cg00024472 -2.44457624
## cg01893212 -0.08273355
## cg11396157 -1.38001701
## cg13355248 -0.84833866
## cg00817367 -0.29112939
## cg16306898 -2.14697683
## cg16556906 -0.96821744
To visualise the effect that the RUVm adjustment is
having on the data, using an MDS plot for example, the
getAdj
function can be used to extract the adjusted values
from the RUVm fit object produced by
RUVfit
. NOTE: The adjusted values should only be used for
visualisations - it is NOT recommended that they are used in any
downstream analysis.
The MDS plots below show how the relationship between the samples changes with and without RUVm adjustment. RUVm reduces the distance between the samples in each group by removing unwanted variation. It can be useful to examine this type of plot when trying to decide on the best set of ECPs or to help select the optimal value of k, if using RUV-4 or RUV-2.
par(mfrow=c(1,2))
plotMDS(M, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Unadjusted", gene.selection = "common")
legend("right",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-inverse", gene.selection = "common")
## Warning in plotMDS.default(Madj, labels = targets$Sample_Name, col =
## as.integer(factor(targets$status)), : dimension 2 is degenerate or all zero
To illustrate how the getAdj
function can be used to
help select an appropriate value for k, we will run the second stage of
the RUVm analysis using RUV-4 with two
different k values.
# Use RUV-4 in stage 2 of RUVm with k=1 and k=2
rfit7 <- RUVfit(Y = M, X = grp, ctl = ctl3,
method = "ruv4", k=1) # Stage 2 with RUV-4, k=1
rfit9 <- RUVfit(Y = M, X = grp, ctl = ctl3,
method = "ruv4", k=2) # Stage 2 with RUV-4, k=2
# get adjusted values
Madj1 <- getAdj(M, rfit7)
Madj2 <- getAdj(M, rfit9)
The following MDS plots show how the relationship between the samples changes from the unadjusted data to data adjusted with RUV-inverse and RUV-4 with two different k values. For this small dataset, RUV-inverse appears to be removing far too much variation as we can see the samples in each group are completely overlapping. Using RUV-4 and choosing a smaller value for k produces more sensible results.
par(mfrow=c(2,2))
plotMDS(M, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Unadjusted", gene.selection = "common")
legend("top",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-inverse", gene.selection = "common")
## Warning in plotMDS.default(Madj, labels = targets$Sample_Name, col =
## as.integer(factor(targets$status)), : dimension 2 is degenerate or all zero
legend("topright",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj1, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-4, k=1", gene.selection = "common")
legend("bottom",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj2, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-4, k=2", gene.selection = "common")
legend("bottomright",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
More information about the various RUV methods can be found at http://www-personal.umich.edu/~johanngb/ruv/, including links to all relevant publications. Further examples of RUV analyses, with code, can be found at https://github.com/johanngb/ruv-useR2018. The tutorials demonstrate how the various plotting functions available in the ruv package (which are not covered in this vignette) can be used to select sensible parameters and assess if the adjustment is “helping” your analysis.
Rather than testing for differences in mean methylation, we may be interested in testing for differences between group variances. For example, it has been hypothesised that highly variable CpGs in cancer are important for tumour progression (Hansen et al. 2011). Hence we may be interested in CpG sites that are consistently methylated in the normal samples, but variably methylated in the cancer samples.
In general we recommend at least 10 samples in each group for accurate variance estimation, however for the purpose of this vignette we perform the analysis on 3 vs 3. In this example, we are interested in testing for differential variability in the cancer versus normal group.
An important note on the coef
parameter: please always
explicitly state which columns of design matrix correspond to the groups
that you are interested in testing for differential variability. The
default setting for coef
is to include all columns of the
design matrix when calculating the Levene residuals, which is not
suitable for when there are additional variables that need to be taken
into account in the linear model. To avoid misspecification of the
model, it is best to explicitly state the coef
parameter.
The additional nuisance or confounding variables will still be taken
into account in the linear modelling step, however we find that
including them when calculating Levene residuals often removes all the
(possibly interesting) variation in the data.
When the design matrix includes an intercept term, the
coef
parameter must include both the intercept and groups
of interest. Consider the design matrix that was used when performing
the limma analysis:
## (Intercept) idid2 idid3 groupcancer
## 1 1 0 1 0
## 2 1 1 0 0
## 3 1 0 1 1
## 4 1 0 0 1
## 5 1 0 0 0
## 6 1 1 0 1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$id
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
The first column of the design matrix is the intercept term, and the
fourth column tells us which samples are cancer and normal samples. The
2nd and 3rd columns correspond to the ID parameter, which is not
interesting in terms of finding differentially variable CpGs, but may be
important to include in the linear model. Hence for this example we
would specify coef = c(1,4)
in the call to
varFit
.
For methylation data, the varFit
function will take
either a matrix of M-values, β
values or a MethylSet
object as input. If β values are supplied, a logit
transformation is performed. Note that as a default, varFit
uses the robust setting in the limma
framework, which requires the use of the statmod
package.
The numbers of hyper-variable (1) and hypo-variable (-1) genes in
cancer vs normal can be obtained using
decideTests
. In the cancer vs normal context, we would
expect to see more variability in methylation in cancer compared to
normal (i.e. hyper-variable).
## (Intercept) idid2 idid3 groupcancer
## Down 0 3 1 0
## NotSig 19721 19996 19992 19991
## Up 279 1 7 9
## SampleVar LogVarRatio DiffLevene t P.Value Adj.P.Value
## cg13516820 5.693427 3.438152 2.891378 5.460085 4.787335e-08 0.0009574671
## cg14267725 8.274259 4.376604 2.883810 4.885952 1.033275e-06 0.0074201573
## cg22091297 6.139080 4.060245 2.820586 4.871277 1.113024e-06 0.0074201573
## cg26744375 6.120996 2.683816 2.478359 4.785455 1.712088e-06 0.0085604409
## cg23071808 5.559516 3.380280 2.655461 4.713859 2.438865e-06 0.0095323005
## cg24879782 4.783563 3.855533 2.725247 4.681319 2.859690e-06 0.0095323005
## cg15174834 4.887070 3.833003 2.633604 4.442969 8.896376e-06 0.0254182172
## cg09912667 4.199660 5.044736 2.773995 4.384781 1.163986e-05 0.0290996396
## cg09028204 5.212884 2.962118 2.413784 4.356993 1.321892e-05 0.0293753768
## cg17969902 4.858522 3.681015 2.491479 4.096117 4.209631e-05 0.0841926111
An alternate parameterisation of the design matrix that does not
include an intercept term can also be used (i.e. a cell means model),
and specific contrasts tested with contrasts.varFit
. Here
we specify the design matrix such that the first two columns correspond
to the normal and cancer groups,
respectively. Note that we now specify coef=c(1,2)
.
design2 <- model.matrix(~0+group+id)
fitvar.contr <- varFit(Mval, design=design2, coef=c(1,2))
contr <- makeContrasts(groupcancer-groupnormal,levels=colnames(design2))
fitvar.contr <- contrasts.varFit(fitvar.contr,contrasts=contr)
The results are identical to before.
## groupcancer - groupnormal
## Down 0
## NotSig 19991
## Up 9
## SampleVar LogVarRatio DiffLevene t P.Value Adj.P.Value
## cg13516820 5.693427 3.438152 2.891378 5.460085 4.787335e-08 0.0009574671
## cg14267725 8.274259 4.376604 2.883810 4.885952 1.033275e-06 0.0074201573
## cg22091297 6.139080 4.060245 2.820586 4.871277 1.113024e-06 0.0074201573
## cg26744375 6.120996 2.683816 2.478359 4.785455 1.712088e-06 0.0085604409
## cg23071808 5.559516 3.380280 2.655461 4.713859 2.438865e-06 0.0095323005
## cg24879782 4.783563 3.855533 2.725247 4.681319 2.859690e-06 0.0095323005
## cg15174834 4.887070 3.833003 2.633604 4.442969 8.896376e-06 0.0254182172
## cg09912667 4.199660 5.044736 2.773995 4.384781 1.163986e-05 0.0290996396
## cg09028204 5.212884 2.962118 2.413784 4.356993 1.321892e-05 0.0293753768
## cg17969902 4.858522 3.681015 2.491479 4.096117 4.209631e-05 0.0841926111
The β values for the top 4 differentially variable CpGs can be seen in Figure @ref(fig:top4DV).
cpgsDV <- rownames(topDV)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(beta[rownames(beta)==cpgsDV[i],]~design[,4],method="jitter",
group.names=c("Normal","Cancer"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(cpgsDV[i],cex.main=1.5)
}
Testing for differential variability in expression data is
straightforward if the technology is gene expression microarrays. The
matrix of expression values can be supplied directly to the
varFit
function. For RNA-Seq data, the mean-variance
relationship that occurs in count data needs to be taken into account.
In order to deal with this issue, we apply a voom
transformation (Law et al. 2014) to obtain
observation weights, which are then used in the linear modelling step.
For RNA-Seq data, the varFit
function will take a
DGElist
object as input.
To demonstrate this, we use data from the tweeDEseqCountData package. This data is part of the International HapMap project, consisting of RNA-Seq profiles from 69 unrelated Nigerian individuals (Pickrell et al. 2010). The only covariate is gender, so we can look at differentially variable expression between males and females. We follow the code from the limma vignette to read in and process the data before testing for differential variability.
First we load up the data and extract the relevant information.
## [1] 38415 69
## gender
## female male
## 40 29
rm(pickrell1.eset)
data(genderGenes)
data(annotEnsembl63)
annot <- annotEnsembl63[,c("Symbol","Chr")]
rm(annotEnsembl63)
We now have the counts, gender of each sample and annotation (gene
symbol and chromosome) for each Ensemble gene. We can form a
DGElist
object using the edgeR
package.
We filter out lowly expressed genes by keeping genes with at least 1 count per million reads in at least 20 samples, as well as genes that have defined annotation. Finally we perform scaling normalisation.
isexpr <- rowSums(cpm(y)>1) >= 20
hasannot <- rowSums(is.na(y$genes))==0
y <- y[isexpr & hasannot,,keep.lib.sizes=FALSE]
dim(y)
## [1] 17310 69
We set up the design matrix and test for differential variability.
design.hapmap <- model.matrix(~gender)
fitvar.hapmap <- varFit(y, design = design.hapmap, coef=c(1,2))
## Converting counts to log counts-per-million using voom.
We can display the results of the test:
## (Intercept) gendermale
## Down 0 2
## NotSig 0 17308
## Up 17310 0
## Symbol Chr SampleVar LogVarRatio DiffLevene t
## ENSG00000213318 RP11-331F4.1 16 5.69839463 -2.562939 -0.9859943 -8.031243
## ENSG00000129824 RPS4Y1 Y 2.32497726 -2.087025 -0.4585620 -4.957005
## ENSG00000233864 TTTY15 Y 6.79004140 -2.245369 -0.6085233 -4.612934
## ENSG00000176171 BNIP3 10 0.41317384 1.199292 0.3632133 4.219404
## ENSG00000197358 BNIP3P1 14 0.39969125 1.149754 0.3353288 4.058147
## ENSG00000025039 RRAGD 6 0.91837213 1.091229 0.4926839 3.977022
## ENSG00000103671 TRIP4 15 0.07456448 -1.457139 -0.1520583 -3.911300
## ENSG00000171100 MTM1 X 0.44049558 -1.133295 -0.3334619 -3.896490
## ENSG00000149476 DAK 11 0.13289523 -1.470460 -0.1919880 -3.785893
## ENSG00000064886 CHI3L2 1 2.70234584 1.468059 0.8449434 3.782010
## P.Value Adj.P.Value
## ENSG00000213318 7.238039e-12 1.252905e-07
## ENSG00000129824 3.960855e-06 3.428120e-02
## ENSG00000233864 1.496237e-05 8.633290e-02
## ENSG00000176171 6.441668e-05 2.787632e-01
## ENSG00000197358 1.147886e-04 3.973982e-01
## ENSG00000025039 1.527695e-04 4.375736e-01
## ENSG00000103671 1.921104e-04 4.375736e-01
## ENSG00000171100 2.022293e-04 4.375736e-01
## ENSG00000149476 2.956364e-04 4.425050e-01
## ENSG00000064886 2.995692e-04 4.425050e-01
The log counts per million for the top 4 differentially variable genes can be seen in Figure @ref(fig:top4DVhapmap).
genesDV <- rownames(topDV.hapmap)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(cpm(y,log=TRUE)[rownames(y)==genesDV[i],]~design.hapmap[,ncol(design.hapmap)],method="jitter",
group.names=c("Female","Male"),pch=16,cex=1.5,col=c(4,2),ylab="Log counts per million",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(genesDV[i],cex.main=1.5)
}
Once a differential methylation or differential variability analysis has been performed, it may be of interest to know which gene pathways are targeted by the significant CpG sites. Geeleher et al. (Geeleher et al. 2013) showed there is a severe bias when performing gene ontology analysis with methylation data. This is due to the fact that there are differing numbers of probes per gene on several different array technologies. For the Illumina Infinium HumanMethylation450 array the number of probes per gene ranges from 1 to 1299, with a median of 15 probes per gene. For the EPIC array, the range is 1 to 1487, with a median of 20 probes per gene. This means that when annotating CpG sites to genes, a gene is more likely to be identified as differentially methylated if there are many CpG sites associated with the gene. We refer to this source of bias as “probe number bias”.
One way to take into account this selection bias is to model the relationship between the number of probes per gene and the probability of being differentially methylated. This can be performed by adapting the goseq method of Young et al. (Young et al. 2010). Each gene then has a prior probability associated with it, and a modified version of a hypergeometric test can be performed, testing for over-representation of the selected genes in each gene set. The BiasedUrn package can be used to obtain p-values from the Wallenius’ noncentral hypergeometric distribution, taking into account the odds of differential methylation for each gene set. Note that the BiasedUrn package can occassionally return p-values of 0. For the gene sets where a p-value of exactly zero is outputted we perform a hypergeometric test, which ensures non-zero p-values and hence false discovery rates.
We have recently uncovered a new source of bias in gene set testing for methylation array data (Maksimovic, Oshlack, and Phipson 2020) that we refer to as “multi-gene bias”. This second source of bias arises due to the fact that around 10% of gene-annotated CpGs are annotated to more than one gene, violating assumptions of independently measured genes. This can lead to some GO categories being identified as significantly enriched as they contain genes with methylation measurements from shared CpGs. This can occur for large gene families such as the protocadherin gamma gene cluster which happen to be over-represented in the GO category “GO:0007156: homophilic cell adhesion via plasma membrane adhesion molecules”. This is now taken into account in the gene set testing functions in missMethyl.
We have developed methods for both CpG and region level analyses. The
gometh
function performs gene set testing on GO categories
or KEGG pathways for significant CpG sites. The gsameth
function is a more generalised gene set testing function which can take
as input a list of user specified gene sets. Note that for
gsameth
, the format for the gene ids for each gene in the
gene set needs to be Entrez Gene IDs. For example, the
entire curated gene set list (C2) from the Broad’s Molecular Signatures
Database can be specified as input. The R version of these lists can be
downloaded from http://bioinf.wehi.edu.au/software/MSigDB/index.html.
Both functions take a vector of significant CpG probe names as input.
For a region level analysis, using the DMRcate
package, for example, goregion
and gsaregion
can perform gene set enrichment analysis using the ranged object as
input.
NOTE ON UPDATES IN SEPTEMBER 2020: We have added new functionality to
the gene set testing functions to allow the user to limit the input CpGs
to specific genomic features of interest using the
genomic.features
argument. This is based on the annotation
information in the manifest files for the 450K and EPIC arrays. Possible
values are ““ALL”, “TSS200”, “TSS1500”, “Body”, “1stExon”, “3’UTR”,
“5’UTR” and “ExonBnd” (only for EPIC), and any combination can be
specified. In order to include the significant genes that overlap with
the gene set of interest, the sig.genes
parameter can be
set to TRUE
. This adds an additional column to the results
dataframe.
To illustrate how to use gometh
, consider the results
from the differential methylation analysis with
RUVm.
##
## FALSE TRUE
## 424168 61344
At a 1% false discovery rate cut-off, there are more than 60,000 CpG sites differentially methylated. These CpGs are annotated to almost 10,000 genes, which means that a gene ontology analysis is unlikely to be relevant or reveal anything biologically interesting. One option for selecting CpGs in this context is to apply not only a false discovery rate cut-off, but also a Δβ cut-off. However, for this dataset, taking a relatively large Δβ cut-off of 0.25 still leaves more than 30000 CpGs differentially methylated, which can be annotated to more than 6000 genes.
beta <- getBeta(mSet)
# make sure that order of beta values matches orer after analysis
beta <- beta[match(rownames(top),rownames(beta)),]
beta_norm <- rowMeans(beta[,grp==0])
beta_can <- rowMeans(beta[,grp==1])
Delta_beta <- beta_can - beta_norm
sigDM <- top$p.BH_X1.1 < 0.01 & abs(Delta_beta) > 0.25
table(sigDM)
## sigDM
## FALSE TRUE
## 451760 33748
Instead, we take the top 10000 CpG sites as input to
gometh
which can be annotated to around 2500 genes.
## [1] "cg07155336" "cg06463958" "cg00024472" "cg02040433" "cg13355248"
## [6] "cg02467990" "cg00817367" "cg11396157" "cg16306898" "cg03735888"
# Check number of genes that significant CpGs are annotated to
check <- getMappedEntrezIDs(sig.cpg = sigCpGs)
## All input CpGs are used for testing.
## [1] 2491
Thegometh
function takes as input a character vector of
CpG names, and optionally, a character vector of all CpG sites tested.
This is important to include if filtering of the CpGs has been performed
prior to differential methylation analysis. If the all.cpg
argument is omitted, all the CpGs on the array are used as background.
To change the array type, the array.type
argument can be
specified as either “450K”, “EPIC” or “EPIC_V2”. The default is
“450K”.
If the plot.bias
argument is TRUE
, a figure
showing the relationship between the probability of differential
methylation and the number of probes per gene will be displayed.
For testing of GO terms, the collection
argument takes
the value “GO”, which is the default setting. For KEGG pathway analysis,
set collection
to “KEGG”. The function topGSA
shows the top enriched GO categories. The gsameth
function
is called for GO and KEGG pathway analysis with the appropriate
inputs.
For GO testing on our example dataset:
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
gst <- gometh(sig.cpg=sigCpGs, all.cpg=rownames(top), collection="GO",
plot.bias=TRUE)
## All input CpGs are used for testing.
## ONTOLOGY TERM N DE P.DE
## GO:0048731 BP system development 4114 917 7.056363e-39
## GO:0007399 BP nervous system development 2595 654 1.202074e-34
## GO:0007275 BP multicellular organism development 4774 988 8.191426e-32
## GO:0048699 BP generation of neurons 1519 430 1.847685e-29
## GO:0048856 BP anatomical structure development 6003 1140 1.288330e-28
## GO:0030182 BP neuron differentiation 1434 406 1.678079e-27
## GO:0022008 BP neurogenesis 1772 471 8.340452e-27
## GO:0032502 BP developmental process 6557 1197 2.652977e-26
## GO:0032501 BP multicellular organismal process 7360 1269 1.150266e-25
## GO:0071944 CC cell periphery 5945 1054 8.906403e-25
## FDR
## GO:0048731 1.580907e-34
## GO:0007399 1.346563e-30
## GO:0007275 6.117357e-28
## GO:0048699 1.034888e-25
## GO:0048856 5.772750e-25
## GO:0030182 6.265947e-24
## GO:0022008 2.669421e-23
## GO:0032502 7.429661e-23
## GO:0032501 2.863395e-22
## GO:0071944 1.995391e-21
Testing all GO categories (>20,000) can be a little slow. To
demonstrate the genomic.features
parameter, let us rather
focus on KEGG pathways (~330 pathways).
## All input CpGs are used for testing.
## Description N DE P.DE
## hsa04080 Neuroactive ligand-receptor interaction 362 80 3.622173e-08
## hsa04020 Calcium signaling pathway 250 76 5.951427e-08
## hsa04514 Cell adhesion molecules 150 47 1.123402e-05
## hsa04970 Salivary secretion 92 27 1.105137e-04
## hsa04024 cAMP signaling pathway 225 57 1.319237e-04
## hsa04950 Maturity onset diabetes of the young 26 12 2.862867e-04
## hsa04151 PI3K-Akt signaling pathway 347 79 7.426530e-04
## hsa04911 Insulin secretion 86 27 1.307017e-03
## hsa04974 Protein digestion and absorption 102 27 1.483637e-03
## hsa05217 Basal cell carcinoma 63 22 1.546653e-03
## FDR
## hsa04080 1.074233e-05
## hsa04020 1.074233e-05
## hsa04514 1.351827e-03
## hsa04970 9.524890e-03
## hsa04024 9.524890e-03
## hsa04950 1.722492e-02
## hsa04151 3.829967e-02
## hsa04911 5.583419e-02
## hsa04974 5.583419e-02
## hsa05217 5.583419e-02
We can limit the input CpGs to those in the promoter regions of genes:
gst.kegg.prom <- gometh(sig.cpg=sigCpGs, all.cpg=rownames(top),
collection="KEGG", genomic.features = c("TSS200",
"TSS1500",
"1stExon"))
topGSA(gst.kegg.prom, n=10)
## Description N DE P.DE
## hsa04080 Neuroactive ligand-receptor interaction 362 57 5.817842e-09
## hsa04020 Calcium signaling pathway 250 46 8.426609e-06
## hsa04514 Cell adhesion molecules 150 29 1.446670e-04
## hsa04024 cAMP signaling pathway 225 37 1.713354e-04
## hsa04911 Insulin secretion 86 19 5.037349e-04
## hsa05032 Morphine addiction 91 20 9.693239e-04
## hsa05217 Basal cell carcinoma 63 15 1.536957e-03
## hsa04916 Melanogenesis 101 20 1.608201e-03
## hsa04727 GABAergic synapse 89 18 2.016696e-03
## hsa04970 Salivary secretion 92 16 2.319422e-03
## FDR
## hsa04080 2.100241e-06
## hsa04020 1.521003e-03
## hsa04514 1.546302e-02
## hsa04024 1.546302e-02
## hsa04911 3.636966e-02
## hsa05032 5.832099e-02
## hsa05217 7.257006e-02
## hsa04916 7.257006e-02
## hsa04727 8.089190e-02
## hsa04970 8.373112e-02
We can see if the results are different if we only include CpGs in gene bodies:
gst.kegg.body <- gometh(sig.cpg=sigCpGs, all.cpg=rownames(top),
collection="KEGG", genomic.features = c("Body"))
topGSA(gst.kegg.body, n=10)
## Description N DE
## hsa04820 Cytoskeleton in muscle cells 232 38
## hsa04974 Protein digestion and absorption 102 19
## hsa04514 Cell adhesion molecules 150 28
## hsa04020 Calcium signaling pathway 250 40
## hsa04512 ECM-receptor interaction 89 19
## hsa04640 Hematopoietic cell lineage 92 14
## hsa04950 Maturity onset diabetes of the young 26 7
## hsa04151 PI3K-Akt signaling pathway 347 45
## hsa04550 Signaling pathways regulating pluripotency of stem cells 141 24
## hsa05410 Hypertrophic cardiomyopathy 99 17
## P.DE FDR
## hsa04820 0.0003185460 0.05084416
## hsa04974 0.0004211805 0.05084416
## hsa04514 0.0004225277 0.05084416
## hsa04020 0.0007280317 0.05995196
## hsa04512 0.0008303595 0.05995196
## hsa04640 0.0018299702 0.11010320
## hsa04950 0.0056797373 0.29291217
## hsa04151 0.0066334332 0.29933367
## hsa04550 0.0077328075 0.31017150
## hsa05410 0.0101666956 0.34180319
The KEGG pathways are quite different when limiting CpGs to those in
gene bodies versus CpGs in promoters. To include the significant genes
that overlap with each gene set, set the sig.genes
parameter to TRUE.
gst.kegg.body <- gometh(sig.cpg=sigCpGs, all.cpg=rownames(top),
collection="KEGG", genomic.features = c("Body"),
sig.genes = TRUE)
topGSA(gst.kegg.body, n=5)
## Description N DE P.DE FDR
## hsa04820 Cytoskeleton in muscle cells 232 38 0.0003185460 0.05084416
## hsa04974 Protein digestion and absorption 102 19 0.0004211805 0.05084416
## hsa04514 Cell adhesion molecules 150 28 0.0004225277 0.05084416
## hsa04020 Calcium signaling pathway 250 40 0.0007280317 0.05995196
## hsa04512 ECM-receptor interaction 89 19 0.0008303595 0.05995196
## SigGenesInSet
## hsa04820 COL1A2,COL4A1,COL4A2,COL4A3,COL5A1,COL5A2,COL6A3,COL11A2,COMP,SGCZ,SYNE3,ELN,FBLN1,FBLN2,MYH15,SUN1,COL24A1,PDLIM3,LAMA1,ITGA4,ITGB4,ITGB5,AGRN,LAMA2,MYBPC3,MYH11,NID1,ATP1B2,FMN2,SPTBN4,ACTA1,SDC2,TNNI2,TNNT3,VIM,OBSCN,ITGA8,COL27A1,SGCE
## hsa04974 COL1A2,COL4A1,COL4A2,COL4A3,COL5A1,COL5A2,COL6A3,COL11A2,COL15A1,CPA2,COL26A1,COL22A1,ELN,COL24A1,KCNQ1,ATP1B2,SLC8A2,COL18A1,COL27A1,COL23A1
## hsa04514 CDH2,CDH4,CNTNAP2,ICOS,HLA-DQA2,HLA-DQB1,HLA-F,HLA-G,ITGA4,ITGAM,MAG,NCAM1,CLDN11,CLDN18,PDCD1,PTPRM,PTPRS,JAM2,SDC2,JAM3,NTNG2,ITGA8,CLDN12,CLDN1,CD8A,NRXN1,NRXN2,CD34
## hsa04020 ADCY1,GRIN3A,ADRA1B,ADCY4,ERBB4,FGF3,FGFR2,P2RX2,FLT1,FLT4,GDNF,GRIN2A,GRIN2D,GRM5,HTR2A,ITPKB,ITPR2,NFATC1,ATP2A1,ATP2B2,NTRK1,NTRK2,NTRK3,PDE1A,PDGFB,PLCG2,PRKCB,RYR1,RYR2,RYR3,SLC8A2,CACNA1B,CACNA1D,PDGFD,CAMK4,CAMK2B,CACNA1I,CACNA1H,PLCZ1,CD38,FGF19
## hsa04512 COL1A2,COL4A1,COL4A2,COL4A3,COL6A3,COMP,LAMA1,FREM2,ITGA4,ITGB4,ITGB5,AGRN,LAMA2,LAMA4,GP6,RELN,TNXB,FRAS1,ITGA8,CD36
For a more generalised version of gene set testing in methylation
data where the user can specify the gene set to be tested, the function
gsameth
can be used. To display the top 20 pathways,
topGSA
can be called. gsameth
can take a
single gene set, or a list of gene sets. The gene identifiers in the
gene set must be Entrez Gene IDs. To demonstrate
gsameth
, we download and use the Hallmark gene set.
hallmark <- readRDS(url("http://bioinf.wehi.edu.au/MSigDB/v7.1/Hs.h.all.v7.1.entrez.rds"))
gsa <- gsameth(sig.cpg=sigCpGs, all.cpg=rownames(top), collection=hallmark)
## All input CpGs are used for testing.
## N DE P.DE FDR
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 199 62 1.517786e-07 7.588932e-06
## HALLMARK_KRAS_SIGNALING_DN 200 47 1.608166e-04 4.020416e-03
## HALLMARK_PANCREAS_BETA_CELLS 40 14 7.632959e-04 1.272160e-02
## HALLMARK_MYOGENESIS 200 46 1.790637e-02 2.238296e-01
## HALLMARK_APICAL_JUNCTION 199 43 3.230033e-02 3.184648e-01
## HALLMARK_SPERMATOGENESIS 135 25 3.821577e-02 3.184648e-01
## HALLMARK_UV_RESPONSE_DN 144 36 9.694014e-02 6.193276e-01
## HALLMARK_HEDGEHOG_SIGNALING 36 11 9.909242e-02 6.193276e-01
## HALLMARK_KRAS_SIGNALING_UP 198 34 1.264278e-01 7.023766e-01
## HALLMARK_NOTCH_SIGNALING 32 9 1.416844e-01 7.084222e-01
Note that if it is of interest to obtain the Entrez Gene
IDs that the significant CpGs are mapped to, the
getMappedEntrezIDs
function can be called.
We have extended gometh
and gsameth
to
perform gene set testing following a region analysis. The region gene
set testing function analogues are goregion
and
gsaregion
. Instead of inputting significant CpGs, a ranged
object as typically outputted from region finding software is used. The
CpGs overlapping the significant differentially methylated regions
(DMRs) are extracted and the same statistical framework in
gometh
and gsameth
is applied to test for
enrichment of gene sets. We find that genes that have more CpGs
measuring methylation are more likely to be called as differentially
methylated regions, and hence it is important to take into account this
bias when performing gene set testing.
To demonstrate goregion
and gsaregion
we
will perform a region analysis on the cancer dataset using the DMRcate
package.
First, the matrix of M-values is annotated with the relevant annotation information about the probes such as their genomic position, gene annotation, etc. By default, this is done using the ilmn12.hg19 annotation. The limma pipeline is then used for differential methylation analysis to calculate moderated t-statistics.
myAnnotation <- cpg.annotate(object = M, datatype = "array", what = "M",
arraytype = c("450K"),
analysis.type = "differential", design = design,
coef = 4)
## Your contrast returned 23435 individually significant probes. We recommend the default setting of pcutoff in dmrcate().
Next we can use the dmrcate
function to combine the CpGs
to identify differentially methylated regions. We can use the
extractRanges
function to extract a GRanges
object with the genomic location and the relevant statistics associated
with each DMR. This object can then be used as input to
goregion
and gsaregion
.
## Fitting chr1...
## Fitting chr2...
## Fitting chr3...
## Fitting chr4...
## Fitting chr5...
## Fitting chr6...
## Fitting chr7...
## Fitting chr8...
## Fitting chr9...
## Fitting chr10...
## Fitting chr11...
## Fitting chr12...
## Fitting chr13...
## Fitting chr14...
## Fitting chr15...
## Fitting chr16...
## Fitting chr17...
## Fitting chr18...
## Fitting chr19...
## Fitting chr20...
## Fitting chr21...
## Fitting chr22...
## Fitting chrX...
## Fitting chrY...
## Demarcating regions...
## Done!
## DMRcatedata not installed.
## Full functionality, documentation, and loading of data might not be possible without installing
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## GRanges object with 2578 ranges and 8 metadata columns:
## seqnames ranges strand | no.cpgs min_smoothed_fdr
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr4 154709441-154714069 * | 54 0
## [2] chr6 133561224-133564066 * | 50 0
## [3] chr13 78491982-78494462 * | 50 0
## [4] chr11 32454216-32461240 * | 43 0
## [5] chr10 118030292-118034357 * | 31 0
## ... ... ... ... . ... ...
## [2574] chr6 91005117-91005162 * | 2 3.59972e-25
## [2575] chr2 241085024-241085043 * | 2 3.64478e-25
## [2576] chr17 48071685-48071706 * | 2 3.86812e-25
## [2577] chr6 166720997-166721155 * | 2 4.22386e-25
## [2578] chr19 58446898-58446988 * | 3 4.47316e-25
## Stouffer HMFDR Fisher maxdiff meandiff
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 1.14510e-52 0.00102944 3.61330e-54 0.586292 0.224951
## [2] 3.29083e-86 0.00103909 4.98468e-81 0.653903 0.362511
## [3] 5.17970e-75 0.00119367 4.78794e-69 0.548184 0.323927
## [4] 1.59959e-71 0.00102449 1.02503e-64 0.648133 0.298009
## [5] 5.43504e-75 0.00102449 1.44576e-66 0.688941 0.411306
## ... ... ... ... ... ...
## [2574] 2.15797e-05 0.00197580 4.71393e-05 0.284672 0.2309882
## [2575] 1.94558e-06 0.00113219 4.68351e-06 -0.344832 -0.3291230
## [2576] 3.47644e-05 0.00256174 7.50944e-05 0.331912 0.3141771
## [2577] 1.33409e-04 0.00251525 2.35477e-04 0.252990 0.2377386
## [2578] 1.89862e-04 0.00923636 4.63475e-04 -0.200463 0.0256059
## overlapping.genes
## <character>
## [1] SFRP2
## [2] EYA4
## [3] RNF219-AS1, EDNRB
## [4] WT1-AS, WT1
## [5] GFRA1
## ... ...
## [2574] BACH2
## [2575] <NA>
## [2576] DLX3
## [2577] PRR18
## [2578] <NA>
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
We can visualise the top DMR using the DMR.plot
function:
cols <- c(2,4)[group]
names(cols) <-group
beta <- getBeta(mSet)
par(mfrow=c(1,1))
DMR.plot(ranges=results.ranges, dmr=2, CpGs=beta, phen.col=cols,
what="Beta", arraytype="450K", genome="hg19")
Now that we have performed our region analysis, we can use
goregion
and gsaregion
to perform gene set
testing. Setting plot.bias
to TRUE, we can see strong probe
number bias in the data. This can be interpretted that a gene is more
likely to have a DMR called if it has more CpGs measuring
methylation.
gst.region <- goregion(results.ranges, all.cpg=rownames(M),
collection="GO", array.type="450K", plot.bias=TRUE)
## All input CpGs are used for testing.
## ONTOLOGY
## GO:0048731 BP
## GO:0007399 BP
## GO:0007275 BP
## GO:0003700 MF
## GO:0000981 MF
## GO:0048856 BP
## GO:0030182 BP
## GO:0048699 BP
## GO:0032501 BP
## GO:0000977 MF
## TERM
## GO:0048731 system development
## GO:0007399 nervous system development
## GO:0007275 multicellular organism development
## GO:0003700 DNA-binding transcription factor activity
## GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific
## GO:0048856 anatomical structure development
## GO:0030182 neuron differentiation
## GO:0048699 generation of neurons
## GO:0032501 multicellular organismal process
## GO:0000977 RNA polymerase II transcription regulatory region sequence-specific DNA binding
## N DE P.DE FDR
## GO:0048731 4114 744 4.349353e-40 9.331936e-36
## GO:0007399 2595 551 8.330598e-40 9.331936e-36
## GO:0007275 4774 804 2.659730e-35 1.986286e-31
## GO:0003700 1401 308 2.699146e-32 1.511792e-28
## GO:0000981 1318 293 8.628196e-31 3.866122e-27
## GO:0048856 6003 910 4.047706e-30 1.511413e-26
## GO:0030182 1434 341 4.126928e-29 1.320853e-25
## GO:0048699 1519 354 9.569565e-29 2.679957e-25
## GO:0032501 7360 1005 4.657137e-28 1.159317e-24
## GO:0000977 1350 288 1.062605e-27 2.380661e-24
We can also test for enrichment of KEGG pathways:
gst.region.kegg <- goregion(results.ranges, all.cpg=rownames(M),
collection="KEGG", array.type="450K")
## All input CpGs are used for testing.
## Description N DE P.DE
## hsa04080 Neuroactive ligand-receptor interaction 362 82 4.960622e-15
## hsa04514 Cell adhesion molecules 150 43 3.583958e-07
## hsa04020 Calcium signaling pathway 250 61 5.288811e-07
## hsa04950 Maturity onset diabetes of the young 26 12 2.118983e-05
## hsa04024 cAMP signaling pathway 225 48 6.083002e-05
## hsa04713 Circadian entrainment 97 27 4.680541e-04
## hsa04970 Salivary secretion 92 21 5.560922e-04
## hsa04742 Taste transduction 85 17 1.427412e-03
## hsa04725 Cholinergic synapse 113 28 1.982539e-03
## hsa04911 Insulin secretion 86 22 2.171758e-03
## FDR
## hsa04080 1.790785e-12
## hsa04514 6.364203e-05
## hsa04020 6.364203e-05
## hsa04950 1.912382e-03
## hsa04024 4.391928e-03
## hsa04713 2.816125e-02
## hsa04970 2.867847e-02
## hsa04742 6.441197e-02
## hsa04725 7.777573e-02
## hsa04911 7.777573e-02
What is interesting is that although very similar pathways are highly ranked compared to the CpG level analysis, gene set testing on the regions is more powerful i.e. the p-values are more significant. In experiments where there are tens of thousands of significant CpGs from a CpG level analysis, we recommend that a good quality region analysis can be a more powerful approach for gene set enrichment analysis.
We can also perform gene set testing on the Hallmark gene sets using
gsaregion
:
## All input CpGs are used for testing.
## N DE P.DE FDR
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 199 41 0.001372868 0.06864339
## HALLMARK_PANCREAS_BETA_CELLS 40 11 0.002838708 0.07096770
## HALLMARK_KRAS_SIGNALING_DN 200 33 0.005944217 0.09907029
## HALLMARK_SPERMATOGENESIS 135 22 0.012163372 0.15204215
## HALLMARK_APICAL_JUNCTION 199 34 0.043709454 0.43709454
## HALLMARK_MYOGENESIS 200 34 0.067441327 0.55864017
## HALLMARK_KRAS_SIGNALING_UP 198 28 0.078209624 0.55864017
## HALLMARK_ANGIOGENESIS 36 7 0.107929922 0.67456201
## HALLMARK_APICAL_SURFACE 44 7 0.287851086 1.00000000
## HALLMARK_INFLAMMATORY_RESPONSE 200 20 0.314445563 1.00000000
Note that the DMR analysis can be further refined by imposing a Δβ value cut-off and changing various parameters. Please refer to the DMRcate package vignette for more details on how to do this.
R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC tzcode source: system (glibc)
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base
other attached packages: [1] DMRcate_3.1.10
[2] edgeR_4.3.21
[3] tweeDEseqCountData_1.43.0
[4] minfiData_0.51.0
[5] IlluminaHumanMethylation450kmanifest_0.4.0
[6] limma_3.61.12
[7] missMethyl_1.41.0
[8] IlluminaHumanMethylationEPICv2anno.20a1.hg38_1.0.0 [9]
IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0 [10]
IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1 [11]
minfi_1.51.0
[12] bumphunter_1.49.0
[13] locfit_1.5-9.10
[14] iterators_1.0.14
[15] foreach_1.5.2
[16] Biostrings_2.75.0
[17] XVector_0.45.0
[18] SummarizedExperiment_1.35.5
[19] Biobase_2.67.0
[20] MatrixGenerics_1.17.1
[21] matrixStats_1.4.1
[22] GenomicRanges_1.57.2
[23] GenomeInfoDb_1.41.2
[24] IRanges_2.39.2
[25] S4Vectors_0.43.2
[26] BiocGenerics_0.53.0
[27] BiocStyle_2.35.0
loaded via a namespace (and not attached): [1] ProtGenerics_1.37.1
bitops_1.0-9
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] tools_4.4.1 doRNG_1.8.6
[7] backports_1.5.0 utf8_1.2.4
[9] R6_2.5.1 HDF5Array_1.33.8
[11] lazyeval_0.2.2 Gviz_1.49.0
[13] rhdf5filters_1.17.0 permute_0.9-7
[15] withr_3.0.2 prettyunits_1.2.0
[17] gridExtra_2.3 base64_2.0.2
[19] preprocessCore_1.67.1 cli_3.6.3
[21] sass_0.4.9 readr_2.1.5
[23] genefilter_1.87.0 askpass_1.2.1
[25] Rsamtools_2.21.2 foreign_0.8-87
[27] siggenes_1.79.0 illuminaio_0.47.0
[29] R.utils_2.12.3 rentrez_1.2.3
[31] dichromat_2.0-0.1 scrime_1.3.5
[33] BSgenome_1.75.0 rstudioapi_0.17.1
[35] RSQLite_2.3.7 generics_0.1.3
[37] BiocIO_1.17.0 gtools_3.9.5
[39] dplyr_1.1.4 GO.db_3.20.0
[41] Matrix_1.7-1 interp_1.1-6
[43] fansi_1.0.6 abind_1.4-8
[45] R.methodsS3_1.8.2 lifecycle_1.0.4
[47] yaml_2.3.10 rhdf5_2.49.0
[49] SparseArray_1.5.45 BiocFileCache_2.15.0
[51] grid_4.4.1 blob_1.2.4
[53] ExperimentHub_2.13.1 crayon_1.5.3
[55] lattice_0.22-6 GenomicFeatures_1.57.1
[57] annotate_1.85.0 KEGGREST_1.45.1
[59] sys_3.4.3 maketools_1.3.1
[61] pillar_1.9.0 knitr_1.48
[63] beanplot_1.3.1 rjson_0.2.23
[65] codetools_0.2-20 glue_1.8.0
[67] data.table_1.16.2 vctrs_0.6.5
[69] png_0.1-8 gtable_0.3.6
[71] cachem_1.1.0 xfun_0.48
[73] mime_0.12 S4Arrays_1.5.11
[75] survival_3.7-0 statmod_1.5.0
[77] nlme_3.1-166 bit64_4.5.2
[79] bsseq_1.43.0 progress_1.2.3
[81] filelock_1.0.3 bslib_0.8.0
[83] nor1mix_1.3-3 rpart_4.1.23
[85] colorspace_2.1-1 DBI_1.2.3
[87] Hmisc_5.2-0 nnet_7.3-19
[89] tidyselect_1.2.1 bit_4.5.0
[91] compiler_4.4.1 curl_5.2.3
[93] httr2_1.0.5 htmlTable_2.4.3
[95] BiasedUrn_2.0.12 xml2_1.3.6
[97] DelayedArray_0.33.1 rtracklayer_1.65.0
[99] checkmate_2.3.2 scales_1.3.0
[101] quadprog_1.5-8 rappdirs_0.3.3
[103] stringr_1.5.1 digest_0.6.37
[105] rmarkdown_2.28 GEOquery_2.73.5
[107] htmltools_0.5.8.1 pkgconfig_2.0.3
[109] jpeg_0.1-10 base64enc_0.1-3
[111] sparseMatrixStats_1.17.2 highr_0.11
[113] dbplyr_2.5.0 ruv_0.9.7.1
[115] fastmap_1.2.0 ensembldb_2.29.1
[117] rlang_1.1.4 htmlwidgets_1.6.4
[119] UCSC.utils_1.1.0 DelayedMatrixStats_1.29.0 [121] jquerylib_0.1.4
jsonlite_1.8.9
[123] BiocParallel_1.41.0 mclust_6.1.1
[125] R.oo_1.26.0 VariantAnnotation_1.51.2 [127] RCurl_1.98-1.16
magrittr_2.0.3
[129] Formula_1.2-5 GenomeInfoDbData_1.2.13
[131] Rhdf5lib_1.27.0 munsell_0.5.1
[133] Rcpp_1.0.13 stringi_1.8.4
[135] zlibbioc_1.51.2 MASS_7.3-61
[137] AnnotationHub_3.15.0 plyr_1.8.9
[139] org.Hs.eg.db_3.20.0 deldir_2.0-4
[141] splines_4.4.1 multtest_2.61.0
[143] hms_1.1.3 rngtools_1.5.2
[145] buildtools_1.0.0 biomaRt_2.63.0
[147] BiocVersion_3.21.1 XML_3.99-0.17
[149] evaluate_1.0.1 latticeExtra_0.6-30
[151] biovizBase_1.55.0 BiocManager_1.30.25
[153] tzdb_0.4.0 tidyr_1.3.1
[155] openssl_2.2.2 purrr_1.0.2
[157] reshape_0.8.9 ggplot2_3.5.1
[159] xtable_1.8-4 restfulr_0.0.15
[161] AnnotationFilter_1.31.0 tibble_3.2.1
[163] memoise_2.0.1 AnnotationDbi_1.69.0
[165] GenomicAlignments_1.41.0 cluster_2.1.6