Experiments comprising different omic data analyses are becoming common. In this cases, the researchers are interested in analyzing the different omic types independently as well as finding association between the different layers.
In this vignette, we present how address to an analysis of methylation and expression data using MEAL. We will show how to analyze each data type independently and how to integrate the results of both analyses. We will also exemplify how to run an analysis when we are interested in our target region.
We will use the data from brgedata package. This package
contains data from a spanish children cohort. We will work with the
methylation and the expression data encapsulated in a
GenomicRatioSet
and in an ExpressionSet
respectively. In our analysis, we will evaluate the effect of sex on
methylation and expression. We will also deeply explore changes in a
region of chromosome 11 around MMP3, a gene differently expressed in men
and women (chr11:102600000-103300000).
Let’s start by loading the required packages and data:
library(MEAL)
library(brgedata)
library(MultiDataSet)
library(missMethyl)
library(minfi)
library(GenomicRanges)
library(ggplot2)
From the six objects of brgedata, we will focus on those containing methylation and expression data: brge_methy and brge_gexp.
brge_methy is a GenomicRatioSet
with methylation data
corresponding to the Illumina Human Methylation 450K. It contains 392277
probes and 20 samples. The object contains 9 phenotypic variables (age,
sex and cell types proportions):
## class: GenomicRatioSet
## dim: 392277 20
## metadata(0):
## assays(1): Beta
## rownames(392277): cg13869341 cg24669183 ... cg26251715 cg25640065
## rowData names(14): Forward_Sequence SourceSeq ...
## Regulatory_Feature_Group DHS
## colnames(20): x0017 x0043 ... x0077 x0079
## colData names(9): age sex ... Mono Neu
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
## Preprocessing
## Method: NA
## minfi version: NA
## Manifest version: NA
## DataFrame with 20 rows and 9 columns
## age sex NK Bcell CD4T CD8T
## <numeric> <factor> <numeric> <numeric> <numeric> <numeric>
## x0017 4 Female -5.81386e-19 0.1735078 0.20406869 0.1009741
## x0043 4 Female 1.64826e-03 0.1820172 0.16171272 0.1287722
## x0036 4 Male 1.13186e-02 0.1690173 0.15546368 0.1277417
## x0041 4 Female 8.50822e-03 0.0697158 0.00732789 0.0321861
## x0032 4 Male 0.00000e+00 0.1139780 0.22230399 0.0216090
## ... ... ... ... ... ... ...
## x0018 4 Female 0.0170284 0.0781698 0.112735 0.06796816
## x0057 4 Female 0.0000000 0.0797774 0.111072 0.00910489
## x0061 4 Female 0.0000000 0.1640266 0.224203 0.13125212
## x0077 4 Female 0.0000000 0.1122731 0.168056 0.07840593
## x0079 4 Female 0.0120148 0.0913650 0.205830 0.11475389
## Eos Mono Neu
## <numeric> <numeric> <numeric>
## x0017 0.00000e+00 0.0385654 0.490936
## x0043 0.00000e+00 0.0499542 0.491822
## x0036 0.00000e+00 0.1027105 0.459632
## x0041 0.00000e+00 0.0718280 0.807749
## x0032 2.60504e-18 0.0567246 0.575614
## ... ... ... ...
## x0018 8.37770e-03 0.0579535 0.658600
## x0057 4.61887e-18 0.1015260 0.686010
## x0061 8.51074e-03 0.0382647 0.429575
## x0077 6.13397e-02 0.0583411 0.515284
## x0079 0.00000e+00 0.0750535 0.513597
brge_gexp is an ExpressionSet
with the expression data
corresponding to an Affimetrix GeneChip Human Gene 2.0 ST Array. It
contains 67528 features and 100 samples as well as two phenotypic
variables (age and sex):
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 67528 features, 100 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: x0001 x0002 ... x0139 (100 total)
## varLabels: age sex
## varMetadata: labelDescription
## featureData
## featureNames: TC01000001.hg.1 TC01000002.hg.1 ...
## TCUn_gl000247000001.hg.1 (67528 total)
## fvarLabels: transcript_cluster_id probeset_id ... notes (11 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## $age
##
## 4
## 99
##
## $sex
##
## Female Male
## 48 51
Finally, we will create a GenomicRanges
with our target
region:
Let’s illustrate the use of this package by analyzing the effect of the sex in methylation and expression. First, a genome wide analysis will be performed and then the region funcionalities will be introduced. It should be noticed that methylation and expression analyses will include hypothesis testing and visualitzation.
This demonstration will show how to perform a probe methylation analysis. To run a more complete analysis, including DMR methods, see MEAL vignette.
The function runDiffMeanAnalysis
detects probes
differentially methylated using limma
. This function
accepts a GenomicRatioSet
, an ExpressionSet
or
a SummarizedExperiment
, so it can analyze methylation and
gene expression data. We can pass our model to function
runDiffMeanAnalysis
with the parameter model
.
We are interested in evaluating the effect of sex and We will include
cell counts as covariables to adjust their effect:
cellCounts <- colnames(colData(brge_methy))[3:9]
methRes <- runDiffMeanAnalysis(set = brge_methy,
model = formula(paste("~ sex +", paste(cellCounts, collapse = "+"))))
methRes
## Object of class 'ResultSet'
## . created with: DAProbe
## . sva:
## . #results: 1 ( error: 1 )
## . featureData: 392277 probes x 19 variables
## [1] "DiffMean"
The analysis generates a ResultSet
object containing the
results of our analysis. The first step after running the pipeline will
be evaluate how this model fits the data. A common way to evaluate the
goodness of fit of these models is by making a QQplot on the p-values of
the linear regression. This plot compares the distribution of p-values
obtained in the analysis with a theoretical distribution. Our plot also
shows the lambda, a measure of inflation. If lambda is higher than 1,
the p-values are smaller than expected. If it is lower than 1, p-values
are bigger than expected. In both causes, the most common cause is that
we should include more variables in the model.
We can get a QQplot with the function plot
. When applied
to a ResultSet
, we can make different plots depending on
the method. We can select the name of the result with the parameter
rid
and the kind of plot with the parameter
type
. We will set rid
to “DiffMean” to use
limma results and type
to “qq” to get a QQplot:
When there are few CpGs modicated, we expect points in the QQplot lying on the theoretical line. However, as all the CpGs from the sexual chromosomes will be differentially methylated between males and females, we obtained a S-shape. In this case, lambda is lower than 1 (0.851).
There are different ways to address this issue. An option is to
include other covariates in our model. When we do not have more
covariates, as it is our case, we can apply Surrogate Variable Analysis
(SVA) to the data. SVA is a statistical technique that tries to
determine hidden covariates or SV (Surrogate Variables) based on the
measurements. This method is very useful to correct inflation when we do
not have the variables that are missing in the model. Our function
runPipeline
includes the parameter sva
that
runs SVA and includes these variables as covariates in the linear model.
More information can be found in MEAL
vignette.
We can get an overview of the results using a Manhattan plot. A
Manhattan plot represents the p-value of each CpG versus their location
in the genome. This representation is useful to find genomic regions
exhibiting differentiated behaviours. We can get a Manhattan plot from a
ResultSet
using the function plot
and setting
type
to “manhattan”. We will highlight the CpGs of our
target region by passing a GenomicRanges
to highlight
argument. GenomicRanges
passed to plot
should
have chromosomes coded as number:
targetRangeNum <- GRanges("11:102600000-103300000")
plot(methRes, rid = "DiffMean", type = "manhattan", main = "Differences in Means", highlight = targetRangeNum)
There are no genomic regions with a lot of CpGs with differences in means. Our target region does not seem to be very different from the rest. On the other hand, chromosome X has a lot of CpGs with big differences in variance. Our target region does not look different.
The expression analysis can be performed in the same way than
methylation. The main difference is that an ExpressionSet
will be used instead of a GenomicRatioSet
. We will run the
same model than for methylation. We will check the effect of sex on gene
expression and we will further evaluate our target region:
targetRange <- GRanges("chr11:102600000-103300000")
gexpRes <- runDiffMeanAnalysis(set = brge_gexp, model = ~ sex)
names(gexpRes)
## [1] "DiffMean"
As we did with methylation data, the first step will be check the goodness of our model. To do so, we will run a QQ-plot using the p-values of the differences of means:
Most of the p-values lie on the theoretical line and the lambda is close to 1 (0.94). Therefore, we will not need to include other covariates in our model.
We will now get a general overview of the results. In gene expression
data, it is very common to use a Volcano plot which shows the change in
expression of a probe against their p-value. We can easily make these
plots using the function plot
and setting the argument
type
to volcano:
Several transcripts have a big difference in mean gene expression and a very small p-value. Transcripts with a positive fold change have a higher expression in boys than in girls and are placed in chromosome Y (name starts by TC0Y). Transcripts with a negative fold change, have a lower expression in boys than in girls and are placed in chromosome X (name starts by TC0X).
We can get an overview of the genome distribution of these transcripts by plotting a Manhattan plot. We will also highlight the transcripts of our target region:
targetRange <- GRanges("chr11:102600000-103300000")
plot(gexpRes, rid = "DiffMean", type = "manhattan", main = "Differences in Means", highlight = targetRangeNum)
There are several transcripts with highly significant differences in means in chromosomes X and Y. However, transcripts in our target region are not significantly changed.
We have included in MEAL two
ways to assess association between gene expression and DNA methylation.
The first option is to simultaneously examine the gene expression and
DNA methylation results of a target region. We can do this with the
function plotRegion
and passing two ResultSets.
We will examine the methylation and gene expression results in our target region. For the sake of clarity, we will only print DiffMean and bumphunter results.
Differences in methylation are represented in black and differences in gene expression are represented in blue. We can see that CpGs with high differences are near transcripts with highest differences expression.
We can get a quantitative estimate of the association between
methylation and gene expression using the function
correlationMethExprs
. This function applies a method
previously described in (Steenaard et al.
2015). CpGs and expression probes are paired by proximity.
Expression probes are paired to a CpG if they are completely inside a
range of ± 250Kb from the CpG. This
distance of 250Kb is set by default but it can be changed with the
parameter flank (whose units are bases). The correlation between
methylation and expression is done by a linear regression.
To account for technical (e.g. batch) or biological (e.g. sex, age) artifacts, a model including those variables (z) is fitted: $$ x_{ij} = \sum_{k = 1}^K{\beta_{ik} z_{kj}} + r_{ij}, i = 1, ..., P $$ where xij is the methylation or expression level of probe i (P is the total number of probes) for individual j, βik is the effect of variable k at probe i, zkj is the value of variable k (K is the total number of covariates) for indivudal j and rij is the residual value of probe i and individual j. The residuals of methylation and expression are used to assess the correlation.
Therefore, let’s create a new MultiDataSet
with
expression and methylation data. The function
createMultiDataSet
creates an empty
MultiDataSet
. Then, we can add gene expression and
methylation datasets with the function add_genexp
and
add_methy
:
multi <- createMultiDataSet()
multi <- add_genexp(multi, brge_gexp)
multi <- add_methy(multi, brge_methy)
In this analysis, we will only take into account the probes in our
target region. We can subset our MultiDataSet passing a
GenomicRanges
to the third position of [
:
Now, we will compute the correlation between the CpGs and the expression probes.
## cpg exprs Beta se P.Value adj.P.Val
## 1 cg08804111 TC11002235.hg.1 0.7496396 0.2143482 0.002572113 0.7462925
## 2 cg09615874 TC11002239.hg.1 -2.2805946 0.6419809 0.002275597 0.7462925
## 3 cg02344527 TC11003311.hg.1 -3.8018045 1.1008235 0.002834022 0.7462925
## 4 cg01766252 TC11002232.hg.1 7.9602956 2.4425111 0.004357524 0.8538027
## 5 cg16466334 TC11002239.hg.1 5.8514095 1.8510037 0.005403815 0.8538027
## 6 cg13759446 TC11000962.hg.1 -7.6725222 2.8959170 0.016308600 0.8747493
The results shows the CpG name, the expression probe, the linear regression coeffcient and its standard error, and the p-value and adjusted p-value (using B&H). Results are ordered by the adjusted p-value, so we can see than in our data there are no correlated CpGs-expression probes.
This case example shows the main functionalities of MEAL package. It has been shown how to evaluate the effect of a phenotype on the methylation and on the expression at genome wide and at region level. Finally, the integration between methylation and expression is tested by checking if there are expression probes correlated to the methylated probes.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] missMethyl_1.39.14
## [2] IlluminaHumanMethylationEPICv2anno.20a1.hg38_1.0.0
## [3] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
## [4] brgedata_1.27.0
## [5] ggplot2_3.5.1
## [6] minfiData_0.51.0
## [7] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [8] IlluminaHumanMethylation450kmanifest_0.4.0
## [9] minfi_1.51.0
## [10] bumphunter_1.49.0
## [11] locfit_1.5-9.10
## [12] iterators_1.0.14
## [13] foreach_1.5.2
## [14] Biostrings_2.75.0
## [15] XVector_0.45.0
## [16] SummarizedExperiment_1.35.5
## [17] MatrixGenerics_1.17.1
## [18] matrixStats_1.4.1
## [19] GenomicRanges_1.57.2
## [20] GenomeInfoDb_1.41.2
## [21] IRanges_2.39.2
## [22] S4Vectors_0.43.2
## [23] MEAL_1.37.0
## [24] MultiDataSet_1.33.0
## [25] Biobase_2.67.0
## [26] BiocGenerics_0.53.0
## [27] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.37.1 bitops_1.0-9
## [3] httr_1.4.7 RColorBrewer_1.1-3
## [5] tools_4.4.1 doRNG_1.8.6
## [7] backports_1.5.0 utf8_1.2.4
## [9] R6_2.5.1 vegan_2.6-8
## [11] HDF5Array_1.33.8 lazyeval_0.2.2
## [13] mgcv_1.9-1 Gviz_1.49.0
## [15] rhdf5filters_1.17.0 permute_0.9-7
## [17] withr_3.0.2 prettyunits_1.2.0
## [19] gridExtra_2.3 base64_2.0.2
## [21] preprocessCore_1.67.1 cli_3.6.3
## [23] labeling_0.4.3 sass_0.4.9
## [25] readr_2.1.5 genefilter_1.87.0
## [27] askpass_1.2.1 Rsamtools_2.21.2
## [29] foreign_0.8-87 siggenes_1.79.0
## [31] illuminaio_0.47.0 rentrez_1.2.3
## [33] dichromat_2.0-0.1 scrime_1.3.5
## [35] BSgenome_1.75.0 limma_3.61.12
## [37] rstudioapi_0.17.1 RSQLite_2.3.7
## [39] generics_0.1.3 BiocIO_1.17.0
## [41] dplyr_1.1.4 Matrix_1.7-1
## [43] interp_1.1-6 qqman_0.1.9
## [45] fansi_1.0.6 abind_1.4-8
## [47] lifecycle_1.0.4 yaml_2.3.10
## [49] rhdf5_2.49.0 SparseArray_1.5.45
## [51] BiocFileCache_2.15.0 grid_4.4.1
## [53] blob_1.2.4 crayon_1.5.3
## [55] lattice_0.22-6 GenomicFeatures_1.57.1
## [57] annotate_1.85.0 KEGGREST_1.45.1
## [59] sys_3.4.3 maketools_1.3.1
## [61] pillar_1.9.0 knitr_1.48
## [63] beanplot_1.3.1 rjson_0.2.23
## [65] codetools_0.2-20 glue_1.8.0
## [67] data.table_1.16.2 vctrs_0.6.5
## [69] png_0.1-8 gtable_0.3.6
## [71] cachem_1.1.0 xfun_0.48
## [73] S4Arrays_1.5.11 survival_3.7-0
## [75] statmod_1.5.0 nlme_3.1-166
## [77] bit64_4.5.2 progress_1.2.3
## [79] filelock_1.0.3 bslib_0.8.0
## [81] nor1mix_1.3-3 rpart_4.1.23
## [83] colorspace_2.1-1 DBI_1.2.3
## [85] Hmisc_5.2-0 nnet_7.3-19
## [87] tidyselect_1.2.1 bit_4.5.0
## [89] compiler_4.4.1 curl_5.2.3
## [91] httr2_1.0.5 htmlTable_2.4.3
## [93] xml2_1.3.6 DelayedArray_0.33.1
## [95] rtracklayer_1.65.0 checkmate_2.3.2
## [97] scales_1.3.0 quadprog_1.5-8
## [99] rappdirs_0.3.3 stringr_1.5.1
## [101] digest_0.6.37 rmarkdown_2.28
## [103] GEOquery_2.73.5 htmltools_0.5.8.1
## [105] pkgconfig_2.0.3 jpeg_0.1-10
## [107] base64enc_0.1-3 sparseMatrixStats_1.17.2
## [109] highr_0.11 dbplyr_2.5.0
## [111] fastmap_1.2.0 ensembldb_2.29.1
## [113] rlang_1.1.4 htmlwidgets_1.6.4
## [115] UCSC.utils_1.1.0 DelayedMatrixStats_1.29.0
## [117] farver_2.1.2 jquerylib_0.1.4
## [119] jsonlite_1.8.9 BiocParallel_1.41.0
## [121] mclust_6.1.1 VariantAnnotation_1.51.2
## [123] RCurl_1.98-1.16 magrittr_2.0.3
## [125] Formula_1.2-5 GenomeInfoDbData_1.2.13
## [127] Rhdf5lib_1.27.0 munsell_0.5.1
## [129] Rcpp_1.0.13 stringi_1.8.4
## [131] zlibbioc_1.51.2 MASS_7.3-61
## [133] plyr_1.8.9 org.Hs.eg.db_3.20.0
## [135] ggrepel_0.9.6 deldir_2.0-4
## [137] splines_4.4.1 multtest_2.61.0
## [139] hms_1.1.3 rngtools_1.5.2
## [141] buildtools_1.0.0 biomaRt_2.63.0
## [143] XML_3.99-0.17 evaluate_1.0.1
## [145] calibrate_1.7.7 latticeExtra_0.6-30
## [147] biovizBase_1.55.0 BiocManager_1.30.25
## [149] tzdb_0.4.0 tidyr_1.3.1
## [151] openssl_2.2.2 purrr_1.0.2
## [153] reshape_0.8.9 xtable_1.8-4
## [155] restfulr_0.0.15 AnnotationFilter_1.31.0
## [157] tibble_3.2.1 memoise_2.0.1
## [159] AnnotationDbi_1.69.0 GenomicAlignments_1.41.0
## [161] cluster_2.1.6