Up to now, more than 10,000 methylation samples from the
state-of-the-art 450K microarray have been made available through The
Cancer Genome Atlas portal (The Cancer Genome
Atlas 2014) and the Gene Expression Omnibus (GEO) (Edgar, Domrachev, and Lash 2002). Large-scale
comparison studies, for instance between cancers or tissues, become
possible epigenome-widely. These large studies often require a
substantial amount of time spent on preprocessing the data and
performing quality control. For such studies, it is not rare to
encounter significant batch effects, and those can have a dramatic
impact on the validity of the biological results (Leek et al. 2010), (Harper, Peters, and Gamble 2013). With that in
mind, we developed shinyMethyl
to make the preprocessing of
large 450K datasets intuitive, enjoyable and reproducible.
shinyMethyl
is an interactive visualization tool for
Illumina 450K methylation array data based on the packages minfi and
shiny.
A few mouse clicks allow the user to appreciate insightful biological
inter-array differences on a large scale. The goal of
shinyMethyl
is two-fold: (1) summarize a high-dimensional
450K array experiment into an exportable small-sized R object and (2)
launch an interactive visualization tool for quality control assessment
as well as exploration of global methylation patterns associated with
different phenotypes.
To take a quick look at how the interactive interface of
shinyMethyl
works, we have included an example dataset in
the companion package shinyMethylData.
The dataset contains the extracted data of 369 Head and Neck cancer
samples downloaded from The Cancer Genome Atlas (TCGA) data portal (The Cancer Genome Atlas 2014): 310 tumor
samples, 50 matched normals and 9 replicates of a control cell line. The
first shinyMethylSet
object (see Section 3 for the
definition of a shinyMethylSet
object) was created from the
raw data (no normalization) and is stored under the name
summary.tcga.raw.rda
; the second
shinyMethylSet
object was created from a
GenomicRatioSet
containing the normalized data and the file
is stored under the name summary.tcga.norm.rda
. The samples
were normalized using functional normalization, a preprocessing
procedure that we recently developed for heterogeneous methylation data
(Fortin et al. 2014).
To launch shinyMethyl
with this TCGA dataset, simply
type the following commands in a fresh R session:
The interactive interface will take a few seconds to be launched in your default HTML browser.
In this section, we describe how to launch an interactive visualization for your methylation dataset.
An RGChannelSet
is an object defined in
minfi
containing the raw intensities of the green and red
channels of your 450K experiment. To create an
RGChannelSet
, you will need to have the raw files of the
experiment with extension .IDAT (we refer to those as .IDAT files). In
case you do not have these files, you might want to ask your
collaborators or your processing core if they have those. You absolutely
need them to both use the packages minfi
and
shinyMethyl
. The vignette in minfi
describes
carefully how to read the data in for different scenarios and how to
construct an RGChannelSet
. Here, we show a quick way to
create an RGChannelSet
from the .IDAT files contained in
the package minfiData
.
We need to tell R which directory contains the .IDAT files and the experiment sheet:
We also need to read in the experiment sheet:
Finally, we construct the RGChannelSet
using
read.450k.exp
:
The function pData
in minfi
allows to see
the phenotype data of the samples:
From the RGChannelSet
created in the previous section,
we create a shinyMethylSet
by using the command
shinySummarize
This is a small object containing all of the necessary information extracted from a RGChannelSet to launch shinyMethyl.
The different figures at the end of the vignette explain how to use
each of the shinyMethyl
panels.
shinyMethyl
also offers the possibility to visualize
normalized data that are stored in a GenomicRatioSet
object. For instance, suppose we normalize the data by using the
quantile normalization algorithm implemented in minfi
(this
function returns a GenomicRatioSet
object by default):
We can then create two separate shinyMethylSet
objects
corresponding to the raw and normalized data respectively:
To launch the shinyMethyl
interface, use
runShinyMethyl
with the first argument being the
shinyMethylSet
extracted from the raw data and the second
argument being the shinyMethylSet
extracted from the data
as follows:
A shinyMethylSet
object contains several summary data
from a 450K experiment: the names of the samples, a data frame for the
phenotype, a list of quantiles for the M and Beta values, a list of
quantiles for the methylated and unmethylated channels intensities and a
list of quantiles for the copy numbers, the green and red intensities of
different control probes, and the principal component analysis (PCA)
results performed on the Beta values. One can access the different
summaries by using the slot operator @
. The slot names can
be obtained with the function slotNames
as follows:
## [1] "sampleNames" "phenotype" "mQuantiles" "betaQuantiles"
## [5] "methQuantiles" "unmethQuantiles" "cnQuantiles" "greenControls"
## [9] "redControls" "pca" "originObject" "array"
For instance, one can retrieve the phenotype by
## gender caseControlStatus plate position
## 5775446049_R01C01 MALE Normal 577544 R01C01
## 5775446049_R01C02 FEMALE Normal 577544 R01C02
## 5775446049_R02C01 MALE Tumor 577544 R02C01
## 5775446049_R02C02 MALE Tumor 577544 R02C02
## 5775446049_R03C01 MALE Tumor 577544 R03C01
## 5775446049_R03C02 FEMALE Tumor 577544 R03C02
shinyMethyl
also contain different accessor functions to
access the slots. Please see the manual for more information.
This is an example of interactive visualization and quality control
assessment. The three plots react simultaneously to the user mouse
clicks and selected samples are represented in black. In this scenario,
colors represent batch, but colors can be chosen to reflect the
phenotype of the samples as well via the left-hand-side panel. The three
different plots are: A) Plot of the quality controls probes reacting to
the left-hand-side panel; the current plot shows the bisulfite
conversion probes intensities. B) Quality control plot as implemented in
minfi
: the median intensity of the M channel against the
median intensity of the U channel. Samples with bad quality would
cluster away from the cloud shown in the current plot. For this dataset,
all samples look good. C) Densities of the methylation intensities (can
be chosen to be Beta-values or M-values, and can be chosen by probe
type). The current plot shows the M-value densities for Infinium I
probes, for the raw data. The dashed and solid lines in black correspond
to the two samples selected by the user and match to the dots circled in
black in the left-hand plots. The left-hand-side panel allows users to
select different tuning parameters for the plots, as well as different
phenotypes for the colors. The user can click on the samples that seem
to have low quality, and can download the names of the samples in a csv
file for further analysis (not shown in the screenshot).}
The plot represents the physical slides (6 x 2 samples) on which the samples were assayed in the machine. The user can select the phenotype to color the samples. This plot allows to explore the quality of the randomization of the samples for a given phenotype.
The difference of the median copy number intensity for the Y chromosome and the median copy number intensity for the X chromosome can be used to separate males and females. In A), the user can select the vertical cutoff (dashed line) manually with the mouse to separate the two clusters (orange for females, blue for males). Corresponding Beta-value densities appear in B) for further validation. The predicted sex can be downloaded in a csv file in C), and samples for which the predicted sex differs from the sex provided in the phenotype will appear in D).
Users can select the principal components to visualize (PC1 and PC2 are shown in the current plot) and can choose the phenotype for the coloring. In the present plot, one can observe that the first two principal components distinguish tumor samples from normal samples for the TCGA example dataset (see example dataset section).
The distributional differences between the Type I and Type II probes can be observed for each sample (selected by the user).
As discussed in the Advanced option, normalized data can be added as well to the visualization interface. In the present plot, the top plot at the right shows the raw densities while the bottom plot shows the densities after functional normalization (Fortin et al. 2014).
In the first two plots are shown the densities of the M-values for Type I green probes before and after functional normalization (Fortin et al. 2014). Each curve represents one sample and different colors represent different batches. The last two plots show the average density for each plate before and after normalization. One can observe that functional normalization removed significantly global batch effects.
In the first two plots are shown the densities of the M-values for Type I green probes before (a) and after (b) functional normalization (Fortin et al. 2014). Green and blue densities represent tumor and normal samples respectively, and red densities represent 9 technical replicates of a control cell line. The last two plots show the average density for each sample group before and after normalization. Functional normalization preserves the expected marginal differences between normal and cancer, while reducing the variation between the technical controls (red lines).
Here is the output of sessionInfo
on the system on which
this document was compiled:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] shinyMethylData_1.26.0
## [2] shinyMethyl_1.43.0
## [3] minfiData_0.52.0
## [4] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [5] IlluminaHumanMethylation450kmanifest_0.4.0
## [6] minfi_1.53.1
## [7] bumphunter_1.49.0
## [8] locfit_1.5-9.10
## [9] iterators_1.0.14
## [10] foreach_1.5.2
## [11] Biostrings_2.75.3
## [12] XVector_0.47.0
## [13] SummarizedExperiment_1.37.0
## [14] Biobase_2.67.0
## [15] MatrixGenerics_1.19.0
## [16] matrixStats_1.4.1
## [17] GenomicRanges_1.59.1
## [18] GenomeInfoDb_1.43.2
## [19] IRanges_2.41.2
## [20] S4Vectors_0.45.2
## [21] BiocGenerics_0.53.3
## [22] generics_0.1.3
## [23] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] GenomicFeatures_1.59.1 rmarkdown_2.29
## [7] BiocIO_1.17.1 zlibbioc_1.52.0
## [9] vctrs_0.6.5 multtest_2.63.0
## [11] memoise_2.0.1 Rsamtools_2.23.1
## [13] DelayedMatrixStats_1.29.0 RCurl_1.98-1.16
## [15] askpass_1.2.1 htmltools_0.5.8.1
## [17] S4Arrays_1.7.1 curl_6.0.1
## [19] Rhdf5lib_1.29.0 SparseArray_1.7.2
## [21] rhdf5_2.51.1 sass_0.4.9
## [23] nor1mix_1.3-3 bslib_0.8.0
## [25] plyr_1.8.9 cachem_1.1.0
## [27] buildtools_1.0.0 GenomicAlignments_1.43.0
## [29] mime_0.12 lifecycle_1.0.4
## [31] pkgconfig_2.0.3 Matrix_1.7-1
## [33] R6_2.5.1 fastmap_1.2.0
## [35] GenomeInfoDbData_1.2.13 shiny_1.10.0
## [37] digest_0.6.37 siggenes_1.81.0
## [39] reshape_0.8.9 AnnotationDbi_1.69.0
## [41] RSQLite_2.3.9 base64_2.0.2
## [43] httr_1.4.7 abind_1.4-8
## [45] compiler_4.4.2 beanplot_1.3.1
## [47] rngtools_1.5.2 bit64_4.5.2
## [49] BiocParallel_1.41.0 DBI_1.2.3
## [51] HDF5Array_1.35.2 MASS_7.3-61
## [53] openssl_2.3.0 DelayedArray_0.33.3
## [55] rjson_0.2.23 tools_4.4.2
## [57] rentrez_1.2.3 httpuv_1.6.15
## [59] glue_1.8.0 quadprog_1.5-8
## [61] restfulr_0.0.15 promises_1.3.2
## [63] nlme_3.1-166 rhdf5filters_1.19.0
## [65] grid_4.4.2 tzdb_0.4.0
## [67] preprocessCore_1.69.0 tidyr_1.3.1
## [69] data.table_1.16.4 hms_1.1.3
## [71] xml2_1.3.6 pillar_1.10.0
## [73] limma_3.63.2 genefilter_1.89.0
## [75] later_1.4.1 splines_4.4.2
## [77] dplyr_1.1.4 lattice_0.22-6
## [79] survival_3.8-3 rtracklayer_1.67.0
## [81] bit_4.5.0.1 GEOquery_2.75.0
## [83] annotate_1.85.0 tidyselect_1.2.1
## [85] maketools_1.3.1 knitr_1.49
## [87] xfun_0.49 scrime_1.3.5
## [89] statmod_1.5.0 UCSC.utils_1.3.0
## [91] yaml_2.3.10 evaluate_1.0.1
## [93] codetools_0.2-20 tibble_3.2.1
## [95] BiocManager_1.30.25 cli_3.6.3
## [97] xtable_1.8-4 jquerylib_0.1.4
## [99] Rcpp_1.0.13-1 png_0.1-8
## [101] XML_3.99-0.17 readr_2.1.5
## [103] blob_1.2.4 mclust_6.1.1
## [105] doRNG_1.8.6 sparseMatrixStats_1.19.0
## [107] bitops_1.0-9 illuminaio_0.49.0
## [109] purrr_1.0.2 crayon_1.5.3
## [111] rlang_1.1.4 KEGGREST_1.47.0