Multi-sample normalization techniques such as quantile normalization Bolstad et al. (2003), Irizarry et al. (2003) have become a standard and essential part of analysis pipelines for high-throughput data. Although it was originally developed for gene expression microarrays, it is now used across many different high-throughput applications including genotyping arrays, DNA Methylation, RNA Sequencing (RNA-Seq) and Chromatin Immunoprecipitation Sequencing (ChIP-Seq). These techniques transform the original raw data to remove unwanted technical variation. However, quantile normalization and other global normalization methods rely on assumptions about the data generation process that are not appropriate in some context. Until now, it has been left to the researcher to check for the appropriateness of these assumptions.
Quantile normalization assumes that the statistical distribution of each sample is the same. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference. This method has worked very well in practice but note that when the assumptions are not met, global changes in distribution, that may be of biological interest, will be wiped out and features that are not different across samples can be artificially induced. These types of assumptions are justified in many biomedical applications, for example in gene expression studies in which only a minority of genes are expected to be differentially expressed. However, if, for example, a substantially higher percentage of genes are expected to be expressed in only one group of samples, it may not be appropriate to use global adjustment methods.
The quantro
R-package can be used to test for global
differences between groups of distributions which asses whether global
normalization methods such as quantile normalization should be applied.
Our method uses the raw unprocessed high-throughput data to test for
global differences in the distributions across a set of groups. The main
function quantro()
will perform two tests:
An ANOVA to test if the medians of the distributions are different across groups. Differences across groups could be attributed to unwanted technical variation (such as batch effects) or real global biological variation. This is a helpful step for the user to verify if there is any technical variation unaccounted for.
A test for global differences between the distributions across
groups which returns a test statistic called quantroStat
.
This test statistic is a ratio of two variances and is similar to the
idea of ANOVA. The main idea is to compare the variability of
distributions within groups relative to between groups. If the
variability between groups is sufficiently larger than the variability
within groups, then this suggests global adjustment methods may not be
appropriate. As a default, we perform this test on the median normalized
data, but the user may change this option.
The R-package quantro can be installed from the Bioconductor
flowSorted
Data ExampleTo explore how to use quantro()
, we use the
FlowSorted.DLPFC.450k
data package in Bioconductor Jaffe and Kaminsky (2021). The data in this
package originally came from Guintivano, Aryee,
and Kaminsky (2013). This data set in
FlowSorted.DLPFC.450k
contains raw data objects of 58
Illumina 450K DNA methylation microarrays, formatted as
RGset
objects. The samples represent two different cellular
populations of brain tissues on the same 29 individuals extracted using
flow sorting. For more information on this data set, please see the
FlowSorted.DLPFC.450k
User’s Guide.For the purposes of this
vignette, a MethylSet object from the minfi
Bioconductor
package Aryee et al. (2014) was created
which is a subset of the rows from the original
FlowSorted.DLPFC.450k
data set. This MethylSet
object is found in the /data folder and the script to create the object
is found in /inst.
Here we will explore the distributions of these two cellular
populations of brain tissue (NeuN_pos
and
NeuN_neg
) and then test if there are global differences in
the distributions across groups. First, load the MethylSet object
(flowSorted
) and compute the Beta values using the function
getBeta()
in the minfi
Bioconductor package.
We use an offset of 100 as this is the default used by Illumina.
library(minfi)
data(flowSorted)
p <- getBeta(flowSorted, offset = 100)
pd <- pData(flowSorted)
dim(p)
## [1] 10000 58
## DataFrame with 6 rows and 11 columns
## Sample_Name SampleID CellType Sentrix.Barcode Sample.Section
## <character> <character> <character> <numeric> <character>
## 813_N 813_N 813 NeuN_pos 7766130090 R02C01
## 1740_N 1740_N 1740 NeuN_pos 7766130090 R02C02
## 1740_G 1740_G 1740 NeuN_neg 7766130090 R04C01
## 1228_G 1228_G 1228 NeuN_neg 7766130090 R04C02
## 813_G 813_G 813 NeuN_neg 7766130090 R06C01
## 1228_N 1228_N 1228 NeuN_pos 7766130090 R06C02
## diag sex ethnicity age PMI
## <character> <character> <character> <integer> <integer>
## 813_N Control Female Caucasian 30 14
## 1740_N Control Female African 13 17
## 1740_G Control Female African 13 17
## 1228_G Control Male Caucasian 47 13
## 813_G Control Female Caucasian 30 14
## 1228_N Control Male Caucasian 47 13
## BasePath
## <character>
## 813_N /dcs01/lieber/ajaffe..
## 1740_N /dcs01/lieber/ajaffe..
## 1740_G /dcs01/lieber/ajaffe..
## 1228_G /dcs01/lieber/ajaffe..
## 813_G /dcs01/lieber/ajaffe..
## 1228_N /dcs01/lieber/ajaffe..
quantro
contains two functions to view the distributions
of the samples of interest: matdensity()
and
matboxplot()
. The function matdensity()
computes the density for each sample (columns) and uses the
matplot()
function to plot all the densities.
matboxplot()
orders and colors the samples by a group level
variable. These two functions use the RColorBrewer
package
and the brewer palettes can be changed using the arguments
brewer.n
and brewer.name
.
The distributions of the two groups of cellular populations are shown
here. The NeuN_neg
samples are colored in green and the
NeuN_pos
are colored in red.
quantro()
functionquantro()
The quantro()
function must have two objects as
input:
An object
which is a data frame or matrix with
observations (e.g. probes or genes) on the rows and samples as the
columns.
A groupFactor
which represents the group level
information about each sample. For example if the samples represent
tumor and normal samples, provide quantro()
with a factor
representing which columns in the object
are normal and
tumor samples.
quantro()
In this example, the groups we are interested in comparing are
contained in the CellType
column in the pd
dataset. To run the quantro()
function, input the data
object and the object containing the phenotypic data. Here we use the
flowSorted
data set as an example.
## quantro: Test for global differences in distributions
## nGroups: 2
## nTotSamples: 58
## nSamplesinGroups: 29 29
## anovaPval: 0.01206
## quantroStat: 8.80735
## quantroPvalPerm: Use B > 0 for permutation testing.
The details related to the experiment can be extracted using the
summary
accessor function:
## $nGroups
## [1] 2
##
## $nTotSamples
## [1] 58
##
## $nSamplesinGroups
## NeuN_neg NeuN_pos
## 29 29
To asssess if the medians of the distributions different across
groups, we perform an ANOVA on the medians from the samples. Those
results can be found using anova
:
## Analysis of Variance Table
##
## Response: objectMedians
## Df Sum Sq Mean Sq F value Pr(>F)
## groupFactor 1 0.006919 0.0069194 6.7327 0.01206 *
## Residuals 56 0.057553 0.0010277
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The full output can be seen The test statistic produced from
quantro()
testing for global differences between
distributions is given by quantroStat
:
## [1] 8.807348
quantro()
also can accept objects that inherit
eSets
such as an ExpressionSet
or
MethylSet
. The groupFactor
must still be
provided.
## [1] TRUE
## quantro: Test for global differences in distributions
## nGroups: 2
## nTotSamples: 58
## nSamplesinGroups: 29 29
## anovaPval: 0.01206
## quantroStat: 8.80735
## quantroPvalPerm: Use B > 0 for permutation testing.
quantro()
Elements in the S4 object from quantro()
include:
Element | Description |
---|---|
summary |
A list that contains (1) number of groups (nGroups ),
(2) total number of samples (nTotSamples ) (3) number of
samples in each group (nSamplesinGroups ) |
anova |
ANOVA to test if the average medians of the distributions are different across groups |
MSbetween |
mean squared error between groups |
MSwithin |
mean squared error within groups |
quantroStat |
test statistic which is a ratio of the mean squared error between groups of distributions to the mean squared error within groups of distributions |
quantroStatPerm |
If B is not equal to 0, then a permutation test was
performed to assess the statistical significance of
quantroStat . These are the test statistics resulting from
the permuted samples |
quantroPvalPerm |
If B is not equal to 0, then this is the p-value associated with the
proportion of times the test statistics (quantroStatPerm )
resulting from the permuted samples were larger than
quantroStat |
To assess statistical significance of the test statistic, we use
permutation testing. We use the foreach
package which
distribute the computations across multiple cross in a single machine or
across multiple machines in a cluster. The user must pick how many
permutations to perform where B
is the number of
permutations. At each permutation of the samples, a test statistic is
calculated. The proportion of test statistics
(quantroStatPerm
) that are larger than the
quantroStat
is reported as the
quantroPvalPerm
. To use the foreach
package,
we first register a backend, in this case a machine with 1 cores.
library(doParallel)
registerDoParallel(cores=1)
qtestPerm <- quantro(p, groupFactor = pd$CellType, B = 1000)
qtestPerm
## quantro: Test for global differences in distributions
## nGroups: 2
## nTotSamples: 58
## nSamplesinGroups: 29 29
## anovaPval: 0.01206
## quantroStat: 8.80735
## quantroPvalPerm: 0.001
If permutation testing was used (i.e. specifying B
> 0), then there is a second function in
the package called quantroPlot()
which will plot the test
statistics of the permuted samples. The plot is a histogram of the null
test statistics quantroStatPerm
from quantro()
and the red line is the observed test statistic quantroStat
from quantro()
.
Additional options in the quantroPlot()
function
include:
Element | Description |
---|---|
xLab | the x-axis label |
yLab | the y-axis label |
mainLab | title of the histogram |
binWidth | change the binwidth |
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] doParallel_1.0.17 minfi_1.53.0
## [3] bumphunter_1.49.0 locfit_1.5-9.10
## [5] iterators_1.0.14 foreach_1.5.2
## [7] Biostrings_2.75.0 XVector_0.46.0
## [9] SummarizedExperiment_1.36.0 Biobase_2.67.0
## [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [13] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [15] IRanges_2.41.0 S4Vectors_0.44.0
## [17] BiocGenerics_0.53.0 quantro_1.41.0
## [19] knitr_1.48 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] GenomicFeatures_1.59.0 farver_2.1.2
## [7] rmarkdown_2.28 BiocIO_1.17.0
## [9] zlibbioc_1.52.0 vctrs_0.6.5
## [11] multtest_2.63.0 memoise_2.0.1
## [13] Rsamtools_2.22.0 DelayedMatrixStats_1.29.0
## [15] RCurl_1.98-1.16 askpass_1.2.1
## [17] htmltools_0.5.8.1 S4Arrays_1.6.0
## [19] curl_5.2.3 Rhdf5lib_1.28.0
## [21] SparseArray_1.6.0 rhdf5_2.50.0
## [23] sass_0.4.9 nor1mix_1.3-3
## [25] bslib_0.8.0 plyr_1.8.9
## [27] cachem_1.1.0 buildtools_1.0.0
## [29] GenomicAlignments_1.43.0 lifecycle_1.0.4
## [31] pkgconfig_2.0.3 Matrix_1.7-1
## [33] R6_2.5.1 fastmap_1.2.0
## [35] GenomeInfoDbData_1.2.13 digest_0.6.37
## [37] colorspace_2.1-1 siggenes_1.80.0
## [39] reshape_0.8.9 AnnotationDbi_1.69.0
## [41] RSQLite_2.3.7 base64_2.0.2
## [43] labeling_0.4.3 fansi_1.0.6
## [45] httr_1.4.7 abind_1.4-8
## [47] compiler_4.4.1 beanplot_1.3.1
## [49] rngtools_1.5.2 withr_3.0.2
## [51] bit64_4.5.2 BiocParallel_1.41.0
## [53] DBI_1.2.3 highr_0.11
## [55] HDF5Array_1.35.0 MASS_7.3-61
## [57] openssl_2.2.2 DelayedArray_0.33.1
## [59] rjson_0.2.23 tools_4.4.1
## [61] rentrez_1.2.3 glue_1.8.0
## [63] quadprog_1.5-8 restfulr_0.0.15
## [65] nlme_3.1-166 rhdf5filters_1.18.0
## [67] grid_4.4.1 generics_0.1.3
## [69] gtable_0.3.6 tzdb_0.4.0
## [71] preprocessCore_1.68.0 tidyr_1.3.1
## [73] data.table_1.16.2 hms_1.1.3
## [75] xml2_1.3.6 utf8_1.2.4
## [77] pillar_1.9.0 limma_3.63.0
## [79] genefilter_1.89.0 splines_4.4.1
## [81] dplyr_1.1.4 lattice_0.22-6
## [83] survival_3.7-0 rtracklayer_1.66.0
## [85] bit_4.5.0 GEOquery_2.75.0
## [87] annotate_1.85.0 tidyselect_1.2.1
## [89] maketools_1.3.1 xfun_0.48
## [91] scrime_1.3.5 statmod_1.5.0
## [93] UCSC.utils_1.2.0 yaml_2.3.10
## [95] evaluate_1.0.1 codetools_0.2-20
## [97] tibble_3.2.1 BiocManager_1.30.25
## [99] cli_3.6.3 xtable_1.8-4
## [101] munsell_0.5.1 jquerylib_0.1.4
## [103] Rcpp_1.0.13 png_0.1-8
## [105] XML_3.99-0.17 ggplot2_3.5.1
## [107] readr_2.1.5 blob_1.2.4
## [109] mclust_6.1.1 doRNG_1.8.6
## [111] sparseMatrixStats_1.18.0 bitops_1.0-9
## [113] scales_1.3.0 illuminaio_0.49.0
## [115] purrr_1.0.2 crayon_1.5.3
## [117] rlang_1.1.4 KEGGREST_1.47.0