OVESEG-test (One Versus Everyone Subtype Exclusively-expressed Genes test) is a statistically-principled multiple-group comparison method that can detect tissue/cell-specific marker genes (MGs) among many subtypes (e.g. tissue/cell types) (Chen et al. 2019). To assess the statistical significance of MGs, OVESEG-test uses a specifically designed test statistics that mathematically matches the definition of MGs, and employs a novel permutation scheme to estimate the corresponding distribution under null hypothesis where the expression patterns of non-MGs can be highly complex.
OVESEG
package provides functions to compute OVESEG-test
statistics, derive component weights in the mixture null distribution
model and estimate p-values from weightedly aggregated permutations.
While OVESEG-test p-values can be used to detect MGs, component weights
of hypotheses also help portrait all kinds of upregulation patterns
among tissue/cell types.
The function OVESEGtest()
includes all necessary steps
to obtain OVESEG-test p-values. You need to specify the expression
matrix, tissue/cell type labels, number of permutations, and parallel
option. For microarray data, the expressions need to be log2-transformed
prior to the test. For RNAseq count data, the counts need to be
transformed to logCPM (log2-counts per million) prior to the test and
use limma::voom()
to incorporate the mean-variance
relationship. Parallel option is set according to BiocParallel
package.
Theoretically, OVESEG-test can be applied to any molecular expression
data types as long as they can be modeled by normal distribution after a
transformation, such as log2-transformation for microarray and
proteomics data, logCPM for RNAseq counts, logit2-transformation for DNA
methylation beta values. If mean-variance relationship exists after
transformation, limma::vooma()/voom()
need to be performed
firstly and the resulted weight matrix is used as the input of
OVESEG
.
Requirements for the input expression data:
The input expression data should be stored in a matrix. Data frame, or SummarizedExperiment object is also accepted and will be internally coerced into a matrix format before analysis. Rows correspond to probes and columns to samples. Tissue/cell labels must match each column in expression matrix. Weight matrix, if provided, must match each spot in expression matrix.
We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE28490) as an example to show OVESEG-test on microarray data.
library(OVESEG)
#Import data
data(RocheBT)
y <- RocheBT$y #5000*15 matrix
group <- RocheBT$group #15 labels
#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(y[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > log2(64))
y <- y[keep,]
#OVESEG-test
rtest1 <- OVESEGtest(y, group, NumPerm = 999,
BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically
Note there are many low-expressed probe filtering methods. Here we
filtered the probes whose mean expression is less than a threshold even
in the highest expressed group. If mean-variance relationship is
obvious, we can consider to apply limma::vooma()
firstly.
#OVESEG-test
yvooma <- limma::vooma(y, model.matrix(~0+factor(group)))
rtest2 <- OVESEGtest(yvooma$E, group, weights=yvooma$weights, NumPerm = 999,
BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically
We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE60424) as an example to show OVESEG-test on RNAseq count data.
library(OVESEG)
#Import data
data(countBT)
count <- countBT$count #10000*60 matrix
group <- countBT$group #60 labels
#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(count[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > 2)
count <- count[keep,]
#OVESEG-test
lib.size <- max(colSums(count))
yvoom <- limma::voom(count, model.matrix(~0+factor(group)),
lib.size = lib.size)
rtest3 <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999,
BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically
Note there are many low-expressed probe filtering methods. Here we
filtered the probes whose mean count is less than a threshold even in
the highest expressed group. lib.size
and other parameters
in limma::voom()
can be set manually according to limma
package.
OVESEGtest
returns three p-values:
pv.overall
, pv.oneside
and
pv.multiside
. pv.overall
is calculated by all
permutations regardless of upregulated subtypes. pv.oneside
is subtype-specific p-values calculated specifically for the upregulated
subtype of each probe. pv.multiside
is
pv.oneside
multiplied by K (K comparison correction) and
truncated at 1. More details can be found in the paper (Chen et al. 2019).
OVESEG-test statistics are defined as $$\mathbf{\max_{k=1,...,K}\left\{min_{l \neq
k}\left\{
\frac{\mu_k(j)-\mu_l(j)}{\sigma(j)\sqrt{\frac{1}{N_k}+\frac{1}{N_l}}}
\right\}\right\}}$$ where μk(j)
is the mean of logarithmic expressions of gene j in subtype k. While it
takes a long time to execute permutations for p-value estimation,
OVESEGtstat()
is useful if users only need test statistics
for ranking genes:
Null hypothesis of OVESEG-test is modeled as a mixture of multiple
components, where the weights of components are estimated from posterior
probabilities over all probes. Those posterior probabilities are also
returned by OVESEGtest()
. If users only want to observe
probewise posterior probabilities, postProbNull()
is
helpful.
ppnull1 <- postProbNull(y, RocheBT$group)
ppnull2 <- postProbNull(yvooma$E, RocheBT$group, weights=yvooma$weights)
ppnull3 <- postProbNull(yvoom$E, countBT$group, weights=yvoom$weights)
By accumulating and normalizing probewise posterior probability with
the function nullDistri()
, we can also obtain the
probability of one subtype being upregulated conditioned on null
hypotheses. The subtype with higher probability tends to get more False
Positive MGs.
nullDistri(ppnull1)
#> B CD4 CD8
#> 0.2544406 0.3699904 0.3755690
nullDistri(ppnull2)
#> B CD4 CD8
#> 0.2425531 0.3695834 0.3878635
nullDistri(ppnull3)
#> B CD4 CD8
#> 0.2555458 0.3664025 0.3780517
There are totally 2K − 1
types of expression patterns in a real dataset: probes exclusively
expressed in 1,2,…, or K of subtypes. Probe unexpressed in any of
subtypes should have been filtered during pre-processing. Accumulating
and normalizing probewise posterior probability of null hypotheses and
of alternative hypotheses using the function
patternDistri()
can present us the ratios of all possible
upregulation patterns among subtypes:
The ratios of patterns can be illustrated as following.
library(gridExtra)
library(grid)
library(reshape2)
library(ggplot2)
gridpatterns <- function (ppnull) {
K <- length(ppnull$label)
cellcomp <- patternDistri(ppnull)
cellcomp <- cellcomp[order(cellcomp[,K+1], decreasing = TRUE),]
#Bar Chart to show pattern percentage
DF1 <- data.frame(Rank = seq_len(2^K - 1), cellcomp)
p1 <- ggplot(DF1, aes(x = Rank, y = Wpattern)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels=scales::percent) +
theme_bw(base_size = 16) +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
plot.margin=unit(c(1,1,-0.2,1), "cm"),
panel.grid.minor = element_line(size = 1),
panel.grid.major = element_line(size = 1),
panel.border = element_blank(),
axis.ticks.x = element_blank()) +
ylab("Percentage of up in certain subtype(s)")
#patterns as X-axis
DF2 <- data.frame(Rank = seq_len(2^K-1), cellcomp[,-(K+1)])
DF2 <- melt(DF2, id.var="Rank")
p2 <- ggplot(DF2, aes(x = Rank, y = value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette="Set2") +
theme(line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
title = element_blank(),
panel.background = element_rect(fill = "white"),
legend.position="bottom",
legend.key.size = unit(1.2,"line"),
plot.margin=unit(c(-0.2,1,1,1), "cm")) +
scale_y_reverse() +
guides(fill = guide_legend(nrow = 1, byrow = TRUE))
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
colnames(g1) <- paste0(seq_len(ncol(g1)))
colnames(g2) <- paste0(seq_len(ncol(g2)))
g <- gtable_combine(g1, g2, along=2)
panels <- g$layout$t[grepl("panel", g$layout$name)]
n1 <- length(g$heights[panels[1]])
n2 <- length(g$heights[panels[2]])
# assign new (relative) heights to the panels
# notice the *4 here to get different heights
g$heights[panels[1]] <- unit(n1*4,"null")
g$heights[panels[2]] <- unit(n2,"null")
return(g)
}
grid.newpage()
grid.draw(gridpatterns(ppnull1))
#> Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#or grid.draw(gridpatterns(rtest1))
Default variance estimator in OVESEG
is empirical Bayes
moderated variance estimator used in limma
(argument alpha="moderated"
). Another option is adding a
constant α to pooled variance
estimator (argument alpha=
α). Setting argument
alpha=NULL
will treat all variances as the same
constant.
Before MG detection, OVESEG-test p-values still need multiple comparison correction or false discovery rate control.
##multiple comparison correction
pv.overall.adj <- stats::p.adjust(rtest1$pv.overall, method="bonferroni")
pv.multiside.adj <- stats::p.adjust(rtest1$pv.multiside, method="bonferroni")
##fdr control
qv.overall <- fdrtool::fdrtool(rtest1$pv.overall, statistic="pvalue",
plot=FALSE, verbose=FALSE)$qval
qv.multiside <- fdrtool::fdrtool(rtest1$pv.multiside, statistic="pvalue",
plot=FALSE, verbose=FALSE)$qval
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] grid stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 reshape2_1.4.4 gridExtra_2.3 OVESEG_1.23.0
#> [5] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4 stringi_1.8.4
#> [4] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
#> [7] RColorBrewer_1.1-3 fastmap_1.2.0 plyr_1.8.9
#> [10] jsonlite_1.8.9 limma_3.63.2 BiocManager_1.30.25
#> [13] fansi_1.0.6 scales_1.3.0 codetools_0.2-20
#> [16] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
#> [19] crayon_1.5.3 munsell_0.5.1 withr_3.0.2
#> [22] cachem_1.1.0 yaml_2.3.10 tools_4.4.2
#> [25] parallel_4.4.2 BiocParallel_1.41.0 colorspace_2.1-1
#> [28] buildtools_1.0.0 vctrs_0.6.5 fdrtool_1.2.18
#> [31] R6_2.5.1 lifecycle_1.0.4 stringr_1.5.1
#> [34] pkgconfig_2.0.3 bslib_0.8.0 pillar_1.9.0
#> [37] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.13-1
#> [40] statmod_1.5.0 xfun_0.49 tibble_3.2.1
#> [43] sys_3.4.3 knitr_1.49 farver_2.1.2
#> [46] htmltools_0.5.8.1 snow_0.4-4 rmarkdown_2.29
#> [49] maketools_1.3.1 labeling_0.4.3 compiler_4.4.2