OVESEG User Manual

Introduction

OVESEG-test (One Versus Everyone Subtype Exclusively-expressed Genes test) is a statistically-principled multiple-group comparison method that can detect tissue/cell-specific marker genes (MGs) among many subtypes (e.g. tissue/cell types) (Chen et al. 2019). To assess the statistical significance of MGs, OVESEG-test uses a specifically designed test statistics that mathematically matches the definition of MGs, and employs a novel permutation scheme to estimate the corresponding distribution under null hypothesis where the expression patterns of non-MGs can be highly complex.

OVESEG package provides functions to compute OVESEG-test statistics, derive component weights in the mixture null distribution model and estimate p-values from weightedly aggregated permutations. While OVESEG-test p-values can be used to detect MGs, component weights of hypotheses also help portrait all kinds of upregulation patterns among tissue/cell types.

Quick Start

The function OVESEGtest() includes all necessary steps to obtain OVESEG-test p-values. You need to specify the expression matrix, tissue/cell type labels, number of permutations, and parallel option. For microarray data, the expressions need to be log2-transformed prior to the test. For RNAseq count data, the counts need to be transformed to logCPM (log2-counts per million) prior to the test and use limma::voom() to incorporate the mean-variance relationship. Parallel option is set according to BiocParallel package.

#Microarray
rtest <- OVESEGtest(y, group, NumPerm=999, 
                    BPPARAM=BiocParallel::SnowParam())

#RNAseq
yvoom <- limma::voom(count, model.matrix(~0+factor(group)))
rtest <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999, 
                    BPPARAM=BiocParallel::SnowParam())

OVESEG-test

Datatypes and Input Format

Theoretically, OVESEG-test can be applied to any molecular expression data types as long as they can be modeled by normal distribution after a transformation, such as log2-transformation for microarray and proteomics data, logCPM for RNAseq counts, logit2-transformation for DNA methylation beta values. If mean-variance relationship exists after transformation, limma::vooma()/voom() need to be performed firstly and the resulted weight matrix is used as the input of OVESEG.

Requirements for the input expression data:

  • be transformed (log2/logCPM/logit2).
  • be already processed by normalization and batch effect removal.
  • no missing values; the molecules containing missing values should be removed prior to OVESEG-test.
  • no low-expressed molecules; otherwise, the results will be affected.

The input expression data should be stored in a matrix. Data frame, or SummarizedExperiment object is also accepted and will be internally coerced into a matrix format before analysis. Rows correspond to probes and columns to samples. Tissue/cell labels must match each column in expression matrix. Weight matrix, if provided, must match each spot in expression matrix.

Microarray data

We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE28490) as an example to show OVESEG-test on microarray data.

library(OVESEG)
#Import data
data(RocheBT)
y <- RocheBT$y #5000*15 matrix
group <- RocheBT$group #15 labels

#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(y[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > log2(64))
y <- y[keep,]

#OVESEG-test
rtest1 <- OVESEGtest(y, group, NumPerm = 999, 
                     BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically

Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean expression is less than a threshold even in the highest expressed group. If mean-variance relationship is obvious, we can consider to apply limma::vooma() firstly.

#OVESEG-test
yvooma <- limma::vooma(y, model.matrix(~0+factor(group)))
rtest2 <- OVESEGtest(yvooma$E, group, weights=yvooma$weights, NumPerm = 999,
                     BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically

RNAseq count data

We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE60424) as an example to show OVESEG-test on RNAseq count data.

library(OVESEG)
#Import data
data(countBT)
count <- countBT$count #10000*60 matrix
group <- countBT$group #60 labels

#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(count[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > 2)
count <- count[keep,]

#OVESEG-test
lib.size <- max(colSums(count))
yvoom <- limma::voom(count, model.matrix(~0+factor(group)), 
                     lib.size = lib.size)
rtest3 <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999,
                     BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically

Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean count is less than a threshold even in the highest expressed group. lib.size and other parameters in limma::voom() can be set manually according to limma package.

p-values

OVESEGtest returns three p-values: pv.overall, pv.oneside and pv.multiside. pv.overall is calculated by all permutations regardless of upregulated subtypes. pv.oneside is subtype-specific p-values calculated specifically for the upregulated subtype of each probe. pv.multiside is pv.oneside multiplied by K (K comparison correction) and truncated at 1. More details can be found in the paper (Chen et al. 2019).

Useful intermediate results

Test statistics

OVESEG-test statistics are defined as $$\mathbf{\max_{k=1,...,K}\left\{min_{l \neq k}\left\{ \frac{\mu_k(j)-\mu_l(j)}{\sigma(j)\sqrt{\frac{1}{N_k}+\frac{1}{N_l}}} \right\}\right\}}$$ where μk(j) is the mean of logarithmic expressions of gene j in subtype k. While it takes a long time to execute permutations for p-value estimation, OVESEGtstat() is useful if users only need test statistics for ranking genes:

#OVESEG-test statistics
rtstat1 <- OVESEGtstat(y, RocheBT$group)
rtstat2 <- OVESEGtstat(yvooma$E, RocheBT$group, weights=yvooma$weights)
rtstat3 <- OVESEGtstat(yvoom$E, countBT$group, weights=yvoom$weights)

Posterior probabilities of null hypothesis components

Null hypothesis of OVESEG-test is modeled as a mixture of multiple components, where the weights of components are estimated from posterior probabilities over all probes. Those posterior probabilities are also returned by OVESEGtest(). If users only want to observe probewise posterior probabilities, postProbNull() is helpful.

ppnull1 <- postProbNull(y, RocheBT$group)
ppnull2 <- postProbNull(yvooma$E, RocheBT$group, weights=yvooma$weights)
ppnull3 <- postProbNull(yvoom$E, countBT$group, weights=yvoom$weights)

By accumulating and normalizing probewise posterior probability with the function nullDistri(), we can also obtain the probability of one subtype being upregulated conditioned on null hypotheses. The subtype with higher probability tends to get more False Positive MGs.

nullDistri(ppnull1)
#>         B       CD4       CD8 
#> 0.2544406 0.3699904 0.3755690
nullDistri(ppnull2)
#>         B       CD4       CD8 
#> 0.2425531 0.3695834 0.3878635
nullDistri(ppnull3)
#>         B       CD4       CD8 
#> 0.2555458 0.3664025 0.3780517

There are totally 2K1 types of expression patterns in a real dataset: probes exclusively expressed in 1,2,…, or K of subtypes. Probe unexpressed in any of subtypes should have been filtered during pre-processing. Accumulating and normalizing probewise posterior probability of null hypotheses and of alternative hypotheses using the function patternDistri() can present us the ratios of all possible upregulation patterns among subtypes:

patterns <- patternDistri(ppnull1)
#or
patterns <- patternDistri(rtest1)

The ratios of patterns can be illustrated as following.

library(gridExtra)
library(grid)
library(reshape2)
library(ggplot2)

gridpatterns <- function (ppnull) {
    K <- length(ppnull$label)
    cellcomp <- patternDistri(ppnull)
    cellcomp <- cellcomp[order(cellcomp[,K+1], decreasing = TRUE),]
    
    #Bar Chart to show pattern percentage
    DF1 <- data.frame(Rank = seq_len(2^K - 1), cellcomp)
    p1 <- ggplot(DF1, aes(x = Rank, y = Wpattern)) +
        geom_bar(stat = "identity") +
        scale_y_continuous(labels=scales::percent) +
        theme_bw(base_size = 16) +
        theme(axis.text.x = element_blank(),
              axis.title.x = element_blank(),
              plot.margin=unit(c(1,1,-0.2,1), "cm"),
              panel.grid.minor = element_line(size = 1),
              panel.grid.major = element_line(size = 1),
              panel.border = element_blank(),
              axis.ticks.x = element_blank()) +
        ylab("Percentage of up in certain subtype(s)")
    
    #patterns as X-axis
    DF2 <- data.frame(Rank = seq_len(2^K-1), cellcomp[,-(K+1)])
    DF2 <- melt(DF2, id.var="Rank")
    p2 <- ggplot(DF2, aes(x = Rank, y = value, fill = variable)) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette="Set2") +
        theme(line = element_blank(),
              axis.text.x = element_blank(),
              axis.text.y = element_blank(),
              title = element_blank(),
              panel.background = element_rect(fill = "white"),
              legend.position="bottom",
              legend.key.size = unit(1.2,"line"),
              plot.margin=unit(c(-0.2,1,1,1), "cm")) +
        scale_y_reverse() +
        guides(fill = guide_legend(nrow = 1, byrow = TRUE))
    
    g1 <- ggplotGrob(p1)
    g2 <- ggplotGrob(p2)
    colnames(g1) <- paste0(seq_len(ncol(g1)))
    colnames(g2) <- paste0(seq_len(ncol(g2)))
    g <- gtable_combine(g1, g2, along=2)
    panels <- g$layout$t[grepl("panel", g$layout$name)]
    n1 <- length(g$heights[panels[1]])
    n2 <- length(g$heights[panels[2]])
    # assign new (relative) heights to the panels
    # notice the *4 here to get different heights
    g$heights[panels[1]] <- unit(n1*4,"null")
    g$heights[panels[2]] <- unit(n2,"null")
    return(g)
} 

grid.newpage()
grid.draw(gridpatterns(ppnull1))
#> Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#or grid.draw(gridpatterns(rtest1))

Notes

Default variance estimator in OVESEG is empirical Bayes moderated variance estimator used in limma (argument alpha="moderated"). Another option is adding a constant α to pooled variance estimator (argument alpha=α). Setting argument alpha=NULL will treat all variances as the same constant.

Before MG detection, OVESEG-test p-values still need multiple comparison correction or false discovery rate control.

##multiple comparison correction 
pv.overall.adj <- stats::p.adjust(rtest1$pv.overall, method="bonferroni")
pv.multiside.adj <- stats::p.adjust(rtest1$pv.multiside, method="bonferroni")

##fdr control
qv.overall <- fdrtool::fdrtool(rtest1$pv.overall, statistic="pvalue",
                               plot=FALSE, verbose=FALSE)$qval
qv.multiside <- fdrtool::fdrtool(rtest1$pv.multiside, statistic="pvalue",
                                 plot=FALSE, verbose=FALSE)$qval

sessionInfo

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1    reshape2_1.4.4   gridExtra_2.3    OVESEG_1.23.0   
#> [5] BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6        limma_3.63.2        jsonlite_1.8.9     
#>  [4] crayon_1.5.3        compiler_4.4.2      BiocManager_1.30.25
#>  [7] Rcpp_1.0.13-1       stringr_1.5.1       parallel_4.4.2     
#> [10] jquerylib_0.1.4     scales_1.3.0        BiocParallel_1.41.0
#> [13] yaml_2.3.10         fastmap_1.2.0       statmod_1.5.0      
#> [16] R6_2.5.1            plyr_1.8.9          labeling_0.4.3     
#> [19] fdrtool_1.2.18      knitr_1.49          tibble_3.2.1       
#> [22] snow_0.4-4          maketools_1.3.1     munsell_0.5.1      
#> [25] RColorBrewer_1.1-3  pillar_1.10.0       bslib_0.8.0        
#> [28] rlang_1.1.4         cachem_1.1.0        stringi_1.8.4      
#> [31] xfun_0.49           sass_0.4.9          sys_3.4.3          
#> [34] cli_3.6.3           withr_3.0.2         magrittr_2.0.3     
#> [37] digest_0.6.37       lifecycle_1.0.4     vctrs_0.6.5        
#> [40] evaluate_1.0.1      glue_1.8.0          farver_2.1.2       
#> [43] codetools_0.2-20    buildtools_1.0.0    colorspace_2.1-1   
#> [46] rmarkdown_2.29      pkgconfig_2.0.3     tools_4.4.2        
#> [49] htmltools_0.5.8.1

References

Chen, Lulu, David Herrington, Robert Clarke, Guoqiang Yu, and Yue Wang. 2019. “Data-Driven Robust Detection of Tissue/Cell-Specific Markers.” bioRxiv. https://doi.org/10.1101/517961.