Statistical testing of quantitative omics data

## 
## Attaching package: 'BiocStyle'
## The following objects are masked from 'package:rmarkdown':
## 
##     html_document, md_document, pdf_document

Introduction

Welcome to the PolySTest vignette.

PolySTest is a comprehensive R package designed for the statistical testing of quantitative omics data, including genomics, proteomics, and metabolomics datasets. By integrating advanced statistical methods, PolySTest addresses the challenges of missing features and high-dimensionality with multiple comparisons normal in omics data.

This package facilitates the identification of differentially abundant features across different conditions or treatments, making it an essential tool for biomarker discovery and biological insight. Additionally, its robust visualization functions enable users with multiple views for further biological interpretations and comparison of the different statistical tests.

Installation

To install PolySTest and its dependencies, you can use Bioconductor’s BiocManager.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("PolySTest")

Data Preparation

PolySTest requires quantitative omics data to be formatted as a SummarizedExperiment object, which integrates the experimental data with its corresponding metadata. Here’s how to prepare your data:

  1. Data Format: Your quantitative data should be in a matrix format, with features (e.g., genes, proteins) as rows and samples as columns. Missing values can be present, but they should be represented as NA. In case of high missingness, it can be worth to remove features with only minimal non-missing values.

  2. Sample Metadata: Accompany your data with metadata that describes each sample, including conditions, replicates, and possible other experimental factors. This metadata will be crucial for defining comparisons in your statistical tests.

  3. Normalization: To correct for technical variability and make your data comparable across samples, apply normalization methods suitable for your data type. For example, median normalization can adjust for differences in loading amounts or sequencing depth:

First, we load the necessary package and example data and reduce it to 200 lines to increase processing speed. The data set consists of protein abundances in liver samples from mice fed with four different diets. The data contains three replicates per diet.

library(PolySTest)
library(SummarizedExperiment)
# Load the data file "LiverAllProteins.csv" from the PolySTest package
filename <- system.file("extdata", "LiverAllProteins.csv", 
                        package = "PolySTest")

dat <- read.csv(filename, row.names=1, stringsAsFactors = FALSE)
dat <- dat[1:200,] # Use a subset of the data for this example

The following line of code normalizes the data by subtracting the median of each column from the data. This is a simple example of normalization, and you should use the appropriate method for your data type.

# Normalization (example)
dat <- t(t(dat) - colMedians(as.matrix(dat), na.rm = TRUE))

Now, create a SummarizedExperiment Object. Integrate your normalized data and metadata into a SummarizedExperiment object. This structured format ensures that PolySTest can efficiently process your data:

sampleMetadata <- data.frame(Condition = rep(paste("Condition", 1:4), each=3),
                             Replicate = rep(1:3, each=4))

fulldata <- SummarizedExperiment(assays = list(quant = dat), 
                                 colData = sampleMetadata)
rowData(fulldata) <- rownames(dat)
metadata(fulldata) <- list(NumReps = 3, NumCond = 4)

assay(fulldata, "quant") <- dat

fulldata <- update_conditions_with_lcs(fulldata)

Defining comparisons for statistical testing

Before running statistical tests, it’s essential to define the comparisons between conditions or treatments of interest.

When selecting comparisons, consider the scientific questions you aim to answer. For instance, if you’re exploring the effect of a specific treatment, compare treated samples against controls. The choice of reference condition can impact the interpretation of your results.

PolySTest facilitates the creation of pairwise comparisons through the create_pairwise_comparisons function. Here’s how to define comparisons between each condition and a reference condition:

conditions <- unique(colData(fulldata)$Condition)

allComps <- create_pairwise_comparisons(conditions, 1)
#> All pairwise comparison between conditions:
#> |Condition A |Condition B |
#> |:-----------|:-----------|
#> |HF.Rep.     |TTA.Rep.    |
#> |HF.Rep.     |FO.Rep.     |
#> |HF.Rep.     |TTA.FO.Rep. |

Running Statistical Tests with PolySTest

PolySTest offers to conduct both paired and unpaired statistical tests:

  • Paired Tests: Use PolySTest_paired when your data involves matched samples, such as before-and-after treatments on the same subjects. Paired tests account for the inherent correlations between matched samples, offering more sensitivity in detecting differences.

  • Unpaired Tests: Use PolySTest_unpaired for independent groups of samples, such as comparing different treatment groups without matching.

PolySTest automatically performs false discovery rate (FDR) correction on p-values from the statistical tests limma, rank products, Miss Test, permutation test and t-test, and all tests but the t-test combined to the PolySTest FDR

fulldata_unpaired <- PolySTest_unpaired(fulldata, allComps)
#> Running limma test
#> limma completed
#> Running Miss_Test test
#> Running Miss test ...
#> ================================================================================
#> Miss_Test completed
#> ================================================================================
#> tests completed
#> Unifying q-values across tests ...
#> Calculating PolySTest FDRs ...
#> ------- Summary of Results --------
#> Number of differentially regulated features with FDR < 0.01:
#> |                       | PolySTest| limma| Miss_Test| rank_products| permutation_test| t_test|
#> |:----------------------|---------:|-----:|---------:|-------------:|----------------:|------:|
#> |TTA.Rep._vs_HF.Rep.    |        34|    49|         7|             3|                0|     16|
#> |FO.Rep._vs_HF.Rep.     |         1|     3|         2|             0|                0|      0|
#> |TTA.FO.Rep._vs_HF.Rep. |        33|    50|         4|             5|                0|      3|

Visualization and interpretation of results

PolySTest includes several visualizations to help interpret the results of your statistical tests. These visualizations not only highlight significant features but also provide insights into the overall distribution of the data and the relationships between different tests:

  • P-value Distributions: Understanding the distribution of p-values can help assess the tests’ sensitivity and the presence of potential biological signals in your data.

  • Volcano Plots: Volcano plots are a powerful way to visualize the trade-off between magnitude of change (fold change) and statistical significance (FDR), highlighting features of potential interest.

  • UpSet Plots: These plots offer a comprehensive view of the overlap between significant features identified by different statistical tests.

  • Expression Profiles and Heatmaps: These visualizations allow for the detailed examination of the expression patterns of significant features, facilitating the identification of potential biological pathways affected by the conditions under study.

We now plot the pvalue distributions of limma and Miss Test for the first comparison.

# Define comparisons to visualize from available ones
compNames <- metadata(fulldata_unpaired)$compNames
print(compNames)
#> [1] "TTA.Rep._vs_HF.Rep."    "FO.Rep._vs_HF.Rep."     "TTA.FO.Rep._vs_HF.Rep."
comps <- compNames[1]

# Plotting the results
plotPvalueDistr(fulldata_unpaired,  comps, c("limma", "Miss Test"))
#> Plotting p-values

#> Plotting p-values finished

Here’s an example of generating a volcano plot, marking proteins that are have significant changes in the first comparison with an FDR smaller than 1%

# Select proteins with FDR < 0.01

sigProts <- which(rowData(fulldata_unpaired)[, paste0("FDR_PolySTest_", 
                                                      comps[1])] < 0.01)

# Volcano plot
plotVolcano(fulldata_unpaired, compNames = comps, sel_prots = sigProts, 
            testNames = c("PolySTest", "limma", "Miss Test"))
#> Plotting volcano plots

#> Plotting volcano plots finished

Now let’s look into the overlap between the different statistical tests. An upset plot provides this information

plotUpset(fulldata_unpaired, qlim=0.01)
#> Plotting upset plots
#> Plotting upset plots finished

PolySTest also provides a combined plot of expression changes and overlap between different tests. Reduce the number of features to maximally 20 to avoid overloaded plots.

plotExpression(fulldata_unpaired, sel_prots = sigProts)
#> plotting expression profiles
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
#> zero-length arrow is of indeterminate angle and so skipped
#> Making circos plot
#> Note: 1 point is out of plotting region in sector '1', track '1'.
#> Note: 1 point is out of plotting region in sector '1', track '1'.
#> Note: 1 point is out of plotting region in sector '2', track '1'.
#> Note: 1 point is out of plotting region in sector '2', track '1'.
#> Note: 1 point is out of plotting region in sector '3', track '1'.
#> Note: 1 point is out of plotting region in sector '3', track '1'.

#> plotting expression profiles finished

A further comparison between tests shows how many features they provide for different cutoffs for the false discovery rate.

plotRegNumber(fulldata_unpaired)
#> Plotting number of regulated features

#> Plotting number of regulated features finished

For the last visualization, we create a hierarchical map (heatmap) for the differentially regulated proteins from the sigProtsobject.

plotHeatmaply(fulldata_unpaired, sel_prots = sigProts, heatmap_scale = "row")
#> plotting heatmap ...
#> Plotting heatmap finished

References and Further Reading

For more information on PolySTest, please refer to the PolySTest paper:
Schwämmle V, Hagensen CE, Rogowska-Wrzesinska A, Jensen ON. PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features. Mol Cell Proteomics. 2020;19(8):1396-1408. doi:10.1074/mcp.RA119.001777

For more tools and information, please visit https://computproteomics.bmb.sdu.dk

Next Steps in Omics Data Analysis

With the significant features identified and visualized, the next steps usually involve further biological interpretation and validation of the findings. Here:

  1. Pathway Enrichment Analysis: Investigate whether the significant features are enriched in specific biological pathways or processes.
  2. Clustering: Group features into clusters based on their expression patterns, which can provide insights into potential biological functions. We recommend using the Bioconductor package VSClust or its Shiny web application from your lab.
  3. Validation: Confirm the findings from your statistical tests using independent experimental methods and datasets.
  4. Integration with Other Omics Layers: Combine your findings with data from other omics levels (e.g., transcriptomics, metabolomics) for a multi-omics analysis.
  5. Protein Complexes: Investigate whether the significant proteins are part of known protein complexes and whether there are protein complex showing strong alteration of their expression. We recommend using our web applications ComplexBrowser and CoExpresso.

Contributing and Feedback

If you have suggestions for improvements, encounter any issues, or would like to contribute code or documentation, please visit our GitHub repository:

GitHub: https://github.com/computproteomics/PolySTest

Your input helps make PolySTest a more robust and user-friendly tool for the omics community.”

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
#>  [5] IRanges_2.41.2              S4Vectors_0.45.2           
#>  [7] BiocGenerics_0.53.4         generics_0.1.3             
#>  [9] MatrixGenerics_1.19.1       matrixStats_1.5.0          
#> [11] PolySTest_1.1.0             BiocStyle_2.35.0           
#> [13] rmarkdown_2.29             
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-9            gridExtra_2.3           fdrtool_1.2.18         
#>  [4] rlang_1.1.5             magrittr_2.0.3          compiler_4.4.2         
#>  [7] vctrs_0.6.5             reshape2_1.4.4          stringr_1.5.1          
#> [10] pkgconfig_2.0.3         shape_1.4.6.1           crayon_1.5.3           
#> [13] fastmap_1.2.0           XVector_0.47.2          labeling_0.4.3         
#> [16] caTools_1.18.3          ca_0.71.1               promises_1.3.2         
#> [19] UCSC.utils_1.3.1        UpSetR_1.4.0            purrr_1.0.2            
#> [22] xfun_0.50               cachem_1.1.0            jsonlite_1.8.9         
#> [25] later_1.4.1             DelayedArray_0.33.4     parallel_4.4.2         
#> [28] R6_2.5.1                bslib_0.8.0             stringi_1.8.4          
#> [31] RColorBrewer_1.1-3      limma_3.63.3            jquerylib_0.1.4        
#> [34] Rcpp_1.0.14             assertthat_0.2.1        iterators_1.0.14       
#> [37] knitr_1.49              httpuv_1.6.15           Matrix_1.7-1           
#> [40] splines_4.4.2           tidyselect_1.2.1        qvalue_2.39.0          
#> [43] abind_1.4-8             yaml_2.3.10             viridis_0.6.5          
#> [46] TSP_1.2-4               gplots_3.2.0            codetools_0.2-20       
#> [49] lattice_0.22-6          tibble_3.2.1            plyr_1.8.9             
#> [52] withr_3.0.2             shiny_1.10.0            evaluate_1.0.3         
#> [55] heatmaply_1.5.0         circlize_0.4.16         pillar_1.10.1          
#> [58] BiocManager_1.30.25     KernSmooth_2.23-26      foreach_1.5.2          
#> [61] plotly_4.10.4           ggplot2_3.5.1           munsell_0.5.1          
#> [64] scales_1.3.0            xtable_1.8-4            gtools_3.9.5           
#> [67] glue_1.8.0              lazyeval_0.2.2          maketools_1.3.1        
#> [70] tools_4.4.2             dendextend_1.19.0       sys_3.4.3              
#> [73] data.table_1.16.4       webshot_0.5.5           registry_0.5-1         
#> [76] buildtools_1.0.0        grid_4.4.2              tidyr_1.3.1            
#> [79] crosstalk_1.2.1         seriation_1.5.7         colorspace_2.1-1       
#> [82] GenomeInfoDbData_1.2.13 cli_3.6.3               S4Arrays_1.7.1         
#> [85] viridisLite_0.4.2       dplyr_1.1.4             gtable_0.3.6           
#> [88] sass_0.4.9              digest_0.6.37           SparseArray_1.7.4      
#> [91] farver_2.1.2            htmlwidgets_1.6.4       htmltools_0.5.8.1      
#> [94] lifecycle_1.0.4         httr_1.4.7              mime_0.12              
#> [97] GlobalOptions_0.1.2     statmod_1.5.0