--- title: "Statistical testing of quantitative omics data" author: "Veit Schwämmle" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{PolySTest} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Welcome to the PolySTest vignette. PolySTest is a comprehensive R package designed for the statistical testing of quantitative omics data, including genomics, proteomics, and metabolomics datasets. By integrating advanced statistical methods, PolySTest addresses the challenges of missing features and high-dimensionality with multiple comparisons normal in omics data. This package facilitates the identification of differentially abundant features across different conditions or treatments, making it an essential tool for biomarker discovery and biological insight. Additionally, its robust visualization functions enable users with multiple views for further biological interpretations and comparison of the different statistical tests. ## Installation To install PolySTest and its dependencies, you can use Bioconductor's BiocManager. ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("PolySTest") ``` ## Data Preparation PolySTest requires quantitative omics data to be formatted as a `SummarizedExperiment` object, which integrates the experimental data with its corresponding metadata. Here’s how to prepare your data: 1. **Data Format**: Your quantitative data should be in a matrix format, with features (e.g., genes, proteins) as rows and samples as columns. Missing values can be present, but they should be represented as NA. In case of high missingness, it can be worth to remove features with only minimal non-missing values. 2. **Sample Metadata**: Accompany your data with metadata that describes each sample, including conditions, replicates, and possible other experimental factors. This metadata will be crucial for defining comparisons in your statistical tests. 3. **Normalization**: To correct for technical variability and make your data comparable across samples, apply normalization methods suitable for your data type. For example, median normalization can adjust for differences in loading amounts or sequencing depth: First, we load the necessary package and example data and reduce it to 200 lines to increase processing speed. The data set consists of protein abundances in liver samples from mice fed with four different diets. The data contains three replicates per diet. ```{r warning=F,message=F} library(PolySTest) library(SummarizedExperiment) # Load the data file "LiverAllProteins.csv" from the PolySTest package filename <- system.file("extdata", "LiverAllProteins.csv", package = "PolySTest") dat <- read.csv(filename, row.names=1, stringsAsFactors = FALSE) dat <- dat[1:200,] # Use a subset of the data for this example ``` The following line of code normalizes the data by subtracting the median of each column from the data. This is a simple example of normalization, and you should use the appropriate method for your data type. ```{r} # Normalization (example) dat <- t(t(dat) - colMedians(as.matrix(dat), na.rm = TRUE)) ``` Now, create a SummarizedExperiment Object. Integrate your normalized data and metadata into a SummarizedExperiment object. This structured format ensures that PolySTest can efficiently process your data: ```{r} sampleMetadata <- data.frame(Condition = rep(paste("Condition", 1:4), each=3), Replicate = rep(1:3, each=4)) fulldata <- SummarizedExperiment(assays = list(quant = dat), colData = sampleMetadata) rowData(fulldata) <- rownames(dat) metadata(fulldata) <- list(NumReps = 3, NumCond = 4) assay(fulldata, "quant") <- dat fulldata <- update_conditions_with_lcs(fulldata) ``` ## Defining comparisons for statistical testing Before running statistical tests, it's essential to define the comparisons between conditions or treatments of interest. When selecting comparisons, consider the scientific questions you aim to answer. For instance, if you're exploring the effect of a specific treatment, compare treated samples against controls. The choice of reference condition can impact the interpretation of your results. PolySTest facilitates the creation of pairwise comparisons through the `create_pairwise_comparisons` function. Here's how to define comparisons between each condition and a reference condition: ```{r} conditions <- unique(colData(fulldata)$Condition) allComps <- create_pairwise_comparisons(conditions, 1) ``` ## Running Statistical Tests with PolySTest PolySTest offers to conduct both paired and unpaired statistical tests: - **Paired Tests**: Use `PolySTest_paired` when your data involves matched samples, such as before-and-after treatments on the same subjects. Paired tests account for the inherent correlations between matched samples, offering more sensitivity in detecting differences. - **Unpaired Tests**: Use `PolySTest_unpaired` for independent groups of samples, such as comparing different treatment groups without matching. PolySTest automatically performs false discovery rate (FDR) correction on p-values from the statistical tests limma, rank products, Miss Test, permutation test and t-test, and all tests but the t-test combined to the PolySTest FDR ```{r} fulldata_unpaired <- PolySTest_unpaired(fulldata, allComps) ``` ## Visualization and interpretation of results PolySTest includes several visualizations to help interpret the results of your statistical tests. These visualizations not only highlight significant features but also provide insights into the overall distribution of the data and the relationships between different tests: - **P-value Distributions**: Understanding the distribution of p-values can help assess the tests' sensitivity and the presence of potential biological signals in your data. - **Volcano Plots**: Volcano plots are a powerful way to visualize the trade-off between magnitude of change (fold change) and statistical significance (FDR), highlighting features of potential interest. - **UpSet Plots**: These plots offer a comprehensive view of the overlap between significant features identified by different statistical tests. - **Expression Profiles and Heatmaps**: These visualizations allow for the detailed examination of the expression patterns of significant features, facilitating the identification of potential biological pathways affected by the conditions under study. We now plot the pvalue distributions of limma and Miss Test for the first comparison. ```{r fig.width=7, fig.height=4} # Define comparisons to visualize from available ones compNames <- metadata(fulldata_unpaired)$compNames print(compNames) comps <- compNames[1] # Plotting the results plotPvalueDistr(fulldata_unpaired, comps, c("limma", "Miss Test")) ``` Here's an example of generating a volcano plot, marking proteins that are have significant changes in the first comparison with an FDR smaller than 1\% ```{r fig.width=7, fig.height=4} # Select proteins with FDR < 0.01 sigProts <- which(rowData(fulldata_unpaired)[, paste0("FDR_PolySTest_", comps[1])] < 0.01) # Volcano plot plotVolcano(fulldata_unpaired, compNames = comps, sel_prots = sigProts, testNames = c("PolySTest", "limma", "Miss Test")) ``` Now let's look into the overlap between the different statistical tests. An upset plot provides this information ```{r fig.width=7, fig.height=4} plotUpset(fulldata_unpaired, qlim=0.01) ``` PolySTest also provides a combined plot of expression changes and overlap between different tests. Reduce the number of features to maximally 20 to avoid overloaded plots. ```{r fig.width=8, fig.height=6} plotExpression(fulldata_unpaired, sel_prots = sigProts) ``` A further comparison between tests shows how many features they provide for different cutoffs for the false discovery rate. ```{r fig.width=7, fig.height=4} plotRegNumber(fulldata_unpaired) ``` For the last visualization, we create a hierarchical map (heatmap) for the differentially regulated proteins from the `sigProts`object. ```{r fig.width=6, fig.height=6} plotHeatmaply(fulldata_unpaired, sel_prots = sigProts, heatmap_scale = "row") ``` ## References and Further Reading For more information on PolySTest, please refer to the PolySTest paper: Schwämmle V, Hagensen CE, Rogowska-Wrzesinska A, Jensen ON. PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features. Mol Cell Proteomics. 2020;19(8):1396-1408. doi:10.1074/mcp.RA119.001777 For more tools and information, please visit https://computproteomics.bmb.sdu.dk ## Next Steps in Omics Data Analysis With the significant features identified and visualized, the next steps usually involve further biological interpretation and validation of the findings. Here: 1. **Pathway Enrichment Analysis**: Investigate whether the significant features are enriched in specific biological pathways or processes. 2. **Clustering**: Group features into clusters based on their expression patterns, which can provide insights into potential biological functions. We recommend using the Bioconductor package [VSClust](https://bioconductor.org/packages/release/bioc/html/vsclust.html) or its [Shiny web application](https://computproteomics.bmb.sdu.dk/app_direct/VSClust/) from your lab. 3. **Validation**: Confirm the findings from your statistical tests using independent experimental methods and datasets. 4. **Integration with Other Omics Layers**: Combine your findings with data from other omics levels (e.g., transcriptomics, metabolomics) for a multi-omics analysis. 5. **Protein Complexes**: Investigate whether the significant proteins are part of known protein complexes and whether there are protein complex showing strong alteration of their expression. We recommend using our web applications [ComplexBrowser](https://computproteomics.bmb.sdu.dk/app_direct/ComplexBrowser) and [CoExpresso](https://computproteomics.bmb.sdu.dk/app_direct/CoExpresso). ## Contributing and Feedback If you have suggestions for improvements, encounter any issues, or would like to contribute code or documentation, please visit our GitHub repository: GitHub: https://github.com/computproteomics/PolySTest Your input helps make PolySTest a more robust and user-friendly tool for the omics community." ## Session info ```{r} sessionInfo() ```