---
title: "daVis"
subtitle: "Visualization of differential expression analysis output"
author: "Katarzyna Górczak and Laure Cougnaud"
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float:
      collapsed: true
    number_sections: true
    toc_depth: 4
vignette: >
  %\VignetteIndexEntry{daVis package}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
  
```{r loadPackages, echo=FALSE, eval=TRUE}
  
suppressPackageStartupMessages({
  library(daVis)
  library(pander)
})
  
```
  
  
```{r optionsChunkDoNotModify, echo = FALSE, message = FALSE, warning = FALSE}
  
## Chunk with options for knitr. This chunk should not be modified.

knitr::opts_chunk$set(
  eval = TRUE,
  echo = TRUE,
  warning = FALSE,
  dpi = 90,
  fig.width = 7,
  fig.height = 4
)

```

# Introduction

This document explains the functionalities available in the `daVis` package.

`daVis` contains utility functions to visualize the output from 
differential expression analysis. The input data can be a model, a 
list of top tables, or a combination of these two. The model can be of class 
`MArrayLM` (limma), `DGELRT` (edgeR), or `DESeqResults` (DESeq2).

## Installation

Download the package from Bioconductor:

```{r echo = TRUE, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("daVis")
```

## Load the package into R session

```{r echo = TRUE, eval = FALSE}
library(daVis)
```

# Load data

Each visualization function takes as an `input`:

* **model** - an object of class `MArrayLM`, `DGELRT` or `DESeqResults`
* **topTables** - a list of top tables for multiple coefficients
* combination of the model and/or top table


```{r loadData, echo=TRUE, eval=TRUE}
tmpDir <- tempfile(); dir.create(tmpDir)
exampleData <- createExampleData(
	path = tmpDir, 
	output = c("limma", "edgeR", "deseq2", "topTable")
)
res.limma <- exampleData$limma
res.edger <- exampleData$edgeR
res.deseq <- exampleData$DESeq2
topTableList <- exampleData$topTable
```

## Top table

The user should provide a named list of top tables for different 
contrasts/coefficients. Here, the list contains 4 top tables from 4 comparisons.
Each top table must contain the following columns: \'logFC\', \'P.Value\', 
\'adj.P.Val\'. Additionally, each table can contain columns with gene 
identifiers and averaged across all samples expression (\'AveExpr\' or 
\'logCPM\').


```{r topTable1}
length(topTableList)

```


The names of the list are linked to the comparisons:

```{r topTable2}
names(topTableList)

```

Below a subset of an example top table: 

```{r topTable3, echo=FALSE}
pander(topTableList[[1]][seq_len(6), c(1, 2, 5, 6, 8, 9)], row.names = FALSE)

```

## Model

The object **res.limma** is of class:


```{r limmaModel}
class(res.limma)

```

The object **res.edger** is of class:

```{r edgerModel}
class(res.edger)

```

The object **res.deseq** is of class:

```{r deseqModel}
class(res.deseq)

```

# Visualizations

## Volcano plot

A **volcano plot** 
enables a quick visual identification of the size and significance of the 
feature expression effects (top left/right). The significance of the effect is 
represented by the raw p-value on the y axis, so highly significant features 
are at the top of the plot. The size of the effect is represented by the log 
of the fold change (negative/positive for down/up-regulation), so features with
high effects are at the right/left side of the plot. Below, the color
scale is used for adjusted p-values (corrected for multiple testing across
genes).

The documentation of the function containing description of each parameter
can be obtained by:


```{r volcanoPlotHelp, eval=FALSE}
help("daVolcanoPlot", "daVis")

```

### Highlight top genes or genes of interest

The function **`daVolcanoPlot()`** creates a volcano plot for the provided 
model or list of top tables, or their combination. 
The feature identifier can be specified by 
`featuresIdVar` parameter. If empty (by default), row names of the input
are used as feature identifiers. 
The `colorVar`, `shapeVar`, `alphaVar` and/or `sizeVar` 
parameters can be used to customize the plot. Here, the `colorVar` 
parameter is used to color the points by adjusted p-value.
The `topGenes` parameter represents the number of top genes with highest 
logFC or p-value to highlight in the plot for each considered coefficient 
(0 by default). The features are then labeled by `topGenesVar` parameter. 
If empty, row names of the input are used.

Additionally, a set of genes of interest can be highlighted in red. 
The features can be specified using the `genesToHighlight` parameter and 
`genesToHighlightVar` indicates the identifier used to label 
genes of interest. `genesToHighlight` should be the same as the input data 
row names.


```{r volcanoPlot-top-genes}
coefs <- c("B.LvsP", "L.LvsP")
genesOfInterest <- c("497097", "20671", "239273", "14862", "27395", "76408")

daVolcanoPlot(
  input = res.limma, 
  coef = coefs, 
  coefLabel = c("A", "B"),
  topGenes = 5,
  topGenesVar = "SYMBOL",
  genesToHighlight = genesOfInterest,
  genesToHighlightVar = "SYMBOL",
  colorVar = "adj.P.Val",
  facetNCol = 2
)

```

### Additional fdr threshold

```{r volcanoPlot-logFC-threshold, eval=FALSE}
coefs <- c("B.LvsP", "L.LvsP")

daVolcanoPlot(
  input = res.limma, 
  coef = coefs, 
  coefLabel = c("A", "B"),
  facetNCol = 2,
  additionalThresholdsAdjPValue = 0.1,
  colorVar = "adj.P.Val"
)

```

### Interactive plot

The interactive volcano plot will be created by changing the `typePlot` 
parameter.

```{r volcanoPlot-interactive, eval=FALSE}
coefs <- c("B.LvsP", "L.LvsP")

daVolcanoPlot(
  input = res.limma, 
  coef = coefs, 
  coefLabel = c("A", "B"),
  typePlot = "interactive"
)

```

## Log-ratio plot

A **log-ratio plot** 
represents the differential effect (e.g., treatment versus control) for several 
conditions (e.g., compounds or concentrations) of the experiment (logFC scale).
This enables to visualize a bigger subset of genes. The significance of genes 
can be represented via colored rows, e.g., red denotes significant genes, while
grey indicates non-significant genes.

The documentation of the function containing description of each parameter 
can be obtained by:

```{r logRatioHelp, eval=FALSE}
help("daLogRatioPlot", "daVis")

```

### Color feature labels

The function **`daLogRatioPlot()`** creates a log-ratio plot for the provided
model, top tables
(or list of those). The features should be specified using the `features` 
parameter, and the feature identifier can be specified using the 
`featuresIdVar` parameter. If the `features` parameter is not provided, the 
plot will display the top 20 features. The features labels can be colored by 
using `featuresColor` parameter. 
Here, the features are colored based on the significance for the 
coefficient. Additionally, the features can be labeled by providing the 
`featuresVar` parameter indicating the column names in the top table. 


```{r logRatio-gene-color}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

features <- c(
  "72243", "66704", "11781", "226101", "14620", "381290", "16012", 
  "100504225", "231830", "78896", "103889", "231991", "77034", "19687", 
  "100043805", "16669", "226162", "24117", "55987", "11419" 
)
dt <- topTableList[['B.LvsP']]
adjPVal <- dt[match(features, dt$ENTREZID), "adj.P.Val"]
signFeature <- ifelse(adjPVal <= 0.05, "red", "grey")

daLogRatioPlot(
  input = topTableList,
  features = features,
  featuresIdVar = "ENTREZID",
  featuresVar = c("SYMBOL", "GENENAME"),
  featuresMaxNChar = 35,
  coef = coefs,
  coefLabel = c("A", "B", "C", "D"),
  facetNCol = 4,
  featuresColor = signFeature,
  errorBars = FALSE
)

```

### Facet by variable(s)

The coefficients can be grouped by specifying multiple sets of labels to the
`coefLabel` parameter. 
The colors of the bars can be changed via `color`. 


```{r logRatio-nested-facet}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

coefsLabel <- list(
  sub("(.+)\\.(.+)", "\\2", coefs),
  sub("(.+)\\.(.+)", "\\1", coefs)
)

colorPalette <- c(
  `B.PvsV` = "darkgreen", `L.PvsV` = "lightgreen",
  `B.LvsP` = "darkblue", `L.LvsP` = "lightblue"
)

daLogRatioPlot(
  input = topTableList,
  featuresIdVar = "ENTREZID",
  coef = coefs,
  coefLabel = coefsLabel,
  facetNCol = 4,
  errorBars = FALSE,
  color = colorPalette
)

# Note: using coef labels as names of the color palette also works 
# (if coefLabel is NOT a list) - for back-compatibility
# colorPalette <- c(
#   C = "darkgreen", D = "lightgreen",
#   A = "darkblue", B = "lightblue"
# )
```

### Mixed input

The log ratio plot can be created for a (mixed) list of top table(s) and
model(s).

```{r logRatio-mixed-input}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV", "A")

daLogRatioPlot(
  input = list(res.limma, A = topTableList[["B.LvsP"]]),
  featuresIdVar = "ENTREZID",
  coef = coefs,
  facetNCol = 5,
  errorBars = TRUE
)
```

### Sort features

The features can be sorted with the `featuresOrder` parameter.


#### Based on significance

```{r logRatio-order-significance}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

daLogRatioPlot(
  input = res.limma,
  featuresIdVar = "ENTREZID",
  features = features, featuresOrder = "significance",
  coef = coefs,
  facetNCol = 4,
  errorBars = TRUE
)
```

### Display text

A text can be displayed in the log ratio plot via the `text` parameter. 
This can be a column of the top table or a function formatting such column(s).  
If the text doesn't fit within the axes limits, the x-axis can be expanded 
via the `xexpand` parameter.

#### Significance star

```{r logRatio-text-significanceStar}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")
# Note: the format.pval function can be used to format the p-value as a text
getSignifStar <- function(topTable){
  with(topTable, as.character(
    stats::symnum(
      x = adj.P.Val, 
      cutpoints = c(0, .001, .01, .05, .1, 1),
      symbols = c("***","**","*","."," "),
      corr = FALSE
    )
  ))
}
daLogRatioPlot(
  input = res.limma,
  coef = coefs,
  coefLabel = c("A", "B", "C", "D"),
  color = colorPalette,
  facetNCol = 4,
  text = getSignifStar
)
```

#### Log fold change

```{r logRatio-text-logFC}
daLogRatioPlot(
  input = res.limma,
  coef = coefs,
  coefLabel = c("A", "B", "C", "D"),
  color = colorPalette,
  facetNCol = 4,
  text = function(topTable)
    with(topTable, round(logFC, digits = 1)),
  textCex = 3
)
```


## Heatmap

A **heatmap** 
represents the differential effect (e.g. treatment versus control) for several
conditions (e.g., compounds or concentrations) of the experiment. It is a 
graphical representation of the individual values contained in a matrix as 
colors. This enables to visualize a bigger subset of genes. The gene label can 
be colored indicating for example the significance of genes, e.g., black color 
denotes significant genes, while grey represents non-significant genes.

The documentation of the function containing description of each parameter can 
be obtained by:

```{r heatmapHelp, eval=FALSE}
help("daHeatmap", "daVis")

```

### Color feature labels

The function **`daHeatmapLogFC()`** creates a heatmap for the provided model or 
list of top tables. The features should be specified using the `features` 
parameter, and the feature identifier can be specified using `featuresIdVar`. 
The features are labeled by row names of input (or `featuresIdVar`) 
by default. Different labels are possible by using the `featuresVar` parameter. 

```{r heatmap-gene-color}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

daHeatmapLogFC(
  input = topTableList,
  features = features,
  featuresIdVar = "ENTREZID",
  featuresVar = c("SYMBOL", "GENENAME"),
  featuresMaxNChar = 35,
  coef = coefs,
  coefLabel = c("A", "B", "C", "D"),
  featuresColor = signFeature
)

```

### Nested coefficient labels

The coefficients can be grouped by specifying multiple sets of labels to the
`coefLabel` parameter.

```{r heatmap-nested-labels}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

coefsLabel <- list(
  sub("(.+)\\.(.+)", "\\2", coefs),
  sub("(.+)\\.(.+)", "\\1", coefs)
)

daHeatmapLogFC(
  input = res.limma,
  features = features,
  featuresIdVar = "ENTREZID",
  featuresVar = c("SYMBOL", "GENENAME"),
  featuresMaxNChar = 35,
  coef = coefs,
  coefLabel = coefsLabel
)

```


## Upset plot

An **upset plot** 
is used to represent the overlap (intersection) or difference of the sets of 
significant genes, down or up-regulated separately, between different 
differential effects. The different shades of blue are indicative for the 
number of differential effects (sharing these up- or down-regulated genes).

The documentation of the function containing description of each parameter 
can be obtained by:

```{r upsetHelp, eval=FALSE}
help("daUpset", "daVis")

```

### Down-regulated genes

The function **`daUpset()`** creates an upset plot for the provided model or 
list of top tables. The sets are created based on the row names of input or 
feature identifiers specified using the `featuresIdVar` parameter.


```{r upset-down}
daUpset(
  input = topTableList,
  coef = coefs,
  featuresIdVar = "SYMBOL",
  fdr = 0.05,
  dir = "down",
  ylab = paste("Number of (shared) significantly \n", 
               "down-regulated genes"),
  xlab = paste("Number of significantly \n", 
               "down-regulated genes")
)

```

### Return overlapping sets

If `returnAnalysis` is set to TRUE, the output of the **`daUpset()`** 
function is a list. The slot `sets` contains
all the overlapping sets between specified coefficients. The sets contain 
identifiers based on `featuresIdVar`. The slot `plot` contains the plot object.

```{r upset-return-sets, eval=FALSE}
out <- daUpset(
  input = topTableList,
  coef = coefs,
  featuresIdVar = "SYMBOL",
  fdr = 0.05,
  dir = "down",
  ylab = paste("Number of (shared) significantly \n", 
               "down-regulated genes"),
  xlab = paste("Number of significantly \n", 
               "down-regulated genes"),
  returnAnalysis = TRUE
)
out$sets
out$plot
```

## Scatter plot

A **scatter plot** 
visualizes the comparison of the logFC for different differential effects.

The documentation of the function containing description of each parameter 
can be obtained by:

```{r pcpHelp, eval=FALSE}
help("daScatterPlot", "daVis")

```

### Highlight genes of interest and top genes

The function **`daScatterPlot()`** creates a scatter plot for the provided 
model or list of top tables. 
The features of interest can be specified using `genesToHighlight` and 
`genesToHighlightVar` parameters. Genes with highest logFC and significance
can be colored as well by `topGenes`. These can be labeled 
by `topGenesVar`.

```{r pcp-genes-of-interest}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")
genesOfInterest <- c("497097", "20671", "239273", "14862", "27395", "76408")

daScatterPlot(
  input = topTableList,
  coef = coefs,
  coefLabel = c("A", "B", "C", "D"),
  genesToHighlight = genesOfInterest,
  genesToHighlightVar = "SYMBOL",
  topGenes = 5,
  topGenesVar = "SYMBOL",
  facetNCol = 3
)

```

### Interactive plot with feature annotation

The interactive scatter plot will be created by changing the `typePlot` 
parameter. An additional feature annotation can be shown when hovering over 
the points. The `featuresVar` parameter allows to specify the column names 
that should be used to annotate the features. 

```{r pcp-interactive-annotation, eval=FALSE}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

daScatterPlot(
  input = topTableList,
  coef = coefs,
  coefLabel = c("A", "B", "C", "D"),
  featuresIdVar = "ENTREZID",
  featuresVar = c("SYMBOL", "GENENAME"),
  facetNCol = 3,
  typePlot = "interactive"
)

```

### Add correlation

The Pearson correlation value can be shown in the plot when setting 
`correlation` parameter to TRUE.

```{r pcp-correlation}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

daScatterPlot(
  input = topTableList,
  coef = coefs[c(1, 2)],
  coefLabel = c("A", "B"),
  featuresIdVar = "ENTREZID",
  correlation = TRUE
)

```

## Waterfall plot

A **waterfall plot** 
visualizes the logFC for different differential effects colored by adjusted 
p-value.

The documentation of the function containing description of each parameter 
can be obtained by:

```{r waterfallHelp, eval=FALSE}
help("daWaterfallPlot", "daVis")

```

### Multiple coefficients

The function **`daWaterfallPlot()`** creates a barplot plot for the provided 
model or list of top tables. 
When more than one coefficient specified, multiple plots side by side are 
created. 

```{r waterfall-more-coef}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")
features <- c(
  "24117", "381290", "226101", "78896", "231830", "16012", "16669", "55987",
  "231991", "14620", "20317", "74747", "11636", "20482", "194126", "270150",
  "17131", "16878", "20564", "73847" 
)

daWaterfallPlot(
  input = res.limma,
  coef = coefs,
  coefLabel = c("A", "B", "C", "D"),
  featuresIdVar = "ENTREZID",
  featuresVar = "SYMBOL",
  features = features,
  facetNCol = 4,
  colorVar = "adj.P.Val",
  color = c("maroon", "orange"),
  fillVar = "adj.P.Val",
  fill = c("maroon", "orange"),
  typePlot = "static"
)

```

## MA plot

A **MA plot** 
visualizes the logFC versus log2 mean expression.

The documentation of the function containing description of each parameter 
can be obtained by:

```{r maplotHelp, eval=FALSE}
help("daMAplot", "daVis")

```

### Label top genes and genes of interest

The function **`daMAplot()`** creates a scatter plot (logFC versus mean 
expression) for the provided model or list of top tables. 
The top genes with highest absolute logFC and significance can be labeled. 
The number of top genes can be specified using the `topGenes` parameter. 
Top genes are labeled by feature names by default. It can be changed with 
`topGenesVar`.
The genes of interest are indicated in green. Additionally, if `coef` 
indicates only one coefficient, the legend shows the number of significantly 
up- or down-regulated genes, or the number of non-significant genes. The points
can be colored by direction (significant up- and down-regulated genes) if 
`direction` is set to TRUE. Customized colors can be used with `color`.

```{r maplot-top-genes-and-genes-of-interest}
coefs <- c("B.LvsP", "L.LvsP")

daMAplot(
  input = res.limma, 
  coef = coefs[1], 
  coefLabel = "A",
  featuresIdVar = "ENTREZID",
  topGenes = 5,
  topGenesVar = "SYMBOL",
  genesToHighlight = genesOfInterest,
  genesToHighlightVar = "SYMBOL",
  direction = TRUE,
  color = c("steelblue", "firebrick", "grey")
)

```

## Significant genes barplot

A **barplot** 
visualizes the number of significant genes per comparison.

The documentation of the function containing description of each parameter 
can be obtained by:

```{r barplotHelp, eval=FALSE}
help("daSignificantGenesBarplot", "daVis")

```

### Nested coefficient labels

The function **`daSignificantGenesBarplot()`** creates a barplot indicating the
number of significant (up- and down-regulated) genes for the provided model or 
list of top tables. The coefficients can be grouped by specifying multiple sets 
of labels to the `coefLabel` parameter.

```{r barplot-nested-labels}
coefs <- c("B.LvsP", "L.LvsP", "B.PvsV", "L.PvsV")

coefsLabel <- list(
  sub("(.+)\\.(.+)", "\\2", coefs),
  sub("(.+)\\.(.+)", "\\1", coefs)
)

daSignificantGenesBarplot(
  input = res.limma, 
  coef = coefs, 
  coefLabel = coefsLabel,
  addPercentage = TRUE
)

```

# Session information

```{r sessionInfo}
sessionInfo()
```