---
title: "Comparative transcriptomic analysis of hybrids and their progenitors"
author: 
  - name: Fabricio Almeida-Silva
    affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
  - name: Lucas Prost-Boxoen
    affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
  - name: Yves Van de Peer
    affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: yes
bibliography: vignette_bibliography.bib
vignette: >
  %\VignetteIndexEntry{Comparative transcriptomic analysis of hybrids and their progenitors}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

# Introduction

The formation of hybrids through the fusion of distinct genomes and 
subsequent genome duplication (in cases of allopolyploidy) represents a 
significant evolutionary event with complex effects on cellular biology, 
particularly gene expression. The impact of such genome mergings and 
duplications on transcription remain incompletely understood. To bridge this 
gap, we introduce __HybridExpress__, a comprehensive package designed to 
facilitate the comparative transcriptomic analysis of hybrids and their 
progenitor species. __HybridExpress__ is tailored for RNA-Seq data derived 
from a 'hybrid triplet': the hybrid organism and its two parental species. 
This package offers a suite of intuitive functions enabling researchers to 
perform differential expression analysis with ease, generate principal 
component analysis (PCA) plots to visualize sample grouping, 
categorize genes into 12 distinct expression pattern groups 
(as in @rapp2009genomic), and conduct functional analyses.
Acknowledging the potential variability in cell and transcriptome size 
across species and ploidy levels, __HybridExpress__ incorporates features for 
rigorous normalization of count data. Specifically, it allows for the 
integration of spike-in controls directly into the normalization process, 
ensuring accurate transcriptome size adjustments when these standards are 
present in the RNA-Seq count data (see full methodology in @coate2023beyond). 
By offering these capabilities, __HybridExpress__ provides a robust 
tool set for unraveling the intricate effects of genome doubling and 
merging on gene expression, paving the way for novel insights into the 
cellular biology of hybrid organisms.

# Installation

`r BiocStyle::Githubpkg("HybridExpress")` can be installed from Bioconductor
with the following code:

```{r installation, eval=FALSE}
if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')

BiocManager::install("HybridExpress")
```

```{r load_package, message=FALSE}
# Load package after installation
library(HybridExpress)

set.seed(123) # for reproducibility
```

# Data description

For this vignette, we will use an example data set that comprises
(unpublished) gene expression data (in counts) for *Chlamydomonas reinhardtii*. 
In our lab, we crossed a diploid line of *C. reinhardtii* (hereafter "P1") 
with a haploid line (hereafter "P2"), thus generating a triploid line through
the merging of the two parental genomes. The count matrix and sample
metadata are stored in `SummarizedExperiment` objects, which is the standard
Bioconductor data structure required by 
`r BiocStyle::Githubpkg("HybridExpress")`. For instructions on how to create
a `SummarizedExperiment` object, check the FAQ section of this vignette.

Let's load the example data and take a quick look at it:

```{r message = FALSE}
library(SummarizedExperiment)

# Load data
data(se_chlamy)

# Inspect the `SummarizedExperiment` object
se_chlamy

## Take a look at the colData and count matrix
colData(se_chlamy)
assay(se_chlamy) |> head()

table(se_chlamy$Ploidy, se_chlamy$Generation)
```

As you can see, the count matrix contains `r nrow(se_chlamy)` genes
and `r ncol(se_chlamy)` samples, with 6 replicates for parent 1 
(P1, diploid), 6 replicates for parent 2 (P2, haploid), and 6 replicates 
for the progeny (F1, triploid).

# Adding midparent expression values

First of all, you'd want to add in your count matrix *in silico* samples 
that contain the expression values of the midparent. This can be done
with the function `add_midparent_expression()`, which takes a random
sample pair (one sample from each parent, sampling without replacement) 
and calculates the midparent expression value in one of three ways:


1. **Mean (default):** get the mean expression of the two samples.
2. **Sum:** get the sum of the two samples.
3. **Weighted mean:** get the weighted mean of the two samples by multiplying
the expression value of each parent by a weight. Typically, this can be used
if the two parents have different ploidy levels, and the weights would
correspond to the ploidy level of each parent.


For this function, besides specifying the method to obtain the midparent 
expression values (i.e., "mean", "sum", or "weightedmean"), users must
also specify the name of the column in `colData` that contains information 
about the generations (default: *"Generation"*), as well as the levels
corresponding to each parent (default: *"P1"* and *"P2"* for parents 1 and 2,
respectively).

```{r}
# Add midparent expression using the mean of the counts
se <- add_midparent_expression(se_chlamy)
head(assay(se))

# Alternative 1: using the sum of the counts
add_midparent_expression(se_chlamy, method = "sum") |>
    assay() |> 
    head()

# Alternative 2: using the weighted mean of the counts (weights = ploidy)
w <- c(2, 1) # P1 = diploid; P2 = haploid
add_midparent_expression(se_chlamy, method = "weightedmean", weights = w) |>
    assay() |> 
    head()
```

We will proceed our analyses with the midparent expression values obtained
from the mean of the counts, stored in the `se` object.

# Normalizing count data

To normalize count data, the function `add_size_factors()` calculates
size factors (used by `r BiocStyle::Biocpkg("DESeq2")` 
for normalization) and adds them as an extra column in the colData slot
of your `SummarizedExperiment` object. Such size factors are calculated using
one of two methods:

1. By library size (default) using the 'median of
ratios' method implemented in `r BiocStyle::Biocpkg("DESeq2")`.

2. By cell size/biomass using spike-in controls (if available). If spike-in 
controls are present in the count matrix, you can use them
for normalization by setting `spikein = TRUE` and specifying the
pattern used to indicate rows that contain spike-ins (usually they start
with *ERCC*)[^1]. Normalization with spike-in controls is
particularly useful if the amount of mRNA per cell is not equal
between generations (due to, for instance, different ploidy levels, which
in turn can lead to different cell sizes and/or biomass). 

[^1]: **Note:** if you use spike-in normalization, the 
function `add_size_factors()` automatically removes rows containing 
spike-in counts from the `SummarizedExperiment` object after normalization. 
However, if you have counts for spike-in controls in your count matrix,
but do not want to use spike-in normalization, you should remove them 
before creating the `SummarizedExperiment` object.

In our example data set, spike-in controls are available 
in the last rows of the count matrix. Let's take a look at them.

```{r}
# Show last rows of the count matrix
assay(se) |> 
    tail()
```

As you can see, rows with spike-in counts start with "ERCC". Let's add 
size factors to our `SummarizedExperiment` object using 
spike-in controls.

```{r}
# Take a look at the original colData
colData(se)

# Add size factors estimated from spike-in controls
se <- add_size_factors(se, spikein = TRUE, spikein_pattern = "ERCC")

# Take a look at the new colData
colData(se)
```

# Exploratory data analyses

Next, you can perform exploratory analyses of sample clustering to verify if
samples group as expected. With `r BiocStyle::Githubpkg("HybridExpress")`,
this can be performed using two functions:

1. `pca_plot()`: creates principal component analysis (PCA) plots, with colors 
and shapes (optional) mapped to levels of `colData` variables;
2. `plot_samplecor()`: plots a heatmap of hierarchically clustered pairwise 
sample correlations.

Let's start with the PCA plot:

```{r}
# For colData rows with missing values (midparent samples), add "midparent"
se$Ploidy[is.na(se$Ploidy)] <- "midparent"
se$Generation[is.na(se$Generation)] <- "midparent"

# Create PCA plot
pca_plot(se, color_by = "Generation", shape_by = "Ploidy", add_mean = TRUE)
```

In the plot above, each data point corresponds to a sample, and colors
and shapes are mapped to levels of the variables specified in the
arguments `color_by` and `shape_by`, respectively. Besides, 
by specifying `add_mean = TRUE`,
we added a diamond shape indicating the mean PC coordinates based on the variable
in `color_by` (here, *"Generation"*).

Now, let's plot the heatmap of sample correlations:

```{r}
# Plot a heatmap of sample correlations
plot_samplecor(se, coldata_cols = c("Generation", "Ploidy"))
```

We can see that samples group well together both in the PCA plot and in
the correlation heatmap. 

Of note, both `pca_plot()` and `plot_samplecor()`
use only the top 500 genes with the highest variances to create the plot.
This is because genes with low variances (i.e., genes that do not vary much
across samples) are uninformative and typically only add noise. You can change
this number (to use more or less genes) in the `ntop` argument of 
both functions.


# Identifying differentially expressed genes between hybrids and their parents

To compare gene expression levels of hybrids to their progenitor species, you
can use the function `get_deg_list()`. This function performs
differential expression analyses using `r BiocStyle::Biocpkg("DESeq2")`
and returns a list of data frames with gene-wise test statistics 
for the following contrasts:

1. `P2_vs_P1`: parent 2 (numerator) versus parent 1 (denominator).
2. `F1_vs_P1`: hybrid (numerator) versus parent 1 (denominator).
3. `F1_vs_P2`: hybrid (numerator) versus parent 2 (denominator).
4. `F1_vs_midparent`: hybrid (numerator) vs midparent (denominator).

The size factors estimated with `add_size_factors()` are used for normalization
before differential expression testing. Let's use `get_deg_list()` to get 
differentially expressed genes (DEGs) for each contrast.

```{r}
# Get a list of differentially expressed genes for each contrast
deg_list <- get_deg_list(se)

# Inspecting the output
## Getting contrast names
names(deg_list)

## Accessing gene-wise test statistics for contrast `F1_vs_P1`
head(deg_list$F1_vs_P1)

## Counting the number of DEGs per contrast
sapply(deg_list, nrow)
```

To summarize the frequencies of up- and down-regulated genes per contrast
in a single data frame, use the function `get_deg_counts()`.

```{r}
# Get a data frame with DEG frequencies for each contrast
deg_counts <- get_deg_counts(deg_list)

deg_counts
```

It is important to note that the columns `perc_up`, `perc_down`, 
and `perc_total` show the percentages of up-regulated, down-regulated, and
all differentially expressed genes relative to the total number of genes
in the count matrix. The total number of genes in the count matrix
is stored in the `ngenes` attribute of the list returned by `get_deg_list()`:

```{r}
attributes(deg_list)$ngenes
```

However, since the count matrix usually does not include all genes in the 
genome (e.g., lowly expressed genes and genes with low variance are usually
filtered out), the percentages in `perc_up`, `perc_down`, and `perc_total`
are not relative to the total number of genes in the genome. To use the
total number of genes in the genome as the reference, we need to update
the `ngenes` attribute of the DEG list with the appropriate number
as follows:

```{r}
# Total number of genes in the C. reinhardtii genome (v6.1): 16883
attributes(deg_list)$ngenes <- 16883
```

Then, we can run `get_deg_counts()` again to get the percentages relative
to the total number of genes in the genome.

```{r}
deg_counts <- get_deg_counts(deg_list)
deg_counts
```

Finally, we can summarize everything in a single publication-ready figure
using the plot `plot_expression_triangle()`, which shows the 
'experimental trio' (i.e., hybrid and its progenitors) as a triangle,
with the frequencies of DEGs indicated.

```{r}
# Plot expression triangle
plot_expression_triangle(deg_counts)
```

This figure is commonly used in publications, and it was inspired by 
@rapp2009genomic. For each edge (line), numbers in the middle (in bold) 
indicate the frequency of all DEGs, and numbers at the ends (close to boxes)
indicate the frequency of up-regulated genes in each generation. For instance, 
the figure above shows that, for the contrast between F1 and P1, there 
are `r deg_counts[2, "total"]` DEGs (`r deg_counts[2, "perc_total"]`% of the
genome), of which `r deg_counts[2, "up"]` are up-regulated in F1,
and `r deg_counts[2, "down"]` are up-regulated in P1.

For a custom figure, you can also specify your own color palette and
labels for the boxes. For example:

```{r fig.height=5}
# Create vectors (length 4) of colors and box labels
pal <- c("springgreen4", "darkorange3", "mediumpurple4", "mediumpurple3")
labels <- c("Parent 1\n(2n)", "Parent 2\n(n)", "Progeny\n(3n)", "Midparent")

plot_expression_triangle(deg_counts, palette = pal, box_labels = labels)
```

# Expression-based gene classification

After identifying DEGs for different contrasts, you'd typically want
to classify your genes into expression partitions based on their expression
patterns. This can be performed with the function `expression_partitioning()`,
which classifies genes into one of the 12 **categories** as in @rapp2009genomic,
and into 5 major **classes** that summarize the 12 categories. The five classes
are:

1. **Transgressive up-regulation (UP):** gene is up-regulated in the hybrid
as compared to both parents.
2. **Transgressive down-regulation (DOWN):** gene is down-regulated in the
hybrid as compared to both parents.
3. **Additivity (ADD):** gene expression in the hybrid is the mean of
both parents (additive effect).
4. **Expression-level dominance toward parent 1 (ELD_P1):** gene expression
in the hybrid is the same as in parent 1, but different from parent 2.
5. **Expression-level dominance toward parent 2 (ELD_P2):** gene expression
in the hybrid is the same as in parent 2, but different from parent 1.


```{r}
# Classify genes in expression partitions
exp_partitions <- expression_partitioning(deg_list)

# Inspect the output
head(exp_partitions)

# Count number of genes per category
table(exp_partitions$Category)

# Count number of genes per class
table(exp_partitions$Class)
```

To visualize the expression partitions as a scatter plot of expression 
divergences, you can use the function `plot_expression_partitions()`.

```{r, fig.height=6, fig.width=8}
# Plot partitions as a scatter plot of divergences
plot_expression_partitions(exp_partitions, group_by = "Category")
```

By default, genes are grouped by `Category`. However, you can also
group genes by `Class` as follows:

```{r, fig.height=7, fig.width=8}
# Group by `Class`
plot_expression_partitions(exp_partitions, group_by = "Class")
```

You can also visualize the frequencies of genes in each partition with
the function `plot_partition_frequencies()`.

```{r, fig.height=7}
# Visualize frequency of genes in each partition
## By `Category` (default)
plot_partition_frequencies(exp_partitions)

## By `Class`
plot_partition_frequencies(exp_partitions, group_by = "Class")
```

# Overrepresentation analysis of functional terms

Lastly, you'd want to explore whether gene sets of interest (e.g., up-regulated
genes in each contrast) are enriched in any particular GO term, pathway,
protein domain, etc. For that, you will use the function `ora()`,
which performs overrepresentation analysis for a gene set given a data frame 
of functional annotation for each gene.

Here, we will use an example data set with GO annotation for *C. reinhardtii*
genes. This data set illustrates how the annotation data frame must be
shaped: gene ID in the first column, and functional annotations in other
columns.

```{r}
# Load example functional annotation (GO terms)
data(go_chlamy)

head(go_chlamy)
```

To demonstrate the usage of `ora()`, let's check if we can find enrichment
for any GO term among genes assigned to class "ADD".

```{r}
# Get a vector of genes in class "ADD"
genes_add <- exp_partitions$Gene[exp_partitions$Class == "ADD"]
head(genes_add)

# Get background genes - genes in the count matrix
bg <- rownames(se)

# Perform overrepresentation analysis
ora_add <- ora(genes_add, go_chlamy, background = bg)

# Inspect results
head(ora_add)
```

## Example 1: overrepresentation analyses for all expression-based classes

In the example above, we performed an overrepresentation analysis in
genes associated with class "ADD". What if we wanted to do the same for *all*
classes?

In that case, you could run `ora()` multiple times by looping over each class.
In details, you would do the following:

1. For each class, create a vector of genes associated with it;
2. Run `ora()` to get a data frame with ORA results.

Below you can find an example on how to do it using `lapply()`. Results
for each class will be stored in elements of a list object.

```{r}
# Create a list of genes in each class
genes_by_class <- split(exp_partitions$Gene, exp_partitions$Class)

names(genes_by_class)
head(genes_by_class$UP)

# Iterate through each class and perform ORA
ora_classes <- lapply(
    genes_by_class, # list through which we will iterate
    ora, # function we will apply to each element of the list
    annotation = go_chlamy, background = bg # additional arguments to function
)

# Inspect output
ora_classes
```

To do the same for each expression-based category (not class), you'd need
to replace `Class` with `Category` in the `split()` function (see example 
above).

## Example 2: overrepresentation analyses for differentially expressed genes

The same you can use `lapply()` to loop through each expression class,
you can also loop through each contrast in the list returned 
by `get_deg_list()`, and perform ORA for up- and down-regulated genes.
Below you can find an example:

```{r}
# Get up-regulated genes for each contrast
up_genes <- lapply(deg_list, function(x) rownames(x[x$log2FoldChange > 0, ]))
names(up_genes)
head(up_genes$F1_vs_P1)

# Perform ORA
ora_up <- lapply(
    up_genes,
    ora,
    annotation = go_chlamy, background = bg
)

ora_up
```

Likewise, for down-regulated genes, you need to replace the `>` symbol
with a `<` symbol in the anonymous function to subset rows.

# FAQ {.unnumbered}

1. How do I create a `SummarizedExperiment` object?

A `SummarizedExperiment` is a data structure (an S4 class) that can
be used to store, in a single object, the following elements:

1. **assay:** A quantitative matrix with features in rows and samples in 
columns. In the context of `r BiocStyle::Githubpkg("HybridExpress")`, this 
would be a gene expression matrix (in counts) with genes in rows and samples
in columns.
2. **colData:** A data frame of sample metadata with samples in rows
and variables that describe samples (e.g., tissue, treatment, and other 
covariates) in columns.
3. **rowData:** A data frame of gene metadata with genes in rows and
variables that describe genes (e.g., chromosome name, alternative IDs,
functional information, etc) in columns.

For this package, you must have an *assay* containing the count matrix
and a *colData* slot with sample metadata. *rowData* can be present,
but it is not required. 

To demonstrate how to create a `SummarizedExperiment` object, we will
extract the *assay* and *colData* from the example object `se_chlamy`
that comes with this package.

```{r}
# Get count matrix
count_matrix <- assay(se_chlamy)
head(count_matrix)

# Get colData (data frame of sample metadata)
coldata <- colData(se_chlamy)
head(coldata)
```

Once you have these two objects, you can create a `SummarizedExperiment` object
with:

```{r}
# Create a SummarizedExperiment object
new_se <- SummarizedExperiment(
    assays = list(counts = count_matrix),
    colData = coldata
)

new_se
```

For more details on this data structure, read the vignette
of the `r BiocStyle::Biocpkg("SummarizedExperiment")` package.

# Session information {.unnumbered}

This document was created under the following conditions:

```{r, echo = FALSE}
sessioninfo::session_info()
```

# References {.unnumbered}