--- title: "Comparative transcriptomic analysis of hybrids and their progenitors" author: - name: Fabricio Almeida-Silva affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium - name: Lucas Prost-Boxoen affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium - name: Yves Van de Peer affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium output: BiocStyle::html_document: toc: true toc_float: true number_sections: yes bibliography: vignette_bibliography.bib vignette: > %\VignetteIndexEntry{Comparative transcriptomic analysis of hybrids and their progenitors} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` # Introduction The formation of hybrids through the fusion of distinct genomes and subsequent genome duplication (in cases of allopolyploidy) represents a significant evolutionary event with complex effects on cellular biology, particularly gene expression. The impact of such genome mergings and duplications on transcription remain incompletely understood. To bridge this gap, we introduce __HybridExpress__, a comprehensive package designed to facilitate the comparative transcriptomic analysis of hybrids and their progenitor species. __HybridExpress__ is tailored for RNA-Seq data derived from a 'hybrid triplet': the hybrid organism and its two parental species. This package offers a suite of intuitive functions enabling researchers to perform differential expression analysis with ease, generate principal component analysis (PCA) plots to visualize sample grouping, categorize genes into 12 distinct expression pattern groups (as in @rapp2009genomic), and conduct functional analyses. Acknowledging the potential variability in cell and transcriptome size across species and ploidy levels, __HybridExpress__ incorporates features for rigorous normalization of count data. Specifically, it allows for the integration of spike-in controls directly into the normalization process, ensuring accurate transcriptome size adjustments when these standards are present in the RNA-Seq count data (see full methodology in @coate2023beyond). By offering these capabilities, __HybridExpress__ provides a robust tool set for unraveling the intricate effects of genome doubling and merging on gene expression, paving the way for novel insights into the cellular biology of hybrid organisms. # Installation `r BiocStyle::Githubpkg("HybridExpress")` can be installed from Bioconductor with the following code: ```{r installation, eval=FALSE} if(!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager') BiocManager::install("HybridExpress") ``` ```{r load_package, message=FALSE} # Load package after installation library(HybridExpress) set.seed(123) # for reproducibility ``` # Data description For this vignette, we will use an example data set that comprises (unpublished) gene expression data (in counts) for *Chlamydomonas reinhardtii*. In our lab, we crossed a diploid line of *C. reinhardtii* (hereafter "P1") with a haploid line (hereafter "P2"), thus generating a triploid line through the merging of the two parental genomes. The count matrix and sample metadata are stored in `SummarizedExperiment` objects, which is the standard Bioconductor data structure required by `r BiocStyle::Githubpkg("HybridExpress")`. For instructions on how to create a `SummarizedExperiment` object, check the FAQ section of this vignette. Let's load the example data and take a quick look at it: ```{r message = FALSE} library(SummarizedExperiment) # Load data data(se_chlamy) # Inspect the `SummarizedExperiment` object se_chlamy ## Take a look at the colData and count matrix colData(se_chlamy) assay(se_chlamy) |> head() table(se_chlamy$Ploidy, se_chlamy$Generation) ``` As you can see, the count matrix contains `r nrow(se_chlamy)` genes and `r ncol(se_chlamy)` samples, with 6 replicates for parent 1 (P1, diploid), 6 replicates for parent 2 (P2, haploid), and 6 replicates for the progeny (F1, triploid). # Adding midparent expression values First of all, you'd want to add in your count matrix *in silico* samples that contain the expression values of the midparent. This can be done with the function `add_midparent_expression()`, which takes a random sample pair (one sample from each parent, sampling without replacement) and calculates the midparent expression value in one of three ways: 1. **Mean (default):** get the mean expression of the two samples. 2. **Sum:** get the sum of the two samples. 3. **Weighted mean:** get the weighted mean of the two samples by multiplying the expression value of each parent by a weight. Typically, this can be used if the two parents have different ploidy levels, and the weights would correspond to the ploidy level of each parent. For this function, besides specifying the method to obtain the midparent expression values (i.e., "mean", "sum", or "weightedmean"), users must also specify the name of the column in `colData` that contains information about the generations (default: *"Generation"*), as well as the levels corresponding to each parent (default: *"P1"* and *"P2"* for parents 1 and 2, respectively). ```{r} # Add midparent expression using the mean of the counts se <- add_midparent_expression(se_chlamy) head(assay(se)) # Alternative 1: using the sum of the counts add_midparent_expression(se_chlamy, method = "sum") |> assay() |> head() # Alternative 2: using the weighted mean of the counts (weights = ploidy) w <- c(2, 1) # P1 = diploid; P2 = haploid add_midparent_expression(se_chlamy, method = "weightedmean", weights = w) |> assay() |> head() ``` We will proceed our analyses with the midparent expression values obtained from the mean of the counts, stored in the `se` object. # Normalizing count data To normalize count data, the function `add_size_factors()` calculates size factors (used by `r BiocStyle::Biocpkg("DESeq2")` for normalization) and adds them as an extra column in the colData slot of your `SummarizedExperiment` object. Such size factors are calculated using one of two methods: 1. By library size (default) using the 'median of ratios' method implemented in `r BiocStyle::Biocpkg("DESeq2")`. 2. By cell size/biomass using spike-in controls (if available). If spike-in controls are present in the count matrix, you can use them for normalization by setting `spikein = TRUE` and specifying the pattern used to indicate rows that contain spike-ins (usually they start with *ERCC*)[^1]. Normalization with spike-in controls is particularly useful if the amount of mRNA per cell is not equal between generations (due to, for instance, different ploidy levels, which in turn can lead to different cell sizes and/or biomass). [^1]: **Note:** if you use spike-in normalization, the function `add_size_factors()` automatically removes rows containing spike-in counts from the `SummarizedExperiment` object after normalization. However, if you have counts for spike-in controls in your count matrix, but do not want to use spike-in normalization, you should remove them before creating the `SummarizedExperiment` object. In our example data set, spike-in controls are available in the last rows of the count matrix. Let's take a look at them. ```{r} # Show last rows of the count matrix assay(se) |> tail() ``` As you can see, rows with spike-in counts start with "ERCC". Let's add size factors to our `SummarizedExperiment` object using spike-in controls. ```{r} # Take a look at the original colData colData(se) # Add size factors estimated from spike-in controls se <- add_size_factors(se, spikein = TRUE, spikein_pattern = "ERCC") # Take a look at the new colData colData(se) ``` # Exploratory data analyses Next, you can perform exploratory analyses of sample clustering to verify if samples group as expected. With `r BiocStyle::Githubpkg("HybridExpress")`, this can be performed using two functions: 1. `pca_plot()`: creates principal component analysis (PCA) plots, with colors and shapes (optional) mapped to levels of `colData` variables; 2. `plot_samplecor()`: plots a heatmap of hierarchically clustered pairwise sample correlations. Let's start with the PCA plot: ```{r} # For colData rows with missing values (midparent samples), add "midparent" se$Ploidy[is.na(se$Ploidy)] <- "midparent" se$Generation[is.na(se$Generation)] <- "midparent" # Create PCA plot pca_plot(se, color_by = "Generation", shape_by = "Ploidy", add_mean = TRUE) ``` In the plot above, each data point corresponds to a sample, and colors and shapes are mapped to levels of the variables specified in the arguments `color_by` and `shape_by`, respectively. Besides, by specifying `add_mean = TRUE`, we added a diamond shape indicating the mean PC coordinates based on the variable in `color_by` (here, *"Generation"*). Now, let's plot the heatmap of sample correlations: ```{r} # Plot a heatmap of sample correlations plot_samplecor(se, coldata_cols = c("Generation", "Ploidy")) ``` We can see that samples group well together both in the PCA plot and in the correlation heatmap. Of note, both `pca_plot()` and `plot_samplecor()` use only the top 500 genes with the highest variances to create the plot. This is because genes with low variances (i.e., genes that do not vary much across samples) are uninformative and typically only add noise. You can change this number (to use more or less genes) in the `ntop` argument of both functions. # Identifying differentially expressed genes between hybrids and their parents To compare gene expression levels of hybrids to their progenitor species, you can use the function `get_deg_list()`. This function performs differential expression analyses using `r BiocStyle::Biocpkg("DESeq2")` and returns a list of data frames with gene-wise test statistics for the following contrasts: 1. `P2_vs_P1`: parent 2 (numerator) versus parent 1 (denominator). 2. `F1_vs_P1`: hybrid (numerator) versus parent 1 (denominator). 3. `F1_vs_P2`: hybrid (numerator) versus parent 2 (denominator). 4. `F1_vs_midparent`: hybrid (numerator) vs midparent (denominator). The size factors estimated with `add_size_factors()` are used for normalization before differential expression testing. Let's use `get_deg_list()` to get differentially expressed genes (DEGs) for each contrast. ```{r} # Get a list of differentially expressed genes for each contrast deg_list <- get_deg_list(se) # Inspecting the output ## Getting contrast names names(deg_list) ## Accessing gene-wise test statistics for contrast `F1_vs_P1` head(deg_list$F1_vs_P1) ## Counting the number of DEGs per contrast sapply(deg_list, nrow) ``` To summarize the frequencies of up- and down-regulated genes per contrast in a single data frame, use the function `get_deg_counts()`. ```{r} # Get a data frame with DEG frequencies for each contrast deg_counts <- get_deg_counts(deg_list) deg_counts ``` It is important to note that the columns `perc_up`, `perc_down`, and `perc_total` show the percentages of up-regulated, down-regulated, and all differentially expressed genes relative to the total number of genes in the count matrix. The total number of genes in the count matrix is stored in the `ngenes` attribute of the list returned by `get_deg_list()`: ```{r} attributes(deg_list)$ngenes ``` However, since the count matrix usually does not include all genes in the genome (e.g., lowly expressed genes and genes with low variance are usually filtered out), the percentages in `perc_up`, `perc_down`, and `perc_total` are not relative to the total number of genes in the genome. To use the total number of genes in the genome as the reference, we need to update the `ngenes` attribute of the DEG list with the appropriate number as follows: ```{r} # Total number of genes in the C. reinhardtii genome (v6.1): 16883 attributes(deg_list)$ngenes <- 16883 ``` Then, we can run `get_deg_counts()` again to get the percentages relative to the total number of genes in the genome. ```{r} deg_counts <- get_deg_counts(deg_list) deg_counts ``` Finally, we can summarize everything in a single publication-ready figure using the plot `plot_expression_triangle()`, which shows the 'experimental trio' (i.e., hybrid and its progenitors) as a triangle, with the frequencies of DEGs indicated. ```{r} # Plot expression triangle plot_expression_triangle(deg_counts) ``` This figure is commonly used in publications, and it was inspired by @rapp2009genomic. For each edge (line), numbers in the middle (in bold) indicate the frequency of all DEGs, and numbers at the ends (close to boxes) indicate the frequency of up-regulated genes in each generation. For instance, the figure above shows that, for the contrast between F1 and P1, there are `r deg_counts[2, "total"]` DEGs (`r deg_counts[2, "perc_total"]`% of the genome), of which `r deg_counts[2, "up"]` are up-regulated in F1, and `r deg_counts[2, "down"]` are up-regulated in P1. For a custom figure, you can also specify your own color palette and labels for the boxes. For example: ```{r fig.height=5} # Create vectors (length 4) of colors and box labels pal <- c("springgreen4", "darkorange3", "mediumpurple4", "mediumpurple3") labels <- c("Parent 1\n(2n)", "Parent 2\n(n)", "Progeny\n(3n)", "Midparent") plot_expression_triangle(deg_counts, palette = pal, box_labels = labels) ``` # Expression-based gene classification After identifying DEGs for different contrasts, you'd typically want to classify your genes into expression partitions based on their expression patterns. This can be performed with the function `expression_partitioning()`, which classifies genes into one of the 12 **categories** as in @rapp2009genomic, and into 5 major **classes** that summarize the 12 categories. The five classes are: 1. **Transgressive up-regulation (UP):** gene is up-regulated in the hybrid as compared to both parents. 2. **Transgressive down-regulation (DOWN):** gene is down-regulated in the hybrid as compared to both parents. 3. **Additivity (ADD):** gene expression in the hybrid is the mean of both parents (additive effect). 4. **Expression-level dominance toward parent 1 (ELD_P1):** gene expression in the hybrid is the same as in parent 1, but different from parent 2. 5. **Expression-level dominance toward parent 2 (ELD_P2):** gene expression in the hybrid is the same as in parent 2, but different from parent 1. ```{r} # Classify genes in expression partitions exp_partitions <- expression_partitioning(deg_list) # Inspect the output head(exp_partitions) # Count number of genes per category table(exp_partitions$Category) # Count number of genes per class table(exp_partitions$Class) ``` To visualize the expression partitions as a scatter plot of expression divergences, you can use the function `plot_expression_partitions()`. ```{r, fig.height=6, fig.width=8} # Plot partitions as a scatter plot of divergences plot_expression_partitions(exp_partitions, group_by = "Category") ``` By default, genes are grouped by `Category`. However, you can also group genes by `Class` as follows: ```{r, fig.height=7, fig.width=8} # Group by `Class` plot_expression_partitions(exp_partitions, group_by = "Class") ``` You can also visualize the frequencies of genes in each partition with the function `plot_partition_frequencies()`. ```{r, fig.height=7} # Visualize frequency of genes in each partition ## By `Category` (default) plot_partition_frequencies(exp_partitions) ## By `Class` plot_partition_frequencies(exp_partitions, group_by = "Class") ``` # Overrepresentation analysis of functional terms Lastly, you'd want to explore whether gene sets of interest (e.g., up-regulated genes in each contrast) are enriched in any particular GO term, pathway, protein domain, etc. For that, you will use the function `ora()`, which performs overrepresentation analysis for a gene set given a data frame of functional annotation for each gene. Here, we will use an example data set with GO annotation for *C. reinhardtii* genes. This data set illustrates how the annotation data frame must be shaped: gene ID in the first column, and functional annotations in other columns. ```{r} # Load example functional annotation (GO terms) data(go_chlamy) head(go_chlamy) ``` To demonstrate the usage of `ora()`, let's check if we can find enrichment for any GO term among genes assigned to class "ADD". ```{r} # Get a vector of genes in class "ADD" genes_add <- exp_partitions$Gene[exp_partitions$Class == "ADD"] head(genes_add) # Get background genes - genes in the count matrix bg <- rownames(se) # Perform overrepresentation analysis ora_add <- ora(genes_add, go_chlamy, background = bg) # Inspect results head(ora_add) ``` ## Example 1: overrepresentation analyses for all expression-based classes In the example above, we performed an overrepresentation analysis in genes associated with class "ADD". What if we wanted to do the same for *all* classes? In that case, you could run `ora()` multiple times by looping over each class. In details, you would do the following: 1. For each class, create a vector of genes associated with it; 2. Run `ora()` to get a data frame with ORA results. Below you can find an example on how to do it using `lapply()`. Results for each class will be stored in elements of a list object. ```{r} # Create a list of genes in each class genes_by_class <- split(exp_partitions$Gene, exp_partitions$Class) names(genes_by_class) head(genes_by_class$UP) # Iterate through each class and perform ORA ora_classes <- lapply( genes_by_class, # list through which we will iterate ora, # function we will apply to each element of the list annotation = go_chlamy, background = bg # additional arguments to function ) # Inspect output ora_classes ``` To do the same for each expression-based category (not class), you'd need to replace `Class` with `Category` in the `split()` function (see example above). ## Example 2: overrepresentation analyses for differentially expressed genes The same you can use `lapply()` to loop through each expression class, you can also loop through each contrast in the list returned by `get_deg_list()`, and perform ORA for up- and down-regulated genes. Below you can find an example: ```{r} # Get up-regulated genes for each contrast up_genes <- lapply(deg_list, function(x) rownames(x[x$log2FoldChange > 0, ])) names(up_genes) head(up_genes$F1_vs_P1) # Perform ORA ora_up <- lapply( up_genes, ora, annotation = go_chlamy, background = bg ) ora_up ``` Likewise, for down-regulated genes, you need to replace the `>` symbol with a `<` symbol in the anonymous function to subset rows. # FAQ {.unnumbered} 1. How do I create a `SummarizedExperiment` object? A `SummarizedExperiment` is a data structure (an S4 class) that can be used to store, in a single object, the following elements: 1. **assay:** A quantitative matrix with features in rows and samples in columns. In the context of `r BiocStyle::Githubpkg("HybridExpress")`, this would be a gene expression matrix (in counts) with genes in rows and samples in columns. 2. **colData:** A data frame of sample metadata with samples in rows and variables that describe samples (e.g., tissue, treatment, and other covariates) in columns. 3. **rowData:** A data frame of gene metadata with genes in rows and variables that describe genes (e.g., chromosome name, alternative IDs, functional information, etc) in columns. For this package, you must have an *assay* containing the count matrix and a *colData* slot with sample metadata. *rowData* can be present, but it is not required. To demonstrate how to create a `SummarizedExperiment` object, we will extract the *assay* and *colData* from the example object `se_chlamy` that comes with this package. ```{r} # Get count matrix count_matrix <- assay(se_chlamy) head(count_matrix) # Get colData (data frame of sample metadata) coldata <- colData(se_chlamy) head(coldata) ``` Once you have these two objects, you can create a `SummarizedExperiment` object with: ```{r} # Create a SummarizedExperiment object new_se <- SummarizedExperiment( assays = list(counts = count_matrix), colData = coldata ) new_se ``` For more details on this data structure, read the vignette of the `r BiocStyle::Biocpkg("SummarizedExperiment")` package. # Session information {.unnumbered} This document was created under the following conditions: ```{r, echo = FALSE} sessioninfo::session_info() ``` # References {.unnumbered}