---
title: "ReducedExperiment"
author: "Jack Gisby"
date: "`r Sys.Date()`" 
output:
    BiocStyle::html_document:
        number_sections: yes
        toc: true
vignette: >
    %\VignetteIndexEntry{ReducedExperiment}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
# Bioc markdown style
BiocStyle::markdown()

# Knitr options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.height = 5,
  out.width = "70%",
  fig.align = "center",
  dpi = 300
)
```

# Installation

The release version of the package may be installed from Bioconductor, 
as follows:

```{r install-release, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("ReducedExperiment")
```

Alternatively, the development version may be installed from Bioconductor
or [GitHub](https://github.com/jackgisby/ReducedExperiment):

```{r install-devel, eval=FALSE}
BiocManager::install("ReducedExperiment", version = "devel")

devtools::install_github("jackgisby/ReducedExperiment")
```

# Introduction

```{r load-packages, message=FALSE, warning=FALSE, include=TRUE}
library(SummarizedExperiment)
library(ReducedExperiment)
library(ggplot2)

theme_set(theme_bw())
```

Dimensionality reduction approaches are often applied to reduce the complexity
of high-dimensional datasets, such as gene expression data produced by 
microarrays and RNA sequencing. The `ReducedExperiment` containers are able to
simultaneously store high-dimensional experimental data, sample- and 
feature-level metadata, and the dimensionally-reduced data. 

If you are not familiar with the `SummarizedExperiment` containers, we suggest
first reading through the package vignette, either through the 
[Bioconductor website](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) 
or by installing the package and running the following.

```{r se-vignette, eval=FALSE}
vignette("SummarizedExperiment", package = "SummarizedExperiment")
```

The key benefit of `SummarizedExperiment` objects is that they allow for both
assay and metadata to be kept in sync while slicing and reordering the features
or samples of an object. The `ReducedExperiment` objects inherit methods from
`SummarizedExperiment`, so the methods available for a `SummarizedExperiment`
object will also work on a `ReducedExperiment`. 

When applying dimensionality reduction to high-dimensional datasets, we 
introduce additional complexity. In addition to assay, row and column data, we
usually generate additional data matrices. For instance, methods like PCA or 
factor analysis tend to estimate a new matrix, with samples as rows and 
factors or components as columns. These methods also usually generate a 
loadings or rotation matrix, with features as rows and factors/components as 
columns.

The `ReducedExperiment` containers extend `SummarizedExperiment` by allowing for
the storage and slicing of these additional matrices in tandem with our original
data. This package implements 
`ReducedExperiment`, a high-level class that can store an additional matrix 
with samples as rows and reduced dimensions as columns. We also implemented 
two additional child classes for working with common dimensionality reduction 
methods.

The results of factor analysis can be stored in the `FactorisedExperiment` 
class, which additionally allows for the storage of a loadings matrix. The
`ModularExperiment` class is built to store analyses based on gene modules,
such as "Weighted Gene Correlation Network Analysis" (WGCNA), a popular method 
used to analyse transcriptomic datasets.

## Relationship to existing Bioconductor packages

Several Bioconductor packages provide functionality for storing reduced 
dimensional representations alongside experimental data. These include 
packages designed for single-cell analysis, proteomics data, and other 
high-dimensional omics data types. These include the `SingleCellExperiment` 
Bioconductor package, which provides a `reducedDims` slot for 
storing multiple low-dimensional representations.

While there is some overlap in capabilities, `ReducedExperiment` serves 
different use cases than existing frameworks:

- **Purpose**: Most existing packages use dimensionality reduction primarily 
to enable visualisation and facilitate downstream analyses, like clustering. 
`SingleCellExperiment`, for example, is designed for single-cell workflows 
where reduced dimensions help manage technical noise and enable cell type 
identification. In contrast, `ReducedExperiment` focuses on methods where 
interpreting the components themselves is the primary goal, such as 
Independent Component Analysis or gene module analysis.

- **Structure**: While packages like `SingleCellExperiment` allow storage of 
multiple dimensionality reduction results with minimal assumptions about their 
generation method, `ReducedExperiment` implements method-specific containers 
(e.g., `FactorisedExperiment`, `ModularExperiment`) that enforce strict 
relationships between features, samples, and components to enable
specialised analyses.

- **Analysis Methods**: `ReducedExperiment` provides integrated workflows for 
specific analysis approaches. For example, when working with ICA, 
it offers methods for assessing component stability and interpreting 
feature loadings. Some methods, like enrichment analysis, are available for
multiple analysis types, with specific considerations for each.

For workflows focused primarily on visualisation, clustering, or other 
downstream analyses, other Bioconductor containers may be more 
appropriate depending on your data type and analysis goals. 
`ReducedExperiment` is best suited for analyses where the interpretation of 
the components themselves (such as ICA factors or gene modules) is 
central to the research question.

# Introducing `ReducedExperiment` containers with a case study

There are two ways to build a `ReducedExperiment` object. Firstly, we provide
convenience functions that both perform dimensionality reduction and package
the results into the appropriate container class. We implement a method,
`estimateFactors`, for identifying factors through Independent Component
Analysis. We also provide a wrapper for applying WGCNA, `identifyModules`.

In this section of the vignette, we will apply both of these methods to 
some sample RNA sequencing data. Alternatively, if you have already applied 
dimensionality reduction to your data, you may simply construct a 
`ReducedExperiment` object from scratch. For instructions on how to do this,
see the section entitled "Constructing a `ReducedExperiment` from scratch".

## The case study data

For the purposes of this vignette, we will be working with some RNA sequencing
data generated for patients with COVID-19. The data were published in the 
following paper: https://www.nature.com/articles/s41467-022-35454-4

The data loaded below are based on the data available from the Zenodo 
repository: https://zenodo.org/records/6497251

The data were normalised with edgeR's TMM method and transformed to log-counts
per million. Genes with low counts were removed before further filtering to
select high-variance genes. For the purposes of this example, only 500
genes were retained and we downsampled the patients.

The COVID-19 data are split up into two cohorts, one for the first wave of
COVID-19 (early-mid 2020) and another for the second (early 2021). In this
vignette, we will mainly use the first cohort, whereas the second cohort
will be used as a validation dataset.

```{r load-data}
wave1_se <- readRDS(system.file(
    "extdata", "wave1.rds",
    package = "ReducedExperiment"
))
wave2_se <- readRDS(system.file(
    "extdata", "wave2.rds",
    package = "ReducedExperiment"
))

wave1_se
```

## Applying factor analysis

The first step in performing independent component analysis is to estimate the
optimal number of components. We can do this with the `estimateStability` 
function. The results of this may be plotted with `plotStability`.

Note that this can take a long time to run. To make this faster, we could
change the `BPPARAM` term of `estimateStability` to run the analysis in 
parallel.

The component estimation procedure is based on the maximally stable 
transcriptome dimension approach, described here:
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4112-9

Using this method, we can plot the stability of the factors as a function of 
the number of 
components. The plot on the left shows a line for each run. We perform
ICA for a sequence of components from 10 to 60, increasing the number of
components by two each time. So, we run ICA for 10, 12, 14, etc. components
and each of these runs produces a line in the plot. The plot on the right 
displays a single line displaying stability averaged over the runs. 

We ideally want to pick the maximal number of components with
acceptable stability. Between 30-40 components, the component stability appears
to drop off, so we move forward with 35 components.

Note that the data subset we are using here has relatively few features for an
RNA-seq (500). The drop-off in stability is more pronounced for the full
dataset, making it easier to identify the elbow.

```{r estimate-stability, eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
stability_res <- estimateStability(
    wave1_se,
    n_runs = 100,
    min_components = 10, max_components = 60, by = 2,
    mean_stability_threshold = 0.85,
    verbose = FALSE, 
    BPPARAM = BiocParallel::SerialParam(RNGseed = 1)
)

stability_plot <- plotStability(stability_res)
print(stability_plot$combined_plot)
```
```{r stability-image, echo=FALSE}
#| fig.cap = "Stability of components as a function of the number of components."
img_path <- system.file("stability.png", package = "ReducedExperiment")
if (file.exists(img_path)) {
  knitr::include_graphics(img_path)
} else {
  knitr::include_graphics("../figures/stability.png")
}
```

Now we run the factor analysis with these components. We additionally set a 
stability threshold of 0.25 to remove components with low stability. This
results in the identification of 34 factors.

```{r run-factor-analysis}
wave1_fe <- estimateFactors(
    wave1_se,
    nc = 35,
    use_stability = TRUE,
    n_runs = 30,
    stability_threshold = 0.25, 
    BPPARAM = BiocParallel::SerialParam(RNGseed = 1)
)

wave1_fe
```

The output of factor analysis is a `FactorisedExperiment` object. This is not
dissimilar from a `SummarizedExperiment` object, with some additional slots.
The `FactorisedExperiment` contains an additional matrix for the 
dimensionally-reduced data (i.e., samples vs. factors) and one for the factor
loadings (i.e., genes vs. factors).

We can get the reduced data as so.

```{r get-factor-reduced}
# get reduced data for first 5 samples and factors
reduced(wave1_fe)[1:5, 1:5]
```

And we get the factor loadings below.

```{r get-factor-loadings}
# get loadings for first 5 genes and factors
loadings(wave1_fe)[1:5, 1:5]
```

Most of the normal operations that can be performed on `SummarizedExperiment` 
objects
can also be performed on `FactorisedExperiment`s. We can slice them by samples,
features and components, like so.

```{r slice-fe}
dim(wave1_fe[1:20, 1:10, 1:5])
```

## Module analysis

Similarly, we can run a module analysis using the weighted gene correlation
network analysis (WGCNA) package: 
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559

The `ReducedExperiment` package provides wrapper functions for running 
such an analysis, before packaging them into a `ModularExperiment`
class. 

First, we use the `assessSoftThreshold` function to determine the 
soft-thresholding power appropriate for our dataset. By default, the function
selects the minimal power satisfying a minimal scale free topology fitting
index of `0.85` and a maximal average connectivity of `100`. See
WGCNA's `WGCNA::pickSoftThreshold` function for more details.

```{r assess-threshold}
WGCNA::disableWGCNAThreads()

wave1_fit_indices <- assessSoftThreshold(wave1_se)

wave1_estimated_power <- 
    wave1_fit_indices$Power[wave1_fit_indices$estimated_power]
print(paste0("Estimated power: ", wave1_estimated_power))
```

Now we can pass our data with the selected power to the `identifyModules`
function. Since for this example we've filtered out most of the genes that were 
in the original dataset, we allow for the identification of smaller modules of 
genes (10). The dendrogram is also stored in this object, allowing us to plot 
the clusters of genes that are identified.

```{r plot-dendro}
#| fig.cap = "Dendrogram of the modules identified in the first wave COVID-19 data."

wave1_me <- identifyModules(
    wave1_se,
    verbose = 0,
    power = wave1_estimated_power,
    minModuleSize = 10
)

plotDendro(wave1_me)
```

In this case we only identify 8 modules (module_0 is not a module itself,
but rather represents the unclustered genes, and is coloured in gray in
the plot above). Far more modules were identified in the
original paper: https://www.nature.com/articles/s41467-022-35454-4#Fig4

This is likely because we are only using a subset of samples and genes,
and we've used a different soft thresholding power. While `identifyModules`
can suggest a soft thresholding power automatically, it is generally recommended
that the user carefully consider the parameters used
to generate modules (e.g., by considering the output of 
`WGCNA::pickSoftThreshold`). For now, however, we will move 
forward with our 8 modules.

```{r print-me}
wave1_me
```

The `ModularExperiment` functions much like a `FactorisedExperiment`, with a slot 
for the dimensionally-reduced data (samples vs. modules) and loadings (a vector 
containing a value for each gene). There is additionally a slot for the module 
assignments. 

The reduced data contain expression profiles for each module that are 
representative of the member genes, also referred to as an "eigengene". We can
get these data as in the following chunk.

```{r get-module-reduced}
# get reduced data for first 5 samples and factors
reduced(wave1_me)[1:5, 1:5]
```

We can also find out which module each feature is assigned to.

```{r get-module-assignments}
# get assignments for first 5 genes
assignments(wave1_me)[1:5]
```

The `ModularExperiment` object also stores the loadings, indicating the 
alignment of each feature with the corresponding eigengene. 

```{r get-module-loadings}
# get loadings for first 5 genes
loadings(wave1_me)[1:5]
```

## Identifying factor and module associations
One thing we might be interested in doing is identifying factors that are 
associated with disease outcomes. We can do this with the 
`associateComponents` function. First, we can look at the sample-level data 
available in the `SummarizedExperiment` container.

```{r show-coldata}
selected_cols <- c(
    "sample_id", 
    "individual_id", 
    "case_control", 
    "sex", 
    "calc_age"
)

colData(wave1_se)[, selected_cols]
```

In this case, we will focus on the `case_control` variable, indicating whether
patients are COVID-19 positive or negative. For each component, we apply linear 
models that are adjusted for for sex and age.  

```{r associate-components}
f <- "~ case_control + sex + calc_age"

associations_me <- associateComponents(wave1_me, method = "lm", formula = f)
associations_fe <- associateComponents(wave1_fe, method = "lm", formula = f)
```

Below is a table for the significant associations between the factors and the 
`case_control` variable, encoding COVID-19 infection status.

```{r associate-table}
associations_fe$summaries[
    associations_fe$summaries$term == "case_controlPOSITIVE" &
        associations_fe$summaries$adj_pvalue < 0.05,
    c("term", "component", "estimate", "stderr", "adj_pvalue")
]
```

We can plot out these results. In the plot below, factors that are "upregulated"
in COVID-19 patients are shown in red, whereas "downregulated" factors are 
in blue. The vertical line indicates significance (i.e., adjusted p-values
less than 0.05). There are a number of factors that are significantly up- or 
down-regulated in COVID-19.

```{r plot-factor-associations}
#| fig.cap = "Associations between factors and COVID-19 status."

ggplot(
    associations_fe$summaries[
        associations_fe$summaries$term == "case_controlPOSITIVE",
    ],
    aes(-log10(adj_pvalue),
        reorder(component, -log10(adj_pvalue)),
        fill = estimate
    )
) +
    geom_point(pch = 21, size = 3) +
    xlab("-log10(adjusted p-value)") +
    ylab("Factor name (ordered by adjusted p-value)") +
    geom_vline(xintercept = -log10(0.05)) +
    scale_fill_gradient2(low = "#3A3A98", high = "#832424")
```

A similar plot for our modules shows significant upregulation of modules 1,
2, 7 and 8, along with down-regulation of modules 3 and 5.

```{r plot-module-associations}
#| fig.cap = "Associations between modules and COVID-19 status."

ggplot(
    associations_me$summaries[
        associations_me$summaries$term == "case_controlPOSITIVE",
    ],
    aes(-log10(adj_pvalue),
        reorder(component, -log10(adj_pvalue)),
        fill = estimate
    )
) +
    geom_point(pch = 21, size = 3) +
    xlab("-log10(adjusted p-value)") +
    ylab("Module name (ordered by adjusted p-value)") +
    geom_vline(xintercept = -log10(0.05)) +
    scale_fill_gradient2(low = "#3A3A98", high = "#832424")
```

## Finding functional enrichments for factors and modules

Now that we know which of our modules are up- and downregulated in COVID-19,
we want to know more about the genes and pathways that are associated
with our factors.

As shown before, we know which genes belong to each module by accessing
the module assignments in the `ModularExperiment` object. We can also estimate
the centrality of the genes in the module, based on the `WGCNA::signedKME` 
function.

```{r module-centrality}
module_centrality <- getCentrality(wave1_me)
head(module_centrality)
```

This indicates the correlation of each feature with the corresponding 
module "eigengene". We can also 
identify the genes that are highly aligned with each factor, as follows:

```{r factor-alignments}
aligned_features <- getAlignedFeatures(wave1_fe, format = "data.frame")
head(aligned_features[, c("component", "feature", "value")])
```

This shows the value of the loadings for the genes most highly associated with 
each factor. Some factors display moderate associations with many genes, 
whereas others show strong associations with just a few genes.

We can also perform pathway enrichment analyses for both factors and modules.
We use the enrichment methods implemented in the `clusterProfiler` package. By
default, these functions run gene set enrichment analysis (GSEA) and
overrepresentation tests for the factors and modules, respectively. For
the purposes of this vignette, we'll just do an overrepresentation analysis
for factor 5 and module 1.

```{r run-enrich, message=FALSE, warning=FALSE}

TERM2GENE <- getMsigdbT2G()

factor_5_enrich <- runEnrich(
    wave1_fe[, , "factor_5"], 
    method = "overrepresentation", 
    as_dataframe = TRUE,
    TERM2GENE = TERM2GENE
)

module_1_enrich <- runEnrich(
    wave1_me[, , "module_1"], 
    method = "overrepresentation", 
    as_dataframe = TRUE,
    TERM2GENE = TERM2GENE
)
```

Factor 5 is enriched for the pathway
"WP_NETWORK_MAP_OF_SARSCOV2_SIGNALING_PATHWAY".

```{r example-factor-enrichment}
rownames(factor_5_enrich) <- NULL
factor_5_enrich[, c("ID", "pvalue", "p.adjust", "geneID")]
```

Module 1 was upregulated in COVID-19 patients, and we see that it is
primarily enriched for pathways related to complement.

```{r example-module-enrichment}
#| fig.cap = "Top 15 enriched pathways for module 1."

ggplot(
    module_1_enrich[1:15, ],
    aes(-log10(p.adjust), reorder(substr(ID, 1, 45), -log10(p.adjust)))
) +
    geom_point(pch = 21, size = 3) +
    xlab("-log10(adjusted p-value)") +
    ylab("Pathway name (ordered by adjusted p-value)") +
    expand_limits(x = 0) +
    geom_vline(xintercept = -log10(0.05))
```

## Applying factors and modules to new datasets

Sometimes, we might want to calculate sample-level values of our factors and
modules in a new dataset. In this instance, we have a second cohort that we
might want to validate our results in. The `ReducedExperiment` package has
methods for projecting new data into the factor-space defined
previously and for recalculating eigengenes in new data.

### Factor projection

The factor projection approach is similar to that applied by the `predict` 
method of `prcomp`. It uses the loadings calculated in the original dataset to 
calculate sample-level scores given new data. Essentially, it uses these
loadings as a weighted score to perform the projection.

Doing this is very simple, as you can see below. Note, however, that these
methods assume that the new data (in this case, the wave 2 data) has been 
processed in the same way as the original data (in this case, wave 1). By 
default, the method applies standardisation to the new data using the means
and standard deviations calculated in the original data. 

Additionally, note that, by default, the projected data are rescaled to have
a mean of 0 and standard deviation of 1. So, it cannot be assumed that the
projected data are on the same scale as the original data.

With all that in mind, we can apply projection as follows. These datasets were
processed and normalised in the same way.

```{r project-data}
wave2_fe <- projectData(wave1_fe, wave2_se)
wave2_fe
```

The new `FactorisedExperiment` created by projecting the wave 2 data (above) 
has the same genes (500) and factors (34) as the original 
`FactorisedExperiment` (below), but with different samples (83 vs. 65). 

```{r reprint-original}
wave1_fe
```

While we will avoid directly comparing the dimensionally-reduced data from
the two cohorts, we can calculate associations between the factors and modules
and COVID-19 status, as we did in the original cohort. In this dataset
there are multiple samples for each patient, so we use a linear mixed model
(`method = "lmer"`) rather than a standard linear model (`method = "lm"`). We
also add a random intercept for the patient ID in the model formula.

```{r replicating-associations}
replication_fe <- associateComponents(
    wave2_fe, 
    method = "lmer", 
    formula = paste0(f, " + (1|individual_id)")
)
```

While there are differences in the most significantly associated modules
between the wave 1 and 2 cohorts, the results are broadly similar. 

```{r plotting-replicated-associations}
#| fig.cap = "Associations between modules and COVID-19 status in the wave 2 dataset."

ggplot(
    replication_fe$summaries[
        replication_fe$summaries$term == "case_controlPOSITIVE",
    ],
    aes(
        -log10(adj_pvalue),
        reorder(component, -log10(adj_pvalue)),
        fill = estimate
    )
) +
    geom_point(pch = 21, size = 3) +
    xlab("-log10(adjusted p-value)") +
    ylab("Module name (ordered by adjusted p-value)") +
    geom_vline(xintercept = -log10(0.05)) +
    scale_fill_gradient2(low = "#3A3A98", high = "#832424") +
    expand_limits(x = 0)
```

And, as we see in the following chunk, the estimated associations between
the modules and COVID-19 status (positive vs. negative) have a strong positive
correlation. 

Differences in the estimated associations between modules and COVID-19 status
may be explained by differences in the cohorts. For instance, different 
SARS-CoV-2 variants were in prevalent at the time the cohorts were sampled, and 
there were more treatments available for the 2021 (wave 2) cohort.

```{r correlation-with-projection}
original_estimates <- associations_fe$summaries$estimate[
    associations_fe$summaries$term == "case_controlPOSITIVE"
]
replication_estimates <- replication_fe$summaries$estimate[
    replication_fe$summaries$term == "case_controlPOSITIVE"
]

cor.test(original_estimates, replication_estimates)
```

### Applying modules to new data

Similarly to factors, we can look at how our modules behave in new data. Below,
we use `calcEigengenes` to do so, and we look at whether the modules behave
similarly with respect to COVID-19 status. As we saw for factors, there is a 
reasonable concordance between the modules in the two datasets.

```{r calc-eigengenes-new-data}
wave2_me <- calcEigengenes(wave1_me, wave2_se)

replication_me <- associateComponents(
    wave2_me, 
    method = "lmer", 
    formula = paste0(f, " + (1|individual_id)")
)

original_estimates <- associations_me$summaries$estimate[
    associations_me$summaries$term == "case_controlPOSITIVE"
]
replication_estimates <- replication_me$summaries$estimate[
    replication_me$summaries$term == "case_controlPOSITIVE"
]

cor.test(original_estimates, replication_estimates)
```

### Module preservation

We can also consider the preservation of modules. Below, we take our modules
calculated in the Wave 1 dataset, and assess their reproducibility in the
Wave 2 dataset. A variety of statistics are calculated by WGCNA's
`modulePreservation`. By default, the `plotModulePreservation` function 
shows us the "medianRank" and "Zsummary" statistics. You can find out more
about these in the corresponding 
[publication](https://doi.org/10.1371/journal.pcbi.1001057), but in general
we are interested in modules with higher Zsummary scores. In short, we can
consider modules above the blue line (Z > 2) to be moderately preserved and
modules above the green line (Z > 10) to be highly preserved. We see
that modules 1 and 2 have relatively high preservation in the Wave 2 dataset,
whereas modules 5 and 0 (our unclustered genes) display relatively low 
preservation. 

```{r module-preservation}
#| fig.cap = "Module preservation between the Wave 1 and Wave 2 datasets."

# Using a low number of permutations for the vignette
mod_pres_res <- modulePreservation(
    wave1_me, 
    wave2_se, 
    verbose = 0, 
    nPermutations = 5
)

plotModulePreservation(mod_pres_res)
```

# Extending the `ReducedExperiment` class

The `SummarizedExperiment` package includes a detailed vignette explaining
how to derive new containers from the `SummarizedExperiment` class.

```{r extension-vignette, eval=FALSE}
vignette("Extensions", package = "SummarizedExperiment")
```

The `ReducedExperiment` containers may be extended in a similar fashion. 
In this package, we implemented `ReducedExperiment` and two derivative classes, 
`ModularExperiment` and `FactorisedExperiment`. While these are convenient for 
working with the output of module and factor analysis, respectively, other 
dimensionality reduction approaches may require the incorporation of additional
components and methods. 

For instance, we could define a `PCAExperiment` class for storing the output
of `stats::prcomp` (i.e., Principal Components Analysis, or PCA). 
The function produces the following:

* `x`: The reduced data, which we will store in the existing `ReducedExperiment`
`reduced` slot.
* `rotation`: The feature loadings, for which we will create a new slot,
* `sdev`: The standard deviations of the principal components, for which
we will create a new slot.
* `center` and `scale`: The centering and scaling used, which we will store
in their existing `ReducedExperiment` slots.

```{r extend-class}
PCAExperiment <- setClass(
    "PCAExperiment",
    contains = "ReducedExperiment",
    slots = representation(
        rotation = "matrix",
        sdev = "numeric"
    )
)
```

We may then run `stats::prcomp` on our data and store the output in a
`PCAExperiment` container. We give the original `SummarizedExperiment` object
as an argument, and we also supply the outputs of `stats::prcomp` to the 
slots defined for `ReducedExperiment` and the new slots defined above for
`PCAExperiment`.

```{r create-pca-experiment}
prcomp_res <- stats::prcomp(t(assay(wave1_se)), center = TRUE, scale. = TRUE)

my_pca_experiment <- PCAExperiment(
    wave1_se,
    reduced = prcomp_res$x,
    rotation = prcomp_res$rotation,
    sdev = prcomp_res$sdev,
    center = prcomp_res$center,
    scale = prcomp_res$scale
)

my_pca_experiment
```

We can retrieve the principal components.

```{r get-pca-reduced}
reduced(my_pca_experiment)[1:5, 1:5]
```

But to retrieve our new `rotation` matrix we need to define a "getter" method.

```{r get-rotation}
setGeneric("rotation", function(x, ...) standardGeneric("rotation"))

setMethod("rotation", "PCAExperiment", function(x, withDimnames = TRUE) {
    return(x@rotation)
})

rotation(my_pca_experiment)[1:5, 1:5]
```

And to change the values of the `rotation` matrix we must also define a setter:

```{r set-rotation}
setGeneric("rotation<-", function(x, ..., value) standardGeneric("rotation<-"))

setReplaceMethod("rotation", "PCAExperiment", function(x, value) {
    x@rotation <- value
    validObject(x)
    return(x)
})

# Set the value at position [1, 1] to 10 and get
rotation(my_pca_experiment)[1, 1] <- 10
rotation(my_pca_experiment)[1:5, 1:5]
```

To fully flesh out our class we would also need to consider:

* Defining getters and setters for the `sdev` slot
* Implementing a constructor method to make creating `PCAExperiment` simpler and
add sensible defaults
* Implementing a method to check the validity of our object
* Modifying existing methods, such as subsetting operations (i.e., `]`), to
handle the new slots we have introduced

For the `ModularExperiment` and `FactorisedExperiment` classes, we additionally
implemented a variety of methods specific to the type of data they store. Since 
the results of PCA are similar to that of factor analysis, in practice
it would likely be more straightforward to simply store these results in a
`FactorisedExperiment`.

# Constructing a `ReducedExperiment` from scratch

If you have already applied dimensionality reduction to your data, you may be
able to package the results into one of the `ReducedExperiment` containers.
Below, we construct a `FactorisedExperiment` container based on the results of
PCA applied to gene expression data. This container is appropriate, because we
can store the sample-level principal components in the `reduced` slot of the
object, and we may store the rotation matrix returned by `stats::prcomp` to
the `loadings` slot. 

To construct our `FactorisedExperiment`, we provide a `SummarizedExperiment` as
our first argument. In this case we reconstruct our `SummarizedExperiment`
for demonstration purposes, but if you have already constructed a 
`SummarizedExperiment` you may simply pass this as the first argument.

We then pass the results of `stats::prcomp` to the object's slots. This object
is now ready to work with as described in the section above 
("Introducing `ReducedExperiment` containers with a case study").

```{r make-fe-from-pca}
# Construct a SummarizedExperiment (if you don't already have one)
expr_mat <- assay(wave1_se)
pheno_data <- colData(wave1_se)
se <- SummarizedExperiment(
    assays = list("normal" = expr_mat),
    colData = pheno_data
)

# Calculate PCA with prcomp
prcomp_res <- stats::prcomp(t(assay(se)), center = TRUE, scale. = TRUE)

# Create a FactorisedExperiment container
fe <- FactorisedExperiment(
    se,
    reduced = prcomp_res$x,
    loadings = prcomp_res$rotation,
    stability = prcomp_res$sdev,
    center = prcomp_res$center,
    scale = prcomp_res$scale
)

fe
```

# Vignette references

This vignette used publicly available gene expression data (RNA sequencing)
from the following publication:

* "Multi-omics identify falling LRRC15 as a COVID-19 severity marker and 
    persistent pro-thrombotic signals in convalescence"
    (https://doi.org/10.1038/s41467-022-35454-4)
    - Gisby *et al*., 2022
* Zenodo data deposition (https://zenodo.org/doi/10.5281/zenodo.6497250)

The stability-based ICA algorithm implemented in this package for identifying
latent factors is based on the ICASSO algorithm, references for which
can be found below:

* "Icasso: software for investigating the reliability of ICA estimates by 
    clustering and visualization" 
    (https://doi.org/10.1109/NNSP.2003.1318025) 
    - Himberg *et al*., 2003
* "stabilized-ica" Python package 
    (https://github.com/ncaptier/stabilized-ica) 
    - Nicolas Captier

The `estimateStability` function is based on the related Most Stable
Transcriptome Dimension approach, which also has a Python implementation 
provided by the stabilized-ica Python package.

* "Determining the optimal number of independent components for reproducible 
    transcriptomic data analysis" 
    (https://doi.org/10.1186/s12864-017-4112-9) 
    - Kairov *et al*., 2017

Identification of coexpressed gene modules is carried out through the
Weighted Gene Correlation Network Analysis (WGCNA) framework

* "WGCNA: an R package for weighted correlation network analysis"
    (https://doi.org/10.1186/1471-2105-9-559)
    - Langfelder and Horvath, 2008
* "WGCNA" R package (https://cran.r-project.org/web/packages/WGCNA/index.html)
* "Is My Network Module Preserved and Reproducible?" 
    (https://doi.org/10.1371/journal.pcbi.1001057) 
    - Langfelder, Luo, Oldham and Horvath, 2011

# Session Information

Here is the output of `sessionInfo()` on the system on which this document was 
compiled:

```{r session-info, echo=FALSE}
sessionInfo()
```