---
title: "An introduction to the MoleculeExperiment Class"
date: "`r BiocStyle::doc_date()`"
author:
- name: Bárbara Zita Peters Couto
  affiliation:
  - School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
  - Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
  - Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- name: Nick Robertson
  affiliation:
  - School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
  - Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
  - Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- name: Mu-Wei Chung
  affiliation:
  - School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
  - Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
  - Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- name: Ellis Patrick
  affiliation:
  - School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
  - Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
  - Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
  - Westmead Institute for Medical Research, University of Sydney, Australia
- name: Shila Ghazanfar
  affiliation:
  - School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
  - Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
  - Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
vignette: >
  %\VignetteIndexEntry{"Introduction to MoleculeExperiment"}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document
---

```{r global-options, include=FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "#>"
)
library(BiocStyle)
```

# MoleculeExperiment

The R package MoleculeExperiment contains functions to create and work with
objects from the new MoleculeExperiment class. We introduce this class for
analysing molecule-based spatial transcriptomics data (e.g., Xenium by 10X,
CosMx SMI by Nanostring, and Merscope by Vizgen, among others).

## Why the MoleculeExperiment class?

The goal of the MoleculeExperiment class is to:
1. Enable analysis of spatial transcriptomics (ST) data at the molecule level, 
independent of aggregating to the cell or tissue level.
2. Standardise molecule-based ST data across vendors, to hopefully facilitate
comparison of different data sources and common analytical and visualisation 
workflows.
3. Enable aggregation to a `SpatialExperiment` object given combinations of 
molecules and segmentation boundaries.

## Installation

The latest release of MoleculeExperiment can be installed using:

```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("MoleculeExperiment")
```

# Minimal example

1. Load required libraries.

```{r}
library(MoleculeExperiment)
library(ggplot2)
library(EBImage)
```

2. Create MoleculeExperiment object with example Xenium data, taken over a 
small patch.

```{r}
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/xenium_V1_FF_Mouse_Brain")

me <- readXenium(repoDir, keepCols = "essential")
me
```

3. Use standardised data in ME object for molecule-level analyses. For example,
plot a simple digital in-situ, with cell boundaries overlaid. 

```{r}
ggplot_me() +
  geom_polygon_me(me, assayName = "cell", fill = "grey") +
  geom_point_me(me) +
  # zoom in to selected patch area
  coord_cartesian(
    xlim = c(4900, 4919.98),
    ylim = c(6400.02, 6420)
  )
```

We can plot molecules only belonging to specific genes via the `selectFeatures` 
argument.

```{r}
ggplot_me() +
  geom_polygon_me(me, assayName = "cell", fill = "grey") +
  geom_point_me(me, byColour = "feature_id", selectFeatures = c("Nrn1", "Slc17a7")) +
  # zoom in to selected patch area
  coord_cartesian(
    xlim = c(4900, 4919.98),
    ylim = c(6400.02, 6420)
  )
```

4. Finally, it is also possible to go from a MoleculeExperiment object to a
SpatialExperiment object. This enables the transition from a molecule-level
analysis to a cell-level analysis with already existing tools.

```{r}
# transform ME to SPE object
spe <- countMolecules(me)
spe
```

# The ME object in detail

## Constructing an ME object

### Use case 1: from dataframes to ME object 

Here we demonstrate how to work with an ME object from toy data, representing a 
scenario where both the detected transcripts information and the boundary 
information have already been read into R. This requires the standardisation
of the data with the `dataframeToMEList()` function.

The flexibility of the arguments in `dataframeToMEList()` enable the creation of
a standard ME object across dataframes coming from different vendors of
molecule-based spatial transcriptomics technologies. 

1) Generate a toy transcripts data.frame.

```{r}
moleculesDf <- data.frame(
  sample_id = rep(c("sample1", "sample2"), times = c(30, 20)),
  features = rep(c("gene1", "gene2"), times = c(20, 30)),
  x_coords = runif(50),
  y_coords = runif(50)
)
head(moleculesDf)
```

2) Generate a toy boundaries data.frame.

```{r}
boundariesDf <- data.frame(
  sample_id = rep(c("sample1", "sample2"), times = c(16, 6)),
  cell_id = rep(
    c(
      "cell1", "cell2", "cell3", "cell4",
      "cell1", "cell2"
    ),
    times = c(4, 4, 4, 4, 3, 3)
  ),
  vertex_x = c(
    0, 0.5, 0.5, 0,
    0.5, 1, 1, 0.5,
    0, 0.5, 0.5, 0,
    0.5, 1, 1, 0.5,
    0, 1, 0, 0, 1, 1
  ),
  vertex_y = c(
    0, 0, 0.5, 0.5,
    0, 0, 0.5, 0.5,
    0.5, 0.5, 1, 1,
    0.5, 0.5, 1, 1,
    0, 1, 1, 0, 0, 1
  )
)
head(boundariesDf)
```

3) Standardise transcripts dataframe to ME list format.

```{r}
moleculesMEList <- dataframeToMEList(moleculesDf,
  dfType = "molecules",
  assayName = "detected",
  sampleCol = "sample_id",
  factorCol = "features",
  xCol = "x_coords",
  yCol = "y_coords"
)
str(moleculesMEList, max.level = 3)
```

4) Standardise boundaries dataframe to ME list format.

```{r}
boundariesMEList <- dataframeToMEList(boundariesDf,
  dfType = "boundaries",
  assayName = "cell",
  sampleCol = "sample_id",
  factorCol = "cell_id",
  xCol = "vertex_x",
  yCol = "vertex_y"
)
str(boundariesMEList, 3)
```

5) Create an ME object by using the MoleculeExperiment object constructor.

```{r}
toyME <- MoleculeExperiment(
  molecules = moleculesMEList,
  boundaries = boundariesMEList
)
toyME
```

6) Add boundaries from an external segmentation algorithm.

In this example, we use the extent of the molecules of generated for `toyME` to 
align the boundaries with the molecules. In general, the extent of the 
segmentation is required for this alignment.

```{r}
repoDir <- system.file("extdata", package = "MoleculeExperiment")
segMask <- paste0(repoDir, "/BIDcell_segmask.tif")
boundaries(toyME, "BIDcell_segmentation") <- readSegMask(
  # use the molecule extent to define the boundary extent
  extent(toyME, assayName = "detected"),
  path = segMask, assayName = "BIDcell_segmentation",
  sample_id = "sample1", background_value = 0
)

toyME
```

Displayed below is the BIDcell segmentation image added to `toyME` as another boundaries

```{r warning=FALSE}
BIDcell_segmask_img <- EBImage::readImage(segMask)
EBImage::display(BIDcell_segmask_img, method = "raster")
```

Finally, a digital in-situ, with cell boundaries and BIDcell segmentation boundaries (red polygons in sample1) can be plotted.

```{r}
ggplot_me() + 
  geom_polygon_me(toyME, assayName = "cell", byFill = "segment_id", colour = "black", alpha = 0.3) + 
  geom_polygon_me(toyME, assayName = "BIDcell_segmentation", fill = NA, colour = "red" ) + 
  geom_point_me(toyME, assayName = "detected", byColour = "feature_id", size = 1) +
  theme_classic()
```

### Use case 2: from machine's output directory to ME object

The MoleculeExperiment package also provides functions to directly work with 
the directories containing output files of commonly used technologies. 
This is especially useful to work with data from multiple samples. 
Here we provide convenience functions to read in data from Xenium (10X 
Genomics), CosMx (Nanostring) and Merscope (Vizgen).

```{r}
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/xenium_V1_FF_Mouse_Brain")

me <- readXenium(repoDir, keepCols = "essential", addBoundaries = "cell")
me
```

readXenium() standardises the transcript and boundary information such that the
column names are consistent across technologies when handling ME objects.

In addition, the `keepCols` argument of `readXenium()` enables the user to 
decide if they want to keep all data that is vendor-specific (e.g., column with 
qv score), some columns of interest, or only the essential columns. By default, 
it is set to `essential`, which refers to feature names, x and y locations in 
the transcripts file, and segment ids, x and y locations for the vertices 
defining the boundaries in the boundaries file.

For CosMx and Merscope data we provide convenience functions that standardise
the raw transcripts data into a MoleculeExperiment object and additionally read 
the boundaries included in the standard data releases.

```{r}
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/nanostring_Lung9_Rep1")

meCosmx <- readCosmx(repoDir, keepCols = "essential", addBoundaries = "cell")
meCosmx
```

```{r}
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/vizgen_HumanOvarianCancerPatient2Slice2")
meMerscope <- readMerscope(repoDir,
  keepCols = "essential",
  addBoundaries = "cell"
)
meMerscope
```

## ME object structure
A MoleculeExperiment object contains a @molecules slot and an optional 
@boundaries slot.

Both slots have a hierarchical list structure that consists of a nested list,
ultimately ending in a data.frame/tibble. Traditional rectangular data 
structures, like dataframes, redundantly store gene names and sample IDs for 
the millions of transcripts. In contrast, data in a list enables us to avoid 
this redundancy and work with objects of smaller size.

### molecules slot
The @molecules slot contains molecule-level information. The essential data it 
contains is the feature name (e.g., gene names) and x and y locations of the
detected molecules (e.g., transcripts), in each sample. Nevertheless, the user
can also decide to keep all molecule metadata (e.g., subcellular location:
nucleus/cytoplasm).

The nested list in the molecules slot has the following hierarchical structure:
"assay name" > "sample ID" > "feature name" > dataframe/tibble with X and Y
locations (and other additional columns of interest).

```{r}
showMolecules(me)
```

### boundaries slot
The @boundaries slot contains information from segmentation analyses (e.g.,
cell boundaries, or nucleus boundaries).

The nested list in the boundaries slot has the following hierarchical structure:
"assay name" > "sample ID" > "segment ID" > dataframe/tibble with the vertex
coordinates defining the boundaries for each segment. For example, if the
boundary information is for cells, the assay name can be set to "cell";
or "nucleus" if one is using nucleus boundaries.

```{r}
showBoundaries(me)
```

# Methods
Here we introduce basic methods to access and manipulate data in an ME object,
i.e., getters and setters, respectively. We also describe subsetting samples.

## Getters
The main getters are molecules() and boundaries(). NOTE: the output of these
methods is the ME nested list, which can be very large on screen. Thus, these
getters should be used when wanting to work with the data. To quickly view the
slot contents, use showMolecules() and showBoundaries() instead.

```{r, echo=TRUE, results= 'hide'}
# NOTE: output not shown as it is too large
# access molecules slot (note: assay name needs to be specified)
molecules(me, assayName = "detected")
# access cell boundary information in boundaries slot
boundaries(me, assayName = "cell")
```

For ease of use, these getters have arguments that enable the transformation of
the data from a nested ME list format to a data.frame format.

```{r}
molecules(me, assayName = "detected", flatten = TRUE)
```

```{r}
boundaries(me, assayName = "cell", flatten = TRUE)
```

Other getters include: features() and segmentIDs().

```{r}
# get features in sample 1
features(me, assayName = "detected")[[1]]
```

```{r}
segmentIDs(me, "cell")
```

## Setters
Main setters include `molecules<-` and `boundaries<-`.
For example, with `boundaries<-` one can add new segmentation assay information
to the boundaries slot. Here we demonstrate this with the nucleus boundaries.

```{r}
repoDir <- system.file("extdata", package = "MoleculeExperiment")
repoDir <- paste0(repoDir, "/xenium_V1_FF_Mouse_Brain")
nucleiMEList <- readBoundaries(
  dataDir = repoDir,
  pattern = "nucleus_boundaries.csv",
  segmentIDCol = "cell_id",
  xCol = "vertex_x",
  yCol = "vertex_y",
  keepCols = "essential",
  boundariesAssay = "nucleus",
  scaleFactorVector = 1
)

boundaries(me, "nucleus") <- nucleiMEList
me # note the addition of the nucleus boundaries to the boundaries slot
```

The additional boundaries can be accessed, e.g. for visualisation.

```{r}
ggplot_me() +
  # add cell segments and colour by cell id
  geom_polygon_me(me, byFill = "segment_id", colour = "black", alpha = 0.1) +
  # add molecule points and colour by feature name
  geom_point_me(me, byColour = "feature_id", size = 0.1) +
  # add nuclei segments and colour the border with red
  geom_polygon_me(me, assayName = "nucleus", fill = NA, colour = "red") +
  # zoom in to selected patch area
  coord_cartesian(xlim = c(4900, 4919.98), ylim = c(6400.02, 6420))
```

## Subsetting

We can subset a MoleculeExperiment object via the `subset_by_extent` function.
In this example, we define the new extent which we want to subset, as an area
of 10um from the topleft corner.

```{r}
new_extent = extent(meMerscope, "detected") - c(0, 100, 0, 0)
# new_extent = c(xmin = 924, xmax = 950, ymin = 26290, ymax = 26330)
new_extent

me_sub = subset_by_extent(meMerscope, new_extent)
me_sub

g1 = ggplot_me() +
  geom_polygon_me(meMerscope, byFill = "segment_id", colour = "black", alpha = 0.1) +
  geom_point_me(meMerscope, byColour = "feature_id", size = 0.1) +
  geom_polygon_me(meMerscope, assayName = "cell", fill = NA, colour = "red") + 
  ggtitle("Before subsetting")
g2 = ggplot_me() +
  geom_polygon_me(me_sub, byFill = "segment_id", colour = "black", alpha = 0.1) +
  geom_point_me(me_sub, byColour = "feature_id", size = 0.1) +
  geom_polygon_me(me_sub, assayName = "cell", fill = NA, colour = "red") + 
  ggtitle("After subsetting")

g1
g2
```


# From MoleculeExperiment to SpatialExperiment
If one is interested in continuing downstream analysis at the cell-level,
the MoleculeExperiment package also provides a convenience function,
countMolecules(), that enables the transition from a MoleculeExperiment
object to a SpatialExperiment object. With this functionality, it is possible to
use already existing methods for cell-level data analysis.

```{r}
spe <- countMolecules(me, boundariesAssay = "nucleus")
spe
```

# Case Study: MoleculeExperiment and napari

Load the demonstration data, which includes molecules for 2 genes.

```{r}
data(small_me)
```

Read in virtual dissection CSV file, exported from napari (screenshot), of the morphology image.

![screenshot of napari for virtual dissection](napari_screenshot.png)

```{r}
bds_colours <- setNames(
  c("#F8766D", "#00BFC4"),
  c("Region 1", "Region 2")
)

fts_colours <- setNames(
  c("#FFD700", "#90EE90"),
  c("Pou3f1", "Sema3a")
)

file_path <- system.file("extdata/tiny_brain_shape2.csv", package = "MoleculeExperiment")

bds_shape_raw <- read.csv(file = file_path, header = TRUE)
bds_shape_raw$sample_id <- "xenium_tiny_subset"
bds_shape_raw$regionName <- names(bds_colours)[bds_shape_raw$index + 1]

bds_shape <- dataframeToMEList(bds_shape_raw,
  dfType = "boundaries",
  assayName = "virtualDissection",
  sampleCol = "sample_id",
  factorCol = "regionName",
  xCol = "axis.1",
  yCol = "axis.0",
  scaleFactor = 0.2125
)

boundaries(small_me, "virtualDissection") <- bds_shape
```

We can plot the resulting MoleculeExperiment using the following code.

```{r}
g <- ggplot() +
  geom_point_me(
    small_me,
    assayName = "detected", byColour = "feature_id", size = 0.2
  ) +
  geom_polygon_me(
    small_me,
    assayName = "cell", fill = NA, colour = "grey50", size = 0.1
  ) +
  geom_polygon_me(
    small_me,
    assayName = "nucleus", fill = NA, colour = "black", size = 0.1
  ) +
  geom_polygon_me(
    small_me,
    assayName = "virtualDissection", byFill = "segment_id", alpha = 0.3
  ) +
  scale_y_reverse() +
  theme_classic() +
  theme(axis.text = element_blank()) +
  theme(axis.ticks = element_blank()) +
  coord_fixed() +
  scale_colour_manual(values = fts_colours) +
  scale_fill_manual(values = bds_colours) +
  guides(color = guide_legend(override.aes = list(size = 2)))
g
```

Now that we have added the virtual dissection boundaries, we can use countMolecules to generate psuedobulk expressions over these regions.

```{r, eval=FALSE}
spe <- countMolecules(
  small_me, boundariesAssay = "virtualDissection")
spe
```

# SessionInfo

```{r}
sessionInfo()
```