---
title: "CatsCradle Quick Start"
author: "Anna Laddach and Michael Shapiro"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{CatsCradle Quick Start}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    fig.dim = c(6,6),
    comment = "#>"
)
```
`r BiocStyle::Biocpkg("BiocStyle")`

![](CatsCradleLogo.png){width=2in}


## Introduction

CatsCradle provides two types of functionality for analysing single
cell data.  It provides tools for the clustering and annotation of
genes and it provides extensive tools for analysing spatial
transcriptomics data.

## Clustering and annotation of genes

A typical analysis of scRNA-seq data using Seurat involves dimensionality reduction 
(PCA, UMAP, tSNE) and the clustering of cells using the Louvain algorithm. All of
this is done based on the similarities and differences in the genes
cells express.  Here we see a UMAP where the points are cells, coloured by
their seurat_clusters.  

```{r [QS1], message=FALSE}
library(Seurat,quietly=TRUE)
library(CatsCradle,quietly=TRUE)
data(exSeuratObj)
DimPlot(exSeuratObj,cols='polychrome')
```

By transposing the expression matrix, we can use the same technology
to produce a Seurat object where the samples are the genes and the features are 
cells.  Genes now have their own UMAP and their own clusters.  This is all done 
on the basis of the similarities and differences in the cells they are
expressed in.

```{r [QS2]}
getExample = make.getExample()
STranspose = getExample('STranspose')
print(STranspose)
DimPlot(STranspose,cols='polychrome')
```

### Gene clusters vs. cell clusters

There are a number of ways of investigating the relationship between
gene clusters and cell clusters.  One is by computing the average
expression of each gene cluster in each cell cluster; this can be
displayed as a heat map or a Sankey graph - the cat's cradle of our
CatsCradle. (See getAverageExpressionMatrix() and sankeyFromMatrix()
in the CatsCradle vignette CatsCradle.

### Spatial co-location of genes on gene UMAP

Gene sets such as Hallmark tend to localise on gene UMAP, though they
are not necessarily confined to specific gene clusters.  Here we show
HALLMARK_OXIDATIVE_PHOSPHORYLATION (subset to the genes in our Seurat
object) superimposed in green and black on the gene cluster UMAP.


```{r [QS3], message=FALSE}
library(ggplot2,quietly=TRUE)
hallmark = getExample('hallmark')
h = 'HALLMARK_OXIDATIVE_PHOSPHORYLATION'
umap = FetchData(STranspose,c('umap_1','umap_2'))
idx = colnames(STranspose) %in% hallmark[[h]]
g = DimPlot(STranspose,cols='polychrome') +
    geom_point(data=umap[idx,],aes(x=umap_1,y=umap_2),color='black',size=2.7) +
    geom_point(data=umap[idx,],aes(x=umap_1,y=umap_2),color='green') +
    ggtitle(paste(h,'\non gene clusters'))
print(g)
pValue = getObjectSubsetClusteringPValue(STranspose,idx)
pValue
```

The p-value here is limited by the default number of randomisation
trials in getSubsetClusteringPValue().  Of the 50 Hallmark gene sets,
31 have clustering p-values less than 0.05.

The computation of these p-values is ultimately carried out by
medianComplementPValue().  We wish to determine whether a subset _X_
of a set of points _S_ is randomly scattered across _S_ or is somehow
localised.  To do this we consider the complement _C_ of _X_ in _S_.
If _X_ is "spread out" randomly across _S_, every point of _C_ will be
close to some point in _X_.  Accordingly, the median distance from
points in _C_ to their nearest point in _X_ will be small.  If on the
contrary, _X_ is localised, many points in _C_ will be distant from
_X_, thereby producing a larger median complement distance.  We
produce a p-value by comparing the median complement distance for _X_
to those for randomised sets of the same size as _X_.

### Gene annotation

This allows one to predict functions of a given gene by looking at
annotations of its neighbouring genes.  If a gene has its own
annotation and also has neighbours which have annotations, we can
compare the actual annotation to the predicted annotation. These predictions 
perform well above chance as described in the section Predicting gene function 
of the CatsCradle vignette CatsCradle.

## Analysis of spatial transcriptomic data

CatsCradle provides extensive tools for the analysis of spatial
transcriptomics data.  Here we see cell type plotted on the cell
centroids of our example data.

```{r [QS4]}
smallXenium = getExample('smallXenium')
ImageDimPlot(smallXenium,cols='polychrome')
```

With Moran's I, we can see spatial autocorrelation of gene expression
and compute p-values for this.  See CatsCradle spatial vignette
CatsCradleSpatial.


### Neighbourhoods

The key concept in our analysis of this kind of data is the
_neighbourhood_. In general, a neighbourhood is a contiguous set of
cells.  In all of our use cases, each neighbourhood is set of cells
_around a given cell_.  Accordingly, a neighbourhood can be referred
to by the name of the cell at its centre.

The simplest type of neighbourhood arises from Delaunay triangulation
where each neighbourhood consists of a cell and its immediate
neighbours.  This is an undirected graph in which each cell has an
edge connecting it to each of these neighbours.

```{r [QS5]}
delaunayNeighbours = getExample('delaunayNeighbours')
head(delaunayNeighbours,10)
```

We can also make extended neighbourhoods, e.g., by finding the
neighbours of each cell, their neighbours, and their neighbours.
getExtendedNBHDs() produces a list characterising these neighbours by
their combinatorial radius and collapseExtendedNBHDs() collapses
these so that all the extended neighbours of each cell are now treated
as neighbours.

In addition, neighbourhoods can be found on the basis of Euclidean
distance.

For each cell type A, we can ask "What types of cells are found around
cells of type A?"  We can display the answer in a directed graph. Here
we are using an extended neighbourhood.  A fat arrow from type A to
type B indicates that neighbourhoods around cells of type A have lots
of cells of type B. Here we only display an arrow from A to B when
cells of type B compose at least 5 percent of the cells in
neighbourhoods around type A.

```{r [QS6]}
NBHDByCTMatrixExtended = getExample('NBHDByCTMatrixExtended')
clusters = getExample('clusters')
colours = getExample('colours')
cellTypesPerCellTypeMatrixExtended = computeCellTypesPerCellTypeMatrix(NBHDByCTMatrixExtended,clusters)

cellTypesPerCellTypeGraphFromCellMatrix(cellTypesPerCellTypeMatrixExtended, minWeight = 0.05, colours = colours)
```

We can also test for statistical significance for the enrichment of a
given cell type in the neighbourhoods around another cell type 
(see CatsCradleSpatial).

### Neighourhood Seurat objects.

Neighbourhoods naturally want to be encoded into Seurat objects.
There are two ways to do this.

#### Neighbourhoods and cell types

Just as cells express genes, with poetic license we can say that
neighbourhoods "express" cell types.  A cell types by neighbourhoods
matrix gives the counts of cells of each type in each neighbourhood.
This can be taken as the counts matrix for a Seurat object.  We can
then use the Louvain algorithm to cluster neighbourhoods into
neighbourhood types.  If we like, we can display these on a
neighbourhood UMAP.

Here we attach these neighbourhood_clusters from a neighbourhood
Seurat object (created by computeNBHDVsCTObject()) to our original
spatial object and visualize the neighbourhood clusters on the tissue
coordinates. (Again, we are using extended neighbourhoods.)

```{r [QS7]}
NBHDByCTSeuratExtended = getExample('NBHDByCTSeuratExtended')
smallXenium$NBHDClusterExtended= 
  NBHDByCTSeuratExtended@active.ident
ImageDimPlot(smallXenium, group.by = c("NBHDClusterExtended"), 
             size = 1, cols = "polychrome")
```

### Aggregate gene expression

As well as expressing cell types, neighbourhoods also express genes.
We can take the gene expression profile of a neighbourhood to be the
total aggregate expression of each gene across all cells in the
neighbourhood.  This produces a genes by neighbourhoods count
matrix. The function aggregateGeneExpression() takes as
arguments an underlying Seurat object and a set of neighbourhoods and
produces a Seurat object where the rows are the genes of the
underlying Seurat object and the columns are the neighbourhoods.
Since each neighbourhood corresponds to a cell in our usage, this is
formally indistinguishable from a normal Seurat object.  However, to
avoid confusion, we have labelled the clusters as aggregation_clusters
rather than the standard seurat_clusters.

This enables all the usual analyses: dimension reduction and Louvain
clustering (here giving another clustering of the neighbourhoods). In
addition, we can find the marker genes which characterise the transcriptomic
differences between the neighbourhood types.

```{r [QS8]}
extendedNeighbours = getExample('extendedNeighbours')
agg = aggregateGeneExpression(smallXenium,extendedNeighbours,
                                    verbose=FALSE)
smallXenium$aggregateNBHDClusters = agg@active.ident
ImageDimPlot(smallXenium,group.by='aggregateNBHDClusters',cols='polychrome',
             size=1)
```

Notice that since our neighbourhoods are all indexed by cells, we can
compare the different methods of clustering neighbourhoods.  Here we
give a heatmap comparing the clustering based on cell types in our
extended neighbourhoods to that based on aggregate gene expression.
We see, for example, that aggregation clusters 7, 13, 14, 16 and 17
are subsumed into cell type neighbourhood cluster 0.  More generally,
we see that clustering based on aggregate gene expression tends to
subdivide that based on the cell types in each neighbourhood.

```{r [QS9], message=FALSE}
library(pheatmap)
tab = table(smallXenium@meta.data[,c("aggregateNBHDClusters",
                                     "NBHDClusterExtended" )])
M = matrix(as.numeric(tab),nrow=nrow(tab))
rownames(M) = paste('aggregation_cluster',rownames(tab))
colnames(M) = paste('nbhd_cluster',colnames(tab))
for(i in 1:nrow(M)) M[i,] = M[i,] / sum(M[i,])
pheatmap(M)
```

Another way of comparing clusterings is via the adjusted Rand
index. This compares two classifications of the same set of objects.
It can vary between -1 and 1.  It takes the value 1 when these
classifications denote the same subsets and 0 when they agree at a
level no better than chance.

```{r [QS10], message=FALSE}
library(fossil,quietly=TRUE)
adjustedRandIndex = adj.rand.index(agg@active.ident,
                                   NBHDByCTSeuratExtended@active.ident)
adjustedRandIndex
```

## Ligand-receptor analysis

The Delaunay triangulation gives us an opportunity to look at the
ligand-receptor interactions between neighbouring cells.  We use the
Nichenetr ligand-receptor pairs to discover the pairs in our gene
panel.  Nichnetr lists over eleven thousand ligand-receptor pairs.  Of
these 28 appear in our example panel.

The main function for carrying out ligand-receptor analysis is
performLigandReceptorAnalysis().  This produces a list with
multiple entries.  These include a data frame interactionsOnEdges.
Here the first two columns are nodeA and nodeB which give the edge in
question and next two columns give their corresponding clusters. The remaining 
columns are logicals, each corresponding to a
particular ligand-receptor pair and indicating whether that pair is
present on that edge.  Notice that while the neighbour-neighbour
relationship in the Delaunay triangulation is undirected, the
ligand-receptor relationship is directed.  Accordingly, in
interactionsOnEdges, a given edge appears as both the pair cell A -
cell B and as the pair cell B - cell A.  

```{r [QS11]}
ligandReceptorResults = getExample('ligandReceptorResults')
ligandReceptorResults$interactionsOnEdges[1:10,1:10]
```


Other fields are downstream of this data frame.  For example, pValue
tells whether a given ligand-receptor pair is statistically significantly
enriched in the interactions between two specific cell types.  For
further details see the section Ligand receptor analysis in the Cats
Cradle Spatial vignette CatsCradleSpatial.

## CatsCradle, Seurat objects, SingleCellExperiment objects and SpatialExperiment objects 

CatsCradle provides rudimentary support for the Bioconductor classes
SingleCellExperiment and its derived class SpatialExperiment.

Every function which accepts a Seurat object as input can also accept
a SingleCellExperiment in its place.  Any function which returns a
Seurat object can also return a SingleCellExperiment, in one case, a
SpatialExperiment, through the use of the parameter returnType.  In
each case, the internals of the functions are manipulating Seurat
objects.  Conversion to and from these is performed upon entry to and
return from the function and these coversions are rather lightweight
in nature.  They have slightly more functionality than as.Seurat and
as.SingleCellExperiment.  In addition to invoking these functions,
they carry over the nearest neighbour graphs, and in one case carry
over the tissue coordinates from a spatial Seurat object to a
SpatialExperiment object.  Because of the shallow nature of these
conversions, other information may be lost.


```{r [QS12]}
sessionInfo()
```