--- title: "CatsCradle SingleCellExperiment Quick Start" author: "Anna Laddach and Michael Shapiro" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{CatsCradle SingleCellExperiment Quick Start} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r [SCE1] setup, include = FALSE, warning = FALSE} knitr::opts_chunk$set( collapse = TRUE, fig.dim = c(6,6), comment = "#>" ) ``` `r BiocStyle::Biocpkg("BiocStyle")` ![](CatsCradleLogo.png){width=2in} ## Introduction CatsCradle provides two types of functionality for analysing single cell data. It provides tools for the clustering and annotation of genes and it provides extensive tools for analysing spatial transcriptomics data. Here we exposit these capabilities using the class SingleCellExperiment. ## Clustering and annotation of genes A typical analysis of scRNA-seq data involves dimensionality reduction (PCA, UMAP, tSNE) and the clustering of cells using the Louvain algorithm. All of this is done based on the similarities and differences in the genes cells express. Here we see a UMAP where the points are cells, coloured by their clusters. ```{r [SCE2], message=FALSE} library(SingleCellExperiment,quietly=TRUE) library(CatsCradle,quietly=TRUE) library(ggplot2,quietly=TRUE) getExample = make.getExample() S_sce = getExample('S_sce') df = as.data.frame(reducedDim(S_sce,'UMAP')) df$seurat_clusters = S_sce$seurat_clusters g = ggplot(df,aes(x=UMAP_1,y=UMAP_2,color=seurat_clusters)) + geom_point() + ggtitle('SingleCellExperiment of cells colored by seurat_cluster') print(g) ``` By transposing the expression matrix, we can use the same technology to produce an object where the samples are the genes and the features are cells. Genes now have their own UMAP and their own clusters. This is all done on the basis of the similarities and differences in the cells they are expressed in. ```{r [SCE3]} STranspose_sce = getExample('STranspose_sce') print(STranspose_sce) df = as.data.frame(reducedDim(STranspose_sce,'UMAP')) df$seurat_clusters = STranspose_sce$seurat_clusters h = ggplot(df,aes(x=umap_1,y=umap_2,color=seurat_clusters)) + geom_point() + ggtitle('SingleCellExperiment of genes colored by cluster') print(h) ``` ### Gene clusters vs. cell clusters There are a number of ways of investigating the relationship between gene clusters and cell clusters. One is by computing the average expression of each gene cluster in each cell cluster; this can be displayed as a heat map or a Sankey graph - the cat's cradle of our CatsCradle. (See getAverageExpressionMatrix() and sankeyFromMatrix() in the CatsCradle vignette [CatsCradle](doc/CatsCradle.html).) ### Spatial co-location of genes on gene UMAP Gene sets such as Hallmark tend to localise on gene UMAP, though they are not necessarily confined to specific gene clusters. Here we show HALLMARK_OXIDATIVE_PHOSPHORYLATION (subset to the genes in our SingleCellExperiment object) superimposed in green and black on the gene cluster UMAP. ```{r [SCE4]} hallmark = getExample('hallmark') h = 'HALLMARK_OXIDATIVE_PHOSPHORYLATION' idx = colnames(STranspose_sce) %in% hallmark[[h]] k = ggplot(data=df,aes(x=umap_1,y=umap_2,color=seurat_clusters)) + geom_point() + geom_point(data=df[idx,],aes(x=umap_1,y=umap_2),color='black',size=2.7) + geom_point(data=df[idx,],aes(x=umap_1,y=umap_2),color='green') + ggtitle(paste(h,'\non gene clusters')) print(k) pValue = getObjectSubsetClusteringPValue(STranspose_sce,idx) pValue ``` The p-value here is limited by the default number of randomisation trials in getSubsetClusteringPValue(). Of the 50 Hallmark gene sets, 31 have clustering p-values less than 0.05. The computation of these p-values is ultimately carried out by medianComplementPValue(). We wish to determine whether a subset _X_ of a set of points _S_ is randomly scattered across _S_ or is somehow localised. To do this we consider the complement _C_ of _X_ in _S_. If _X_ is "spread out" randomly across _S_, every point of _C_ will be close to some point in _X_. Accordingly, the median distance from points in _C_ to their nearest point in _X_ will be small. If on the contrary, _X_ is localised, many points in _C_ will be distant from _X_, thereby producing a larger median complement distance. We produce a p-value by comparing the median complement distance for _X_ to those for randomised sets of the same size as _X_. ### Gene annotation This allows one to predict functions of a given gene by looking at annotations of its neighbouring genes. If a gene has its own annotation and also has neighbours which have annotations, we can compare the actual annotation to the predicted annotation. These predictions perform well above chance as described in the section Predicting gene function of the CatsCradle vignette[CatsCradle](doc/CatsCradle.html) . ## Analysis of spatial transcriptomic data CatsCradle provides extensive tools for the analysis of spatial transcriptomics data. Here we see cell type plotted on the cell centroids of our example data. ```{r [SCE5], message=FALSE} library(SpatialExperiment) X_sce = getExample('X_sce') cellCoords = as.data.frame(spatialCoords(X_sce)) cellCoords$cluster = X_sce$cluster ell = ggplot(cellCoords,aes(x=x,y=y,color=cluster)) + geom_point() + ggtitle('Spatial coordinates colored by cell type') print(ell) ``` With Moran's I, we can see spatial autocorrelation of gene expression and compute p-values for this. See CatsCradle spatial vignette [CatsCradleSpatial](doc/CatsCradleSpatial.html). ### Neighbourhoods The key concept in our analysis of this kind of data is the _neighbourhood_. In general, a neighbourhood is a contiguous set of cells. In all of our use cases, each neighbourhood is set of cells _around a given cell_. Accordingly, a neighbourhood can be referred to by the name of the cell at its centre. The simplest type of neighbourhood arises from Delaunay triangulation where each neighbourhood consists of a cell and its immediate neighbours. This is an undirected graph in which each cell has an edge connecting it to each of these neighbours. ```{r [SCE6]} delaunayNeighbours = getExample('delaunayNeighbours') head(delaunayNeighbours,10) ``` We can also make extended neighbourhoods, e.g., by finding the neighbours of each cell, their neighbours, and their neighbours. getExtendedNBHDs() produces a list characterising these neighbours by their combinatorial radius and collapseExtendedNBHDs() collapses these so that all the extended neighbours of each cell are now treated as neighbours. In addition, neighbourhoods can be found on the basis of Euclidean distance. For each cell type A, we can ask "What types of cells are found around cells of type A?" We can display the answer in a directed graph. Here we are using an extended neighbourhood. A fat arrow from type A to type B indicates that neighbourhoods around cells of type A have a high proportion of cells of type B. Here we only display an arrow from A to B when cells of type B compose at least 5 percent of the cells in neighbourhoods around type A. ```{r [SCE7]} NBHDByCTMatrixExtended = getExample('NBHDByCTMatrixExtended') clusters = getExample('clusters') colours = getExample('colours') cellTypesPerCellTypeMatrixExtended = computeCellTypesPerCellTypeMatrix(NBHDByCTMatrixExtended,clusters) cellTypesPerCellTypeGraphFromCellMatrix(cellTypesPerCellTypeMatrixExtended, minWeight = 0.05, colours = colours) ``` We can also test for statistical significance for the enrichment of a given cell type in the neighbourhoods around another cell type (see [CatsCradleSpatial](doc/CatsCradleSpatial.html)). ### Neighourhood objects. Each neighbourhood is the neighbourhood _of a cell_. Accordingly, neighbourhoods naturally want to be encoded into single cell objects. There are two ways to do this. #### Neighbourhoods and cell types Just as cells express genes, with poetic license we can say that neighbourhoods "express" cell types. A cell types by neighbourhoods matrix gives the counts of cells of each type in each neighbourhood. This can be taken as the counts matrix for a SingleCellExperiment object. We can then use the Louvain algorithm to cluster neighbourhoods into neighbourhood types. If we like, we can display these on a neighbourhood UMAP. Here we attach these neighbourhood_clusters from a neighbourhood object (created by computeNBHDVsCTObject()) to our original spatial object and visualize the neighbourhood clusters on the tissue coordinates. (Again, we are using extended neighbourhoods.) ```{r [SCE8]} NBHDByCTSingleCellExtended_sce = getExample('NBHDByCTSingleCellExtended_sce') cellCoords$neighbourhood_clusters = NBHDByCTSingleCellExtended_sce$neighbourhood_clusters nbhdPlot = ggplot(cellCoords,aes(x=x,y=y,color=neighbourhood_clusters)) + geom_point() + ggtitle('Cell coordinates colored by neighbourhood_clusters') print(nbhdPlot) ``` ### Aggregate gene expression As well as expressing cell types, neighbourhoods also express genes. We can take the gene expression profile of a neighbourhood to be the total aggregate expression of each gene across all cells in the neighbourhood. This produces a genes by neighbourhoods count matrix. The function aggregateGeneExpression() takes as arguments an underlying Seurat object and a set of neighbourhoods and produces a Seurat object where the rows are the genes of the underlying Seurat object and the columns are the neighbourhoods. Since each neighbourhood corresponds to a cell in our usage, this is formally indistinguishable from a normal Seurat object. However, to avoid confusion, we have labelled the clusters as aggregation_clusters rather than the standard seurat_clusters. This enables all the usual analyses: dimension reduction and Louvain clustering (here giving another clustering of the neighbourhoods). In addition, we can find the marker genes which characterise the transcriptomic differences between the neighbourhood types. ```{r [SCE9]} extendedNeighbours = getExample('extendedNeighbours') smallXenium = getExample('smallXenium') agg_sce = aggregateGeneExpression(smallXenium,extendedNeighbours, verbose=FALSE, returnType='sce') cellCoords$aggregation_clusters = agg_sce$aggregation_clusters aggPlot = ggplot(cellCoords,aes(x=x,y=y,color=aggregation_clusters)) + geom_point() + ggtitle('Cell coordinates colored by aggregation_clusters') print(aggPlot) ``` Notice that since our neighbourhoods are all indexed by cells, we can compare the different methods of clustering neighbourhoods. Here we give a heatmap comparing the clustering based on cell types in our extended neighbourhoods to that based on aggregate gene expression. We see, for example, that aggregation clusters 7, 13, 14, 16 and 17 are subsumed into cell type neighbourhood cluster 0. More generally, we see that clustering based on aggregate gene expression tends to subdivide that based on the cell types in each neighbourhood. ```{r [SCE10], message=FALSE} library(pheatmap) tab = table(cellCoords[,c('aggregation_clusters','neighbourhood_clusters')]) M = matrix(as.numeric(tab),nrow=nrow(tab)) rownames(M) = paste('aggregation_cluster',rownames(tab)) colnames(M) = paste('nbhd_cluster',colnames(tab)) for(i in 1:nrow(M)) M[i,] = M[i,] / sum(M[i,]) pheatmap(M) ``` Another way of comparing clusterings is via the adjusted Rand index. This compares two classifications of the same set of objects. It can vary between -1 and 1. It takes the value 1 when these classifications denote the same subsets and 0 when they agree at a level no better than chance. ```{r [SCE11], message=FALSE} library(fossil,quietly=TRUE) adjustedRandIndex = adj.rand.index(agg_sce$aggregation_clusters, NBHDByCTSingleCellExtended_sce$neighbourhood_clusters) adjustedRandIndex ``` ## Ligand-receptor analysis The Delaunay triangulation gives us an opportunity to look at the ligand-receptor interactions between neighbouring cells. We use the Nichenetr ligand-receptor pairs to discover the pairs in our gene panel. Nichnetr lists over eleven thousand ligand-receptor pairs. Of these 28 appear in our example panel. The main function for carrying out ligand-receptor analysis is performLigandReceptorAnalysis(). This produces a list with multiple entries. These include a data frame interactionsOnEdges. Here the first two columns are nodeA and nodeB which give the edge in question and next two columns give their corresponding clusters. The remaining columns are logicals, each corresponding to a particular ligand-receptor pair and indicating whether that pair is present on that edge. Notice that while the neighbour-neighbour relationship in the Delaunay triangulation is undirected, the ligand-receptor relationship is directed. Accordingly, in interactionsOnEdges, a given edge appears as both the pair cell A - cell B and as the pair cell B - cell A. ```{r [SCE12]} ligandReceptorResults = getExample('ligandReceptorResults') ligandReceptorResults$interactionsOnEdges[1:10,1:10] ``` Other fields are downstream of this data frame. For example, pValue tells whether a given ligand-receptor pair is statistically significantly enriched in the interactions between two specific cell types. For further details see the section Ligand receptor analysis in the Cats Cradle Spatial vignette [CatsCradleSpatial](doc/CatsCradleSpatial.html). ## CatsCradle, Seurat objects, SingleCellExperiment objects and SpatialExperiment objects CatsCradle provides rudimentary support for the Bioconductor classes SingleCellExperiment and its derived class SpatialExperiment. Every function which accepts a Seurat object as input can also accept a SingleCellExperiment in its place. Any function which returns a Seurat object can also return a SingleCellExperiment, in one case, a SpatialExperiment, through the use of the parameter returnType. In each case, the internals of the functions are manipulating Seurat objects. Conversion to and from these is performed upon entry to and return from the function and these coversions are rather lightweight in nature. They have slightly more functionality than as.Seurat and as.SingleCellExperiment. In addition to invoking these functions, they carry over the nearest neighbour graphs, and in one case carry over the tissue coordinates from a spatial Seurat object to a SpatialExperiment object. Because of the shallow nature of these conversions, other information may be lost. ```{r [SCE13]} sessionInfo() ```