--- title: "CatsCradle" author: "Anna Laddach and Michael Shapiro" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{CatsCradle} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE, warning = FALSE} knitr::opts_chunk$set( collapse = TRUE, fig.dim=c(6,6), comment = "#>" ) ``` `r BiocStyle::Biocpkg("BiocStyle")` ![](CatsCradleLogo.png){width=2in} ## Introduction Here we describe the functionality in this package concerned with analysing the clustering of genes in single cell RNASeq data. A typical Seurat single cell analysis starts with an expression matrix $M$ where the rows represent genes and the columns represent individual cells. After normalisation one uses dimension reduction - PCA, UMAP, tSNE - to produce lower dimensional representations of the data for the cells and the Louvain algorithm to cluster cells with similar expression patterns. CatsCradle operates based on a simple observation: by transposing the matrix $M$, we can use the same methods to produce lower-dimensional representations of the genes and cluster the genes into groups that show similar patterns of expression. ```{r [CC1], message=FALSE} library(CatsCradle,quietly=TRUE) getExample = make.getExample() exSeuratObj = getExample('exSeuratObj') STranspose = transposeObject(exSeuratObj) ``` This function transposes the expression matrix and carries out the basic functions, FindVariableFeatures(), ScaleData(), RunPCA(), RunUMAP(), FindNeighbors(), and FindClusters(). ## Exploring CatsCradle After transposing the usual Seurat object, the genes are now the columns (samples) and the individual cells are the rows (features). The Louvain clustering of the genes is now encoded in STranspose$seurat_clusters. As with the cells, we can observe these clusters on UMAP, tSNE or PCA. ```{r [CC2], message=FALSE} library(Seurat,quietly=TRUE) library(ggplot2,quietly=TRUE) DimPlot(exSeuratObj,cols='polychrome') + ggtitle('Cell clusters on UMAP') ``` ```{r [CC3]} DimPlot(STranspose,cols='polychrome') + ggtitle('Gene clusters on UMAP') ``` We have never seen a use case in which there was a reason to query the identities of the individual cells in a UMAP plot. However, this changes with a gene UMAP as each gene has a distinct (and interesting) identity. We recommend using plotly to produce a browseable version of the gene UMAP. This allows one to hover over the individual points and discover the genes in each cluster. Typical code might be ```{r [CC4], eval = FALSE} library(plotly,quietly=TRUE) umap = FetchData(STranspose,c('UMAP_1','UMAP_2','seurat_clusters')) umap$gene = colnames(STranspose) plot = ggplot(umap,aes(x=UMAP_1,y=UMAP_2,color=seurat_clusters,label=gene) + geom_point() browseable = ggplotly(plot) print(browseable) htmlwidgets::saveWidget(as_widget(browseable),'genesOnUMAP.html') ``` The question arises as to how to annotate the gene clusters. Assuming you have working annotations for the cell clusters it can be useful to examine which cells each of the gene clusters is expressed in. Here we give a heatmap of average expression of each gene cluster (columns) across each cell cluster (rows). ```{r [CC5], message=FALSE} library(pheatmap,quietly=TRUE) averageExpMatrix = getAverageExpressionMatrix(exSeuratObj,STranspose,layer='data') averageExpMatrix = tagRowAndColNames(averageExpMatrix, ccTag='cellClusters_', gcTag='geneClusters_') pheatmap(averageExpMatrix, treeheight_row=0, treeheight_col=0, fontsize_row=8, fontsize_col=8, cellheight=10, cellwidth=10, main='Cell clusters vs. Gene clusters') ``` Another way of seeing the relationship between cell clusters and gene clusters is in a Sankey graph. This is a bi-partite graph whose vertices are the cell clusters and the gene clusters. The edges of the graph display mean expression as the width of the edge. One can either display all edges with edge weight (width) displaying absolute value and colour distinguishing up- and down-regulation of expression or display separate Sankey graphs for the up- and down-regulated gene sets. It is these bi-partite graphs that contribute the name CatsCradle. Here the up-regulated gene sets are shown in cyan, the down-regulated sets in pink. The image was produced with the following code. ```{r [CC6], eval = FALSE} catsCradle = sankeyFromMatrix(averageExpMatrix, disambiguation=c('cells_','genes_'), plus='cyan',minus='pink', height=800) print(catsCradle) ``` The print command opens this in a browser. This allows one to query the individual vertices and edges. This can be saved with the saveWidget command as above. ## Biologically relevant gene sets on UMAP Biologically relevant gene sets often cluster on CatsCradle gene UMAPs. Here we see a UMAP plot showing the gene clusters (by color), over-printed with the HALLMARK_G2M_CHECKPOINT that appear in STranspose in black. We see that these are strongly associated with gene cluster 8, but also show "satellite clusters" including an association with cluster 4 and with the border between clusters 0 and 3. In our experience, proliferation associated gene sets are among the most strongly clustered. ```{r [CC7]} hallmark = getExample('hallmark') h = 'HALLMARK_G2M_CHECKPOINT' umap = FetchData(STranspose,c('umap_1','umap_2')) idx = colnames(STranspose) %in% hallmark[[h]] g = DimPlot(STranspose,cols='polychrome') + geom_point(data=umap[idx,],aes(x=umap_1,y=umap_2),color='black',size=2.7) + geom_point(data=umap[idx,],aes(x=umap_1,y=umap_2),color='green') + ggtitle(paste(h,'\non gene clusters')) print(g) ``` ## Determining statistical significance of clustering Given a set of points, $S$ and a non-empty proper subset $X \subset S$ we would like to determine the statistical significance of the degree to which $X$ is clustered. To compute this we ask the opposite question: what would we see if $X$ were randomly chosen? In this case we expect to see $X$ broadly spread out across $S$, i.e., most of $S$ should be close to some point of $X$. In particular we expect the median distance from the complement $S \setminus X$ to $X$ to be low. Of course, how low, depends on the size of $X$. Conversely, if the points of $X$ cluster together, we expect much of $S \setminus X$ to be further from $X$, at least compared to other sets of the same size. We use a distance function inspired by Hausdorf distance. Give a set $X$, for each $s_k \in S \setminus X$, we take $d_k$ to be the distance from $s_k$ to the nearest point of $X$. We then take the __median complement distance__ to be the median of the values $d_k$. Comparing this median complement distance for $X \subset S$ with those for randomly chosen sets $X_i \subset S$ allows us to assess the clustering of $X$. (These $X_i$ are chosen to be the same size as $X$.) ```{r [CC8]} g2mGenes = intersect(colnames(STranspose), hallmark[['HALLMARK_G2M_CHECKPOINT']]) stats = getObjectSubsetClusteringStatistics(STranspose, g2mGenes, numTrials=1000) ``` This uses UMAP as the default reduction and returns the median complement distance to the Hallmark G2M genes in STranspose, and the distances for 1000 randomly chosen sets of 56 genes. It computes a p-value based on the rank of the actual distance among the random distances and its z-score. Here we report a p-value of 0.001. However, as can be seen from the figure, the actual p-value is lower by many orders of magnitude. ```{r [CC9]} statsPlot = ggplot(data.frame(medianComplementDistance=stats$randomSubsetDistance), aes(x=medianComplementDistance)) + geom_histogram(bins=50) + geom_vline(xintercept=stats$subsetDistance,color='red') + ggtitle('Hallmark G2M real and random median complement distance') print(statsPlot) ``` Here we have shown the statistics for one of the gene sets that is most tightly clustered on UMAP. However, of the 50 Hallmark gene sets 31 cluster with a p-value better than 0.05. ```{r [CC10]} df = read.table('hallmarkPValues.txt',header=TRUE,sep='\t') g = ggplot(df,aes(x=logPValue)) + geom_histogram() + geom_vline(xintercept=-log10(0.05)) + ggtitle('Hallmark gene set p-values') print(g) ``` ## Gene z-scores on gene UMAP Here we show that gene UMAP can reveal co-location of the genes that are up-regulated in each cell cluster. To do this, we start by finding the z-score for expression of each gene computed across all cells. For each gene in the gene Seurat object, and each cell cluster in the cell Seurat object, we then compute the mean z-score for that gene in the cells of that cluster. Plot this on the gene umap reveals localised patterns of gene expression. Creating a browseable figure allows for easy querying of the up-regulated genes. ```{r [CC11]} meanZDF = meanZPerClusterOnUMAP(exSeuratObj, STranspose, 'shortName') h = ggplot(meanZDF,aes(x=umap_1,y=umap_2,color=EntericGliaCells)) + geom_point() + scale_color_gradient(low='green',high='red') + ggtitle('Mean z-score, Enteric glia on gene UMAP') print(h) ``` P-values for the spatial autocorrelation of these values can be found using runMoransI(). In the case where there are multiple clusters of similar cells, plotting the difference in mean z-score can illuminate the differences in the sub-clusters. Here we plot the difference in z-score between TCells3 and TCells1. ```{r [CC12]} TDiff = meanZDF[,1:3] TDiff$TDiff = meanZDF$TCells3 - meanZDF$TCells1 k = ggplot(TDiff,aes(x=umap_1,y=umap_2,color=TDiff)) + geom_point() + scale_color_gradient(low='green',high='red') + ggtitle('Mean z-score, TCells3 - TCells1') print(k) ``` ## Nearby genes We have seen that genes with similar annotation have a tendency to cluster. This suggests that nearby genes may have similar functions. To this end, we have supplied a function which finds nearby genes. This can be done geometrically using either PCA, UMAP or tSNE as the embedding or combinatorially using the nearest neighbor graph. The function returns a named vector whose values are the distances from the gene set and whose names are the genes. Here we find those genes which are within radius 0.2 in UMAP coordinates of genes in the HALLMARK_INTERFERON_ALPHA_RESPONSE gene set. As you can see, the combinatorial radius of a gene set can grow quite quickly and need not have a close relation to UMAP distance. This function will also return weighted combinatorial distance from a single gene where distance is the reciprocal of edge weight. ```{r [CC13]} geneSet = intersect(colnames(STranspose), hallmark[['HALLMARK_INTERFERON_ALPHA_RESPONSE']]) geometricallyNearbyGenes = getNearbyGenes(STranspose,geneSet,radius=0.2,metric='umap') theGeometricGenesThemselves = names(geometricallyNearbyGenes) combinatoriallyNearbyGenes = getNearbyGenes(STranspose,geneSet,radius=1,metric='NN') theCombinatoricGenesThemselves = names(combinatoriallyNearbyGenes) df = FetchData(STranspose,c('umap_1','umap_2')) df$gene = colnames(STranspose) geneSetIdx = df$gene %in% geneSet nearbyIdx = df$gene %in% theGeometricGenesThemselves g = ggplot() + geom_point(data=df,aes(x=umap_1,y=umap_2),color='gray') + geom_point(data=df[geneSetIdx,],aes(x=umap_1,y=umap_2),color='blue') + geom_point(data=df[nearbyIdx,],aes(x=umap_1,y=umap_2),color='red') + ggtitle(paste0('Genes within geometric radius 0.2 (red) of \n', 'HALLMARK_INTERFERON_ALPHA_RESPONSE (blue)')) print(g) ``` ## Predicting gene function Given a particular gene, it is interesting to look at the annotations of nearby genes in the gene Seurat object. In this context, "annotations" might mean GO or Hallmark, "nearby" might mean in terms of UMAP or PCA coordinates or in the nearest neighbour graph in the gene Seurat object. Here we will look at Hallmark annotations and UMAP coordinates. Gene annotation lists give lists of genes. For each gene, we can collect the annotations is belongs to. The function annotateGenesByGeneSet() inverts the gene sets to give a list of the sets each gene belongs to. ```{r [CC14]} annotatedGenes = annotateGenesByGeneSet(hallmark) names(annotatedGenes[['Myc']]) ``` We see that Myc belongs to ten hallmark sets. We can also give the annotations of a gene as a vector. ```{r [CC15]} Myc = annotateGeneAsVector('Myc',hallmark) MycNormalised = annotateGeneAsVector('Myc',hallmark,TRUE) ``` These are named vectors whose names are the Hallmark sets. Myc is a 0-1 vector indicating membership. MycNormalised gives these values normalised by the size of the sets in question. This is appropriate when we wish to weight the contributions of nearby genes as they are more likely to belong to larger gene sets. ```{r [CC16]} predicted = predictAnnotation('Myc',hallmark,STranspose,radius=.5) predicted$Myc[1:10] ``` predictAnnotation() accepts a vector of genes and returns a list of prediction vectors. Here we have given it a single gene, Myc, and we see that it has predicted one of the gene sets listed for Myc. Predictions made in this manner perform well above chance. Of the 2000 genes in STranspose, 922 have the property that they appear in at least one of the Hallmark gene sets and at least one of their nearby genes appears in a Hallmark gene set. This means that we are able to compare their actual annotation vectors with their predicted annotation vectors. After normalising both to unit vectors, we can take their dot prodcts as a measure of their closeness. Comparing the actual dot products for these 922 genes to those produced by 1000 randomised predictions produces the following comparison. ![comparing real and randomised predictions](actualVsRandomisedPredictions.jpg){width=6in} ```{r [CC17]} sessionInfo() ```