Annotation for unannotated single-cell RNA-Seq data by scTGIF

Introduction

About scTGIF

Here, we explain the concept of scTGIF. The analysis of single-cell RNA-Seq (scRNA-Seq) has a potential difficult problem; which data corresponds to what kind of cell type is not known a priori.

Therefore, at the start point of the data analysis of the scRNA-Seq dataset, each cell is “not colored” (unannotated) (Figure 1). There some approaches to support users to infer the cell types such as (1) Known marker gene expression, (2) BLAST-like gene expression comparison with reference DB, (3) differentially expressed genes (DEGs) and over-representative analysis (ORA) (scRNA-tools).

The first approach might be the most popular method, but this task is based on the expert knowledge about the cell types, and not always general-purpose. The second approach is easy and scalable, but still limited when the cell type is not known or still not measured by the other research organization. The third approach can perhaps be used in any situation but ambiguous and time-consuming task; this task is based on the cluster label and the true cluster structure, which is not known and some DEG methods have to be performed in each cluster, but recent scRNA-Seq dataset has tens to hundreds of cell types. Besides, a scRNA-Seq dataset can have low-quality cells and artifacts (e.g. doublet) but it is hard to distinguish from real cell data. Therefore, in actual data analytical situation, laborious trial-and-error cycle along with the change of cellular label cannot be evitable (Figure 1).

scTGIF is developed to reduce this trial-and-error cycle; This tool directly connects the unannotated cells and related gene function. Since this tool does not use reference DB, marker gene list, and cluster label can be used in any situation without expert knowledge and is not influenced by the change of cellular label.

Figure 1: Concept of scTGIF
Figure 1: Concept of scTGIF

In scTGIF, three data is required; the gene expression matrix, 2D coordinates of the cells (e.g. t-SNE, UMAP), and geneset of MSigDB. Firstly, the 2D coordinates are segmented as 50-by-50 grids, and gene expression is summarized in each grid level (X1). Next, the correspondence between genes and the related gene functions are summarized as gene-by-function matrix (X2). Here, we support only common genes are used in X1 and X2. Performing joint non-negative matrix factorization (jNMF) algorithm, which is implemented in nnTensor, the shared latent variables (W) with the two matrices are estimated.

Figure 2: Joint NMF
Figure 2: Joint NMF

By this algorithm, a grid set and corresponding gene functions are paired. Lower-dimension (D)-by-Grid matrix H1 works as attention maps to help users to pay attention the grids, and D-by-Function matrix H2 shows the gene function enriched in the grids.

Figure 3: H1 and H2 matrices
Figure 3: H1 and H2 matrices

scTGIF also supports some QC metrics to distinguish low-quality cells and artifacts from real cellular data.

Usage

Test data

To demonstrate the usage of scTGIF, we prepared a testdata of distal lung epithelium.

library("scTGIF")
library("SingleCellExperiment")
library("GSEABase")
library("msigdbr")

data("DistalLungEpithelium")
data("pca.DistalLungEpithelium")
data("label.DistalLungEpithelium")

Although this data is still annotated and the cell type label is provided, scTGIF does not rely on this information.

par(ask=FALSE)
plot(pca.DistalLungEpithelium, col=label.DistalLungEpithelium, pch=16,
    main="Distal lung epithelium dataset", xlab="PCA1", ylab="PCA2", bty="n")
text(0.1, 0.05, "AT1", col="#FF7F00", cex=2)
text(0.07, -0.15, "AT2", col="#E41A1C", cex=2)
text(0.13, -0.04, "BP", col="#A65628", cex=2)
text(0.125, -0.15, "Clara", col="#377EB8", cex=2)
text(0.09, -0.2, "Cilliated", col="#4DAF4A", cex=2)

To combine with the gene expression and the related gene function, we suppose the gene function data is summarized as the object of GSEABase. This data is directly downloadable from MSigDB and can be imported like gmt <- GSEABase::getGmt(“/YOURPATH/h.all.v6.0.entrez.gmt”)

Note that scTGIF only supports NCBI Gene IDs (Entrez IDs) for now. When the scRNA-Seq is not about human, the situation is more complicated. Here, we use msigdbr package to retrieve the mouse MSigDB genesets like below.

m_df = msigdbr(species = "Mus musculus",
    category = "H")[, c("gs_name", "entrez_gene")]

hallmark = unique(m_df$gs_name)
gsc <- lapply(hallmark, function(h){
    target = which(m_df$gs_name == h)
    geneIds = unique(as.character(m_df$entrez_gene[target]))
    GeneSet(setName=h, geneIds)
})
gmt = GeneSetCollection(gsc)
gmt = gmt[1:10] # Reduced for this demo

Parameter setting : settingTGIF

Next, the data matrix is converted to the object of SingleCellExperiment package, and the 2D coordinates are registered as the reducedDims slot.

sce <- SingleCellExperiment(assays = list(counts = DistalLungEpithelium))
reducedDims(sce) <- SimpleList(PCA=pca.DistalLungEpithelium)

Although the default mode of scTGIF use count slot as the input matrix, the normalized gene expression data can also be specified. In such a case, we recommend using normcounts slot to register the data.

CPMED <- function(input){
    libsize <- colSums(input)
    median(libsize) * t(t(input) / libsize)
}
normcounts(sce) <- log10(CPMED(counts(sce)) + 1)

After the registration of the data in sce, settingTGIF will work like below.

settingTGIF(sce, gmt, reducedDimNames="PCA", assayNames="normcounts")

HTML Report : reportTGIF

Finally, reportTGIF generates the HTML report to summarize the result of jNMF.

reportTGIF(sce,
    html.open=FALSE,
    title="scTGIF Report for DistalLungEpithelium dataset",
    author="Koki Tsuyuzaki")
## index.Rmd is created...
## index.Rmd is compiled to index.html...
## ################################################
## Data files are saved in
## /tmp/Rtmpk8awJu
## ################################################

Since this function takes some time, please type example("reportTGIF") by your own environment.

Session information

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] plotly_4.10.4               ggplot2_3.5.1              
##  [3] msigdbr_7.5.1               GSEABase_1.69.0            
##  [5] graph_1.85.0                annotate_1.85.0            
##  [7] XML_3.99-0.17               AnnotationDbi_1.69.0       
##  [9] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [11] Biobase_2.67.0              GenomicRanges_1.59.1       
## [13] GenomeInfoDb_1.43.2         IRanges_2.41.2             
## [15] S4Vectors_0.45.2            BiocGenerics_0.53.3        
## [17] generics_0.1.3              MatrixGenerics_1.19.0      
## [19] matrixStats_1.4.1           scTGIF_1.21.0              
## [21] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               tcltk_4.4.2             rlang_1.1.4            
##  [4] magrittr_2.0.3          compiler_4.4.2          RSQLite_2.3.9          
##  [7] png_0.1-8               vctrs_0.6.5             maps_3.4.2.1           
## [10] nnTensor_1.3.0          pkgconfig_2.0.3         crayon_1.5.3           
## [13] fastmap_1.2.0           XVector_0.47.1          labeling_0.4.3         
## [16] tagcloud_0.6            rmarkdown_2.29          UCSC.utils_1.3.0       
## [19] purrr_1.0.2             bit_4.5.0.1             xfun_0.49              
## [22] cachem_1.1.0            jsonlite_1.8.9          blob_1.2.4             
## [25] DelayedArray_0.33.3     tweenr_2.0.3            cluster_2.1.8          
## [28] R6_2.5.1                bslib_0.8.0             RColorBrewer_1.1-3     
## [31] schex_1.21.0            jquerylib_0.1.4         bookdown_0.41          
## [34] Rcpp_1.0.13-1           knitr_1.49              fields_16.3            
## [37] igraph_2.1.2            Matrix_1.7-1            tidyselect_1.2.1       
## [40] abind_1.4-8             yaml_2.3.10             misc3d_0.9-1           
## [43] lattice_0.22-6          tibble_3.2.1            withr_3.0.2            
## [46] KEGGREST_1.47.0         evaluate_1.0.1          polyclip_1.10-7        
## [49] Biostrings_2.75.3       pillar_1.10.0           BiocManager_1.30.25    
## [52] munsell_0.5.1           scales_1.3.0            xtable_1.8-4           
## [55] rTensor_1.4.8           glue_1.8.0              lazyeval_0.2.2         
## [58] maketools_1.3.1         tools_4.4.2             hexbin_1.28.5          
## [61] sys_3.4.3               data.table_1.16.4       babelgene_22.9         
## [64] buildtools_1.0.0        dotCall64_1.2           grid_4.4.2             
## [67] tidyr_1.3.1             crosstalk_1.2.1         colorspace_2.1-1       
## [70] GenomeInfoDbData_1.2.13 ggforce_0.4.2           cli_3.6.3              
## [73] spam_2.11-0             S4Arrays_1.7.1          plot3D_1.4.1           
## [76] viridisLite_0.4.2       concaveman_1.1.0        dplyr_1.1.4            
## [79] gtable_0.3.6            sass_0.4.9              digest_0.6.37          
## [82] SparseArray_1.7.2       farver_2.1.2            htmlwidgets_1.6.4      
## [85] entropy_1.3.1           memoise_2.0.1           htmltools_0.5.8.1      
## [88] lifecycle_1.0.4         httr_1.4.7              bit64_4.5.2            
## [91] MASS_7.3-61