EasyCellType: an example workflow

1. Introduction

The EasyCellType package was designed to examine an input marker list using the databases and provide annotation recommendations in graphical outcomes. The package refers to 3 public available marker gene data bases, and provides two approaches to conduct the annotation anaysis: gene set enrichment analysis(GSEA) and a modified Fisher’s exact test. The package has been submitted to bioconductor to achieve an easy access for researchers.

This vignette shows a simple workflow illustrating how EasyCellType package works. The data set that will be used throughout the example is freely available from 10X Genomics.

Installation

The package can be installed using BiocManager by the following commands

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("EasyCellType")

Alternatively, the package can also be installed using devtools and launched by

library(devtools)
install_github("rx-li/EasyCellType")

After the installation, the package can be loaded with

library(EasyCellType)

2. Example workflow

We use the Peripheral Blood Mononuclear Cells (PBMC) data freely available from 10X Genomics. The data can be downladed from https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz. After downloading the data, it can be read using function Read10X.

We have included the data in our package, which can be loaded with

data(pbmc_data)

We followed the standard workflow provided by Seurat package(Hao et al. 2021) to process the PBMC data set. The detailed technical explanations can be found in https://satijalab.org/seurat/articles/pbmc3k_tutorial.html.

library(Seurat)
# Initialize the Seurat object
pbmc <- CreateSeuratObject(counts = pbmc_data, project = "pbmc3k", min.cells = 3, min.features = 200)
# QC and select samples
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# Normalize the data
pbmc <- NormalizeData(pbmc)
# Identify highly variable features
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
# Perfom linear dimensional reduction
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Cluster the cells
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# Find differentially expressed features
markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

Now we get the expressed markers for each cluster. We then convert the gene symbols to Entrez IDs.

library(org.Hs.eg.db)
library(AnnotationDbi)
markers$entrezid <- mapIds(org.Hs.eg.db,
                           keys=markers$gene, #Column containing Ensembl gene ids
                           column="ENTREZID",
                           keytype="SYMBOL",
                           multiVals="first")
markers <- na.omit(markers)

In case the data is measured in mouse, we would replace the package org.Hs.eg.db with org.Mm.eg.db and do the above analysis.

The input for EasyCellType package should be a data frame containing Entrez IDs, clusters and expression scores. The order of columns should follow this rule. In each cluster, the gene should be sorted by the expression score.

library(dplyr)
markers_sort <- data.frame(gene=markers$entrezid, cluster=markers$cluster, 
                      score=markers$avg_log2FC) %>% 
  group_by(cluster) %>% 
  mutate(rank = rank(score),  ties.method = "random") %>% 
  arrange(desc(rank)) 
input.d <- as.data.frame(markers_sort[, 1:3])

We have include the processed data in the package. It can be loaded with

data("gene_pbmc")
input.d <- gene_pbmc

Now we can call the annot function to run annotation analysis.

annot.GSEA <- easyct(input.d, db="cellmarker", species="Human", 
                    tissue=c("Blood", "Peripheral blood", "Blood vessel",
                      "Umbilical cord blood", "Venous blood"), p_cut=0.3,
                    test="GSEA")

We used the GSEA approach to do the annotation. In our package, we use GSEA function in clusterProfiler package(Wu et al. 2021) to conduct the enrichment analysis. You can replace ‘GSEA’ with ‘fisher’ if you would like to use Fisher exact test to do the annotation. The candidate tissues can be seen using data(cellmarker_tissue), data(clustermole_tissue) and data(panglao_tissue).

The dot plot showing the overall annotation results can be created by

plot_dot(test="GSEA", annot.GSEA)

Bar plot can be created by

plot_bar(test="GSEA", annot.GSEA)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.4          org.Hs.eg.db_3.20.0  AnnotationDbi_1.69.0
#>  [4] IRanges_2.41.1       S4Vectors_0.45.2     Biobase_2.67.0      
#>  [7] BiocGenerics_0.53.3  generics_0.1.3       Seurat_5.1.0        
#> [10] SeuratObject_5.0.2   sp_2.1-4             EasyCellType_1.5.4  
#> [13] devtools_2.4.5       usethis_3.0.0        BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] RcppAnnoy_0.0.22        splines_4.4.2           later_1.3.2            
#>   [4] ggplotify_0.1.2         tibble_3.2.1            R.oo_1.27.0            
#>   [7] polyclip_1.10-7         fastDummies_1.7.4       lifecycle_1.0.4        
#>  [10] globals_0.16.3          processx_3.8.4          lattice_0.22-6         
#>  [13] MASS_7.3-61             magrittr_2.0.3          plotly_4.10.4          
#>  [16] sass_0.4.9              rmarkdown_2.29          jquerylib_0.1.4        
#>  [19] yaml_2.3.10             remotes_2.5.0           httpuv_1.6.15          
#>  [22] ggtangle_0.0.4          sctransform_0.4.1       spam_2.11-0            
#>  [25] spatstat.sparse_3.1-0   sessioninfo_1.2.2       pkgbuild_1.4.5         
#>  [28] reticulate_1.40.0       pbapply_1.7-2           cowplot_1.1.3          
#>  [31] DBI_1.2.3               buildtools_1.0.0        RColorBrewer_1.1-3     
#>  [34] abind_1.4-8             pkgload_1.4.0           zlibbioc_1.52.0        
#>  [37] Rtsne_0.17              purrr_1.0.2             R.utils_2.12.3         
#>  [40] yulab.utils_0.1.8       GenomeInfoDbData_1.2.13 enrichplot_1.27.1      
#>  [43] ggrepel_0.9.6           irlba_2.3.5.1           spatstat.utils_3.1-1   
#>  [46] listenv_0.9.1           tidytree_0.4.6          maketools_1.3.1        
#>  [49] goftest_1.2-3           RSpectra_0.16-2         spatstat.random_3.3-2  
#>  [52] fitdistrplus_1.2-1      parallelly_1.39.0       leiden_0.4.3.1         
#>  [55] codetools_0.2-20        DOSE_4.1.0              tidyselect_1.2.1       
#>  [58] aplot_0.2.3             UCSC.utils_1.3.0        farver_2.1.2           
#>  [61] spatstat.explore_3.3-3  matrixStats_1.4.1       jsonlite_1.8.9         
#>  [64] ellipsis_0.3.2          progressr_0.15.0        ggridges_0.5.6         
#>  [67] survival_3.7-0          tools_4.4.2             treeio_1.31.0          
#>  [70] ica_1.0-3               Rcpp_1.0.13-1           glue_1.8.0             
#>  [73] gridExtra_2.3           xfun_0.49               qvalue_2.39.0          
#>  [76] GenomeInfoDb_1.43.1     withr_3.0.2             BiocManager_1.30.25    
#>  [79] fastmap_1.2.0           fansi_1.0.6             callr_3.7.6            
#>  [82] digest_0.6.37           R6_2.5.1                mime_0.12              
#>  [85] gridGraphics_0.5-1      colorspace_2.1-1        scattermore_1.2        
#>  [88] GO.db_3.20.0            tensor_1.5              spatstat.data_3.1-4    
#>  [91] RSQLite_2.3.8           R.methodsS3_1.8.2       utf8_1.2.4             
#>  [94] tidyr_1.3.1             data.table_1.16.2       httr_1.4.7             
#>  [97] htmlwidgets_1.6.4       org.Mm.eg.db_3.20.0     uwot_0.2.2             
#> [100] pkgconfig_2.0.3         gtable_0.3.6            blob_1.2.4             
#> [103] lmtest_0.9-40           XVector_0.47.0          sys_3.4.3              
#> [106] clusterProfiler_4.15.0  htmltools_0.5.8.1       profvis_0.4.0          
#> [109] dotCall64_1.2           fgsea_1.33.0            scales_1.3.0           
#> [112] png_0.1-8               spatstat.univar_3.1-1   ggfun_0.1.7            
#> [115] knitr_1.49              reshape2_1.4.4          nlme_3.1-166           
#> [118] curl_6.0.1              zoo_1.8-12              cachem_1.1.0           
#> [121] stringr_1.5.1           KernSmooth_2.23-24      parallel_4.4.2         
#> [124] miniUI_0.1.1.1          desc_1.4.3              pillar_1.9.0           
#> [127] grid_4.4.2              vctrs_0.6.5             RANN_2.6.2             
#> [130] urlchecker_1.0.1        promises_1.3.0          xtable_1.8-4           
#> [133] cluster_2.1.6           evaluate_1.0.1          cli_3.6.3              
#> [136] compiler_4.4.2          rlang_1.1.4             crayon_1.5.3           
#> [139] future.apply_1.11.3     ps_1.8.1                plyr_1.8.9             
#> [142] forcats_1.0.0           fs_1.6.5                stringi_1.8.4          
#> [145] deldir_2.0-4            viridisLite_0.4.2       BiocParallel_1.41.0    
#> [148] munsell_0.5.1           Biostrings_2.75.1       lazyeval_0.2.2         
#> [151] spatstat.geom_3.3-4     GOSemSim_2.33.0         Matrix_1.7-1           
#> [154] RcppHNSW_0.6.0          patchwork_1.3.0         bit64_4.5.2            
#> [157] future_1.34.0           ggplot2_3.5.1           KEGGREST_1.47.0        
#> [160] shiny_1.9.1             ROCR_1.0-11             igraph_2.1.1           
#> [163] memoise_2.0.1           bslib_0.8.0             ggtree_3.15.0          
#> [166] fastmatch_1.1-4         bit_4.5.0               ape_5.8                
#> [169] gson_0.1.0

3. Reference

Hao, Yuhan, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck III, Shiwei Zheng, Andrew Butler, Maddie J. Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” Cell. https://doi.org/10.1016/j.cell.2021.04.048.
Wu, Tianzhi, Erqiang Hu, Shuangbin Xu, Meijun Chen, Pingfan Guo, Zehan Dai, Tingze Feng, et al. 2021. “clusterProfiler 4.0: A Universal Enrichment Tool for Interpreting Omics Data.” The Innovation. https://doi.org/10.1016/j.xinn.2021.100141.