Select target genes for TAP-seq

One application of TAP-seq is to measure the expression of genes that enable assigning cells to different cell types. The TAPseq package offers a simple approach to identify a desired number of cell type markers based on differentially expressed genes between cell types.

This optional functionality requires additional R-packages such as Seurat. Please install the TAPseq package with all suggested dependencies.

Data

The TAPseq package contains a small subset of the mouse bone marrow data from Baccin et al., 2019 (https://www.nature.com/articles/s41556-019-0439-6). The full dataset can be downloaded here. This dataset is stored in a Seurat object which contains both gene expression and cell type data for every cell.

library(TAPseq)
library(Seurat)

# example of mouse bone marrow 10x gene expression data
data("bone_marrow_genex")

# transcript counts
GetAssayData(bone_marrow_genex)[1:10, 1:10]

# number of cells per cell type
table(Idents(bone_marrow_genex))

Select target genes

A linear model with lasso regularization is used to select target genes that best identify cell types. By default 10-fold cross-validation is used to select the number of target genes based on the largest value of lambda within 1 standard error of the minimum. See ?glmnet::cv.glmnet for more details.

# automatically select a number of target genes using 10-fold cross-validation
target_genes_cv <- selectTargetGenes(bone_marrow_genex)
#> Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
#> multinomial or binomial class has fewer than 8 observations; dangerous ground
#> Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
#> multinomial or binomial class has fewer than 8 observations; dangerous ground
head(target_genes_cv)
length(target_genes_cv)

This example results in a reasonable target gene panel, but in cases with many different cell types the resulting panels might be very large.

To avoid this, we can specify a desired number of targets. This selects a value for lamda in the lasso model that results in approximately this number of non-zero coefficients, i.e. marker genes.

# identify approximately 100 target genes that can be used to identify cell populations
target_genes_100 <- selectTargetGenes(bone_marrow_genex, targets = 100)
length(target_genes_100)

Assess target gene panels

To intuitively assess how well a chosen set of target genes distinguishes cell types, we can use UMAP plots based on the full gene expression data and on target genes only.

plotTargetGenes(bone_marrow_genex, target_genes = target_genes_100)

We can see that the expression of the 100 selected target genes groups cells of different populations together.

A good follow up would be to cluster the cells based on only the target genes following the same workflow used to define the cell identities in the original object. This could then be used to verify that the selected target genes reliably identify the correct cell types.

Session information

All of the output in this vignette was produced under the following conditions:

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] Seurat_5.1.0                      SeuratObject_5.0.2               
#>  [3] sp_2.1-4                          ggplot2_3.5.1                    
#>  [5] dplyr_1.1.4                       BSgenome.Hsapiens.UCSC.hg38_1.4.5
#>  [7] BSgenome_1.75.0                   rtracklayer_1.67.0               
#>  [9] BiocIO_1.17.1                     Biostrings_2.75.3                
#> [11] XVector_0.47.0                    BiocParallel_1.41.0              
#> [13] GenomicRanges_1.59.1              GenomeInfoDb_1.43.2              
#> [15] IRanges_2.41.2                    S4Vectors_0.45.2                 
#> [17] BiocGenerics_0.53.3               generics_0.1.3                   
#> [19] TAPseq_1.19.0                     BiocStyle_2.35.0                 
#> 
#> loaded via a namespace (and not attached):
#>   [1] RcppAnnoy_0.0.22            splines_4.4.2              
#>   [3] later_1.4.1                 bitops_1.0-9               
#>   [5] tibble_3.2.1                polyclip_1.10-7            
#>   [7] XML_3.99-0.17               fastDummies_1.7.4          
#>   [9] lifecycle_1.0.4             globals_0.16.3             
#>  [11] lattice_0.22-6              MASS_7.3-61                
#>  [13] magrittr_2.0.3              plotly_4.10.4              
#>  [15] sass_0.4.9                  rmarkdown_2.29             
#>  [17] jquerylib_0.1.4             yaml_2.3.10                
#>  [19] httpuv_1.6.15               sctransform_0.4.1          
#>  [21] spam_2.11-0                 spatstat.sparse_3.1-0      
#>  [23] reticulate_1.40.0           cowplot_1.1.3              
#>  [25] pbapply_1.7-2               DBI_1.2.3                  
#>  [27] buildtools_1.0.0            RColorBrewer_1.1-3         
#>  [29] abind_1.4-8                 zlibbioc_1.52.0            
#>  [31] Rtsne_0.17                  purrr_1.0.2                
#>  [33] RCurl_1.98-1.16             GenomeInfoDbData_1.2.13    
#>  [35] ggrepel_0.9.6               irlba_2.3.5.1              
#>  [37] listenv_0.9.1               spatstat.utils_3.1-1       
#>  [39] maketools_1.3.1             goftest_1.2-3              
#>  [41] RSpectra_0.16-2             spatstat.random_3.3-2      
#>  [43] fitdistrplus_1.2-1          parallelly_1.40.1          
#>  [45] leiden_0.4.3.1              codetools_0.2-20           
#>  [47] DelayedArray_0.33.3         shape_1.4.6.1              
#>  [49] tidyselect_1.2.1            UCSC.utils_1.3.0           
#>  [51] farver_2.1.2                matrixStats_1.4.1          
#>  [53] spatstat.explore_3.3-3      GenomicAlignments_1.43.0   
#>  [55] jsonlite_1.8.9              progressr_0.15.1           
#>  [57] iterators_1.0.14            ggridges_0.5.6             
#>  [59] survival_3.8-3              foreach_1.5.2              
#>  [61] tools_4.4.2                 ica_1.0-3                  
#>  [63] Rcpp_1.0.13-1               glue_1.8.0                 
#>  [65] gridExtra_2.3               SparseArray_1.7.2          
#>  [67] xfun_0.49                   MatrixGenerics_1.19.0      
#>  [69] withr_3.0.2                 BiocManager_1.30.25        
#>  [71] fastmap_1.2.0               digest_0.6.37              
#>  [73] R6_2.5.1                    mime_0.12                  
#>  [75] colorspace_2.1-1            scattermore_1.2            
#>  [77] tensor_1.5                  spatstat.data_3.1-4        
#>  [79] RSQLite_2.3.9               tidyr_1.3.1                
#>  [81] data.table_1.16.4           httr_1.4.7                 
#>  [83] htmlwidgets_1.6.4           S4Arrays_1.7.1             
#>  [85] uwot_0.2.2                  pkgconfig_2.0.3            
#>  [87] gtable_0.3.6                blob_1.2.4                 
#>  [89] lmtest_0.9-40               sys_3.4.3                  
#>  [91] htmltools_0.5.8.1           dotCall64_1.2              
#>  [93] scales_1.3.0                Biobase_2.67.0             
#>  [95] png_0.1-8                   spatstat.univar_3.1-1      
#>  [97] knitr_1.49                  reshape2_1.4.4             
#>  [99] rjson_0.2.23                nlme_3.1-166               
#> [101] curl_6.0.1                  cachem_1.1.0               
#> [103] zoo_1.8-12                  stringr_1.5.1              
#> [105] KernSmooth_2.23-24          parallel_4.4.2             
#> [107] miniUI_0.1.1.1              AnnotationDbi_1.69.0       
#> [109] restfulr_0.0.15             pillar_1.10.0              
#> [111] grid_4.4.2                  vctrs_0.6.5                
#> [113] RANN_2.6.2                  promises_1.3.2             
#> [115] xtable_1.8-4                cluster_2.1.8              
#> [117] evaluate_1.0.1              GenomicFeatures_1.59.1     
#> [119] cli_3.6.3                   compiler_4.4.2             
#> [121] Rsamtools_2.23.1            rlang_1.1.4                
#> [123] crayon_1.5.3                future.apply_1.11.3        
#> [125] labeling_0.4.3              plyr_1.8.9                 
#> [127] stringi_1.8.4               deldir_2.0-4               
#> [129] viridisLite_0.4.2           munsell_0.5.1              
#> [131] lazyeval_0.2.2              spatstat.geom_3.3-4        
#> [133] glmnet_4.1-8                Matrix_1.7-1               
#> [135] RcppHNSW_0.6.0              patchwork_1.3.0            
#> [137] bit64_4.5.2                 future_1.34.0              
#> [139] KEGGREST_1.47.0             shiny_1.10.0               
#> [141] SummarizedExperiment_1.37.0 ROCR_1.0-11                
#> [143] igraph_2.1.2                memoise_2.0.1              
#> [145] bslib_0.8.0                 bit_4.5.0.1