One application of TAP-seq is to measure the expression of genes that enable assigning cells to different cell types. The TAPseq package offers a simple approach to identify a desired number of cell type markers based on differentially expressed genes between cell types.
This optional functionality requires additional R-packages such as Seurat. Please install the TAPseq package with all suggested dependencies.
The TAPseq package contains a small subset of the mouse bone marrow data from Baccin et al., 2019 (https://www.nature.com/articles/s41556-019-0439-6). The full dataset can be downloaded here. This dataset is stored in a Seurat object which contains both gene expression and cell type data for every cell.
A linear model with lasso regularization is used to select target
genes that best identify cell types. By default 10-fold cross-validation
is used to select the number of target genes based on the largest value
of lambda within 1 standard error of the minimum. See
?glmnet::cv.glmnet
for more details.
# automatically select a number of target genes using 10-fold cross-validation
target_genes_cv <- selectTargetGenes(bone_marrow_genex)
#> Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
#> multinomial or binomial class has fewer than 8 observations; dangerous ground
#> Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
#> multinomial or binomial class has fewer than 8 observations; dangerous ground
head(target_genes_cv)
length(target_genes_cv)
This example results in a reasonable target gene panel, but in cases with many different cell types the resulting panels might be very large.
To avoid this, we can specify a desired number of targets. This selects a value for lamda in the lasso model that results in approximately this number of non-zero coefficients, i.e. marker genes.
To intuitively assess how well a chosen set of target genes distinguishes cell types, we can use UMAP plots based on the full gene expression data and on target genes only.
We can see that the expression of the 100 selected target genes groups cells of different populations together.
A good follow up would be to cluster the cells based on only the target genes following the same workflow used to define the cell identities in the original object. This could then be used to verify that the selected target genes reliably identify the correct cell types.
All of the output in this vignette was produced under the following conditions:
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] Seurat_5.1.0 SeuratObject_5.0.2
#> [3] sp_2.1-4 ggplot2_3.5.1
#> [5] dplyr_1.1.4 BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [7] BSgenome_1.75.0 rtracklayer_1.67.0
#> [9] BiocIO_1.17.0 Biostrings_2.75.1
#> [11] XVector_0.47.0 BiocParallel_1.41.0
#> [13] GenomicRanges_1.59.0 GenomeInfoDb_1.43.1
#> [15] IRanges_2.41.1 S4Vectors_0.45.2
#> [17] BiocGenerics_0.53.3 generics_0.1.3
#> [19] TAPseq_1.19.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.22 splines_4.4.2
#> [3] later_1.3.2 bitops_1.0-9
#> [5] tibble_3.2.1 polyclip_1.10-7
#> [7] XML_3.99-0.17 fastDummies_1.7.4
#> [9] lifecycle_1.0.4 globals_0.16.3
#> [11] lattice_0.22-6 MASS_7.3-61
#> [13] magrittr_2.0.3 plotly_4.10.4
#> [15] sass_0.4.9 rmarkdown_2.29
#> [17] jquerylib_0.1.4 yaml_2.3.10
#> [19] httpuv_1.6.15 sctransform_0.4.1
#> [21] spam_2.11-0 spatstat.sparse_3.1-0
#> [23] reticulate_1.40.0 cowplot_1.1.3
#> [25] pbapply_1.7-2 DBI_1.2.3
#> [27] buildtools_1.0.0 RColorBrewer_1.1-3
#> [29] abind_1.4-8 zlibbioc_1.52.0
#> [31] Rtsne_0.17 purrr_1.0.2
#> [33] RCurl_1.98-1.16 GenomeInfoDbData_1.2.13
#> [35] ggrepel_0.9.6 irlba_2.3.5.1
#> [37] listenv_0.9.1 spatstat.utils_3.1-1
#> [39] maketools_1.3.1 goftest_1.2-3
#> [41] RSpectra_0.16-2 spatstat.random_3.3-2
#> [43] fitdistrplus_1.2-1 parallelly_1.39.0
#> [45] leiden_0.4.3.1 codetools_0.2-20
#> [47] DelayedArray_0.33.2 shape_1.4.6.1
#> [49] tidyselect_1.2.1 UCSC.utils_1.3.0
#> [51] farver_2.1.2 matrixStats_1.4.1
#> [53] spatstat.explore_3.3-3 GenomicAlignments_1.43.0
#> [55] jsonlite_1.8.9 progressr_0.15.0
#> [57] iterators_1.0.14 ggridges_0.5.6
#> [59] survival_3.7-0 foreach_1.5.2
#> [61] tools_4.4.2 ica_1.0-3
#> [63] Rcpp_1.0.13-1 glue_1.8.0
#> [65] gridExtra_2.3 SparseArray_1.7.2
#> [67] xfun_0.49 MatrixGenerics_1.19.0
#> [69] withr_3.0.2 BiocManager_1.30.25
#> [71] fastmap_1.2.0 fansi_1.0.6
#> [73] digest_0.6.37 R6_2.5.1
#> [75] mime_0.12 colorspace_2.1-1
#> [77] scattermore_1.2 tensor_1.5
#> [79] spatstat.data_3.1-4 RSQLite_2.3.8
#> [81] utf8_1.2.4 tidyr_1.3.1
#> [83] data.table_1.16.2 httr_1.4.7
#> [85] htmlwidgets_1.6.4 S4Arrays_1.7.1
#> [87] uwot_0.2.2 pkgconfig_2.0.3
#> [89] gtable_0.3.6 blob_1.2.4
#> [91] lmtest_0.9-40 sys_3.4.3
#> [93] htmltools_0.5.8.1 dotCall64_1.2
#> [95] scales_1.3.0 Biobase_2.67.0
#> [97] png_0.1-8 spatstat.univar_3.1-1
#> [99] knitr_1.49 reshape2_1.4.4
#> [101] rjson_0.2.23 nlme_3.1-166
#> [103] curl_6.0.1 cachem_1.1.0
#> [105] zoo_1.8-12 stringr_1.5.1
#> [107] KernSmooth_2.23-24 parallel_4.4.2
#> [109] miniUI_0.1.1.1 AnnotationDbi_1.69.0
#> [111] restfulr_0.0.15 pillar_1.9.0
#> [113] grid_4.4.2 vctrs_0.6.5
#> [115] RANN_2.6.2 promises_1.3.0
#> [117] xtable_1.8-4 cluster_2.1.6
#> [119] evaluate_1.0.1 GenomicFeatures_1.59.1
#> [121] cli_3.6.3 compiler_4.4.2
#> [123] Rsamtools_2.23.0 rlang_1.1.4
#> [125] crayon_1.5.3 future.apply_1.11.3
#> [127] labeling_0.4.3 plyr_1.8.9
#> [129] stringi_1.8.4 deldir_2.0-4
#> [131] viridisLite_0.4.2 munsell_0.5.1
#> [133] lazyeval_0.2.2 spatstat.geom_3.3-3
#> [135] glmnet_4.1-8 Matrix_1.7-1
#> [137] RcppHNSW_0.6.0 patchwork_1.3.0
#> [139] bit64_4.5.2 future_1.34.0
#> [141] KEGGREST_1.47.0 shiny_1.9.1
#> [143] SummarizedExperiment_1.37.0 ROCR_1.0-11
#> [145] igraph_2.1.1 memoise_2.0.1
#> [147] bslib_0.8.0 bit_4.5.0