This page is a standalone metadata guide for cellNexus
and documents the key fields used in downstream analysis.
library(cellNexus)
metadata <- get_metadata(cloud_metadata = SAMPLE_DATABASE_URL["cellnexus"])
metadata
#> # Source: SQL [?? x 58]
#> # Database: DuckDB 1.4.3 [unknown@Linux 5.14.0-570.112.1.el9_6.x86_64:R 4.5.3/:memory:]
#> cell_id observation_joinid dataset_id sample_id sample_ experiment___ run_from_cell_id sample_heuristic age_days tissue_groups
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 17 QRMCN*8*|# 842c6f5d-4a94-4eef-8510-8c792d1… 1119f482… 1119f4… "" <NA> 182a61cc-b041-4… 14600 breast
#> 2 16 j}0<Y>a#X~ 842c6f5d-4a94-4eef-8510-8c792d1… 1119f482… 1119f4… "" <NA> 182a61cc-b041-4… 14600 breast
#> 3 20 6Eu5c&aEH; 842c6f5d-4a94-4eef-8510-8c792d1… 1119f482… 1119f4… "" <NA> 182a61cc-b041-4… 14600 breast
#> 4 19 lNmuO5xs~3 842c6f5d-4a94-4eef-8510-8c792d1… 1119f482… 1119f4… "" <NA> 182a61cc-b041-4… 14600 breast
#> 5 15 TjgA2vJ1;{ 842c6f5d-4a94-4eef-8510-8c792d1… 1119f482… 1119f4… "" <NA> 182a61cc-b041-4… 14600 breast
#> 6 18 h22!#$}SJ* 842c6f5d-4a94-4eef-8510-8c792d1… 1119f482… 1119f4… "" <NA> 182a61cc-b041-4… 14600 breast
#> 7 14 qxl7HJjL$L 842c6f5d-4a94-4eef-8510-8c792d1… 1119f482… 1119f4… "" <NA> 182a61cc-b041-4… 14600 breast
#> 8 3 5jp7Uu{#@# 842c6f5d-4a94-4eef-8510-8c792d1… 1f755b9b… 1f755b… "" <NA> 9ca47fe5-873e-4… 14600 breast
#> 9 2 $jvBt8wHSK 842c6f5d-4a94-4eef-8510-8c792d1… 1f755b9b… 1f755b… "" <NA> 9ca47fe5-873e-4… 14600 breast
#> 10 5 N>_|{;6_6N 842c6f5d-4a94-4eef-8510-8c792d1… 1f755b9b… 1f755b… "" <NA> 9ca47fe5-873e-4… 14600 breast
#> # ℹ more rows
#> # ℹ 48 more variables: nFeature_expressed_in_sample <int>, nCount_RNA <dbl>, empty_droplet <lgl>, cell_type_unified_ensemble <chr>, is_immune <lgl>,
#> # subsets_Mito_percent <int>, subsets_Ribo_percent <int>, high_mitochondrion <lgl>, high_ribosome <lgl>, scDblFinder.class <chr>,
#> # sample_chunk <int>, cell_chunk <int>, sample_pseudobulk_chunk <int>, file_id_cellNexus_single_cell <chr>, file_id_cellNexus_pseudobulk <chr>,
#> # count_upper_bound <dbl>, nfeature_expressed_thresh <dbl>, inverse_transform <chr>, alive <lgl>, cell_annotation_blueprint_singler <chr>,
#> # cell_annotation_monaco_singler <chr>, cell_annotation_azimuth_l2 <chr>, ethnicity_flagging_score <dbl>, low_confidence_ethnicity <chr>,
#> # .aggregated_cells <int>, imputed_ethnicity <chr>, atlas_id <chr>, citation <chr>, collection_id <chr>, dataset_version_id <chr>, …cellNexus metadata are harmonised to support
cross-dataset analysis:
| Column | Description |
|---|---|
cell_id |
Cell identifier. |
observation_joinid |
Cell ID join key linking metadata. |
dataset_id |
Primary dataset identifier in the atlas. |
sample_id |
Harmonised sample identifier. |
sample_ |
Internal sample subdivision helper. |
experiment___ |
Upstream experiment grouping variable. |
sample_heuristic |
Internal sample subdivision helper. |
age_days |
Donor age in days. |
tissue_groups |
Coarse tissue grouping for analysis. |
nFeature_expressed_in_sample |
Number of expressed features per cell. |
nCount_RNA |
Total RNA counts per cell (sample-aware). |
empty_droplet |
Quality-control flag for empty droplets. |
cell_type_unified_ensemble |
Consensus immune identity from Azimuth and SingleR
(Blueprint, Monaco). |
is_immune |
Curated flag for immune-cell context. |
subsets_Mito_percent |
Percent of each cell’s total counts coming from mitochondrial genes in a sample. |
subsets_Ribo_percent |
Percent of each cell’s total counts coming from ribosomal genes in a sample. |
high_mitochondrion |
TRUE if the cell’s mitochondrial percent exceeds the QC cutoff. |
high_ribosome |
TRUE if the cell’s ribosomal percent exceeds the QC cutoff. |
scDblFinder.class |
Quality-control flag for doublet classification from
scDblFinder. |
sample_chunk |
Internal sample subdivision chunks. |
cell_chunk |
Internal cell subdivision chunks. |
sample_pseudobulk_chunk |
Internal pseudobulk subdivision chunks. |
file_id_cellNexus_single_cell |
Internal file id for single-cell layers. |
file_id_cellNexus_pseudobulk |
Internal file id for pseudobulk layers. |
count_upper_bound |
Count capping threshold used in transformation. |
nfeature_expressed_thresh |
Threshold of the number of expressed features per cell. |
inverse_transform |
Transformation method used in pre-processing pipeline. |
alive |
Quality-control flag for viable cells (e.g. mitochondrial signal). |
cell_annotation_blueprint_singler |
SingleR annotation (Blueprint). |
cell_annotation_monaco_singler |
SingleR annotation (Monaco). |
cell_annotation_azimuth_l2 |
Azimuth cell annotation. |
ethnicity_flagging_score |
Supporting score for ethnicity imputation. |
low_confidence_ethnicity |
Supporting flag for low-confidence ethnicity calls. |
.aggregated_cells |
Post-QC cells combined into each pseudobulk sample. |
imputed_ethnicity |
Imputed ethnicity label. |
atlas_id |
cellNexus atlas release identifier (internal use). |
# Which columns are available?
colnames(metadata)
#> [1] "cell_id" "observation_joinid" "dataset_id" "sample_id"
#> [5] "sample_" "experiment___" "run_from_cell_id" "sample_heuristic"
#> [9] "age_days" "tissue_groups" "nFeature_expressed_in_sample" "nCount_RNA"
#> [13] "empty_droplet" "cell_type_unified_ensemble" "is_immune" "subsets_Mito_percent"
#> [17] "subsets_Ribo_percent" "high_mitochondrion" "high_ribosome" "scDblFinder.class"
#> [21] "sample_chunk" "cell_chunk" "sample_pseudobulk_chunk" "file_id_cellNexus_single_cell"
#> [25] "file_id_cellNexus_pseudobulk" "count_upper_bound" "nfeature_expressed_thresh" "inverse_transform"
#> [29] "alive" "cell_annotation_blueprint_singler" "cell_annotation_monaco_singler" "cell_annotation_azimuth_l2"
#> [33] "ethnicity_flagging_score" "low_confidence_ethnicity" ".aggregated_cells" "imputed_ethnicity"
#> [37] "atlas_id" "citation" "collection_id" "dataset_version_id"
#> [41] "default_embedding" "published_at" "raw_data_location" "revised_at"
#> [45] "primary_cell_count" "schema_version" "tissue_type" "title"
#> [49] "tombstone" "x_approximate_distribution" "explorer_url" "cell_count"
#> [53] "feature_count" "filesize" "filetype" "mean_genes_per_cell"
#> [57] "suspension_type" "url"
# How many datasets per tissue group?
metadata |>
dplyr::distinct(dataset_id, tissue_groups) |>
dplyr::count(tissue_groups, sort = TRUE)
#> # Source: SQL [?? x 2]
#> # Database: DuckDB 1.4.3 [unknown@Linux 5.14.0-570.112.1.el9_6.x86_64:R 4.5.3/:memory:]
#> # Ordered by: desc(n)
#> tissue_groups n
#> <chr> <dbl>
#> 1 blood 9
#> 2 respiratory system 6
#> 3 bone marrow 5
#> 4 renal system 3
#> 5 thymus 3
#> 6 breast 3
#> 7 cerebral lobes and cortical areas 2
#> 8 nasal, oral, and pharyngeal regions 2
#> 9 female reproductive system 2
#> 10 lymphatic system 2
#> 11 spleen 2
#> 12 integumentary system (skin) 1
#> 13 sensory-related structures 1
#> 14 oesophagus 1
#> 15 brainstem and cerebellar structures 1
#> 16 small intestine 1
#> 17 vasculature 1
#> 18 epithelium and mucosal tissues 1
#> 19 gastrointestinal accessory organs 1
#> 20 stomach 1
# Typical quality-control filtering
metadata_qc <- metadata |>
dplyr::filter(
empty_droplet == FALSE,
alive == TRUE,
scDblFinder.class != "doublet"
)sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Red Hat Enterprise Linux 9.6 (Plow)
#>
#> Matrix products: default
#> BLAS: /stornext/System/data/software/rhel/9/base/tools/R/4.5.3/lib64/R/lib/libRblas.so
#> LAPACK: /stornext/System/data/software/rhel/9/base/tools/R/4.5.3/lib64/R/lib/libRlapack.so; LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
#> [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Australia/Melbourne
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BiocStyle_2.38.0 ggplot2_4.0.2 dplyr_1.2.1 cellNexus_0.99.22
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.23 splines_4.5.3 later_1.4.8 filelock_1.0.3
#> [5] tibble_3.3.1 polyclip_1.10-7 fastDummies_1.7.5 lifecycle_1.0.5
#> [9] rprojroot_2.1.1 globals_0.19.1 lattice_0.22-9 MASS_7.3-65
#> [13] backports_1.5.1 magrittr_2.0.5 sass_0.4.10 plotly_4.12.0
#> [17] rmarkdown_2.31 jquerylib_0.1.4 yaml_2.3.12 httpuv_1.6.17
#> [21] otel_0.2.0 Seurat_5.5.0.9002 sctransform_0.4.3 spam_2.11-3
#> [25] sp_2.2-1 sessioninfo_1.2.3 pkgbuild_1.4.8 spatstat.sparse_3.1-0
#> [29] reticulate_1.46.0 cowplot_1.2.0 pbapply_1.7-4 DBI_1.3.0
#> [33] RColorBrewer_1.1-3 abind_1.4-8 pkgload_1.5.1 Rtsne_0.17
#> [37] GenomicRanges_1.62.1 purrr_1.2.2 BiocGenerics_0.56.0 tidySingleCellExperiment_1.20.1
#> [41] IRanges_2.44.0 S4Vectors_0.49.1-1 ggrepel_0.9.8 irlba_2.3.7
#> [45] listenv_0.10.1 spatstat.utils_3.2-2 goftest_1.2-3 RSpectra_0.16-2
#> [49] spatstat.random_3.4-5 fitdistrplus_1.2-6 parallelly_1.46.1 commonmark_2.0.0
#> [53] codetools_0.2-20 DelayedArray_0.36.1 xml2_1.5.2 tidyselect_1.2.1
#> [57] rclipboard_0.2.1 UCSC.utils_1.6.1 farver_2.1.2 shinyWidgets_0.9.1
#> [61] matrixStats_1.5.0 stats4_4.5.3 spatstat.explore_3.8-0 duckdb_1.4.3
#> [65] Seqinfo_1.0.0 roxygen2_7.3.3 jsonlite_2.0.0 ellipsis_0.3.3
#> [69] progressr_0.19.0 ggridges_0.5.7 survival_3.8-6 tools_4.5.3
#> [73] ica_1.0-3 Rcpp_1.1.1-1 glue_1.8.0 gridExtra_2.3
#> [77] SparseArray_1.10.10 xfun_0.57 MatrixGenerics_1.22.0 usethis_3.2.1
#> [81] GenomeInfoDb_1.46.2 HDF5Array_1.38.0 withr_3.0.2 BiocManager_1.30.27
#> [85] fastmap_1.2.0 basilisk_1.22.0 fansi_1.0.7 rhdf5filters_1.22.0
#> [89] ttservice_0.5.3 digest_0.6.39 R6_2.6.1 mime_0.13
#> [93] scattermore_1.2 tensor_1.5.1 spatstat.data_3.1-9 h5mread_1.2.1
#> [97] utf8_1.2.6 tidyr_1.3.2 generics_0.1.4 data.table_1.18.2.1
#> [101] httr_1.4.8 htmlwidgets_1.6.4 S4Arrays_1.10.1 uwot_0.2.4
#> [105] pkgconfig_2.0.3 gtable_0.3.6 rsconnect_1.8.0 blob_1.3.0
#> [109] lmtest_0.9-40 S7_0.2.1-1 SingleCellExperiment_1.32.0 XVector_0.50.0
#> [113] htmltools_0.5.9 bookdown_0.46 dotCall64_1.2 SeuratObject_5.4.0
#> [117] scales_1.4.0 Biobase_2.70.0 png_0.1-9 spatstat.univar_3.1-7
#> [121] knitr_1.51 rstudioapi_0.18.0 reshape2_1.4.5 checkmate_2.3.4
#> [125] nlme_3.1-168 curl_7.0.0 anndataR_1.0.2 rhdf5_2.54.1
#> [129] cachem_1.1.0 zoo_1.8-15 stringr_1.6.0 KernSmooth_2.23-26
#> [133] parallel_4.5.3 miniUI_0.1.2 arrow_23.0.1.2 zellkonverter_1.20.1
#> [137] desc_1.4.3 pillar_1.11.1 grid_4.5.3 vctrs_0.7.3
#> [141] RANN_2.6.2 promises_1.5.0 dbplyr_2.5.2 xtable_1.8-8
#> [145] cluster_2.1.8.2 evaluate_1.0.5 cli_3.6.6 compiler_4.5.3
#> [149] rlang_1.2.0 future.apply_1.20.2 forcats_1.0.1 plyr_1.8.9
#> [153] fs_2.0.1 stringi_1.8.7 viridisLite_0.4.3 deldir_2.0-4
#> [157] assertthat_0.2.1 lazyeval_0.2.3 devtools_2.5.0 spatstat.geom_3.7-3
#> [161] Matrix_1.7-4 dir.expiry_1.18.0 RcppHNSW_0.6.0 patchwork_1.3.2
#> [165] bit64_4.6.0-1 future_1.70.0 Rhdf5lib_1.32.0 shiny_1.13.0
#> [169] SummarizedExperiment_1.40.0 ROCR_1.0-12 igraph_2.2.3 memoise_2.0.1
#> [173] bslib_0.10.0 bit_4.6.0