Public Data Integration using Sfaira

library(SimBu)

Sfaira Integeration

This vignette will cover the integration of the public database Sfaria.

Setup

As a public database, sfaira (Fischer et al. 2020) is used, which is a dataset and model repository for single-cell RNA-sequencing data. It gives access to about multiple datasets from human and mouse with more than 3 million cells in total. You can browse them interactively here: https://theislab.github.io/sfaira-portal/Datasets. Note that only annotated datasets will be downloaded! Also there are cases of datasets, which have private URLs and cannot be automatically downloaded; SimBu will skip these datasets.
In order to use this database, we first need to install it. This can easily be done, by running the setup_sfaira() function for the first time. In the background we use the basilisik package to establish a conda environment that has all sfaira dependencies installed. The installation will be only performed one single time, even if you close your R session and call setup_sfaira() again. The given directory serves as the storage for all future downloaded datasets from sfaira:

setup_list <- SimBu::setup_sfaira(basedir = tempdir())

Creating a dataset

We will now create a dataset of samples from human pancreas using the organisms and tissues parameter. You can provide a single word (like we do here) or for example a list of tissues you want to download: c("pancreas","lung"). An additional parameter is the assays parameter, where you subset the database further to only download datasets from certain sequencing assays (for examples Smart-seq2).
The name parameter is used later on to give each sample (cell) a unique name.

ds_pancrease <- SimBu::dataset_sfaira_multiple(
  sfaira_setup = setup_list,
  organisms = "Homo sapiens",
  tissues = "pancreas",
  name = "human_pancreas"
)

Currently there are three datasets in sfaira from human pancreas, which have cell-type annotation. The package will download them for you automatically and merge them together into a single expression matrix and a streamlined annotation table, which we can use for our simulation.
It can happen, that some datasets from sfaira are not (yet) ready for the automatic download, an error message will then appear in R, telling you which file to download and where to put it.

If you wish to see all datasets which are included in sfaira you can use the following command:

all_datasets <- SimBu::sfaira_overview(setup_list = setup_list)
head(all_datasets)

This allows you to find the specific IDs of datasets, which you can download directly:

SimBu::dataset_sfaira(
  sfaira_id = "homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x",
  sfaira_setup = setup_list,
  name = "dataset_by_id"
)
utils::sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SimBu_1.9.0    rmarkdown_2.28
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.36.0 gtable_0.3.6               
#>  [3] xfun_0.48                   bslib_0.8.0                
#>  [5] ggplot2_3.5.1               Biobase_2.67.0             
#>  [7] lattice_0.22-6              vctrs_0.6.5                
#>  [9] tools_4.4.1                 generics_0.1.3             
#> [11] stats4_4.4.1                parallel_4.4.1             
#> [13] tibble_3.2.1                fansi_1.0.6                
#> [15] highr_0.11                  pkgconfig_2.0.3            
#> [17] Matrix_1.7-1                data.table_1.16.2          
#> [19] RColorBrewer_1.1-3          S4Vectors_0.44.0           
#> [21] sparseMatrixStats_1.18.0    lifecycle_1.0.4            
#> [23] GenomeInfoDbData_1.2.13     farver_2.1.2               
#> [25] compiler_4.4.1              munsell_0.5.1              
#> [27] codetools_0.2-20            GenomeInfoDb_1.43.0        
#> [29] htmltools_0.5.8.1           sys_3.4.3                  
#> [31] buildtools_1.0.0            sass_0.4.9                 
#> [33] yaml_2.3.10                 pillar_1.9.0               
#> [35] crayon_1.5.3                jquerylib_0.1.4            
#> [37] tidyr_1.3.1                 BiocParallel_1.41.0        
#> [39] DelayedArray_0.33.1         cachem_1.1.0               
#> [41] abind_1.4-8                 tidyselect_1.2.1           
#> [43] digest_0.6.37               dplyr_1.1.4                
#> [45] purrr_1.0.2                 labeling_0.4.3             
#> [47] maketools_1.3.1             fastmap_1.2.0              
#> [49] grid_4.4.1                  colorspace_2.1-1           
#> [51] cli_3.6.3                   SparseArray_1.6.0          
#> [53] magrittr_2.0.3              S4Arrays_1.6.0             
#> [55] utf8_1.2.4                  withr_3.0.2                
#> [57] UCSC.utils_1.2.0            scales_1.3.0               
#> [59] XVector_0.46.0              httr_1.4.7                 
#> [61] matrixStats_1.4.1           proxyC_0.4.1               
#> [63] evaluate_1.0.1              knitr_1.48                 
#> [65] GenomicRanges_1.59.0        IRanges_2.41.0             
#> [67] rlang_1.1.4                 Rcpp_1.0.13                
#> [69] glue_1.8.0                  BiocGenerics_0.53.1        
#> [71] jsonlite_1.8.9              R6_2.5.1                   
#> [73] MatrixGenerics_1.19.0       zlibbioc_1.52.0
Fischer, David S., Leander Dony, Martin König, Abdul Moeed, Luke Zappia, Sophie Tritschler, Olle Holmberg, Hananeh Aliee, and Fabian J. Theis. 2020. “Sfaira Accelerates Data and Model Reuse in Single Cell Genomics.” bioRxiv. https://doi.org/10.1101/2020.12.16.419036.