The Barcode,
UMI, Set (BUS) format is a new way to represent pseudoalignments of
reads from RNA-seq. Files of this format can be efficiently generated by
the command line tool kallisto bus
.
With kallisto bus
and this package, we go from the fastq
files to the sparse matrix used for downstream analysis such as with
Seurat within half an hour, while Cell Ranger would take hours.
In this vignette, we convert an 10x 1:1 mouse and human cell mixture dataset from the BUS format to a sparse matrix. To see how the BUS format can be generated from fastq file, as well as more in depth vignettes, see the website of this package.
Note that this vignette is deprecated and is kept for historical
reasons as it was implemented when kallisto | bustools
was
experimental. The functionality of make_sparse_matrix
has
been implemented more efficiently in the command line tool
bustools
. Please use the updated version of
bustools
and if you wish, the wrapper kb
instead.
# The dataset package
library(TENxBUSData)
library(BUSpaRse)
library(Matrix)
library(zeallot)
library(ggplot2)
TENxBUSData(".", dataset = "hgmm100")
#> see ?TENxBUSData and browseVignettes('TENxBUSData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> The downloaded files are in /tmp/Rtmpti4Wql/Rbuild1de21855c6c5/BUSpaRse/vignettes/out_hgmm100
#> [1] "/tmp/Rtmpti4Wql/Rbuild1de21855c6c5/BUSpaRse/vignettes/out_hgmm100"
First, we map transcripts, as in the kallisto
index, to
the corresponding genes.
tr2g <- transcript2gene(species = c("Homo sapiens", "Mus musculus"), type = "vertebrate",
kallisto_out_path = "./out_hgmm100", ensembl_version = 99,
write_tr2g = FALSE)
#> Querying biomart for transcript and gene IDs of Homo sapiens
#> Querying biomart for transcript and gene IDs of Mus musculus
head(tr2g)
#> transcript gene gene_name chromosome_name
#> 1 ENST00000434970.2 ENSG00000237235.2 TRDD2 14
#> 2 ENST00000415118.1 ENSG00000223997.1 TRDD1 14
#> 3 ENST00000448914.1 ENSG00000228985.1 TRDD3 14
#> 4 ENST00000632684.1 ENSG00000282431.1 TRBD1 7
#> 5 ENST00000390583.1 ENSG00000211923.1 IGHD3-10 14
#> 6 ENST00000431440.2 ENSG00000232543.2 IGHD4-11 14
Here we make both the gene count matrix and the TCC matrix.
Here we have a sparse matrix with genes in rows and cells in columns.
This dataset should only have about 100 cells, but here we get over 100,000. In fact, most of the barcodes correspond to empty droplets; they can be removed by filtering out barcodes with too few UMI.
tot_counts <- Matrix::colSums(gene_count)
summary(tot_counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 1.00 2.00 24.77 5.00 64041.00
Then this sparse matrix can be used in Seurat for downstream analysis.
Likewise, we can remove empty droplets from the TCC matrix.
This dataset should only have about 100 cells, but here we get over 100,000. In fact, most of the barcodes correspond to empty droplets; they can be removed by filtering out barcodes with too few UMI.
tot_counts <- Matrix::colSums(tcc)
summary(tot_counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 1.00 2.00 25.84 5.00 69235.00
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 zeallot_0.1.0 Matrix_1.7-1 BUSpaRse_1.21.0
#> [5] TENxBUSData_1.19.0 BiocStyle_2.33.1
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] httr2_1.0.5 biomaRt_2.61.3
#> [5] rlang_1.1.4 magrittr_2.0.3
#> [7] matrixStats_1.4.1 compiler_4.4.1
#> [9] RSQLite_2.3.7 GenomicFeatures_1.57.1
#> [11] png_0.1-8 vctrs_0.6.5
#> [13] stringr_1.5.1 ProtGenerics_1.37.1
#> [15] pkgconfig_2.0.3 crayon_1.5.3
#> [17] fastmap_1.2.0 dbplyr_2.5.0
#> [19] XVector_0.45.0 utf8_1.2.4
#> [21] Rsamtools_2.21.2 rmarkdown_2.28
#> [23] UCSC.utils_1.1.0 purrr_1.0.2
#> [25] bit_4.5.0 xfun_0.48
#> [27] zlibbioc_1.51.2 cachem_1.1.0
#> [29] GenomeInfoDb_1.41.2 jsonlite_1.8.9
#> [31] progress_1.2.3 blob_1.2.4
#> [33] highr_0.11 DelayedArray_0.31.14
#> [35] BiocParallel_1.39.0 parallel_4.4.1
#> [37] prettyunits_1.2.0 plyranges_1.25.0
#> [39] R6_2.5.1 bslib_0.8.0
#> [41] stringi_1.8.4 rtracklayer_1.65.0
#> [43] GenomicRanges_1.57.2 jquerylib_0.1.4
#> [45] Rcpp_1.0.13 SummarizedExperiment_1.35.5
#> [47] knitr_1.48 IRanges_2.39.2
#> [49] tidyselect_1.2.1 abind_1.4-8
#> [51] yaml_2.3.10 codetools_0.2-20
#> [53] curl_5.2.3 lattice_0.22-6
#> [55] tibble_3.2.1 withr_3.0.2
#> [57] Biobase_2.65.1 KEGGREST_1.45.1
#> [59] evaluate_1.0.1 BiocFileCache_2.13.2
#> [61] xml2_1.3.6 ExperimentHub_2.13.1
#> [63] Biostrings_2.73.2 pillar_1.9.0
#> [65] BiocManager_1.30.25 filelock_1.0.3
#> [67] MatrixGenerics_1.17.1 stats4_4.4.1
#> [69] generics_0.1.3 RCurl_1.98-1.16
#> [71] BiocVersion_3.20.0 ensembldb_2.29.1
#> [73] S4Vectors_0.43.2 hms_1.1.3
#> [75] munsell_0.5.1 scales_1.3.0
#> [77] glue_1.8.0 lazyeval_0.2.2
#> [79] maketools_1.3.1 tools_4.4.1
#> [81] AnnotationHub_3.15.0 BiocIO_1.15.2
#> [83] sys_3.4.3 BSgenome_1.73.1
#> [85] GenomicAlignments_1.41.0 buildtools_1.0.0
#> [87] XML_3.99-0.17 grid_4.4.1
#> [89] tidyr_1.3.1 colorspace_2.1-1
#> [91] AnnotationDbi_1.69.0 GenomeInfoDbData_1.2.13
#> [93] restfulr_0.0.15 cli_3.6.3
#> [95] rappdirs_0.3.3 fansi_1.0.6
#> [97] S4Arrays_1.5.11 dplyr_1.1.4
#> [99] AnnotationFilter_1.31.0 gtable_0.3.6
#> [101] sass_0.4.9 digest_0.6.37
#> [103] BiocGenerics_0.51.3 SparseArray_1.5.45
#> [105] farver_2.1.2 rjson_0.2.23
#> [107] memoise_2.0.1 htmltools_0.5.8.1
#> [109] lifecycle_1.0.4 httr_1.4.7
#> [111] mime_0.12 bit64_4.5.2