Package: BUSpaRse 1.21.0
BUSpaRse: kallisto | bustools R utilities
The kallisto | bustools pipeline is a fast and modular set of tools to convert single cell RNA-seq reads in fastq files into gene count or transcript compatibility counts (TCC) matrices for downstream analysis. Central to this pipeline is the barcode, UMI, and set (BUS) file format. This package serves the following purposes: First, this package allows users to manipulate BUS format files as data frames in R and then convert them into gene count or TCC matrices. Furthermore, since R and Rcpp code is easier to handle than pure C++ code, users are encouraged to tweak the source code of this package to experiment with new uses of BUS format and different ways to convert the BUS file into gene count matrix. Second, this package can conveniently generate files required to generate gene count matrices for spliced and unspliced transcripts for RNA velocity. Here biotypes can be filtered and scaffolds and haplotypes can be removed, and the filtered transcriptome can be extracted and written to disk. Third, this package implements utility functions to get transcripts and associated genes required to convert BUS files to gene count matrices, to write the transcript to gene information in the format required by bustools, and to read output of bustools into R as sparses matrices.
Authors:
BUSpaRse_1.21.0.tar.gz
BUSpaRse_1.21.0.zip(r-4.5)BUSpaRse_1.21.0.zip(r-4.4)BUSpaRse_1.21.0.zip(r-4.3)
BUSpaRse_1.21.0.tgz(r-4.5-x86_64)BUSpaRse_1.21.0.tgz(r-4.5-arm64)BUSpaRse_1.21.0.tgz(r-4.4-x86_64)BUSpaRse_1.21.0.tgz(r-4.4-arm64)BUSpaRse_1.21.0.tgz(r-4.3-x86_64)BUSpaRse_1.21.0.tgz(r-4.3-arm64)
BUSpaRse_1.21.0.tar.gz(r-4.5-noble)BUSpaRse_1.21.0.tar.gz(r-4.4-noble)
BUSpaRse_1.21.0.tgz(r-4.4-emscripten)BUSpaRse_1.21.0.tgz(r-4.3-emscripten)
BUSpaRse.pdf |BUSpaRse.html✨
BUSpaRse/json (API)
NEWS
# Install 'BUSpaRse' in R: |
install.packages('BUSpaRse', repos = c('https://bioc.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/bustools/busparse/issues
- cellranger_biotypes - Cell Ranger gene biotypes
- ensembl_gene_biotypes - Gene biotypes from Ensembl
- ensembl_gff_mcols - These are the column names of the 'mcols' when the Ensembl GTF file is read into R as a 'GRanges', including 'gene_id', 'transcript_id', 'biotype', 'description', and so on, and the mandatory tags like 'ID', 'Name', and 'Parent'.
- ensembl_gtf_mcols - Tags in the attributes field of Ensembl GTF files
- ensembl_tx_biotypes - Transcript biotypes from Ensembl
- refseq_gff_mcols - Tags in the attributes field of RefSeq GFF files
On BioConductor:BUSpaRse-1.21.0(bioc 3.21)BUSpaRse-1.20.0(bioc 3.20)
singlecellrnaseqworkflowstepcpp
Last updated 5 months agofrom:0e32057b9f. Checks:1 OK, 10 NOTE. Indexed: yes.
Target | Result | Latest binary |
---|---|---|
Doc / Vignettes | OK | Feb 27 2025 |
R-4.5-win-x86_64 | NOTE | Feb 27 2025 |
R-4.5-mac-x86_64 | NOTE | Feb 27 2025 |
R-4.5-mac-aarch64 | NOTE | Feb 27 2025 |
R-4.5-linux-x86_64 | NOTE | Feb 27 2025 |
R-4.4-win-x86_64 | NOTE | Feb 27 2025 |
R-4.4-mac-x86_64 | NOTE | Feb 27 2025 |
R-4.4-mac-aarch64 | NOTE | Feb 27 2025 |
R-4.3-win-x86_64 | NOTE | Feb 27 2025 |
R-4.3-mac-x86_64 | NOTE | Feb 27 2025 |
R-4.3-mac-aarch64 | NOTE | Feb 27 2025 |
Exports:annots_from_fa_dfannots_from_fa_GRangesdl_transcriptomeEC2geneget_inflectionget_knee_dfget_velocity_filesknee_plotmake_sparse_matrixread_count_outputread_velocity_outputsave_tr2g_bustoolssort_tr2gspecies2datasetsubset_annottr2g_EnsDbtr2g_ensembltr2g_fastatr2g_gff3tr2g_gtftr2g_TxDbtranscript2gene
Dependencies:abindAnnotationDbiAnnotationFilteraskpassBHBiobaseBiocFileCacheBiocGenericsBiocIOBiocParallelbiomaRtBiostringsbitbit64bitopsblobBSgenomecachemclicodetoolscolorspacecpp11crayoncurlDBIdbplyrDelayedArraydigestdplyrensembldbfansifarverfastmapfilelockformatRfutile.loggerfutile.optionsgenericsGenomeInfoDbGenomeInfoDbDataGenomicAlignmentsGenomicFeaturesGenomicRangesggplot2gluegtablehmshttrhttr2IRangesisobandjsonliteKEGGRESTlabelinglambda.rlatticelazyevallifecyclemagrittrMASSMatrixMatrixGenericsmatrixStatsmemoisemgcvmimemunsellnlmeopensslpillarpkgconfigplogrplyrangespngprettyunitsprogressProtGenericspurrrR6rappdirsRColorBrewerRcppRcppArmadilloRcppProgressRCurlrestfulrRhtslibrjsonrlangRsamtoolsRSQLitertracklayerS4ArraysS4VectorsscalessnowSparseArraystringistringrSummarizedExperimentsystibbletidyrtidyselectUCSC.utilsutf8vctrsviridisLitewithrXMLxml2XVectoryamlzeallot
Citation
To cite package ‘BUSpaRse’ in publications use:
Moses L, Pachter L (2024). BUSpaRse: kallisto | bustools R utilities. R package version 1.21.0, https://bioconductor.org/packages/BUSpaRse.
Corresponding BibTeX entry:
@Manual{, title = {BUSpaRse: kallisto | bustools R utilities}, author = {Lambda Moses and Lior Pachter}, year = {2024}, note = {R package version 1.21.0}, url = {https://bioconductor.org/packages/BUSpaRse}, }
Readme and manuals
BUSpaRse
This package processes bus
files generated from single-cell RNA-seq FASTQ files, e.g. using kallisto. The bus
format is a table with 4 columns: Barcode, UMI, Set, and counts, that represent key information in single-cell RNA-seq datasets. See this paper for more information about the bus
format. A gene count matrix for a single-cell RNA-seq experiment can be generated with the kallisto bus
command and the bustools suite of programs many times faster than with other programs.
The most recent version of bustools
can convert bus
files to the gene count and transcript compatibility count (TCC) matrices very efficiently. This package has an alternative implementation of the algorithm that converts bus
files to gene count and TCC matrices. This implementation is much less efficient (though still many times faster than, e.g., Cell Ranger). The purpose of this implementation is to facilitate experimentation with new algorithms or to adapt the methods for other applications. The implementation in this package is written in Rcpp, which is easier to work with than pure C++ code and requires less expertise of C++.
A file mapping transcripts to genes is required to convert the bus
file to a gene count matrix, either with bustools
or with this package. This package contains functions that produces this file or data frame, by directly querying Ensembl, by parsing GTF or GFF3 files, by extracting information from TxDb
or EnsDb
gene annotation resources from Bioconductor, or by parsing sequence names of fasta files of transcriptomes downloaded from Ensembl. This package can query Ensembl for not only vertebrates (i.e. www.ensembl.org), but also plants, fungi, invertebrates, and protists. Now the functions used to map transcript to genes can also filter by biotypes and only keep standard chromosomes, and extract filtered transcriptomes.
This package can also generate the files required for running RNA velocity with kallisto
and bustools
, including a fasta file with not only the transcriptome but also appropriately flanked intronic sequences, lists of transcripts and introns to be captured, and a file mapping transcripts and introns to genes. For spliced transcripts, you may either use the cDNA sequences, or exon-exon junctions, for pseudoalignment. Using exon-exon junctions should more unambiguously distinguish between spliced and unspliced transcripts, since unspliced transcripts also have exonic sequences.
Example
See the vignettes for examples of using kallisto bus
, bustools
, and BUSpaRse
on real data. The vignettes contain a complete walk-through, starting with downloading the FASTQ files for an experiment and ending with an analysis. Google Colab version of those vignettes can be found here. Also see browseVignettes("BUSpaRse")
for vignettes for using BUSpaRse
to get gene count matrix and for extracting filtered transcriptomes with tr2g_*
functions.
Installation
You can install development version of BUSpaRse with:
if (!require(devtools)) install.packages("devtools")
devtools::install_github("BUStools/BUSpaRse")
The release version can be installed from Bioconductor, or the development version with the version = "devel"
argument:
BiocManager::install("BUSpaRse")