Package: BUSpaRse 1.21.0

Lambda Moses

BUSpaRse: kallisto | bustools R utilities

The kallisto | bustools pipeline is a fast and modular set of tools to convert single cell RNA-seq reads in fastq files into gene count or transcript compatibility counts (TCC) matrices for downstream analysis. Central to this pipeline is the barcode, UMI, and set (BUS) file format. This package serves the following purposes: First, this package allows users to manipulate BUS format files as data frames in R and then convert them into gene count or TCC matrices. Furthermore, since R and Rcpp code is easier to handle than pure C++ code, users are encouraged to tweak the source code of this package to experiment with new uses of BUS format and different ways to convert the BUS file into gene count matrix. Second, this package can conveniently generate files required to generate gene count matrices for spliced and unspliced transcripts for RNA velocity. Here biotypes can be filtered and scaffolds and haplotypes can be removed, and the filtered transcriptome can be extracted and written to disk. Third, this package implements utility functions to get transcripts and associated genes required to convert BUS files to gene count matrices, to write the transcript to gene information in the format required by bustools, and to read output of bustools into R as sparses matrices.

Authors:Lambda Moses [aut, cre], Lior Pachter [aut, ths]

BUSpaRse_1.21.0.tar.gz
BUSpaRse_1.21.0.zip(r-4.5)BUSpaRse_1.21.0.zip(r-4.4)BUSpaRse_1.21.0.zip(r-4.3)
BUSpaRse_1.21.0.tgz(r-4.5-x86_64)BUSpaRse_1.21.0.tgz(r-4.5-arm64)BUSpaRse_1.21.0.tgz(r-4.4-x86_64)BUSpaRse_1.21.0.tgz(r-4.4-arm64)BUSpaRse_1.21.0.tgz(r-4.3-x86_64)BUSpaRse_1.21.0.tgz(r-4.3-arm64)
BUSpaRse_1.21.0.tar.gz(r-4.5-noble)BUSpaRse_1.21.0.tar.gz(r-4.4-noble)
BUSpaRse_1.21.0.tgz(r-4.4-emscripten)BUSpaRse_1.21.0.tgz(r-4.3-emscripten)
BUSpaRse.pdf |BUSpaRse.html
BUSpaRse/json (API)
NEWS

# Install 'BUSpaRse' in R:
install.packages('BUSpaRse', repos = c('https://bioc.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/bustools/busparse/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:

On BioConductor:BUSpaRse-1.21.0(bioc 3.21)BUSpaRse-1.20.0(bioc 3.20)

singlecellrnaseqworkflowstepcpp

7.35 score 9 stars 165 scripts 383 downloads 1 mentions 22 exports 114 dependencies

Last updated 5 months agofrom:0e32057b9f. Checks:1 OK, 10 NOTE. Indexed: yes.

TargetResultLatest binary
Doc / VignettesOKFeb 27 2025
R-4.5-win-x86_64NOTEFeb 27 2025
R-4.5-mac-x86_64NOTEFeb 27 2025
R-4.5-mac-aarch64NOTEFeb 27 2025
R-4.5-linux-x86_64NOTEFeb 27 2025
R-4.4-win-x86_64NOTEFeb 27 2025
R-4.4-mac-x86_64NOTEFeb 27 2025
R-4.4-mac-aarch64NOTEFeb 27 2025
R-4.3-win-x86_64NOTEFeb 27 2025
R-4.3-mac-x86_64NOTEFeb 27 2025
R-4.3-mac-aarch64NOTEFeb 27 2025

Exports:annots_from_fa_dfannots_from_fa_GRangesdl_transcriptomeEC2geneget_inflectionget_knee_dfget_velocity_filesknee_plotmake_sparse_matrixread_count_outputread_velocity_outputsave_tr2g_bustoolssort_tr2gspecies2datasetsubset_annottr2g_EnsDbtr2g_ensembltr2g_fastatr2g_gff3tr2g_gtftr2g_TxDbtranscript2gene

Dependencies:abindAnnotationDbiAnnotationFilteraskpassBHBiobaseBiocFileCacheBiocGenericsBiocIOBiocParallelbiomaRtBiostringsbitbit64bitopsblobBSgenomecachemclicodetoolscolorspacecpp11crayoncurlDBIdbplyrDelayedArraydigestdplyrensembldbfansifarverfastmapfilelockformatRfutile.loggerfutile.optionsgenericsGenomeInfoDbGenomeInfoDbDataGenomicAlignmentsGenomicFeaturesGenomicRangesggplot2gluegtablehmshttrhttr2IRangesisobandjsonliteKEGGRESTlabelinglambda.rlatticelazyevallifecyclemagrittrMASSMatrixMatrixGenericsmatrixStatsmemoisemgcvmimemunsellnlmeopensslpillarpkgconfigplogrplyrangespngprettyunitsprogressProtGenericspurrrR6rappdirsRColorBrewerRcppRcppArmadilloRcppProgressRCurlrestfulrRhtslibrjsonrlangRsamtoolsRSQLitertracklayerS4ArraysS4VectorsscalessnowSparseArraystringistringrSummarizedExperimentsystibbletidyrtidyselectUCSC.utilsutf8vctrsviridisLitewithrXMLxml2XVectoryamlzeallot

Converting BUS format into sparse matrix

Rendered fromsparse-matrix.Rmdusingknitr::rmarkdownon Feb 27 2025.

Last update: 2021-03-01
Started: 2019-06-18

Generate transcript to gene file for bustools

Rendered fromtr2g.Rmdusingknitr::rmarkdownon Feb 27 2025.

Last update: 2024-07-31
Started: 2019-06-18

Citation

To cite package ‘BUSpaRse’ in publications use:

Moses L, Pachter L (2024). BUSpaRse: kallisto | bustools R utilities. R package version 1.21.0, https://bioconductor.org/packages/BUSpaRse.

Corresponding BibTeX entry:

  @Manual{,
    title = {BUSpaRse: kallisto | bustools R utilities},
    author = {Lambda Moses and Lior Pachter},
    year = {2024},
    note = {R package version 1.21.0},
    url = {https://bioconductor.org/packages/BUSpaRse},
  }

Readme and manuals

BUSpaRse

This package processes bus files generated from single-cell RNA-seq FASTQ files, e.g. using kallisto. The bus format is a table with 4 columns: Barcode, UMI, Set, and counts, that represent key information in single-cell RNA-seq datasets. See this paper for more information about the bus format. A gene count matrix for a single-cell RNA-seq experiment can be generated with the kallisto bus command and the bustools suite of programs many times faster than with other programs.

The most recent version of bustools can convert bus files to the gene count and transcript compatibility count (TCC) matrices very efficiently. This package has an alternative implementation of the algorithm that converts bus files to gene count and TCC matrices. This implementation is much less efficient (though still many times faster than, e.g., Cell Ranger). The purpose of this implementation is to facilitate experimentation with new algorithms or to adapt the methods for other applications. The implementation in this package is written in Rcpp, which is easier to work with than pure C++ code and requires less expertise of C++.

A file mapping transcripts to genes is required to convert the bus file to a gene count matrix, either with bustools or with this package. This package contains functions that produces this file or data frame, by directly querying Ensembl, by parsing GTF or GFF3 files, by extracting information from TxDb or EnsDb gene annotation resources from Bioconductor, or by parsing sequence names of fasta files of transcriptomes downloaded from Ensembl. This package can query Ensembl for not only vertebrates (i.e. www.ensembl.org), but also plants, fungi, invertebrates, and protists. Now the functions used to map transcript to genes can also filter by biotypes and only keep standard chromosomes, and extract filtered transcriptomes.

This package can also generate the files required for running RNA velocity with kallisto and bustools, including a fasta file with not only the transcriptome but also appropriately flanked intronic sequences, lists of transcripts and introns to be captured, and a file mapping transcripts and introns to genes. For spliced transcripts, you may either use the cDNA sequences, or exon-exon junctions, for pseudoalignment. Using exon-exon junctions should more unambiguously distinguish between spliced and unspliced transcripts, since unspliced transcripts also have exonic sequences.

Example

See the vignettes for examples of using kallisto bus, bustools, and BUSpaRse on real data. The vignettes contain a complete walk-through, starting with downloading the FASTQ files for an experiment and ending with an analysis. Google Colab version of those vignettes can be found here. Also see browseVignettes("BUSpaRse") for vignettes for using BUSpaRse to get gene count matrix and for extracting filtered transcriptomes with tr2g_* functions.

Installation

You can install development version of BUSpaRse with:

if (!require(devtools)) install.packages("devtools")
devtools::install_github("BUStools/BUSpaRse")

The release version can be installed from Bioconductor, or the development version with the version = "devel" argument:

BiocManager::install("BUSpaRse")

Help Manual

Help pageTopics
Generate RNA velocity files for GRanges.get_velocity_files
Transfer information about circular chromosomes between genome and annotationannot_circular
Get genome annotation from Ensembl FASTA fileannots_from_fa_df annots_from_fa_GRanges
Cell Ranger gene biotypescellranger_biotypes
Check that an object is a character vector of length 1check_char1
Check for chromosomes in genome but not annotationcheck_genome
Check inputs to tr2g_gtf and tr2g_gff3check_gff
Check that a tag is present in attribute field of GTF/GFFcheck_tag_present
Check if transcript ID in transcriptome and annotation matchcheck_tx
Download transcriptome from Ensembldl_transcriptome
Map EC Index to Genes Compatible with the ECEC2gene
Gene biotypes from Ensemblensembl_gene_biotypes
These are the column names of the 'mcols' when the Ensembl GTF file is read into R as a 'GRanges', including 'gene_id', 'transcript_id', 'biotype', 'description', and so on, and the mandatory tags like 'ID', 'Name', and 'Parent'.ensembl_gff_mcols
Tags in the attributes field of Ensembl GTF filesensembl_gtf_mcols
Transcript biotypes from Ensemblensembl_tx_biotypes
Get flanked intronic rangesget_intron_flanks
Plot the transposed knee plot and inflection pointget_inflection get_knee_df knee_plot
Get files required for RNA velocity with bustoolsget_velocity_files get_velocity_files,character-method get_velocity_files,EnsDb-method get_velocity_files,GRanges-method get_velocity_files,TxDb-method
Convert the Output of 'kallisto bus' into Gene by Gell Matrixmake_sparse_matrix
Match chromosome naming styles of annotation and genomematch_style
Read matrix along with barcode and gene namesread_count_output
Read intronic and exonic matrices into Rread_velocity_output
Tags in the attributes field of RefSeq GFF filesrefseq_gff_mcols
Save transcript to gene file for use in 'bustools'save_tr2g_bustools
Sort transcripts to the same order as in kallisto indexsort_tr2g
Convert Latin species name to dataset namespecies2dataset
Standardize GRanges field namesstandardize_tags
Remove chromosomes in anotation absent from genomesub_annot
Subset genome annotationsubset_annot subset_annot,BSgenome-method subset_annot,DNAStringSet-method
Get transcript and gene info from EnsDb objectstr2g_EnsDb
Get transcript and gene info from Ensembltr2g_ensembl
Get transcript and gene info from names in FASTA filestr2g_fasta
Get transcript and gene info from GFF3 filetr2g_gff3
Get transcript and gene info from GRangestr2g_GRanges
Get transcript and gene info from GTF filetr2g_gtf
tr2g for exon-exon junctionstr2g_junction
Get transcript and gene info from TxDb objectstr2g_TxDb
Map Ensembl transcript ID to gene IDtranscript2gene
Validate input to get_velocity_filesvalidate_velocity_input
Write the files for RNA velocity to diskwrite_velocity_output