Originally, this package was written when the
kallisto | bustools
concept was still experimental, to test
a new and fast way to generate the gene count matrix from fastq files
for scRNA-seq. In the past year, kallisto | bustools
has
matured. Now there’s a wrapper kb-python that can
download a prebuilt kallisto
index for human and mice and
call kallisto bus
and bustools
to get the gene
count matrix. So largely, the old way of calling
kallisto bus
and bustools
, and some
functionalities of BUSpaRse
, such as getting transcript to
gene mapping, are obsolete.
So now the focus of BUSpaRse
has shifted to finer
control of the transcripts that go into the transcriptome and more
options. Now all tr2g_*
functions (except
tr2g_ensembl
) can filter transcripts for gene and
transcript biotypes, only keep standard chromosomes (so no scaffolds and
haplotypes), and extract the filtered transcripts from the
transcriptome. GTF files from Ensembl, Ensembl fasta files, GFF3 files
from Ensembl and RefSeq, TxDb, and EnsDb can all be used here.
library(BUSpaRse)
library(BSgenome.Hsapiens.UCSC.hg38)
#> Loading required package: GenomeInfoDb
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:Matrix':
#>
#> expand, unname
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: BSgenome
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#>
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#>
#> FileForFormat
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(EnsDb.Hsapiens.v86)
#> Loading required package: ensembldb
#> Loading required package: AnnotationFilter
#>
#> Attaching package: 'ensembldb'
#> The following object is masked from 'package:stats':
#>
#> filter
The transcriptome can be downloaded from a specified version of
Ensembl and filtered for biotypes and standard chromosomes, not only for
the vertebrate database (www.ensembl.org and its mirrors), but also
other Ensembl sites for plants, fungi, protists, and metazoa. The
gene_biotype_use = "cellranger"
means that the same gene
biotypes Cell
Ranger uses for its reference package are used here. By default,
only standard chromosomes are kept. The dl_transcriptome
function not only downloads the transcriptome and filters it, it also
output the tr2g.tsv
file of all transcripts in the filtered
transcriptome, without column names, so can be directly used for
bustools
.
Wonder which biotypes are available? The lists of all gene and
transcript biotypes from Ensembl are now provided in this package, and
can be queried by data("ensembl_gene_biotypes")
and
data("ensembl_tx_biotypes")
.
Resources for common invertebrate model organisms such as Drosophila melanogaster and C. elegans are actually available on the vertebrate site (www.ensembl.org).
# For Drosophila
dl_transcriptome("Drosophila melanogaster", out_path = "fly",
gene_biotype_use = "cellranger", verbose = FALSE)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("fly")
#> [1] "Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz"
#> [2] "tr2g.tsv"
#> [3] "tx_filtered.fa"
The first file is the original fasta file. The second is the
tr2g
file without column names. The third is the filtered
fasta file.
For C. elegans, from an archived version of Ensembl. Note that archives older than version 98 might not work.
dl_transcriptome("Caenorhabditis elegans", out_path = "worm", verbose = FALSE,
gene_biotype_use = "cellranger", ensembl_version = 98)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("worm")
#> [1] "Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz"
#> [2] "tr2g.tsv"
#> [3] "tx_filtered.fa"
For Saccharomyces cerevisiae. Note that the versioning of Ensembl for the plants, fungi, and etc. sites, that are actually www.ensemblgenomes.org, is different from that of the vertebrate site.
dl_transcriptome("Saccharomyces cerevisiae", out_path = "yeast",
type = "fungus", gene_biotype_use = "cellranger",
verbose = FALSE)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("yeast")
#> [1] "Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz"
#> [2] "tr2g.tsv"
#> [3] "tx_filtered.fa"
The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.
# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 99,
other_attrs = "description",
gene_biotype_use = "cellranger")
#> Querying biomart for transcript and gene IDs of Mus musculus
head(tr2g_mm)
#> transcript gene
#> 3 ENSMUST00000082421.1 ENSMUSG00000064370.1
#> 5 ENSMUST00000082419.1 ENSMUSG00000064368.1
#> 6 ENSMUST00000082418.1 ENSMUSG00000064367.1
#> 10 ENSMUST00000082414.1 ENSMUSG00000064363.1
#> 11 ENSMUST00000084013.1 ENSMUSG00000065947.3
#> 13 ENSMUST00000082411.1 ENSMUSG00000064360.1
#> description
#> 3 mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
#> 5 mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
#> 6 mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
#> 10 mitochondrially encoded NADH dehydrogenase 4 [Source:MGI Symbol;Acc:MGI:102498]
#> 11 mitochondrially encoded NADH dehydrogenase 4L [Source:MGI Symbol;Acc:MGI:102497]
#> 13 mitochondrially encoded NADH dehydrogenase 3 [Source:MGI Symbol;Acc:MGI:102499]
#> gene_name gene_biotype chromosome_name
#> 3 mt-Cytb protein_coding MT
#> 5 mt-Nd6 protein_coding MT
#> 6 mt-Nd5 protein_coding MT
#> 10 mt-Nd4 protein_coding MT
#> 11 mt-Nd4l protein_coding MT
#> 13 mt-Nd3 protein_coding MT
We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.
By default, the tr2g.tsv
file and filtered fasta file
(if filtering for biotypes and chromosomes) are written to disk, but
these can be turned off so only the tr2g
data frame is
returned into the R session.
# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta, write_tr2g = FALSE, save_filtered = FALSE)
head(tr2g_fa)
#> # A tibble: 5 × 3
#> transcript gene gene_name
#> <chr> <chr> <chr>
#> 1 ENST00000362684.1 ENSG00000277234.1 RNU1-5P
#> 2 ENST00000459215.1 ENSG00000239102.1 RNU7-185P
#> 3 ENST00000410940.1 ENSG00000222872.1 RNU4-78P
#> 4 ENST00000383971.1 ENSG00000206698.1 RNU1-73P
#> 5 ENST00000516317.2 ENSG00000252126.2 RNU6-313P
If you have GTF or GFF3 files for other purposes, these can also be
used to generate the transcript to gene file. Now tr2g_gtf
and tr2g_gff3
can extract transcriptome from a genome that
is either a BSgenome
or a DNAStringSet
.
# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf, Genome = BSgenome.Hsapiens.UCSC.hg38,
gene_biotype_use = "cellranger",
out_path = "gtf")
#> 706 sequences in the genome are absent from the annotation.
head(tr2g_tg)
#> # A tibble: 6 × 3
#> transcript gene gene_name
#> <chr> <chr> <chr>
#> 1 ENST00000287218.9 ENSG00000156639.12 ZFAND3
#> 2 ENST00000373391.6 ENSG00000156639.12 ZFAND3
#> 3 ENST00000474522.5 ENSG00000156639.12 ZFAND3
#> 4 ENST00000373389.5 ENSG00000156639.12 ZFAND3
#> 5 ENST00000469208.1 ENSG00000156639.12 ZFAND3
#> 6 ENST00000440482.2 ENSG00000156639.12 ZFAND3
A new GTF or GFF3 file after filtering biotypes and chromosomes is
also written, and this can be turned off by setting
save_filtered_gtf = FALSE
or
save_filtered_gff = FALSE
. The transcriptome, with biotypes
filtered and only standard chromosomes, is in
transcriptome.fa
. Use compress_fa = TRUE
to
gzip it.
TxDb
is a class for storing transcript annotations from
the Bioconductor package GenomicFeatures
. Unfortunately,
TxDb.Hsapiens.UCSC.hg38.knownGene
does not have biotype
information or gene symbols.
tr2g_hs <- tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, get_transcriptome = FALSE,
write_tr2g = FALSE)
#> 'select()' returned 1:1 mapping between keys and columns
head(tr2g_hs)
#> transcript gene tx_id seqnames
#> 1 ENST00000456328.2 100287102 1 chr1
#> 2 ENST00000450305.2 100287102 2 chr1
#> 3 ENST00000473358.1 107985730 3 chr1
#> 4 ENST00000469289.1 107985730 4 chr1
#> 5 ENST00000607096.1 100302278 5 chr1
#> 9 ENST00000641515.2 79501 9 chr1
EnsDb
is a class for Ensembl gene annotations, from the
Bioconductor package ensembldb
. Ensembl annotations as
EnsDb
are available on AnnotationHub
(since
version 87), and some older versions are stand alone packages
(e.g. EnsDb.Hsapiens.v86
).
tr2g_hs86 <- tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE,
write_tr2g = FALSE, gene_biotype_use = "cellranger",
use_gene_version = FALSE, use_transcript_version = FALSE)
head(tr2g_hs86)
#> transcript gene gene_name gene_biotype seqnames
#> 1 ENST00000000233 ENSG00000004059 ARF5 protein_coding 7
#> 2 ENST00000000412 ENSG00000003056 M6PR protein_coding 12
#> 3 ENST00000000442 ENSG00000173153 ESRRA protein_coding 11
#> 4 ENST00000001008 ENSG00000004478 FKBP4 protein_coding 12
#> 5 ENST00000001146 ENSG00000003137 CYP26B1 protein_coding 2
#> 6 ENST00000002125 ENSG00000003509 NDUFAF7 protein_coding 2
There used to be sections about sort_tr2g
and
save_tr2g_bustools
, but these functions have been
superseded by the new version of tr2g
functions and
dl_transcriptome
, which sort the transcriptome after
extracting it so the tr2g
and the transcriptome are in the
same order. The new version of tr2g
functions and
dl_transcriptome
also by default writes the
tr2g.tsv
without column names with the first column as
transcript and the second as gene to disk.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] EnsDb.Hsapiens.v86_2.99.0
#> [2] ensembldb_2.29.1
#> [3] AnnotationFilter_1.31.0
#> [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
#> [5] GenomicFeatures_1.57.1
#> [6] AnnotationDbi_1.69.0
#> [7] Biobase_2.65.1
#> [8] BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [9] BSgenome_1.73.1
#> [10] rtracklayer_1.65.0
#> [11] BiocIO_1.15.2
#> [12] Biostrings_2.73.2
#> [13] XVector_0.45.0
#> [14] GenomicRanges_1.57.2
#> [15] GenomeInfoDb_1.41.2
#> [16] IRanges_2.39.2
#> [17] S4Vectors_0.43.2
#> [18] BiocGenerics_0.51.3
#> [19] ggplot2_3.5.1
#> [20] zeallot_0.1.0
#> [21] Matrix_1.7-1
#> [22] BUSpaRse_1.21.0
#> [23] TENxBUSData_1.19.0
#> [24] BiocStyle_2.33.1
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] httr2_1.0.5 biomaRt_2.61.3
#> [5] rlang_1.1.4 magrittr_2.0.3
#> [7] matrixStats_1.4.1 compiler_4.4.1
#> [9] RSQLite_2.3.7 png_0.1-8
#> [11] vctrs_0.6.5 stringr_1.5.1
#> [13] ProtGenerics_1.37.1 pkgconfig_2.0.3
#> [15] crayon_1.5.3 fastmap_1.2.0
#> [17] dbplyr_2.5.0 utf8_1.2.4
#> [19] Rsamtools_2.21.2 rmarkdown_2.28
#> [21] UCSC.utils_1.1.0 purrr_1.0.2
#> [23] bit_4.5.0 xfun_0.48
#> [25] zlibbioc_1.51.2 cachem_1.1.0
#> [27] jsonlite_1.8.9 progress_1.2.3
#> [29] blob_1.2.4 highr_0.11
#> [31] DelayedArray_0.31.14 BiocParallel_1.39.0
#> [33] parallel_4.4.1 prettyunits_1.2.0
#> [35] plyranges_1.25.0 R6_2.5.1
#> [37] bslib_0.8.0 stringi_1.8.4
#> [39] jquerylib_0.1.4 Rcpp_1.0.13
#> [41] SummarizedExperiment_1.35.5 knitr_1.48
#> [43] tidyselect_1.2.1 abind_1.4-8
#> [45] yaml_2.3.10 codetools_0.2-20
#> [47] curl_5.2.3 lattice_0.22-6
#> [49] tibble_3.2.1 withr_3.0.2
#> [51] KEGGREST_1.45.1 evaluate_1.0.1
#> [53] BiocFileCache_2.13.2 xml2_1.3.6
#> [55] ExperimentHub_2.13.1 pillar_1.9.0
#> [57] BiocManager_1.30.25 filelock_1.0.3
#> [59] MatrixGenerics_1.17.1 generics_0.1.3
#> [61] RCurl_1.98-1.16 BiocVersion_3.20.0
#> [63] hms_1.1.3 munsell_0.5.1
#> [65] scales_1.3.0 glue_1.8.0
#> [67] lazyeval_0.2.2 maketools_1.3.1
#> [69] tools_4.4.1 AnnotationHub_3.15.0
#> [71] sys_3.4.3 GenomicAlignments_1.41.0
#> [73] buildtools_1.0.0 XML_3.99-0.17
#> [75] grid_4.4.1 tidyr_1.3.1
#> [77] colorspace_2.1-1 GenomeInfoDbData_1.2.13
#> [79] restfulr_0.0.15 cli_3.6.3
#> [81] rappdirs_0.3.3 fansi_1.0.6
#> [83] S4Arrays_1.5.11 dplyr_1.1.4
#> [85] gtable_0.3.6 sass_0.4.9
#> [87] digest_0.6.37 SparseArray_1.5.45
#> [89] farver_2.1.2 rjson_0.2.23
#> [91] memoise_2.0.1 htmltools_0.5.8.1
#> [93] lifecycle_1.0.4 httr_1.4.7
#> [95] mime_0.12 bit64_4.5.2