Generate transcript to gene file for bustools

Introduction

Originally, this package was written when the kallisto | bustools concept was still experimental, to test a new and fast way to generate the gene count matrix from fastq files for scRNA-seq. In the past year, kallisto | bustools has matured. Now there’s a wrapper kb-python that can download a prebuilt kallisto index for human and mice and call kallisto bus and bustools to get the gene count matrix. So largely, the old way of calling kallisto bus and bustools, and some functionalities of BUSpaRse, such as getting transcript to gene mapping, are obsolete.

So now the focus of BUSpaRse has shifted to finer control of the transcripts that go into the transcriptome and more options. Now all tr2g_* functions (except tr2g_ensembl) can filter transcripts for gene and transcript biotypes, only keep standard chromosomes (so no scaffolds and haplotypes), and extract the filtered transcripts from the transcriptome. GTF files from Ensembl, Ensembl fasta files, GFF3 files from Ensembl and RefSeq, TxDb, and EnsDb can all be used here.

library(BUSpaRse)
library(BSgenome.Hsapiens.UCSC.hg38)
#> Loading required package: GenomeInfoDb
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:Matrix':
#> 
#>     expand, unname
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: BSgenome
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#> 
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#> 
#>     FileForFormat
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
library(EnsDb.Hsapiens.v86)
#> Loading required package: ensembldb
#> Loading required package: AnnotationFilter
#> 
#> Attaching package: 'ensembldb'
#> The following object is masked from 'package:stats':
#> 
#>     filter

Downloading a transcriptome

The transcriptome can be downloaded from a specified version of Ensembl and filtered for biotypes and standard chromosomes, not only for the vertebrate database (www.ensembl.org and its mirrors), but also other Ensembl sites for plants, fungi, protists, and metazoa. The gene_biotype_use = "cellranger" means that the same gene biotypes Cell Ranger uses for its reference package are used here. By default, only standard chromosomes are kept. The dl_transcriptome function not only downloads the transcriptome and filters it, it also output the tr2g.tsv file of all transcripts in the filtered transcriptome, without column names, so can be directly used for bustools.

Wonder which biotypes are available? The lists of all gene and transcript biotypes from Ensembl are now provided in this package, and can be queried by data("ensembl_gene_biotypes") and data("ensembl_tx_biotypes").

Resources for common invertebrate model organisms such as Drosophila melanogaster and C. elegans are actually available on the vertebrate site (www.ensembl.org).

# For Drosophila
dl_transcriptome("Drosophila melanogaster", out_path = "fly", 
                 gene_biotype_use = "cellranger", verbose = FALSE)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("fly")
#> [1] "Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                       
#> [3] "tx_filtered.fa"

The first file is the original fasta file. The second is the tr2g file without column names. The third is the filtered fasta file.

For C. elegans, from an archived version of Ensembl. Note that archives older than version 98 might not work.

dl_transcriptome("Caenorhabditis elegans", out_path = "worm", verbose = FALSE,
                 gene_biotype_use = "cellranger", ensembl_version = 98)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("worm")
#> [1] "Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                      
#> [3] "tx_filtered.fa"

For Saccharomyces cerevisiae. Note that the versioning of Ensembl for the plants, fungi, and etc. sites, that are actually www.ensemblgenomes.org, is different from that of the vertebrate site.

dl_transcriptome("Saccharomyces cerevisiae", out_path = "yeast", 
                 type = "fungus", gene_biotype_use = "cellranger", 
                 verbose = FALSE)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("yeast")
#> [1] "Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                       
#> [3] "tx_filtered.fa"

Obtaining transcript to gene information

From Ensembl

The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.

# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 99, 
                        other_attrs = "description", 
                        gene_biotype_use = "cellranger")
#> Querying biomart for transcript and gene IDs of Mus musculus
head(tr2g_mm)
#>              transcript                 gene
#> 3  ENSMUST00000082421.1 ENSMUSG00000064370.1
#> 5  ENSMUST00000082419.1 ENSMUSG00000064368.1
#> 6  ENSMUST00000082418.1 ENSMUSG00000064367.1
#> 10 ENSMUST00000082414.1 ENSMUSG00000064363.1
#> 11 ENSMUST00000084013.1 ENSMUSG00000065947.3
#> 13 ENSMUST00000082411.1 ENSMUSG00000064360.1
#>                                                                         description
#> 3           mitochondrially encoded cytochrome b [Source:MGI Symbol;Acc:MGI:102501]
#> 5   mitochondrially encoded NADH dehydrogenase 6 [Source:MGI Symbol;Acc:MGI:102495]
#> 6   mitochondrially encoded NADH dehydrogenase 5 [Source:MGI Symbol;Acc:MGI:102496]
#> 10  mitochondrially encoded NADH dehydrogenase 4 [Source:MGI Symbol;Acc:MGI:102498]
#> 11 mitochondrially encoded NADH dehydrogenase 4L [Source:MGI Symbol;Acc:MGI:102497]
#> 13  mitochondrially encoded NADH dehydrogenase 3 [Source:MGI Symbol;Acc:MGI:102499]
#>    gene_name   gene_biotype chromosome_name
#> 3    mt-Cytb protein_coding              MT
#> 5     mt-Nd6 protein_coding              MT
#> 6     mt-Nd5 protein_coding              MT
#> 10    mt-Nd4 protein_coding              MT
#> 11   mt-Nd4l protein_coding              MT
#> 13    mt-Nd3 protein_coding              MT
# Plants
tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
#> Version is only available to vertebrates.
#> Querying biomart for transcript and gene IDs of Arabidopsis thaliana
#> File /tmp/Rtmpti4Wql/Rbuild1de21855c6c5/BUSpaRse/vignettes/tr2g.tsv already exists.
head(tr2g_at)
#>    transcript      gene gene_name chromosome_name
#> 1 AT3G11415.1 AT3G11415 AT3G11415               3
#> 2 AT1G31258.1 AT1G31258 AT1G31258               1
#> 3 AT5G24735.1 AT5G24735 AT5G24735               5
#> 4 AT2G45780.1 AT2G45780 AT2G45780               2
#> 5 AT2G42425.1 AT2G42425 AT2G42425               2
#> 6 AT4G01533.1 AT4G01533 AT4G01533               4

From FASTA file

We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.

By default, the tr2g.tsv file and filtered fasta file (if filtering for biotypes and chromosomes) are written to disk, but these can be turned off so only the tr2g data frame is returned into the R session.

# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta, write_tr2g = FALSE, save_filtered = FALSE)
head(tr2g_fa)
#> # A tibble: 5 × 3
#>   transcript        gene              gene_name
#>   <chr>             <chr>             <chr>    
#> 1 ENST00000362684.1 ENSG00000277234.1 RNU1-5P  
#> 2 ENST00000459215.1 ENSG00000239102.1 RNU7-185P
#> 3 ENST00000410940.1 ENSG00000222872.1 RNU4-78P 
#> 4 ENST00000383971.1 ENSG00000206698.1 RNU1-73P 
#> 5 ENST00000516317.2 ENSG00000252126.2 RNU6-313P

From GTF and GFF3 files

If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file. Now tr2g_gtf and tr2g_gff3 can extract transcriptome from a genome that is either a BSgenome or a DNAStringSet.

# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf, Genome = BSgenome.Hsapiens.UCSC.hg38,
                    gene_biotype_use = "cellranger",
                    out_path = "gtf")
#> 706 sequences in the genome are absent from the annotation.
head(tr2g_tg)
#> # A tibble: 6 × 3
#>   transcript        gene               gene_name
#>   <chr>             <chr>              <chr>    
#> 1 ENST00000287218.9 ENSG00000156639.12 ZFAND3   
#> 2 ENST00000373391.6 ENSG00000156639.12 ZFAND3   
#> 3 ENST00000474522.5 ENSG00000156639.12 ZFAND3   
#> 4 ENST00000373389.5 ENSG00000156639.12 ZFAND3   
#> 5 ENST00000469208.1 ENSG00000156639.12 ZFAND3   
#> 6 ENST00000440482.2 ENSG00000156639.12 ZFAND3

A new GTF or GFF3 file after filtering biotypes and chromosomes is also written, and this can be turned off by setting save_filtered_gtf = FALSE or save_filtered_gff = FALSE. The transcriptome, with biotypes filtered and only standard chromosomes, is in transcriptome.fa. Use compress_fa = TRUE to gzip it.

list.files("gtf")
#> [1] "gtf_filtered.gtf" "tr2g.tsv"         "transcriptome.fa"

From TxDb

TxDb is a class for storing transcript annotations from the Bioconductor package GenomicFeatures. Unfortunately, TxDb.Hsapiens.UCSC.hg38.knownGene does not have biotype information or gene symbols.

tr2g_hs <- tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, get_transcriptome = FALSE,
                     write_tr2g = FALSE)
#> 'select()' returned 1:1 mapping between keys and columns
head(tr2g_hs)
#>          transcript      gene tx_id seqnames
#> 1 ENST00000456328.2 100287102     1     chr1
#> 2 ENST00000450305.2 100287102     2     chr1
#> 3 ENST00000473358.1 107985730     3     chr1
#> 4 ENST00000469289.1 107985730     4     chr1
#> 5 ENST00000607096.1 100302278     5     chr1
#> 9 ENST00000641515.2     79501     9     chr1

From EnsDb

EnsDb is a class for Ensembl gene annotations, from the Bioconductor package ensembldb. Ensembl annotations as EnsDb are available on AnnotationHub (since version 87), and some older versions are stand alone packages (e.g. EnsDb.Hsapiens.v86).

tr2g_hs86 <- tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, 
                        write_tr2g = FALSE, gene_biotype_use = "cellranger",
                        use_gene_version = FALSE, use_transcript_version = FALSE)
head(tr2g_hs86)
#>        transcript            gene gene_name   gene_biotype seqnames
#> 1 ENST00000000233 ENSG00000004059      ARF5 protein_coding        7
#> 2 ENST00000000412 ENSG00000003056      M6PR protein_coding       12
#> 3 ENST00000000442 ENSG00000173153     ESRRA protein_coding       11
#> 4 ENST00000001008 ENSG00000004478     FKBP4 protein_coding       12
#> 5 ENST00000001146 ENSG00000003137   CYP26B1 protein_coding        2
#> 6 ENST00000002125 ENSG00000003509   NDUFAF7 protein_coding        2

Deprecation

There used to be sections about sort_tr2g and save_tr2g_bustools, but these functions have been superseded by the new version of tr2g functions and dl_transcriptome, which sort the transcriptome after extracting it so the tr2g and the transcriptome are in the same order. The new version of tr2g functions and dl_transcriptome also by default writes the tr2g.tsv without column names with the first column as transcript and the second as gene to disk.

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] EnsDb.Hsapiens.v86_2.99.0               
#>  [2] ensembldb_2.29.1                        
#>  [3] AnnotationFilter_1.31.0                 
#>  [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
#>  [5] GenomicFeatures_1.57.1                  
#>  [6] AnnotationDbi_1.69.0                    
#>  [7] Biobase_2.65.1                          
#>  [8] BSgenome.Hsapiens.UCSC.hg38_1.4.5       
#>  [9] BSgenome_1.73.1                         
#> [10] rtracklayer_1.65.0                      
#> [11] BiocIO_1.15.2                           
#> [12] Biostrings_2.73.2                       
#> [13] XVector_0.45.0                          
#> [14] GenomicRanges_1.57.2                    
#> [15] GenomeInfoDb_1.41.2                     
#> [16] IRanges_2.39.2                          
#> [17] S4Vectors_0.43.2                        
#> [18] BiocGenerics_0.51.3                     
#> [19] ggplot2_3.5.1                           
#> [20] zeallot_0.1.0                           
#> [21] Matrix_1.7-1                            
#> [22] BUSpaRse_1.21.0                         
#> [23] TENxBUSData_1.19.0                      
#> [24] BiocStyle_2.33.1                        
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.3                   bitops_1.0-9               
#>  [3] httr2_1.0.5                 biomaRt_2.61.3             
#>  [5] rlang_1.1.4                 magrittr_2.0.3             
#>  [7] matrixStats_1.4.1           compiler_4.4.1             
#>  [9] RSQLite_2.3.7               png_0.1-8                  
#> [11] vctrs_0.6.5                 stringr_1.5.1              
#> [13] ProtGenerics_1.37.1         pkgconfig_2.0.3            
#> [15] crayon_1.5.3                fastmap_1.2.0              
#> [17] dbplyr_2.5.0                utf8_1.2.4                 
#> [19] Rsamtools_2.21.2            rmarkdown_2.28             
#> [21] UCSC.utils_1.1.0            purrr_1.0.2                
#> [23] bit_4.5.0                   xfun_0.48                  
#> [25] zlibbioc_1.51.2             cachem_1.1.0               
#> [27] jsonlite_1.8.9              progress_1.2.3             
#> [29] blob_1.2.4                  highr_0.11                 
#> [31] DelayedArray_0.31.14        BiocParallel_1.39.0        
#> [33] parallel_4.4.1              prettyunits_1.2.0          
#> [35] plyranges_1.25.0            R6_2.5.1                   
#> [37] bslib_0.8.0                 stringi_1.8.4              
#> [39] jquerylib_0.1.4             Rcpp_1.0.13                
#> [41] SummarizedExperiment_1.35.5 knitr_1.48                 
#> [43] tidyselect_1.2.1            abind_1.4-8                
#> [45] yaml_2.3.10                 codetools_0.2-20           
#> [47] curl_5.2.3                  lattice_0.22-6             
#> [49] tibble_3.2.1                withr_3.0.2                
#> [51] KEGGREST_1.45.1             evaluate_1.0.1             
#> [53] BiocFileCache_2.13.2        xml2_1.3.6                 
#> [55] ExperimentHub_2.13.1        pillar_1.9.0               
#> [57] BiocManager_1.30.25         filelock_1.0.3             
#> [59] MatrixGenerics_1.17.1       generics_0.1.3             
#> [61] RCurl_1.98-1.16             BiocVersion_3.20.0         
#> [63] hms_1.1.3                   munsell_0.5.1              
#> [65] scales_1.3.0                glue_1.8.0                 
#> [67] lazyeval_0.2.2              maketools_1.3.1            
#> [69] tools_4.4.1                 AnnotationHub_3.15.0       
#> [71] sys_3.4.3                   GenomicAlignments_1.41.0   
#> [73] buildtools_1.0.0            XML_3.99-0.17              
#> [75] grid_4.4.1                  tidyr_1.3.1                
#> [77] colorspace_2.1-1            GenomeInfoDbData_1.2.13    
#> [79] restfulr_0.0.15             cli_3.6.3                  
#> [81] rappdirs_0.3.3              fansi_1.0.6                
#> [83] S4Arrays_1.5.11             dplyr_1.1.4                
#> [85] gtable_0.3.6                sass_0.4.9                 
#> [87] digest_0.6.37               SparseArray_1.5.45         
#> [89] farver_2.1.2                rjson_0.2.23               
#> [91] memoise_2.0.1               htmltools_0.5.8.1          
#> [93] lifecycle_1.0.4             httr_1.4.7                 
#> [95] mime_0.12                   bit64_4.5.2