Package 'BUSpaRse' reference manual

Title:	kallisto \| bustools R utilities
Description:	The kallisto \| bustools pipeline is a fast and modular set of tools to convert single cell RNA-seq reads in fastq files into gene count or transcript compatibility counts (TCC) matrices for downstream analysis. Central to this pipeline is the barcode, UMI, and set (BUS) file format. This package serves the following purposes: First, this package allows users to manipulate BUS format files as data frames in R and then convert them into gene count or TCC matrices. Furthermore, since R and Rcpp code is easier to handle than pure C++ code, users are encouraged to tweak the source code of this package to experiment with new uses of BUS format and different ways to convert the BUS file into gene count matrix. Second, this package can conveniently generate files required to generate gene count matrices for spliced and unspliced transcripts for RNA velocity. Here biotypes can be filtered and scaffolds and haplotypes can be removed, and the filtered transcriptome can be extracted and written to disk. Third, this package implements utility functions to get transcripts and associated genes required to convert BUS files to gene count matrices, to write the transcript to gene information in the format required by bustools, and to read output of bustools into R as sparses matrices.
Authors:	Lambda Moses [aut, cre] , Lior Pachter [aut, ths]
Maintainer:	Lambda Moses <dl3764@columbia.edu>
License:	BSD_2_clause + file LICENSE
Version:	1.21.0
Built:	2025-02-27 05:39:32 UTC
Source:	https://github.com/bioc/BUSpaRse

Generate RNA velocity files for GRanges

Description

Generate RNA velocity files for GRanges

Usage

.get_velocity_files(
  gr,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE,
  compress_fa = FALSE,
  width = 80L
)
.get_velocity_files(
  gr,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE,
  compress_fa = FALSE,
  width = 80L
)

Arguments

`gr`	A `GRanges` object for gene annotation.
`L`	Length of the biological read. For instance, 10xv1: 98 nt, 10xv2: 98 nt, 10xv3: 91 nt, Drop-seq: 50 nt. If in doubt check read length in a fastq file for biological reads with the `bash` commands: If the fastq file is gzipped, then do `⁠zcat your_file.fastq.gz \| head⁠` on Linux. If on Mac, then `zcat < your_file.fastq.gz \| head`. Then you will see lines with nucleotide bases. Copy one of those lines and determine its length with `str_length` in R or `⁠echo -n <the sequence> \| wc -c⁠` in `bash`. Which file corresponds to biological reads depends on the particular technology.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`Transcriptome`	A `XStringSet`, a path to a fasta file (can be gzipped) of the transcriptome which contains sequences of spliced transcripts, or `NULL`. The transcriptome here will be concatenated with the intronic sequences to give one fasta file. When `NULL`, the transriptome sequences will be extracted from the genome given the gene annotation, so it will be guaranteed that transcript IDs in the transcriptome and in the annotation match. Otherwise, the type of transcript ID in the transcriptome must match that in the gene annotation supplied via argument `X`.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`style`	Formatting of chromosome names. Use `genomeStyles` to check which styles are supported for your organism of interest and what those styles look like. This can also be a style supported for your organism different from the style used by the annotation and the genome. Then this style will be used for both the annotation and the genome. Can take the following values: genome If style of the annnotation is different from that of the genome, then the style of the genome will be used. other Custom style, need to manually ensure that the style in annotation matches that of the genome. Ensembl Or `UCSC` or `NCBI`, whichever is supported by your species of interest.
`isoform_action`	Character, indicating action to take with different transcripts of the same gene. Must be one of the following: collapse First, the union of all exons of different transcripts of a gene will be taken. Then the introns will be inferred from this union. Only the flanked intronic sequences are affected; isoforms will always be taken into account for spliced sequences or exon-exon junctions. separate Introns from different transcripts will be kept separate.
`exon_option`	Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following: full The full cDNA sequences, which include the full exonic sequences, will be used. This is the default. junction Only the exon-exon junctions, with L-1 bases on each side of the junctions, will be used.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.
`transcript_version`	Character vector of length 1. Tag in `attribute` field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`gene_version`	Character vector of length 1. Tag in `attribute` field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. Unlike transcript version number, it's up to you whether to include gene version number.
`version_sep`	Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`save_filtered_gtf`	Logical. If filtering type, biotypes, and/or chromosomes, whether to save the filtered `GRanges` as a GTF file.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`width`	Maximum number of letters per line of sequence in the output fasta file. Must be an integer.

Value

See get_velocity_files

Transfer information about circular chromosomes between genome and annotation

Description

Internal use, called after calling subset_annot.

Usage

annot_circular(Genome, annot)
annot_circular(Genome, annot)

Arguments

Genome

Either a BSgenome or a XStringSet object of genomic sequences, where the intronic sequences will be extracted from. Use genomeStyles to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.

annot

Genome annotation, an object of a class with a seqlevels method, such as GRanges, TxDb, and EnsDb.

Value

If neither genome nor annotation indicates which chromosome is circular, then the input will be returned unchanged. If only one of genome and annotation has such information, then it will be transferred to the one that does not. If both do have such information, the information from the genome will be transferred to the annotation if they're different.

Get genome annotation from Ensembl FASTA file

Description

Ensembl FASTA files for RNA contain much of the information contained in GTF files, such as chromosome, genome assembly version, coordinates, strand, gene ID, gene symbol, and gene description. Given such a FASTA file, this function can extract all the genome annotation information and return a data frame or a GRanges object.

Usage

annots_from_fa_df(file)

annots_from_fa_GRanges(file)
annots_from_fa_df(file)

annots_from_fa_GRanges(file)

Arguments

file

Path to the FASTA file to be read. The file can remain gzipped.

Value

annots_from_fa_df returns a data frame and annots_from_fa_GRanges returns GRanges.

A data frame with genome annotations.

Examples

fn <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
annots_from_fa_df(fn)
annots_from_fa_GRanges(fn)

fn <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
annots_from_fa_df(fn)
annots_from_fa_GRanges(fn)

Cell Ranger gene biotypes

Description

In the GRCh38 Cell Ranger reference package, an Ensembl human GTF file is filtered by gene biotypes. This vector includes all gene biotypes included by this Cell Ranger reference package.

Usage

cellranger_biotypes
cellranger_biotypes

Format

A character vector with all Cell Ranger reference package gene biotypes.

Source

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references

Check that an object is a character vector of length 1

Description

Just in case the user passes something with length more than 1 and messes up everything thanks to vectorization.

Usage

check_char1(x)
check_char1(x)

Arguments

`x`	Named vector of arguments to be checked.

Value

Error if x is not a character vector with length 1.

Check for chromosomes in genome but not annotation

Description

Check for chromosomes in genome but not annotation

Usage

check_genome(chrs_use, Genome)
check_genome(chrs_use, Genome)

Arguments

chrs_use

Character vector of names of chromosomes present in both the annotation and the genome.

Genome

Value

Nothing. Will emit message if the genome contains chromosomes absent from the annotation.

Check inputs to tr2g_gtf and tr2g_gff3

Description

This function validates inputs to tr2g_gtf and tr2g_gff3 and throws error early if some inputs are wrong.

Usage

check_gff(format, file, transcript_id, gene_id)
check_gff(format, file, transcript_id, gene_id)

Arguments

`format`	Whether it's gtf or gff3.
`file`	Path to a GTF file to be read. The file can remain gzipped. Use `getGTF` from the `biomartr` package to download GTF files from Ensembl, and use `getGFF` from `biomartr` to download GFF3 files from Ensembl and RefSeq.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.

Value

Nothing, will throw error if there's a problem.

Check that a tag is present in attribute field of GTF/GFF

Description

The attribute field of GTF/GFF files are very complicated and is very inconsistent between sources. This function is to make sure that transcript and gene IDs can be extracted properly.

Usage

check_tag_present(tags_use, tags, error = TRUE)
check_tag_present(tags_use, tags, error = TRUE)

Arguments

`tags_use`	The tags to be checked.
`tags`	The tags present in attribute field.
`error`	Whether to throw an error for absent tags. If `FALSE`, then a warning will be given.

Value

Error or warning if tag is absent.

Check if transcript ID in transcriptome and annotation match

Description

This function throws an error if transcript IDs in transcriptome and annotation do not overlap. If they do overlap, this function will give a message about transcript IDs that do not agree in the transcriptome and the annotation

Usage

check_tx(tx_annot, tx)
check_tx(tx_annot, tx)

Arguments

`tx_annot`	Character vector of transcript IDs from the annotation.
`tx`	Character vector of transcript IDs from the transcriptome.

Value

Character vector of the overlapping transcript IDs.

Download transcriptome from Ensembl

Description

This function downloads the cDNA fasta file from specific version of Ensembl. It can also filter the fasta file by gene and transcript biotype and remove scaffolds and haplotypes.

Usage

dl_transcriptome(
  species,
  out_path = ".",
  type = c("vertebrate", "metazoa", "plant", "fungus", "protist"),
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  ensembl_version = NULL,
  verbose = TRUE,
  ...
)
dl_transcriptome(
  species,
  out_path = ".",
  type = c("vertebrate", "metazoa", "plant", "fungus", "protist"),
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  ensembl_version = NULL,
  verbose = TRUE,
  ...
)

Arguments

`species`	Character vector of length 1, Latin name of the species of interest.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`type`	Character, must be one of "vertebrate", "metazoa", "plant", "fungus" and "protist". Passing "vertebrate" will use the default www.ensembl.org host. Gene annotation of some common invertebrate model organisms, such as Drosophila melanogaster, are available on www.ensembl.org so for these invertebrate model organisms, "vertebrate" can be used for this argument. Passing values other than "vertebrate" will use other Ensembl hosts. For animals absent from www.ensembl.org, try "metazoa".
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`ensembl_version`	Integer version number of Ensembl (e.g. 94 for the October 2018 release). This argument defaults to `NULL`, which will use the current release of Ensembl. Use `listEnsemblArchives` to see the version number corresponding to the Ensembl release of a particular date. The version specified here must match the version of Ensembl where the transcriptome used to build the kallisto index was downloaded. This only works for vertebrates and the most common invertebrate model organisms like Drosophila melanogaster and C. elegans (i.e. www.ensembl.org and its mirrors), not the other Ensembl sites for plants, protists, fungi, and metazoa.
`verbose`	Whether to display progress.
`...`	Other arguments passed to `tr2g_fasta`.

Value

Invisibly returns the path to the fasta file. The following files are written to disk, in the out_path directory:

species.genome.cdna.all.fa.gz: The cDNA fasta file from Ensembl, from the specified version.
cdna_filtered.fa: The filtered cDNA fasta file, only keeping the desired biotypes and without scaffolds and haplotypes (if chrs_only = TRUE). This file will not be written if all gene and transcript biotypes are used and scaffolds and haplotypes are not removed.
tr2g.tsv: The transcript to gene file, without headers so can be directly used for bustools.

Examples

dl_transcriptome("Drosophila melanogaster", gene_biotype_use = "cellranger",
                 chrs_only = FALSE)
# Clean up
file.remove("Drosophila_melanogaster.BDGP6.32.cdna.all.fa.gz")
dl_transcriptome("Drosophila melanogaster", gene_biotype_use = "cellranger",
                 chrs_only = FALSE)
# Clean up
file.remove("Drosophila_melanogaster.BDGP6.32.cdna.all.fa.gz")

Map EC Index to Genes Compatible with the EC

Description

In the output file output.bus, equivalence classes (EC) are denoted by an index, which is related to the set of transcripts the EC is compatible to in the output file matrix.ec. This function further relates the set of transcripts to the set of genes the EC is compatible to. This function first reads in matrix.ec, and then translates the transcripts into genes.

Usage

EC2gene(tr2g, kallisto_out_path, verbose = TRUE)
EC2gene(tr2g, kallisto_out_path, verbose = TRUE)

Arguments

`tr2g`	A Data frame with columns `gene` and `transcript`, in the same order as in the transcriptome index for `kallisto`.
`kallisto_out_path`	Path to the `kallisto bus` output directory.
`verbose`	Logical, whether to display progress.

Details

The data frame passed to tr2g can be generated from function transcript2gene in this package for any organism that has gene and transcript ID on Ensembl, or from the tr2g_* family of function. You no longer need to use this function before running make_sparse_matrix; the purpose of this function is to query which genes equivalence classes map to.

Calling this function is unnessary when working with gene count matrices. However, this function is useful for finding genes the ECs map to in TCC matrices, such as when finding species-specific ECs in mixed species datasets and identifying ECs mapped to known marker genes of cell types.

Value

A data frame with 3 columns:

EC_ind: Index of the EC as appearing in the matrix.ec file.
EC: A list column each element of which is a numeric vector of the transcripts in the EC corresponding to the EC index. To learn more about list columns, see the relevant section in the R for Data Science book.
gene: A list column each element of which is a character vector of genes the EC maps to.

Examples

# Load toy example for testing
toy_path <- system.file("testdata", package = "BUSpaRse")
load(paste(toy_path, "toy_example.RData", sep = "/"))
EC2gene(tr2g_toy, toy_path, verbose = FALSE)
# Load toy example for testing
toy_path <- system.file("testdata", package = "BUSpaRse")
load(paste(toy_path, "toy_example.RData", sep = "/"))
EC2gene(tr2g_toy, toy_path, verbose = FALSE)

Gene biotypes from Ensembl

Description

These vectors are here to make it easy to look up which biotypes are available for Ensembl without having to parse GTF and fasta files every time. See this page from Ensembl for what the biotypes mean.

Usage

ensembl_gene_biotypes
ensembl_gene_biotypes

Format

A character vector with all Ensembl gene biotypes.

Source

This vector is all the unique gene biotypes in the Ensembl version 99 human GTF file.

These are the column names of the `mcols` when the Ensembl GTF file is read into R as a `GRanges`, including `gene_id`, `transcript_id`, `biotype`, `description`, and so on, and the mandatory tags like `ID`, `Name`, and `Parent`.

Description

These are the column names of the mcols when the Ensembl GTF file is read into R as a GRanges, including gene_id, transcript_id, biotype, description, and so on, and the mandatory tags like ID, Name, and Parent.

Usage

ensembl_gff_mcols
ensembl_gff_mcols

Format

A character vector

Source

Ensembl version 99 human GFF3 file

Tags in the attributes field of Ensembl GTF files

Description

These are the column names of the mcols when the Ensembl GTF file is read into R as a GRanges, including gene_id, transcript_id, gene_biotype, transcript_biotype, description, and so on.

Usage

ensembl_gtf_mcols
ensembl_gtf_mcols

Format

A character vector

Source

Ensembl version 99 human GTF file

Transcript biotypes from Ensembl

Description

Usage

ensembl_tx_biotypes
ensembl_tx_biotypes

Format

A character vector with all Ensembl transcript biotypes.

Source

This vector is all the unique transcript biotypes in the Ensembl version 99 human GTF file.

Get flanked intronic ranges

Description

Get flanked intronic ranges

Usage

get_intron_flanks(grl, L, get_junctions)
get_intron_flanks(grl, L, get_junctions)

Arguments

`grl`	A `CompressedGRangesList` for exonic ranges, each element for one transcript.
`L`	Read length.
`get_junctions`	Logical, whether to also return exon-exon junctions.

Value

If get_junctions is FALSE, then a GRanges object with ranges for flanked intronic regions. If get_junctions is TRUE, then in addition to the flanked intronic ranges, a CompressedGRangesList with exon-exon junction ranges and ranges for transcripts without introns.

Plot the transposed knee plot and inflection point

Description

Plot a transposed knee plot, showing the inflection point and the number of remaining cells after inflection point filtering. It's transposed since it's more generalizable to multi-modal data.

Usage

get_knee_df(mat)

get_inflection(df, lower = 100)

knee_plot(df, inflection)
get_knee_df(mat)

get_inflection(df, lower = 100)

knee_plot(df, inflection)

Arguments

`mat`	Gene count matrix, a dgCMatrix.
`df`	The data frame from `get_knee_df`.
`lower`	Minimum total UMI counts for barcode for it to be considered when calculating the inflection point; this helps to avoid the noisy part of the curve for barcodes with very few counts.
`inflection`	Output of `get_inflection`.

Value

get_knee_df returns a tibble with two columns: total for total UMI counts for each barcode, and rank for rank of the total counts, with number 1 for the barcode with the most counts.

get_inflection returns a numeric(1) for the total UMI count at the inflection point.

knee_plot returns a ggplot2 object.

Note

Code in part adapted from barcodeRanks from DropetUtils.

Examples

# Download dataset already in BUS format
library(TENxBUSData)
TENxBUSData(".", dataset = "hgmm100")
tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"), 
  type = "vertebrate",
  ensembl_version = 99, kallisto_out_path = "./out_hgmm100")
m <- make_sparse_matrix("./out_hgmm100/output.sorted.txt",
  tr2g = tr2g, est_ncells = 1e5,
  est_ngenes = nrow(tr2g), TCC = FALSE)
df <- get_knee_df(m)
infl <- get_inflection(df)
knee_plot(df, infl)
# Clean up files from the example
unlink("out_hgmm100")
# Download dataset already in BUS format
library(TENxBUSData)
TENxBUSData(".", dataset = "hgmm100")
tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"), 
  type = "vertebrate",
  ensembl_version = 99, kallisto_out_path = "./out_hgmm100")
m <- make_sparse_matrix("./out_hgmm100/output.sorted.txt",
  tr2g = tr2g, est_ncells = 1e5,
  est_ngenes = nrow(tr2g), TCC = FALSE)
df <- get_knee_df(m)
infl <- get_inflection(df)
knee_plot(df, infl)
# Clean up files from the example
unlink("out_hgmm100")

Get files required for RNA velocity with bustools

Description

Computation of RNA velocity requires the number of unspliced transcripts, which can be quantified with reads containing intronic sequences. This function extracts intronic sequences flanked by L-1 bases of exonic sequences where L is the biological read length of the single cell technology of interest. The flanking exonic sequences are included for reads partially mapping to an intron and an exon.

Usage

get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  ...
)

## S4 method for signature 'GRanges'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE
)

## S4 method for signature 'character'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  is_circular = NULL,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE
)

## S4 method for signature 'TxDb'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome,
  out_path,
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  chrs_only = TRUE
)

## S4 method for signature 'EnsDb'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome,
  out_path,
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "TXBIOTYPE",
  gene_biotype_col = "GENEBIOTYPE",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE
)
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  ...
)

## S4 method for signature 'GRanges'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE
)

## S4 method for signature 'character'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome = NULL,
  out_path = ".",
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  is_circular = NULL,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered_gtf = FALSE
)

## S4 method for signature 'TxDb'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome,
  out_path,
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  chrs_only = TRUE
)

## S4 method for signature 'EnsDb'
get_velocity_files(
  X,
  L,
  Genome,
  Transcriptome,
  out_path,
  style = c("genome", "Ensembl", "UCSC", "NCBI", "other"),
  isoform_action = c("separate", "collapse"),
  exon_option = c("full", "junction"),
  compress_fa = FALSE,
  width = 80L,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "TXBIOTYPE",
  gene_biotype_col = "GENEBIOTYPE",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE
)

Arguments

`X`	Gene annotation with transcript and exon information. It can be a path to a GTF file with annotation of exon coordinates of transcripts, preferably from Ensembl. In the metadata, the following fields are required: type (e.g. whether the range of interest is a gene or transcript or exon or CDS), gene ID, and transcript ID. These fields need not to have standard names, as long as their names are specified in arguments of this function. It can also be a `TxDb` object, such as from the Bioconductor package `TxDb.Hsapiens.UCSC.hg38.knownGene`. It can also be a `EnsDb` object.
`L`	Length of the biological read. For instance, 10xv1: 98 nt, 10xv2: 98 nt, 10xv3: 91 nt, Drop-seq: 50 nt. If in doubt check read length in a fastq file for biological reads with the `bash` commands: If the fastq file is gzipped, then do `⁠zcat your_file.fastq.gz \| head⁠` on Linux. If on Mac, then `zcat < your_file.fastq.gz \| head`. Then you will see lines with nucleotide bases. Copy one of those lines and determine its length with `str_length` in R or `⁠echo -n <the sequence> \| wc -c⁠` in `bash`. Which file corresponds to biological reads depends on the particular technology.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`Transcriptome`	A `XStringSet`, a path to a fasta file (can be gzipped) of the transcriptome which contains sequences of spliced transcripts, or `NULL`. The transcriptome here will be concatenated with the intronic sequences to give one fasta file. When `NULL`, the transriptome sequences will be extracted from the genome given the gene annotation, so it will be guaranteed that transcript IDs in the transcriptome and in the annotation match. Otherwise, the type of transcript ID in the transcriptome must match that in the gene annotation supplied via argument `X`.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`style`	Formatting of chromosome names. Use `genomeStyles` to check which styles are supported for your organism of interest and what those styles look like. This can also be a style supported for your organism different from the style used by the annotation and the genome. Then this style will be used for both the annotation and the genome. Can take the following values: genome If style of the annnotation is different from that of the genome, then the style of the genome will be used. other Custom style, need to manually ensure that the style in annotation matches that of the genome. Ensembl Or `UCSC` or `NCBI`, whichever is supported by your species of interest.
`isoform_action`	Character, indicating action to take with different transcripts of the same gene. Must be one of the following: collapse First, the union of all exons of different transcripts of a gene will be taken. Then the introns will be inferred from this union. Only the flanked intronic sequences are affected; isoforms will always be taken into account for spliced sequences or exon-exon junctions. separate Introns from different transcripts will be kept separate.
`exon_option`	Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following: full The full cDNA sequences, which include the full exonic sequences, will be used. This is the default. junction Only the exon-exon junctions, with L-1 bases on each side of the junctions, will be used.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`width`	Maximum number of letters per line of sequence in the output fasta file. Must be an integer.
`...`	Extra arguments for methods.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.
`transcript_version`	Character vector of length 1. Tag in `attribute` field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`gene_version`	Character vector of length 1. Tag in `attribute` field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. Unlike transcript version number, it's up to you whether to include gene version number.
`version_sep`	Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`save_filtered_gtf`	Logical. If filtering type, biotypes, and/or chromosomes, whether to save the filtered `GRanges` as a GTF file.
`is_circular`	Logical vector of the same length as the number of sequences in the annotation and with the same names as the sequences, indicating whether the sequence is circular. If `NULL`, then all sequences will be assumed to be linear.
`use_transcript_version`	Logical, whether to include version number in the Ensembl transcript ID.
`use_gene_version`	Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number.

Value

The following files will be written to disk in the directory out_path:

cDNA_introns.fa: A fasta file containing both the spliced transcripts and the flanked intronic sequences. The intronic sequences are flanked by L-1 nt of exonic sequences to capture reads from nascent transcript partially mapping to exons. If the exon is shorter than 2*(L-1) nt, then the entire exon will be included in the intronic sequence. This will be used to build the kallisto index.
cDNA_tx_to_capture.txt: A text file of transcript IDs of spliced transcripts. If exon_option == "junction", then IDs of the exon-exon junctions. These IDs will have the pattern "transcript ID"-Jx, where x is a number differentiating between different junctions of the same transcript. Here x will always be ordered from 5' to 3' as on the plus strand.
introns_tx_to_capture.txt: A text file of IDs of introns. The names will have the pattern "transcript ID"-Ix, where x is a number differentiating between introns of the same transcript. If all transcripts of the same gene are collapsed before inferring intronic sequences, gene ID will be used in place of transcript ID. Here x will always be ordered from 5' to 3' as on the plus strand.
tr2g.txt: A text file with two columns matching transcripts and introns to genes. The first column is transcript or intron ID, and the second column is the corresponding gene ID. The part for transcripts are generated from the gene annotation supplied.

Nothing is returned into the R session.

Examples

# Use toy example
toy_path <- system.file("testdata", package = "BUSpaRse")
file <- paste0(toy_path, "/velocity_annot.gtf")
genome <- Biostrings::readDNAStringSet(paste0(toy_path, "/velocity_genome.fa"))
transcriptome <- paste0(toy_path, "/velocity_tx.fa")
get_velocity_files(file, 11, genome, transcriptome, ".",
  gene_version = NULL, transcript_version = NULL)
# Clean up output of the example
file.remove("cDNA_introns.fa", "cDNA_tx_to_capture.txt",
            "introns_tx_to_capture.txt", "tr2g.txt")
# Use toy example
toy_path <- system.file("testdata", package = "BUSpaRse")
file <- paste0(toy_path, "/velocity_annot.gtf")
genome <- Biostrings::readDNAStringSet(paste0(toy_path, "/velocity_genome.fa"))
transcriptome <- paste0(toy_path, "/velocity_tx.fa")
get_velocity_files(file, 11, genome, transcriptome, ".",
  gene_version = NULL, transcript_version = NULL)
# Clean up output of the example
file.remove("cDNA_introns.fa", "cDNA_tx_to_capture.txt",
            "introns_tx_to_capture.txt", "tr2g.txt")

Convert the Output of `kallisto bus` into Gene by Gell Matrix

Description

This function takes the output file of kallisto bus, after being sorted and converted into text with bustools. See vignettes on the website of this package for a tutorial. The bustools output has 4 columns: barcode, UMI, equivalence class, and counts. This function converts that file into a sparse matrix that can be used in downstream analyses.

Usage

make_sparse_matrix(
  bus_path,
  tr2g,
  est_ncells,
  est_ngenes,
  whitelist = NULL,
  gene_count = TRUE,
  TCC = TRUE,
  single_gene = TRUE,
  verbose = TRUE,
  progress_unit = 5e+06
)
make_sparse_matrix(
  bus_path,
  tr2g,
  est_ncells,
  est_ngenes,
  whitelist = NULL,
  gene_count = TRUE,
  TCC = TRUE,
  single_gene = TRUE,
  verbose = TRUE,
  progress_unit = 5e+06
)

Arguments

`bus_path`	Path to the sorted text `bus` output file.
`tr2g`	A Data frame with columns `gene` and `transcript`, in the same order as in the transcriptome index for `kallisto`. This argument can be missing or is ignored if only the TCC matrix, not the gene count matrix, is made.
`est_ncells`	Estimated number of cells; providing this argument will speed up computation as it minimizes memory reallocation as vectors grow.
`est_ngenes`	Estimated number of genes or equivalence classes.
`whitelist`	A character vector with valid cell barcodes. This is an optional argument, that defaults to `NULL`. When it is `NULL`, all cell barcodes present that have some UMI assignable to genes or ECs will be included in the sparse matrix whether they are known to be valid or not. Barcodes with only UMIs that are not assignable to genes or ECs will still be excluded.
`gene_count`	Logical, whether the gene count matrix should be returned.
`TCC`	Logical, whether the TCC matrix should be returned.
`single_gene`	Logical, whether to use single gene mode. In single gene mode, only UMIs that can be uniquely mapped to one gene are kept. Without single gene mode, UMIs mapped to multiple genes will be evenly distributed to those genes.
`verbose`	Whether to display progress.
`progress_unit`	How many iteration to print one progress update when reading in the `kallisto bus` file.

Details

This function can generate both the gene count matrix and the transcript compatibility count (TCC) matrix. The TCC matrix has barcodes in the columns and equivalence classes in the rows. See Ntranos et al. 2016 for more information about the RCC matrix.

For 10x data sets, you can find a barcode whitelist file that comes with CellRanger installation. You don't need to run CellRanger to get that. An example path to get the whitelist file is cellranger-2.1.0/cellranger-cs/2.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt for v2 chemistry.

Value

If both gene count and TCC matrices are returned, then this function returns a list with two matrices, each with genes/equivalence classes in the rows and barcodes in the columns. If only one of gene count and TCC matrices is returned, then a dgCMatrix with genes/equivalence classes in the rows and barcodes in the columns. These matrices are unfiltered. Please filter the empty droplets before downstream analysis.

Examples

# Load toy example for testing
toy_path <- system.file("testdata", package = "BUSpaRse")
load(paste(toy_path, "toy_example.RData", sep = "/"))
out_fn <- paste0(toy_path, "/output.sorted.txt")
# With whitelist
m <- make_sparse_matrix(out_fn, tr2g_toy, 10, 3, whitelist = whitelist,
  gene_count = TRUE, TCC = FALSE, single_gene = TRUE,
  verbose = FALSE)
# Load toy example for testing
toy_path <- system.file("testdata", package = "BUSpaRse")
load(paste(toy_path, "toy_example.RData", sep = "/"))
out_fn <- paste0(toy_path, "/output.sorted.txt")
# With whitelist
m <- make_sparse_matrix(out_fn, tr2g_toy, 10, 3, whitelist = whitelist,
  gene_count = TRUE, TCC = FALSE, single_gene = TRUE,
  verbose = FALSE)

Match chromosome naming styles of annotation and genome

Description

Internal use. This function matches chromosome naming styles. It will also give the genome and the annotation the same genome slot. This function assumes that the annotation and the genome refer to the same version of genome. If more than one style, then the first element will be used.

Usage

match_style(Genome, annot, style)
match_style(Genome, annot, style)

Arguments

Genome

annot

Genome annotation, an object of a class with a seqlevels method, such as GRanges, TxDb, and EnsDb.

style

Formatting of chromosome names. Use genomeStyles to check which styles are supported for your organism of interest and what those styles look like. This can also be a style supported for your organism different from the style used by the annotation and the genome. Then this style will be used for both the annotation and the genome. Can take the following values:

genome: If style of the annnotation is different from that of the genome, then the style of the genome will be used.
other: Custom style, need to manually ensure that the style in annotation matches that of the genome.
Ensembl: Or UCSC or NCBI, whichever is supported by your species of interest.

Value

A list of two. The first element is the genome with the proper style, and the second element is the annotation with the proper style.

Read matrix along with barcode and gene names

Description

This function takes in a directory and name and reads the mtx file, genes, and barcodes from the output of bustools to return a sparse matrix with column names and row names.

Usage

read_count_output(dir, name, tcc = FALSE)
read_count_output(dir, name, tcc = FALSE)

Arguments

`dir`	Directory with the bustools count outputs.
`name`	The files in the output directory should be "name".mtx, "name".genes.txt, and "name".barcodes.txt.
`tcc`	Logical, whether the matrix of interest is a TCC matrix. Defaults to `FALSE`.

Value

A dgCMatrix with barcodes as column names and genes as row names.

Examples

# Internal toy data used for unit testing
toy_path <- system.file("testdata", package = "BUSpaRse")
m <- read_count_output(toy_path, name = "genes", tcc = FALSE)
# Internal toy data used for unit testing
toy_path <- system.file("testdata", package = "BUSpaRse")
m <- read_count_output(toy_path, name = "genes", tcc = FALSE)

Read intronic and exonic matrices into R

Description

Read intronic and exonic matrices into R

Usage

read_velocity_output(spliced_dir, unspliced_dir, spliced_name, unspliced_name)
read_velocity_output(spliced_dir, unspliced_dir, spliced_name, unspliced_name)

Arguments

`spliced_dir`	Directory with `mtx` file for UMI counts of spliced transcripts.
`unspliced_dir`	Directory with `mtx` file for UMI counts of unspliced transcripts.
`spliced_name`	The files in the splicedd directory should be <spliced_name>.mtx, <spliced_name>.genes.txt, and <spliced_name>.barcodes.txt.
`unspliced_name`	The files in the unsplicedd directory should be <unspliced_name>.mtx, <unspliced_name>.genes.txt, and <unspliced_name>.barcodes.txt.

Value

A list of two dgCMatrix with barcodes as column names and genes as row names. The elements of the list will be spliced and unspliced.

Examples

# Internal toy data used for unit testing
toy_path <- system.file("testdata", package = "BUSpaRse")
m <- read_velocity_output(toy_path, toy_path,
  spliced_name = "genes",
  unspliced_name = "genes")
# Internal toy data used for unit testing
toy_path <- system.file("testdata", package = "BUSpaRse")
m <- read_velocity_output(toy_path, toy_path,
  spliced_name = "genes",
  unspliced_name = "genes")

Tags in the attributes field of RefSeq GFF files

Description

These are the column names of the mcols when the Ensembl GTF file is read into R as a GRanges, including gene, transcript_id, gene_biotype, description, and so on, and the mandatory tags like ID, Name, and Parent.

Usage

refseq_gff_mcols
refseq_gff_mcols

Format

A character vector

Source

Ensembl version 99 human GTF file

Save transcript to gene file for use in `bustools`

Description

This function saves the transcript to gene data frame generated by this package in whatever means in a format required by bustools. In order to use bustools to generate the gene count or TCC matrix, a file that maps transcripts to genes is required. This should be a tsv file with 2 columns: the first column for transcript ID and the second for gene ID. The order of transcripts in this file must be the same as the order in the kallisto index, and this ordering can be ensured by the function sort_tr2g. There must also be no headers. All columns other than transcript and gene will be discarded. To save a file with those columns, directly save the transcript to gene data frame with function like write.table, readr::write_delim.

Usage

save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv")
save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv")

Arguments

`tr2g`	The data frame output from the `tr2g_*` family of functions.
`file_save`	File name of the file to be saved. The directory in which the file is to be saved must exist.

Value

Nothing is returned into the R session. A tsv file of the format required by bustools with the name and directory specified will be written to disk.

Note

This function has been superseded by the new version of tr2g_* functions that can extract transcriptome for only the biotypes specified and with only the standard chromosomes. The new version of tr2g_* functions also sorts the transcriptome so the tr2g and the transcriptome have transcripts in the same order, and write the tr2g.tsv file in the bustools format.

Examples

toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gtf_test.gtf", sep = "/")
tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE)
save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv")
# Clean up files from the example
file.remove("tr2g.tsv")
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gtf_test.gtf", sep = "/")
tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE)
save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv")
# Clean up files from the example
file.remove("tr2g.tsv")

Sort transcripts to the same order as in kallisto index

Description

This function takes the data frame output from the tr2g_* family of functions in this package as the input, and sorts it so the transcripts are in the same order as in the kallisto index used to generate the bus file. Sorting is vital to obtain the correct sparse matrix from the bus file as equivalence class notations are based on the index of transcripts in the kallisto index.

Usage

sort_tr2g(tr2g, file, kallisto_out_path)
sort_tr2g(tr2g, file, kallisto_out_path)

Arguments

`tr2g`	The data frame output from the `tr2g_*` family of functions.
`file`	Character vector of length 1, path to a tsv file with transcript IDs and the corresponding gene IDs, in the format required for `bustools`, or written by `save_tr2g_bustools`.
`kallisto_out_path`	Character vector of length 1, path to the directory for the outputs of kallisto bus.

Details

Since the attribute field of GTF and GFF3 files varies across sources, output from tr2g_gtf and tr2g_gff3 may need further clean up. You may also supply gene and transcript IDs from other sources. This function should be used after the clean up, when the transcript IDs in the cleaned up data frame have the same format as those in transcript

Value

A data frame with columns transcript and gene and the other columns present in tr2g or the data frame in file, with the transcript IDs sorted to be in the same order as in the kallisto index.

Note

Examples

toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gtf_test.gtf", sep = "/")
tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE, transcript_version = NULL)
tr2g <- sort_tr2g(tr2g, kallisto_out_path = toy_path)
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gtf_test.gtf", sep = "/")
tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE, transcript_version = NULL)
tr2g <- sort_tr2g(tr2g, kallisto_out_path = toy_path)

Convert Latin species name to dataset name

Description

This function converts Latin species name to a dataset name in biomart to query gene and transcript ID.

Usage

species2dataset(
  species,
  type = c("vertebrate", "metazoa", "plant", "fungus", "protist")
)
species2dataset(
  species,
  type = c("vertebrate", "metazoa", "plant", "fungus", "protist")
)

Arguments

`species`	A character vector of Latin names of species present in this scRNA-seq dataset. This is used to retrieve Ensembl information from biomart.
`type`	A character vector indicating the type of each species. Each element must be one of "vertebrate", "metazoa", "plant", "fungus", and "protist". If length is 1, then this type will be used for all species specified here. Can be missing if `fasta_file` is specified.

Value

The appropriate dataset name for biomart.

Examples

species2dataset(species = "Homo sapiens")
species2dataset(species = "Homo sapiens")

Standardize GRanges field names

Description

To avoid introducing rlang as another dependency for tidyeval. This function will also convert exon numbers to integer.

Usage

standardize_tags(gr, gene_id, transcript_id)
standardize_tags(gr, gene_id, transcript_id)

Arguments

`gr`	A `GRanges` object.
`gene_id`	Name of the metadata field for gene ID.
`transcript_id`	Name of the metadata field for transcript ID.

Value

A GRanges object with standardized names: gene ID as gene_id, and transcript ID as transcript_id.

Remove chromosomes in anotation absent from genome

Description

Remove chromosomes in anotation absent from genome

Usage

sub_annot(chrs_use, annot)
sub_annot(chrs_use, annot)

Arguments

`chrs_use`	Character vector of names of chromosomes present in both the annotation and the genome.
`annot`	Either a `GRanges` object or a `TxDb` object for gene annotation.

Value

A subsetted genome annotation of the same type ofo the input genome annotation.

Subset genome annotation

Description

Exclude chromosomes present in the annotation but absent from the genome and add information about circular chromosomes.

Usage

subset_annot(Genome, annot)

## S4 method for signature 'DNAStringSet'
subset_annot(Genome, annot)

## S4 method for signature 'BSgenome'
subset_annot(Genome, annot)
subset_annot(Genome, annot)

## S4 method for signature 'DNAStringSet'
subset_annot(Genome, annot)

## S4 method for signature 'BSgenome'
subset_annot(Genome, annot)

Arguments

Genome

annot

Either a GRanges object or a TxDb object for gene annotation.

Value

A subsetted genome annotation of the same type of the input genome annotation.

Examples

library(BSgenome.Hsapiens.UCSC.hg38)
library(EnsDb.Hsapiens.v86) 
library(GenomeInfoDb)
gn <- BSgenome.Hsapiens.UCSC.hg38
seqlevelsStyle(gn) <- "Ensembl"
subset_annot(gn, EnsDb.Hsapiens.v86)

library(BSgenome.Hsapiens.UCSC.hg38)
library(EnsDb.Hsapiens.v86) 
library(GenomeInfoDb)
gn <- BSgenome.Hsapiens.UCSC.hg38
seqlevelsStyle(gn) <- "Ensembl"
subset_annot(gn, EnsDb.Hsapiens.v86)

Get transcript and gene info from EnsDb objects

Description

Bioconductor provides Ensembl genome annotation in AnnotationHub; older versions of Ensembl annotation can be obtained from packages like EnsDb.Hsapiens.v86. This is an alternative to querying Ensembl with biomart; Ensembl's server seems to be less stable than that of Bioconductor. However, more information and species are available on Ensembl biomart than on AnnotationHub.

Usage

tr2g_EnsDb(
  ensdb,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  other_attrs = NULL,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "TXBIOTYPE",
  gene_biotype_col = "GENEBIOTYPE",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)
tr2g_EnsDb(
  ensdb,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  other_attrs = NULL,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "TXBIOTYPE",
  gene_biotype_col = "GENEBIOTYPE",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)

Arguments

`ensdb`	Ann `EnsDb` object, such as from `AnnotationHub` or `EnsDb.Hsapiens.v86`.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`get_transcriptome`	Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered `GRanges` will be used to extract transcriptome.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`other_attrs`	Character vector. Other attributes to get from the `EnsDb` object, such as gene symbol and position on the genome. Use `columns` to see which attributes are available.
`use_gene_name`	Logical, whether to get gene names.
`use_transcript_version`	Logical, whether to include version number in the Ensembl transcript ID. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`use_gene_version`	Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.

Value

A data frame with at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally gene_name for gene names. If other_attrs has been specified, then those will also be columns in the data frame returned.

Examples

library(EnsDb.Hsapiens.v86)
tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, write_tr2g = FALSE,
 use_transcript_version = FALSE,
 use_gene_version = FALSE)
library(EnsDb.Hsapiens.v86)
tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, write_tr2g = FALSE,
 use_transcript_version = FALSE,
 use_gene_version = FALSE)

Get transcript and gene info from Ensembl

Description

This function queries Ensembl biomart to convert transcript IDs to gene IDs.

Usage

tr2g_ensembl(
  species,
  type = c("vertebrate", "metazoa", "plant", "fungus", "protist"),
  out_path = ".",
  write_tr2g = TRUE,
  other_attrs = NULL,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  ensembl_version = NULL,
  overwrite = FALSE,
  verbose = TRUE,
  ...
)
tr2g_ensembl(
  species,
  type = c("vertebrate", "metazoa", "plant", "fungus", "protist"),
  out_path = ".",
  write_tr2g = TRUE,
  other_attrs = NULL,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  ensembl_version = NULL,
  overwrite = FALSE,
  verbose = TRUE,
  ...
)

Arguments

`species`	Character vector of length 1, Latin name of the species of interest.
`type`	Character, must be one of "vertebrate", "metazoa", "plant", "fungus" and "protist". Passing "vertebrate" will use the default www.ensembl.org host. Gene annotation of some common invertebrate model organisms, such as Drosophila melanogaster, are available on www.ensembl.org so for these invertebrate model organisms, "vertebrate" can be used for this argument. Passing values other than "vertebrate" will use other Ensembl hosts. For animals absent from www.ensembl.org, try "metazoa".
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`other_attrs`	Character vector. Other attributes to get from Ensembl, such as gene symbol and position on the genome. Use `listAttributes` to see which attributes are available.
`use_gene_name`	Logical, whether to get gene names.
`use_transcript_version`	Logical, whether to include version number in the Ensembl transcript ID. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`use_gene_version`	Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`ensembl_version`	Integer version number of Ensembl (e.g. 94 for the October 2018 release). This argument defaults to `NULL`, which will use the current release of Ensembl. Use `listEnsemblArchives` to see the version number corresponding to the Ensembl release of a particular date. The version specified here must match the version of Ensembl where the transcriptome used to build the kallisto index was downloaded. This only works for vertebrates and the most common invertebrate model organisms like Drosophila melanogaster and C. elegans (i.e. www.ensembl.org and its mirrors), not the other Ensembl sites for plants, protists, fungi, and metazoa.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.
`verbose`	Whether to display progress.
`...`	Othe arguments to be passed to `useMart`, such as mirror. Note that setting mirrors other than the default, e.g. uswest, does not work for archived versions.

Value

Examples

tr2g <- tr2g_ensembl(species = "Danio rerio",
other_attrs = "description", write_tr2g = FALSE)
# This will use plants.ensembl.org as host instead of www.ensembl.org
tr2g <- tr2g_ensembl(species = "Arabidopsis thaliana", type = "plant",
  write_tr2g = FALSE)
tr2g <- tr2g_ensembl(species = "Danio rerio",
other_attrs = "description", write_tr2g = FALSE)
# This will use plants.ensembl.org as host instead of www.ensembl.org
tr2g <- tr2g_ensembl(species = "Arabidopsis thaliana", type = "plant",
  write_tr2g = FALSE)

Get transcript and gene info from names in FASTA files

Description

FASTA files, such as those for cDNA and ncRNA from Ensembl, might have genome annotation information in the name of each sequence entry. This function extracts the transcript and gene IDs from such FASTA files.

Usage

tr2g_fasta(
  file,
  out_path = ".",
  write_tr2g = TRUE,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)
tr2g_fasta(
  file,
  out_path = ".",
  write_tr2g = TRUE,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)

Arguments

`file`	Path to the FASTA file to be read. The file can remain gzipped.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`use_gene_name`	Logical, whether to get gene names.
`use_transcript_version`	Logical, whether to include version number in the Ensembl transcript ID. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`use_gene_version`	Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`save_filtered`	If filtering for biotype and chromosomes, whether to save the filtered fasta file. If `TRUE`, the file will be `tx_filtered.fa` in `out_path`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.

Details

At present, this function only works with FASTA files from Ensembl, and uses regex to extract vertebrate Ensembl IDs. Sequence names should be formatted as follows:

ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1
gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene
gene_symbol:TRBD1 description:T cell receptor beta diversity 1
[Source:HGNC Symbol;Acc:HGNC:12158]

If your FASTA file sequence names are formatted differently, then you must extract the transcript and gene IDs by some other means. The Bioconductor package Biostrings is recommended; after reading the FASTA file into R, the sequence names can be accessed by the names function.

While normally, you should call sort_tr2g to sort the transcript IDs from the output of the tr2g_* family of functions, If the FASTA file supplied here is the same as the one used to build the kallisto index, then the transcript IDs in the output of this function are in the same order as in the kallisto index, so you can skip sort_tr2g and proceed directly to EC2gene with the output of this function.

Value

A data frame with at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally gene_name for gene names.

Examples

toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "fasta_test.fasta", sep = "/")
tr2g <- tr2g_fasta(file = file_use, save_filtered = FALSE, write_tr2g = FALSE)
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "fasta_test.fasta", sep = "/")
tr2g <- tr2g_fasta(file = file_use, save_filtered = FALSE, write_tr2g = FALSE)

Get transcript and gene info from GFF3 file

Description

This function reads a GFF3 file and extracts the transcript ID and corresponding gene ID. This function assumes that the GFF3 file is properly formatted. See http://gmod.org/wiki/GFF3 for a detailed description of proper GFF3 format. Note that GTF files have a somewhat different and simpler format in the attribute field, which this function does not support. See http://mblab.wustl.edu/GTF2.html for a detailed description of proper GTF format. To extract transcript and gene information from GTF files, see the function tr2g_gtf in this package. Some files bearing the .gff3 are in fact more like the GTF format. If this is so, then change the extension to .gtf and use the function tr2g_gtf in this package instead.

Usage

tr2g_gff3(
  file,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  gene_name = "Name",
  transcript_version = "version",
  gene_version = "version",
  version_sep = ".",
  transcript_biotype_col = "biotype",
  gene_biotype_col = "biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  save_filtered_gff = TRUE,
  overwrite = FALSE,
  source = c("ensembl", "refseq")
)
tr2g_gff3(
  file,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  gene_name = "Name",
  transcript_version = "version",
  gene_version = "version",
  version_sep = ".",
  transcript_biotype_col = "biotype",
  gene_biotype_col = "biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  save_filtered_gff = TRUE,
  overwrite = FALSE,
  source = c("ensembl", "refseq")
)

Arguments

`file`	Path to a GTF file to be read. The file can remain gzipped. Use `getGTF` from the `biomartr` package to download GTF files from Ensembl, and use `getGFF` from `biomartr` to download GFF3 files from Ensembl and RefSeq.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`get_transcriptome`	Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered `GRanges` will be used to extract transcriptome.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.
`gene_name`	Character vector of length 1. Tag in `attribute` field corresponding to gene symbols. This argument can be `NA` or `NULL` if you are fine with non-human readable gene IDs and do not wish to extract human readable gene symbols.
`transcript_version`	Character vector of length 1. Tag in `attribute` field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`gene_version`	Character vector of length 1. Tag in `attribute` field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. Unlike transcript version number, it's up to you whether to include gene version number.
`version_sep`	Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`save_filtered_gff`	Logical. If filtering type, biotypes, and/or chromosomes, whether to save the filtered `GRanges` as a GFF3 file.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.
`source`	Name of the database where this GFF3 file was downloaded. Must be either "ensembl" or "refseq".

Details

Transcript and gene versions may not be present in all GTF files, so these arguments are optional. This function has arguments for transcript and gene version numbers because Ensembl IDs have version numbers. For Ensembl IDs, we recommend including the version number, since a change in version number signals a change in the entity referred to by the ID after reannotation. If a version is used, then it will be appended to the ID, separated by version_sep.

The transcript and gene IDs are The attribute field (the last field) of GTF files can be complicated and inconsistent across different sources. Please check the attribute tags in your GTF file and consider the arguments of this function carefully. The defaults are set according to Ensembl GTF files; defaults may not work for files from other sources. Due to the general lack of standards for the attribute field, you may need to further clean up the output of this function.

Value

A data frame at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally, gene_name for gene names.

Note

The defaults here are for Ensembl GFF3 files. To see all attribute tags for Ensembl and RefSeq GFF3 files, see data("ensembl_gff_mcols") and data("refseq_gff_mcols").

Examples

toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gff3_test.gff3", sep = "/")
# Default
tr2g <- tr2g_gff3(file = file_use, write_tr2g = FALSE,
get_transcriptome = FALSE, save_filtered_gff = FALSE)
# Excluding version numbers
tr2g <- tr2g_gff3(file = file_use, transcript_version = NULL,
  gene_version = NULL, write_tr2g = FALSE, get_transcriptome = FALSE,
  save_filtered_gff = FALSE)
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gff3_test.gff3", sep = "/")
# Default
tr2g <- tr2g_gff3(file = file_use, write_tr2g = FALSE,
get_transcriptome = FALSE, save_filtered_gff = FALSE)
# Excluding version numbers
tr2g <- tr2g_gff3(file = file_use, transcript_version = NULL,
  gene_version = NULL, write_tr2g = FALSE, get_transcriptome = FALSE,
  save_filtered_gff = FALSE)

Get transcript and gene info from GRanges

Description

Internal use, for GRanges from GTF files

Usage

tr2g_GRanges(
  gr,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  gene_name = "gene_name",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  save_filtered_gtf = TRUE,
  overwrite = FALSE
)
tr2g_GRanges(
  gr,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  gene_name = "gene_name",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  save_filtered_gtf = TRUE,
  overwrite = FALSE
)

Arguments

`gr`	A `GRanges` object. The metadata columns should be atomic vectors, not lists.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`get_transcriptome`	Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered `GRanges` will be used to extract transcriptome.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.
`gene_name`	Character vector of length 1. Tag in `attribute` field corresponding to gene symbols. This argument can be `NA` or `NULL` if you are fine with non-human readable gene IDs and do not wish to extract human readable gene symbols.
`transcript_version`	Character vector of length 1. Tag in `attribute` field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`gene_version`	Character vector of length 1. Tag in `attribute` field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. Unlike transcript version number, it's up to you whether to include gene version number.
`version_sep`	Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`save_filtered_gtf`	Logical. If filtering type, biotypes, and/or chromosomes, whether to save the filtered `GRanges` as a GTF file.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.

Value

A data frame at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally, gene_name for gene names.

Get transcript and gene info from GTF file

Description

This function reads a GTF file and extracts the transcript ID and corresponding gene ID. This function assumes that the GTF file is properly formatted. See http://mblab.wustl.edu/GTF2.html for a detailed description of proper GTF format. Note that GFF3 files have a somewhat different and more complicated format in the attribute field, which this function does not support. See http://gmod.org/wiki/GFF3 for a detailed description of proper GFF3 format. To extract transcript and gene information from GFF3 files, see the function tr2g_gff3 in this package.

Usage

tr2g_gtf(
  file,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  gene_name = "gene_name",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  save_filtered_gtf = TRUE,
  overwrite = FALSE
)
tr2g_gtf(
  file,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  gene_name = "gene_name",
  transcript_version = "transcript_version",
  gene_version = "gene_version",
  version_sep = ".",
  transcript_biotype_col = "transcript_biotype",
  gene_biotype_col = "gene_biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  save_filtered_gtf = TRUE,
  overwrite = FALSE
)

Arguments

`file`	Path to a GTF file to be read. The file can remain gzipped. Use `getGTF` from the `biomartr` package to download GTF files from Ensembl, and use `getGFF` from `biomartr` to download GFF3 files from Ensembl and RefSeq.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`get_transcriptome`	Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered `GRanges` will be used to extract transcriptome.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.
`gene_name`	Character vector of length 1. Tag in `attribute` field corresponding to gene symbols. This argument can be `NA` or `NULL` if you are fine with non-human readable gene IDs and do not wish to extract human readable gene symbols.
`transcript_version`	Character vector of length 1. Tag in `attribute` field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`gene_version`	Character vector of length 1. Tag in `attribute` field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. Unlike transcript version number, it's up to you whether to include gene version number.
`version_sep`	Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`save_filtered_gtf`	Logical. If filtering type, biotypes, and/or chromosomes, whether to save the filtered `GRanges` as a GTF file.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.

Details

Value

A data frame at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally, gene_name for gene names.

Examples

toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gtf_test.gtf", sep = "/")
# Default
tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE)
# Excluding version numbers
tr2g <- tr2g_gtf(file = file_use, transcript_version = NULL,
  gene_version = NULL, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE)
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gtf_test.gtf", sep = "/")
# Default
tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE)
# Excluding version numbers
tr2g <- tr2g_gtf(file = file_use, transcript_version = NULL,
  gene_version = NULL, get_transcriptome = FALSE,
  write_tr2g = FALSE, save_filtered_gtf = FALSE)

tr2g for exon-exon junctions

Description

tr2g for exon-exon junctions

Usage

tr2g_junction(tr2g_cdna, junction_names)
tr2g_junction(tr2g_cdna, junction_names)

Arguments

`tr2g_cdna`	The original `tr2g_cdna`.
`junction_names`	Names of junctions internally generated.

Value

A tr2g data frame where "transcripts" are the exon-exon junctions and genes are the corresponding genes.

Get transcript and gene info from TxDb objects

Description

The genome and gene annotations of some species can be conveniently obtained from Bioconductor packages. This is more convenient than downloading GTF files from Ensembl and reading it into R. In these packages, the gene annotation is stored in a TxDb object, which has standardized names for gene IDs, transcript IDs, exon IDs, and so on, which are stored in the metadata fields in GTF and GFF3 files, which are not standardized. This function extracts transcript and corresponding gene information from gene annotation stored in a TxDb object.

Usage

tr2g_TxDb(
  txdb,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  chrs_only = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)
tr2g_TxDb(
  txdb,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  chrs_only = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)

Arguments

`txdb`	A `TxDb` object with gene annotation.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`get_transcriptome`	Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered `GRanges` will be used to extract transcriptome.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.

Value

A data frame with 3 columns: gene for gene ID, transcript for transcript ID, and tx_id for internal transcript IDs used to avoid duplicate transcript names. For TxDb packages from Bioconductor, gene ID is Entrez ID, while transcript IDs are Ensembl IDs with version numbers for TxDb.Hsapiens.UCSC.hg38.knownGene. In some cases, the transcript ID have duplicates, and this is resolved by adding numbers to make the IDs unique.

A data frame with 3 columns: gene for gene ID, transcript for transcript ID, and gene_name for gene names. If other_attrs has been specified, then those will also be columns in the data frame returned.

Examples

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38)
# Clean up
file.remove("transcriptome.fa", "tr2g.tsv")
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38)
# Clean up
file.remove("transcriptome.fa", "tr2g.tsv")

Map Ensembl transcript ID to gene ID

Description

This function is a shortcut to get the correctly sorted data frame with transcript IDs and the corresponding gene IDs from Ensembl biomart or Ensembl transcriptome FASTA files. For biomart query, it calls tr2g_ensembl and then sort_tr2g. For FASTA files, it calls tr2g_fasta and then sort_tr2g. Unlike in tr2g_ensembl and tr2g_fasta, multiple species can be supplied if cells from different species were sequenced together. This function should only be used if the kallisto inidex was built with transcriptomes from Ensembl. Also, if querying biomart, please make sure to set ensembl_version to match the version where the transcriptomes were downloaded.

Usage

transcript2gene(
  species,
  fasta_file,
  kallisto_out_path,
  type = "vertebrate",
  ...
)
transcript2gene(
  species,
  fasta_file,
  kallisto_out_path,
  type = "vertebrate",
  ...
)

Arguments

`species`	A character vector of Latin names of species present in this scRNA-seq dataset. This is used to retrieve Ensembl information from biomart.
`fasta_file`	Character vector of paths to the transcriptome FASTA files used to build the kallisto index. Exactly one of `species` and `fasta_file` can be missing.
`kallisto_out_path`	Path to the `kallisto bus` output directory.
`type`	A character vector indicating the type of each species. Each element must be one of "vertebrate", "metazoa", "plant", "fungus", and "protist". If length is 1, then this type will be used for all species specified here. Can be missing if `fasta_file` is specified.
`...`	Other arguments passed to `tr2g_ensembl` such as `other_attrs`, `ensembl_version`, and arguments passed to `useMart`. If `fasta_files` is supplied instead of `species`, then this will be extra argumennts to `tr2g_fasta`, such as `use_transcript_version` and `use_gene_version`.

Value

A data frame with two columns: gene and transcript, with Ensembl gene and transcript IDs (with version number), in the same order as in the transcriptome index used in kallisto.

Note

Examples

# Download dataset already in BUS format
library(TENxBUSData)
TENxBUSData(".", dataset = "hgmm100")
tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"),
  type = "vertebrate", save_filtered = FALSE,
  ensembl_version = 99, kallisto_out_path = "./out_hgmm100")
# Clean up files from the example
unlink("out_hgmm100")
# Download dataset already in BUS format
library(TENxBUSData)
TENxBUSData(".", dataset = "hgmm100")
tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"),
  type = "vertebrate", save_filtered = FALSE,
  ensembl_version = 99, kallisto_out_path = "./out_hgmm100")
# Clean up files from the example
unlink("out_hgmm100")

Validate input to get_velocity_files

Description

Validate input to get_velocity_files

Usage

validate_velocity_input(
  L,
  Genome,
  Transcriptome,
  out_path,
  compress_fa,
  width,
  exon_option
)
validate_velocity_input(
  L,
  Genome,
  Transcriptome,
  out_path,
  compress_fa,
  width,
  exon_option
)

Arguments

`L`	Length of the biological read. For instance, 10xv1: 98 nt, 10xv2: 98 nt, 10xv3: 91 nt, Drop-seq: 50 nt. If in doubt check read length in a fastq file for biological reads with the `bash` commands: If the fastq file is gzipped, then do `⁠zcat your_file.fastq.gz \| head⁠` on Linux. If on Mac, then `zcat < your_file.fastq.gz \| head`. Then you will see lines with nucleotide bases. Copy one of those lines and determine its length with `str_length` in R or `⁠echo -n <the sequence> \| wc -c⁠` in `bash`. Which file corresponds to biological reads depends on the particular technology.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`Transcriptome`	A `XStringSet`, a path to a fasta file (can be gzipped) of the transcriptome which contains sequences of spliced transcripts, or `NULL`. The transcriptome here will be concatenated with the intronic sequences to give one fasta file. When `NULL`, the transriptome sequences will be extracted from the genome given the gene annotation, so it will be guaranteed that transcript IDs in the transcriptome and in the annotation match. Otherwise, the type of transcript ID in the transcriptome must match that in the gene annotation supplied via argument `X`.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`width`	Maximum number of letters per line of sequence in the output fasta file. Must be an integer.
`exon_option`	Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following: full The full cDNA sequences, which include the full exonic sequences, will be used. This is the default. junction Only the exon-exon junctions, with L-1 bases on each side of the junctions, will be used.

Value

Will throw error if validation fails. Returns a named list whose first element is the normalized path to output directory, and whose second element is the normalized path to the transcriptome file if specified.

Write the files for RNA velocity to disk

Description

Write the files for RNA velocity to disk, in the specified output directory.

Usage

write_velocity_output(
  out_path,
  introns,
  Genome,
  Transcriptome,
  isoform_action,
  exon_option,
  tr2g_cdna,
  compress_fa,
  width
)
write_velocity_output(
  out_path,
  introns,
  Genome,
  Transcriptome,
  isoform_action,
  exon_option,
  tr2g_cdna,
  compress_fa,
  width
)

Arguments

`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created.
`introns`	Intronic ranges plus flanking region, returned by `get_intron_flanks`.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`Transcriptome`	A `XStringSet`, a path to a fasta file (can be gzipped) of the transcriptome which contains sequences of spliced transcripts, or `NULL`. The transcriptome here will be concatenated with the intronic sequences to give one fasta file. When `NULL`, the transriptome sequences will be extracted from the genome given the gene annotation, so it will be guaranteed that transcript IDs in the transcriptome and in the annotation match. Otherwise, the type of transcript ID in the transcriptome must match that in the gene annotation supplied via argument `X`.
`isoform_action`	Character, indicating action to take with different transcripts of the same gene. Must be one of the following: collapse First, the union of all exons of different transcripts of a gene will be taken. Then the introns will be inferred from this union. Only the flanked intronic sequences are affected; isoforms will always be taken into account for spliced sequences or exon-exon junctions. separate Introns from different transcripts will be kept separate.
`exon_option`	Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following: full The full cDNA sequences, which include the full exonic sequences, will be used. This is the default. junction Only the exon-exon junctions, with L-1 bases on each side of the junctions, will be used.
`tr2g_cdna`	A data frame with columns `transcript` and `gene` that maps transcripts to genes for spliced transcripts.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`width`	Maximum number of letters per line of sequence in the output fasta file. Must be an integer.

Value

Nothing into the R session. The files are written to disk.

Package 'BUSpaRse'

Help Index

Generate RNA velocity files for GRanges

Description

Usage

Arguments

Value

Transfer information about circular chromosomes between genome and annotation

Description

Usage

Arguments

Value

Get genome annotation from Ensembl FASTA file

Description

Usage

Arguments

Value

Examples

Cell Ranger gene biotypes

Description

Usage

Format

Source

See Also

Check that an object is a character vector of length 1

Description

Usage

Arguments

Value

Check for chromosomes in genome but not annotation

Description

Usage

Arguments

Value

Check inputs to tr2g_gtf and tr2g_gff3

Description

Usage

Arguments

Value

Check that a tag is present in attribute field of GTF/GFF

Description

Usage

Arguments

Value

Check if transcript ID in transcriptome and annotation match

Description

Usage

Arguments

Value

Download transcriptome from Ensembl

Description

Usage

Arguments

Value

Examples

Map EC Index to Genes Compatible with the EC

Description

Usage

Arguments

Details

Value

See Also

Examples

Gene biotypes from Ensembl

Description

Usage

Format

Source

See Also

These are the column names of the mcols when the Ensembl GTF file is read into R as a GRanges, including gene_id, transcript_id, biotype, description, and so on, and the mandatory tags like ID, Name, and Parent.

Description

Usage

Format

Source

Tags in the attributes field of Ensembl GTF files

Description

Usage

Format

Source

Transcript biotypes from Ensembl

These are the column names of the `mcols` when the Ensembl GTF file is read into R as a `GRanges`, including `gene_id`, `transcript_id`, `biotype`, `description`, and so on, and the mandatory tags like `ID`, `Name`, and `Parent`.

Convert the Output of `kallisto bus` into Gene by Gell Matrix

Save transcript to gene file for use in `bustools`