Title: | kallisto | bustools R utilities |
---|---|
Description: | The kallisto | bustools pipeline is a fast and modular set of tools to convert single cell RNA-seq reads in fastq files into gene count or transcript compatibility counts (TCC) matrices for downstream analysis. Central to this pipeline is the barcode, UMI, and set (BUS) file format. This package serves the following purposes: First, this package allows users to manipulate BUS format files as data frames in R and then convert them into gene count or TCC matrices. Furthermore, since R and Rcpp code is easier to handle than pure C++ code, users are encouraged to tweak the source code of this package to experiment with new uses of BUS format and different ways to convert the BUS file into gene count matrix. Second, this package can conveniently generate files required to generate gene count matrices for spliced and unspliced transcripts for RNA velocity. Here biotypes can be filtered and scaffolds and haplotypes can be removed, and the filtered transcriptome can be extracted and written to disk. Third, this package implements utility functions to get transcripts and associated genes required to convert BUS files to gene count matrices, to write the transcript to gene information in the format required by bustools, and to read output of bustools into R as sparses matrices. |
Authors: | Lambda Moses [aut, cre] , Lior Pachter [aut, ths] |
Maintainer: | Lambda Moses <[email protected]> |
License: | BSD_2_clause + file LICENSE |
Version: | 1.21.0 |
Built: | 2024-11-29 04:44:20 UTC |
Source: | https://github.com/bioc/BUSpaRse |
Generate RNA velocity files for GRanges
.get_velocity_files( gr, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), transcript_id = "transcript_id", gene_id = "gene_id", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered_gtf = FALSE, compress_fa = FALSE, width = 80L )
.get_velocity_files( gr, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), transcript_id = "transcript_id", gene_id = "gene_id", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered_gtf = FALSE, compress_fa = FALSE, width = 80L )
gr |
A |
L |
Length of the biological read. For instance, 10xv1: 98 nt,
10xv2: 98 nt, 10xv3: 91 nt, Drop-seq: 50 nt. If in doubt check read length
in a fastq file for biological reads with the |
Genome |
Either a |
Transcriptome |
A |
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
style |
Formatting of chromosome names. Use
|
isoform_action |
Character, indicating action to take with different transcripts of the same gene. Must be one of the following:
|
exon_option |
Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following:
|
transcript_id |
Character vector of length 1. Tag in |
gene_id |
Character vector of length 1. Tag in |
transcript_version |
Character vector of length 1. Tag in |
gene_version |
Character vector of length 1. Tag in |
version_sep |
Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
save_filtered_gtf |
Logical. If filtering type, biotypes, and/or
chromosomes, whether to save the filtered |
compress_fa |
Logical, whether to compress the output fasta file. If
|
width |
Maximum number of letters per line of sequence in the output fasta file. Must be an integer. |
Internal use, called after calling subset_annot
.
annot_circular(Genome, annot)
annot_circular(Genome, annot)
Genome |
Either a |
annot |
Genome annotation, an object of a class with a
|
If neither genome nor annotation indicates which chromosome is circular, then the input will be returned unchanged. If only one of genome and annotation has such information, then it will be transferred to the one that does not. If both do have such information, the information from the genome will be transferred to the annotation if they're different.
Ensembl FASTA files for RNA contain much of the information contained in GTF
files, such as chromosome, genome assembly version, coordinates, strand,
gene ID, gene symbol, and gene description. Given such a FASTA file, this
function can extract all the genome annotation information and return a
data frame or a GRanges
object.
annots_from_fa_df(file) annots_from_fa_GRanges(file)
annots_from_fa_df(file) annots_from_fa_GRanges(file)
file |
Path to the FASTA file to be read. The file can remain gzipped. |
annots_from_fa_df
returns a data frame and
annots_from_fa_GRanges
returns GRanges
.
A data frame with genome annotations.
fn <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse") annots_from_fa_df(fn) annots_from_fa_GRanges(fn)
fn <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse") annots_from_fa_df(fn) annots_from_fa_GRanges(fn)
In the GRCh38 Cell Ranger reference package, an Ensembl human GTF file is filtered by gene biotypes. This vector includes all gene biotypes included by this Cell Ranger reference package.
cellranger_biotypes
cellranger_biotypes
A character vector with all Cell Ranger reference package gene biotypes.
ensembl_gene_biotypes ensembl_tx_biotypes
Just in case the user passes something with length more than 1 and messes up everything thanks to vectorization.
check_char1(x)
check_char1(x)
x |
Named vector of arguments to be checked. |
Error if x
is not a character vector with length 1.
Check for chromosomes in genome but not annotation
check_genome(chrs_use, Genome)
check_genome(chrs_use, Genome)
chrs_use |
Character vector of names of chromosomes present in both the annotation and the genome. |
Genome |
Either a |
Nothing. Will emit message if the genome contains chromosomes absent from the annotation.
This function validates inputs to tr2g_gtf and tr2g_gff3 and throws error early if some inputs are wrong.
check_gff(format, file, transcript_id, gene_id)
check_gff(format, file, transcript_id, gene_id)
format |
Whether it's gtf or gff3. |
file |
Path to a GTF file to be read. The file can remain gzipped. Use
|
transcript_id |
Character vector of length 1. Tag in |
gene_id |
Character vector of length 1. Tag in |
Nothing, will throw error if there's a problem.
The attribute field of GTF/GFF files are very complicated and is very inconsistent between sources. This function is to make sure that transcript and gene IDs can be extracted properly.
check_tag_present(tags_use, tags, error = TRUE)
check_tag_present(tags_use, tags, error = TRUE)
tags_use |
The tags to be checked. |
tags |
The tags present in attribute field. |
error |
Whether to throw an error for absent tags. If |
Error or warning if tag is absent.
This function throws an error if transcript IDs in transcriptome and annotation do not overlap. If they do overlap, this function will give a message about transcript IDs that do not agree in the transcriptome and the annotation
check_tx(tx_annot, tx)
check_tx(tx_annot, tx)
tx_annot |
Character vector of transcript IDs from the annotation. |
tx |
Character vector of transcript IDs from the transcriptome. |
Character vector of the overlapping transcript IDs.
This function downloads the cDNA fasta file from specific version of Ensembl. It can also filter the fasta file by gene and transcript biotype and remove scaffolds and haplotypes.
dl_transcriptome( species, out_path = ".", type = c("vertebrate", "metazoa", "plant", "fungus", "protist"), transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, ensembl_version = NULL, verbose = TRUE, ... )
dl_transcriptome( species, out_path = ".", type = c("vertebrate", "metazoa", "plant", "fungus", "protist"), transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, ensembl_version = NULL, verbose = TRUE, ... )
species |
Character vector of length 1, Latin name of the species of interest. |
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
type |
Character, must be one of "vertebrate", "metazoa", "plant", "fungus" and "protist". Passing "vertebrate" will use the default www.ensembl.org host. Gene annotation of some common invertebrate model organisms, such as Drosophila melanogaster, are available on www.ensembl.org so for these invertebrate model organisms, "vertebrate" can be used for this argument. Passing values other than "vertebrate" will use other Ensembl hosts. For animals absent from www.ensembl.org, try "metazoa". |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
ensembl_version |
Integer version number of Ensembl (e.g. 94 for the
October 2018 release). This argument defaults to |
verbose |
Whether to display progress. |
... |
Other arguments passed to |
Invisibly returns the path to the fasta file. The following files are
written to disk, in the out_path
directory:
The cDNA fasta file from Ensembl, from the specified version.
The filtered cDNA fasta file, only keeping the
desired biotypes and without scaffolds and haplotypes (if
chrs_only = TRUE
). This file will not be written if all gene and transcript
biotypes are used and scaffolds and haplotypes are not removed.
The transcript to gene file, without headers so can be
directly used for bustools
.
dl_transcriptome("Drosophila melanogaster", gene_biotype_use = "cellranger", chrs_only = FALSE) # Clean up file.remove("Drosophila_melanogaster.BDGP6.32.cdna.all.fa.gz")
dl_transcriptome("Drosophila melanogaster", gene_biotype_use = "cellranger", chrs_only = FALSE) # Clean up file.remove("Drosophila_melanogaster.BDGP6.32.cdna.all.fa.gz")
In the output file output.bus
, equivalence classes (EC) are denoted by
an index, which is related to the set of transcripts the EC is compatible to
in the output file matrix.ec
. This function further relates the set of
transcripts to the set of genes the EC is compatible to. This function first
reads in matrix.ec
, and then translates the transcripts into genes.
EC2gene(tr2g, kallisto_out_path, verbose = TRUE)
EC2gene(tr2g, kallisto_out_path, verbose = TRUE)
tr2g |
A Data frame with columns |
kallisto_out_path |
Path to the |
verbose |
Logical, whether to display progress. |
The data frame passed to tr2g
can be generated from function
transcript2gene
in this package for any organism that has gene and
transcript ID on Ensembl, or from the tr2g_*
family of function.
You no longer need to use this function before running make_sparse_matrix
;
the purpose of this function is to query which genes equivalence classes map
to.
Calling this function is unnessary when working with gene count matrices. However, this function is useful for finding genes the ECs map to in TCC matrices, such as when finding species-specific ECs in mixed species datasets and identifying ECs mapped to known marker genes of cell types.
A data frame with 3 columns:
Index of the EC as appearing in the matrix.ec
file.
A list column each element of which is a numeric vector of the transcripts in the EC corresponding to the EC index. To learn more about list columns, see the relevant section in the R for Data Science book.
A list column each element of which is a character vector of genes the EC maps to.
# Load toy example for testing toy_path <- system.file("testdata", package = "BUSpaRse") load(paste(toy_path, "toy_example.RData", sep = "/")) EC2gene(tr2g_toy, toy_path, verbose = FALSE)
# Load toy example for testing toy_path <- system.file("testdata", package = "BUSpaRse") load(paste(toy_path, "toy_example.RData", sep = "/")) EC2gene(tr2g_toy, toy_path, verbose = FALSE)
These vectors are here to make it easy to look up which biotypes are available for Ensembl without having to parse GTF and fasta files every time. See this page from Ensembl for what the biotypes mean.
ensembl_gene_biotypes
ensembl_gene_biotypes
A character vector with all Ensembl gene biotypes.
This vector is all the unique gene biotypes in the Ensembl version 99 human GTF file.
ensembl_tx_biotypes cellranger_biotypes
mcols
when the Ensembl GTF file is read
into R as a GRanges
, including gene_id
, transcript_id
, biotype
,
description
, and so on, and the mandatory tags like ID
, Name
, and
Parent
.These are the column names of the mcols
when the Ensembl GTF file is read
into R as a GRanges
, including gene_id
, transcript_id
, biotype
,
description
, and so on, and the mandatory tags like ID
, Name
, and
Parent
.
ensembl_gff_mcols
ensembl_gff_mcols
A character vector
Ensembl version 99 human GFF3 file
These are the column names of the mcols
when the Ensembl GTF file is read
into R as a GRanges
, including gene_id
, transcript_id
, gene_biotype
,
transcript_biotype
, description
, and so on.
ensembl_gtf_mcols
ensembl_gtf_mcols
A character vector
Ensembl version 99 human GTF file
These vectors are here to make it easy to look up which biotypes are available for Ensembl without having to parse GTF and fasta files every time. See this page from Ensembl for what the biotypes mean.
ensembl_tx_biotypes
ensembl_tx_biotypes
A character vector with all Ensembl transcript biotypes.
This vector is all the unique transcript biotypes in the Ensembl version 99 human GTF file.
ensembl_gene_biotypes cellranger_biotypes
Get flanked intronic ranges
get_intron_flanks(grl, L, get_junctions)
get_intron_flanks(grl, L, get_junctions)
grl |
A |
L |
Read length. |
get_junctions |
Logical, whether to also return exon-exon junctions. |
If get_junctions
is FALSE
, then a GRanges
object
with ranges for flanked intronic regions. If get_junctions
is TRUE
, then
in addition to the flanked intronic ranges, a CompressedGRangesList
with
exon-exon junction ranges and ranges for transcripts without introns.
Plot a transposed knee plot, showing the inflection point and the number of remaining cells after inflection point filtering. It's transposed since it's more generalizable to multi-modal data.
get_knee_df(mat) get_inflection(df, lower = 100) knee_plot(df, inflection)
get_knee_df(mat) get_inflection(df, lower = 100) knee_plot(df, inflection)
mat |
Gene count matrix, a dgCMatrix. |
df |
The data frame from |
lower |
Minimum total UMI counts for barcode for it to be considered when calculating the inflection point; this helps to avoid the noisy part of the curve for barcodes with very few counts. |
inflection |
Output of |
get_knee_df
returns a tibble with two columns: total
for
total UMI counts for each barcode, and rank
for rank of the total
counts, with number 1 for the barcode with the most counts.
get_inflection
returns a numeric(1)
for the total UMI count
at the inflection point.
knee_plot
returns a ggplot2
object.
Code in part adapted from barcodeRanks
from DropetUtils
.
# Download dataset already in BUS format library(TENxBUSData) TENxBUSData(".", dataset = "hgmm100") tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"), type = "vertebrate", ensembl_version = 99, kallisto_out_path = "./out_hgmm100") m <- make_sparse_matrix("./out_hgmm100/output.sorted.txt", tr2g = tr2g, est_ncells = 1e5, est_ngenes = nrow(tr2g), TCC = FALSE) df <- get_knee_df(m) infl <- get_inflection(df) knee_plot(df, infl) # Clean up files from the example unlink("out_hgmm100")
# Download dataset already in BUS format library(TENxBUSData) TENxBUSData(".", dataset = "hgmm100") tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"), type = "vertebrate", ensembl_version = 99, kallisto_out_path = "./out_hgmm100") m <- make_sparse_matrix("./out_hgmm100/output.sorted.txt", tr2g = tr2g, est_ncells = 1e5, est_ngenes = nrow(tr2g), TCC = FALSE) df <- get_knee_df(m) infl <- get_inflection(df) knee_plot(df, infl) # Clean up files from the example unlink("out_hgmm100")
Computation of RNA velocity requires the number of unspliced transcripts, which can be quantified with reads containing intronic sequences. This function extracts intronic sequences flanked by L-1 bases of exonic sequences where L is the biological read length of the single cell technology of interest. The flanking exonic sequences are included for reads partially mapping to an intron and an exon.
get_velocity_files( X, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, ... ) ## S4 method for signature 'GRanges' get_velocity_files( X, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, transcript_id = "transcript_id", gene_id = "gene_id", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered_gtf = FALSE ) ## S4 method for signature 'character' get_velocity_files( X, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, is_circular = NULL, transcript_id = "transcript_id", gene_id = "gene_id", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered_gtf = FALSE ) ## S4 method for signature 'TxDb' get_velocity_files( X, L, Genome, Transcriptome, out_path, style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, chrs_only = TRUE ) ## S4 method for signature 'EnsDb' get_velocity_files( X, L, Genome, Transcriptome, out_path, style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_col = "TXBIOTYPE", gene_biotype_col = "GENEBIOTYPE", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE )
get_velocity_files( X, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, ... ) ## S4 method for signature 'GRanges' get_velocity_files( X, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, transcript_id = "transcript_id", gene_id = "gene_id", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered_gtf = FALSE ) ## S4 method for signature 'character' get_velocity_files( X, L, Genome, Transcriptome = NULL, out_path = ".", style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, is_circular = NULL, transcript_id = "transcript_id", gene_id = "gene_id", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered_gtf = FALSE ) ## S4 method for signature 'TxDb' get_velocity_files( X, L, Genome, Transcriptome, out_path, style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, chrs_only = TRUE ) ## S4 method for signature 'EnsDb' get_velocity_files( X, L, Genome, Transcriptome, out_path, style = c("genome", "Ensembl", "UCSC", "NCBI", "other"), isoform_action = c("separate", "collapse"), exon_option = c("full", "junction"), compress_fa = FALSE, width = 80L, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_col = "TXBIOTYPE", gene_biotype_col = "GENEBIOTYPE", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE )
X |
Gene annotation with transcript and exon information. It can be a
path to a GTF file with annotation of exon coordinates of
transcripts, preferably from Ensembl. In the metadata, the following fields
are required: type (e.g. whether the range of interest is a gene or
transcript or exon or CDS), gene ID, and transcript ID. These
fields need not to have standard names, as long as their names are specified
in arguments of this function. It can also be a |
L |
Length of the biological read. For instance, 10xv1: 98 nt,
10xv2: 98 nt, 10xv3: 91 nt, Drop-seq: 50 nt. If in doubt check read length
in a fastq file for biological reads with the |
Genome |
Either a |
Transcriptome |
A |
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
style |
Formatting of chromosome names. Use
|
isoform_action |
Character, indicating action to take with different transcripts of the same gene. Must be one of the following:
|
exon_option |
Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following:
|
compress_fa |
Logical, whether to compress the output fasta file. If
|
width |
Maximum number of letters per line of sequence in the output fasta file. Must be an integer. |
... |
Extra arguments for methods. |
transcript_id |
Character vector of length 1. Tag in |
gene_id |
Character vector of length 1. Tag in |
transcript_version |
Character vector of length 1. Tag in |
gene_version |
Character vector of length 1. Tag in |
version_sep |
Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
save_filtered_gtf |
Logical. If filtering type, biotypes, and/or
chromosomes, whether to save the filtered |
is_circular |
Logical vector of the same length as the number of
sequences in the annotation and with the same names as the sequences,
indicating whether the sequence is circular. If |
use_transcript_version |
Logical, whether to include version number in the Ensembl transcript ID. |
use_gene_version |
Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number. |
The following files will be written to disk in the directory
out_path
:
A fasta file containing both the spliced transcripts
and the flanked intronic sequences. The intronic sequences are flanked by L-1
nt of exonic sequences to capture reads from nascent transcript partially
mapping to exons. If the exon is shorter than 2*(L-1) nt, then the entire
exon will be included in the intronic sequence. This will be used to build
the kallisto
index.
A text file of transcript IDs of spliced
transcripts. If exon_option == "junction"
, then IDs of the exon-exon
junctions. These IDs will have the pattern "transcript ID"-Jx, where x is a
number differentiating between different junctions of the same transcript.
Here x will always be ordered from 5' to 3' as on the plus strand.
A text file of IDs of introns. The names will have the pattern "transcript ID"-Ix, where x is a number differentiating between introns of the same transcript. If all transcripts of the same gene are collapsed before inferring intronic sequences, gene ID will be used in place of transcript ID. Here x will always be ordered from 5' to 3' as on the plus strand.
A text file with two columns matching transcripts and introns to genes. The first column is transcript or intron ID, and the second column is the corresponding gene ID. The part for transcripts are generated from the gene annotation supplied.
Nothing is returned into the R session.
# Use toy example toy_path <- system.file("testdata", package = "BUSpaRse") file <- paste0(toy_path, "/velocity_annot.gtf") genome <- Biostrings::readDNAStringSet(paste0(toy_path, "/velocity_genome.fa")) transcriptome <- paste0(toy_path, "/velocity_tx.fa") get_velocity_files(file, 11, genome, transcriptome, ".", gene_version = NULL, transcript_version = NULL) # Clean up output of the example file.remove("cDNA_introns.fa", "cDNA_tx_to_capture.txt", "introns_tx_to_capture.txt", "tr2g.txt")
# Use toy example toy_path <- system.file("testdata", package = "BUSpaRse") file <- paste0(toy_path, "/velocity_annot.gtf") genome <- Biostrings::readDNAStringSet(paste0(toy_path, "/velocity_genome.fa")) transcriptome <- paste0(toy_path, "/velocity_tx.fa") get_velocity_files(file, 11, genome, transcriptome, ".", gene_version = NULL, transcript_version = NULL) # Clean up output of the example file.remove("cDNA_introns.fa", "cDNA_tx_to_capture.txt", "introns_tx_to_capture.txt", "tr2g.txt")
kallisto bus
into Gene by Gell MatrixThis function takes the output file of kallisto bus
, after being
sorted and converted into text with bustools
. See vignettes on the
website of this package for a
tutorial. The bustools
output has 4 columns: barcode, UMI, equivalence
class, and counts. This function converts that file into a sparse matrix that
can be used in downstream analyses.
make_sparse_matrix( bus_path, tr2g, est_ncells, est_ngenes, whitelist = NULL, gene_count = TRUE, TCC = TRUE, single_gene = TRUE, verbose = TRUE, progress_unit = 5e+06 )
make_sparse_matrix( bus_path, tr2g, est_ncells, est_ngenes, whitelist = NULL, gene_count = TRUE, TCC = TRUE, single_gene = TRUE, verbose = TRUE, progress_unit = 5e+06 )
bus_path |
Path to the sorted text |
tr2g |
A Data frame with columns |
est_ncells |
Estimated number of cells; providing this argument will speed up computation as it minimizes memory reallocation as vectors grow. |
est_ngenes |
Estimated number of genes or equivalence classes. |
whitelist |
A character vector with valid cell barcodes. This is an
optional argument, that defaults to |
gene_count |
Logical, whether the gene count matrix should be returned. |
TCC |
Logical, whether the TCC matrix should be returned. |
single_gene |
Logical, whether to use single gene mode. In single gene mode, only UMIs that can be uniquely mapped to one gene are kept. Without single gene mode, UMIs mapped to multiple genes will be evenly distributed to those genes. |
verbose |
Whether to display progress. |
progress_unit |
How many iteration to print one progress update when
reading in the |
This function can generate both the gene count matrix and the transcript compatibility count (TCC) matrix. The TCC matrix has barcodes in the columns and equivalence classes in the rows. See Ntranos et al. 2016 for more information about the RCC matrix.
For 10x data sets, you can find a barcode whitelist file that comes with
CellRanger installation. You don't need to run CellRanger to get that. An
example path to get the whitelist file is
cellranger-2.1.0/cellranger-cs/2.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt
for v2 chemistry.
If both gene count and TCC matrices are returned, then this function
returns a list with two matrices, each with genes/equivalence classes in the
rows and barcodes in the columns. If only one of gene count and TCC matrices
is returned, then a dgCMatrix
with genes/equivalence classes in the
rows and barcodes in the columns. These matrices are unfiltered. Please filter
the empty droplets before downstream analysis.
# Load toy example for testing toy_path <- system.file("testdata", package = "BUSpaRse") load(paste(toy_path, "toy_example.RData", sep = "/")) out_fn <- paste0(toy_path, "/output.sorted.txt") # With whitelist m <- make_sparse_matrix(out_fn, tr2g_toy, 10, 3, whitelist = whitelist, gene_count = TRUE, TCC = FALSE, single_gene = TRUE, verbose = FALSE)
# Load toy example for testing toy_path <- system.file("testdata", package = "BUSpaRse") load(paste(toy_path, "toy_example.RData", sep = "/")) out_fn <- paste0(toy_path, "/output.sorted.txt") # With whitelist m <- make_sparse_matrix(out_fn, tr2g_toy, 10, 3, whitelist = whitelist, gene_count = TRUE, TCC = FALSE, single_gene = TRUE, verbose = FALSE)
Internal use. This function matches chromosome naming styles. It will also
give the genome and the annotation the same genome
slot. This function
assumes that the annotation and the genome refer to the same version of
genome. If more than one style, then the first element will be used.
match_style(Genome, annot, style)
match_style(Genome, annot, style)
Genome |
Either a |
annot |
Genome annotation, an object of a class with a
|
style |
Formatting of chromosome names. Use
|
A list of two. The first element is the genome with the proper style, and the second element is the annotation with the proper style.
This function takes in a directory and name and reads the mtx file, genes,
and barcodes from the output of bustools
to return a sparse matrix with
column names and row names.
read_count_output(dir, name, tcc = FALSE)
read_count_output(dir, name, tcc = FALSE)
dir |
Directory with the bustools count outputs. |
name |
The files in the output directory should be "name".mtx, "name".genes.txt, and "name".barcodes.txt. |
tcc |
Logical, whether the matrix of interest is a TCC matrix. Defaults
to |
A dgCMatrix with barcodes as column names and genes as row names.
# Internal toy data used for unit testing toy_path <- system.file("testdata", package = "BUSpaRse") m <- read_count_output(toy_path, name = "genes", tcc = FALSE)
# Internal toy data used for unit testing toy_path <- system.file("testdata", package = "BUSpaRse") m <- read_count_output(toy_path, name = "genes", tcc = FALSE)
Read intronic and exonic matrices into R
read_velocity_output(spliced_dir, unspliced_dir, spliced_name, unspliced_name)
read_velocity_output(spliced_dir, unspliced_dir, spliced_name, unspliced_name)
spliced_dir |
Directory with |
unspliced_dir |
Directory with |
spliced_name |
The files in the splicedd directory should be <spliced_name>.mtx, <spliced_name>.genes.txt, and <spliced_name>.barcodes.txt. |
unspliced_name |
The files in the unsplicedd directory should be <unspliced_name>.mtx, <unspliced_name>.genes.txt, and <unspliced_name>.barcodes.txt. |
A list of two dgCMatrix with barcodes as column names and genes as
row names. The elements of the list will be spliced
and unspliced
.
# Internal toy data used for unit testing toy_path <- system.file("testdata", package = "BUSpaRse") m <- read_velocity_output(toy_path, toy_path, spliced_name = "genes", unspliced_name = "genes")
# Internal toy data used for unit testing toy_path <- system.file("testdata", package = "BUSpaRse") m <- read_velocity_output(toy_path, toy_path, spliced_name = "genes", unspliced_name = "genes")
These are the column names of the mcols
when the Ensembl GTF file is read
into R as a GRanges
, including gene
, transcript_id
, gene_biotype
,
description
, and so on, and the mandatory tags like ID
, Name
, and
Parent
.
refseq_gff_mcols
refseq_gff_mcols
A character vector
Ensembl version 99 human GTF file
bustools
This function saves the transcript to gene data frame generated by this package
in whatever means in a format required by bustools
. In order to use
bustools
to generate the gene count or TCC matrix, a file
that maps transcripts to genes is required. This should be a tsv file with 2
columns: the first column for transcript ID and the second for gene ID. The
order of transcripts in this file must be the same as the order in the
kallisto index, and this ordering can be ensured by the function
sort_tr2g
. There must also be no headers. All columns other than
transcript
and gene
will be discarded. To save a file with those columns,
directly save the transcript to gene data frame with function like
write.table
, readr::write_delim
.
save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv")
save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv")
tr2g |
The data frame output from the |
file_save |
File name of the file to be saved. The directory in which the file is to be saved must exist. |
Nothing is returned into the R session. A tsv file of the format
required by bustools
with the name and directory specified will be written
to disk.
This function has been superseded by the new version of tr2g_* functions that can extract transcriptome for only the biotypes specified and with only the standard chromosomes. The new version of tr2g_* functions also sorts the transcriptome so the tr2g and the transcriptome have transcripts in the same order, and write the tr2g.tsv file in the bustools format.
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gtf_test.gtf", sep = "/") tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE) save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv") # Clean up files from the example file.remove("tr2g.tsv")
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gtf_test.gtf", sep = "/") tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE) save_tr2g_bustools(tr2g, file_save = "./tr2g.tsv") # Clean up files from the example file.remove("tr2g.tsv")
This function takes the data frame output from the tr2g_*
family of
functions in this package as the input, and sorts it so the transcripts are
in the same order as in the kallisto index used to generate the bus
file. Sorting is vital to obtain the correct sparse matrix from the bus
file as equivalence class notations are based on the index of transcripts
in the kallisto index.
sort_tr2g(tr2g, file, kallisto_out_path)
sort_tr2g(tr2g, file, kallisto_out_path)
tr2g |
The data frame output from the |
file |
Character vector of length 1, path to a tsv file with
transcript IDs and the corresponding gene IDs, in the format required for
|
kallisto_out_path |
Character vector of length 1, path to the directory for the outputs of kallisto bus. |
Since the attribute field of GTF and GFF3 files varies across sources, output
from tr2g_gtf
and tr2g_gff3
may need further
clean up. You may also supply gene and transcript IDs from other sources.
This function should be used after the clean up, when the transcript IDs in
the cleaned up data frame have the same format as those in transcript
A data frame with columns transcript
and gene
and the
other columns present in tr2g
or the data frame in file
, with
the transcript IDs sorted to be in the same order as in the kallisto index.
This function has been superseded by the new version of tr2g_* functions that can extract transcriptome for only the biotypes specified and with only the standard chromosomes. The new version of tr2g_* functions also sorts the transcriptome so the tr2g and the transcriptome have transcripts in the same order.
Other functions to retrieve transcript and gene info:
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gff3()
,
tr2g_gtf()
,
transcript2gene()
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gtf_test.gtf", sep = "/") tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE, transcript_version = NULL) tr2g <- sort_tr2g(tr2g, kallisto_out_path = toy_path)
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gtf_test.gtf", sep = "/") tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE, transcript_version = NULL) tr2g <- sort_tr2g(tr2g, kallisto_out_path = toy_path)
This function converts Latin species name to a dataset name in biomart to query gene and transcript ID.
species2dataset( species, type = c("vertebrate", "metazoa", "plant", "fungus", "protist") )
species2dataset( species, type = c("vertebrate", "metazoa", "plant", "fungus", "protist") )
species |
A character vector of Latin names of species present in this scRNA-seq dataset. This is used to retrieve Ensembl information from biomart. |
type |
A character vector indicating the type of each species. Each
element must be one of "vertebrate", "metazoa", "plant", "fungus", and
"protist". If length is 1, then this type will be used for all species specified
here. Can be missing if |
The appropriate dataset name for biomart.
species2dataset(species = "Homo sapiens")
species2dataset(species = "Homo sapiens")
To avoid introducing rlang as another dependency for tidyeval. This function will also convert exon numbers to integer.
standardize_tags(gr, gene_id, transcript_id)
standardize_tags(gr, gene_id, transcript_id)
gr |
A |
gene_id |
Name of the metadata field for gene ID. |
transcript_id |
Name of the metadata field for transcript ID. |
A GRanges
object with standardized names: gene ID as gene_id
,
and transcript ID as transcript_id
.
Remove chromosomes in anotation absent from genome
sub_annot(chrs_use, annot)
sub_annot(chrs_use, annot)
chrs_use |
Character vector of names of chromosomes present in both the annotation and the genome. |
annot |
Either a |
A subsetted genome annotation of the same type ofo the input genome annotation.
Exclude chromosomes present in the annotation but absent from the genome and add information about circular chromosomes.
subset_annot(Genome, annot) ## S4 method for signature 'DNAStringSet' subset_annot(Genome, annot) ## S4 method for signature 'BSgenome' subset_annot(Genome, annot)
subset_annot(Genome, annot) ## S4 method for signature 'DNAStringSet' subset_annot(Genome, annot) ## S4 method for signature 'BSgenome' subset_annot(Genome, annot)
Genome |
Either a |
annot |
Either a |
A subsetted genome annotation of the same type of the input genome annotation.
library(BSgenome.Hsapiens.UCSC.hg38) library(EnsDb.Hsapiens.v86) library(GenomeInfoDb) gn <- BSgenome.Hsapiens.UCSC.hg38 seqlevelsStyle(gn) <- "Ensembl" subset_annot(gn, EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg38) library(EnsDb.Hsapiens.v86) library(GenomeInfoDb) gn <- BSgenome.Hsapiens.UCSC.hg38 seqlevelsStyle(gn) <- "Ensembl" subset_annot(gn, EnsDb.Hsapiens.v86)
Bioconductor provides Ensembl genome annotation in AnnotationHub
; older
versions of Ensembl annotation can be obtained from packages like
EnsDb.Hsapiens.v86
. This is an alternative to querying Ensembl with
biomart; Ensembl's server seems to be less stable than that of Bioconductor.
However, more information and species are available on Ensembl biomart than
on AnnotationHub
.
tr2g_EnsDb( ensdb, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, other_attrs = NULL, use_gene_name = TRUE, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_col = "TXBIOTYPE", gene_biotype_col = "GENEBIOTYPE", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, overwrite = FALSE )
tr2g_EnsDb( ensdb, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, other_attrs = NULL, use_gene_name = TRUE, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_col = "TXBIOTYPE", gene_biotype_col = "GENEBIOTYPE", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, overwrite = FALSE )
ensdb |
Ann |
Genome |
Either a |
get_transcriptome |
Logical, whether to extract transcriptome from
genome with the GTF file. If filtering biotypes or chromosomes, the filtered
|
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
other_attrs |
Character vector. Other attributes to get from the |
use_gene_name |
Logical, whether to get gene names. |
use_transcript_version |
Logical, whether to include version number in
the Ensembl transcript ID. To decide whether to
include transcript version number, check whether version numbers are included
in the |
use_gene_version |
Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
compress_fa |
Logical, whether to compress the output fasta file. If
|
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
A data frame with at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally gene_name
for gene names. If other_attrs
has been specified, then those will
also be columns in the data frame returned.
ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gff3()
,
tr2g_gtf()
,
transcript2gene()
library(EnsDb.Hsapiens.v86) tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, write_tr2g = FALSE, use_transcript_version = FALSE, use_gene_version = FALSE)
library(EnsDb.Hsapiens.v86) tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, write_tr2g = FALSE, use_transcript_version = FALSE, use_gene_version = FALSE)
This function queries Ensembl biomart to convert transcript IDs to gene IDs.
tr2g_ensembl( species, type = c("vertebrate", "metazoa", "plant", "fungus", "protist"), out_path = ".", write_tr2g = TRUE, other_attrs = NULL, use_gene_name = TRUE, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, ensembl_version = NULL, overwrite = FALSE, verbose = TRUE, ... )
tr2g_ensembl( species, type = c("vertebrate", "metazoa", "plant", "fungus", "protist"), out_path = ".", write_tr2g = TRUE, other_attrs = NULL, use_gene_name = TRUE, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, ensembl_version = NULL, overwrite = FALSE, verbose = TRUE, ... )
species |
Character vector of length 1, Latin name of the species of interest. |
type |
Character, must be one of "vertebrate", "metazoa", "plant", "fungus" and "protist". Passing "vertebrate" will use the default www.ensembl.org host. Gene annotation of some common invertebrate model organisms, such as Drosophila melanogaster, are available on www.ensembl.org so for these invertebrate model organisms, "vertebrate" can be used for this argument. Passing values other than "vertebrate" will use other Ensembl hosts. For animals absent from www.ensembl.org, try "metazoa". |
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
other_attrs |
Character vector. Other attributes to get from Ensembl,
such as gene symbol and position on the genome.
Use |
use_gene_name |
Logical, whether to get gene names. |
use_transcript_version |
Logical, whether to include version number in
the Ensembl transcript ID. To decide whether to
include transcript version number, check whether version numbers are included
in the |
use_gene_version |
Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
ensembl_version |
Integer version number of Ensembl (e.g. 94 for the
October 2018 release). This argument defaults to |
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
verbose |
Whether to display progress. |
... |
Othe arguments to be passed to |
A data frame with at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally gene_name
for gene names. If other_attrs
has been specified, then those will
also be columns in the data frame returned.
dl_transcriptome
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_fasta()
,
tr2g_gff3()
,
tr2g_gtf()
,
transcript2gene()
tr2g <- tr2g_ensembl(species = "Danio rerio", other_attrs = "description", write_tr2g = FALSE) # This will use plants.ensembl.org as host instead of www.ensembl.org tr2g <- tr2g_ensembl(species = "Arabidopsis thaliana", type = "plant", write_tr2g = FALSE)
tr2g <- tr2g_ensembl(species = "Danio rerio", other_attrs = "description", write_tr2g = FALSE) # This will use plants.ensembl.org as host instead of www.ensembl.org tr2g <- tr2g_ensembl(species = "Arabidopsis thaliana", type = "plant", write_tr2g = FALSE)
FASTA files, such as those for cDNA and ncRNA from Ensembl, might have genome annotation information in the name of each sequence entry. This function extracts the transcript and gene IDs from such FASTA files.
tr2g_fasta( file, out_path = ".", write_tr2g = TRUE, use_gene_name = TRUE, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered = TRUE, compress_fa = FALSE, overwrite = FALSE )
tr2g_fasta( file, out_path = ".", write_tr2g = TRUE, use_gene_name = TRUE, use_transcript_version = TRUE, use_gene_version = TRUE, transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, save_filtered = TRUE, compress_fa = FALSE, overwrite = FALSE )
file |
Path to the FASTA file to be read. The file can remain gzipped. |
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
use_gene_name |
Logical, whether to get gene names. |
use_transcript_version |
Logical, whether to include version number in
the Ensembl transcript ID. To decide whether to
include transcript version number, check whether version numbers are included
in the |
use_gene_version |
Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number. |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
save_filtered |
If filtering for biotype and chromosomes, whether to
save the filtered fasta file. If |
compress_fa |
Logical, whether to compress the output fasta file. If
|
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
At present, this function only works with FASTA files from Ensembl, and uses regex to extract vertebrate Ensembl IDs. Sequence names should be formatted as follows:
ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
If your FASTA file sequence names are formatted differently, then you must
extract the transcript and gene IDs by some other means. The Bioconductor
package Biostrings
is recommended; after reading the FASTA file into
R, the sequence names can be accessed by the names
function.
While normally, you should call sort_tr2g
to sort the
transcript IDs from the output of the tr2g_*
family of functions, If
the FASTA file supplied here is the same as the one used to build the
kallisto index, then the transcript IDs in the output of this function are in
the same order as in the kallisto index, so you can skip sort_tr2g
and proceed directly to EC2gene
with the output of this
function.
A data frame with at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally gene_name
for gene
names.
ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_gff3()
,
tr2g_gtf()
,
transcript2gene()
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "fasta_test.fasta", sep = "/") tr2g <- tr2g_fasta(file = file_use, save_filtered = FALSE, write_tr2g = FALSE)
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "fasta_test.fasta", sep = "/") tr2g <- tr2g_fasta(file = file_use, save_filtered = FALSE, write_tr2g = FALSE)
This function reads a GFF3 file and extracts the transcript ID and
corresponding gene ID. This function assumes that the GFF3 file is properly
formatted. See http://gmod.org/wiki/GFF3 for a detailed
description of proper GFF3 format. Note that GTF files have a somewhat
different and simpler format in the attribute field, which this function does
not support. See http://mblab.wustl.edu/GTF2.html for a detailed
description of proper GTF format. To extract transcript and gene information
from GTF files, see the function tr2g_gtf
in this package.
Some files bearing the .gff3
are in fact more like the GTF format. If
this is so, then change the extension to .gtf
and use the function
tr2g_gtf
in this package instead.
tr2g_gff3( file, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, transcript_id = "transcript_id", gene_id = "gene_id", gene_name = "Name", transcript_version = "version", gene_version = "version", version_sep = ".", transcript_biotype_col = "biotype", gene_biotype_col = "biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, save_filtered_gff = TRUE, overwrite = FALSE, source = c("ensembl", "refseq") )
tr2g_gff3( file, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, transcript_id = "transcript_id", gene_id = "gene_id", gene_name = "Name", transcript_version = "version", gene_version = "version", version_sep = ".", transcript_biotype_col = "biotype", gene_biotype_col = "biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, save_filtered_gff = TRUE, overwrite = FALSE, source = c("ensembl", "refseq") )
file |
Path to a GTF file to be read. The file can remain gzipped. Use
|
Genome |
Either a |
get_transcriptome |
Logical, whether to extract transcriptome from
genome with the GTF file. If filtering biotypes or chromosomes, the filtered
|
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
transcript_id |
Character vector of length 1. Tag in |
gene_id |
Character vector of length 1. Tag in |
gene_name |
Character vector of length 1. Tag in |
transcript_version |
Character vector of length 1. Tag in |
gene_version |
Character vector of length 1. Tag in |
version_sep |
Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
compress_fa |
Logical, whether to compress the output fasta file. If
|
save_filtered_gff |
Logical. If filtering type, biotypes, and/or
chromosomes, whether to save the filtered |
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
source |
Name of the database where this GFF3 file was downloaded. Must be either "ensembl" or "refseq". |
Transcript and gene versions may not be present in all GTF files, so these
arguments are optional. This function has arguments for transcript and gene
version numbers because Ensembl IDs have version numbers. For Ensembl IDs, we
recommend including the version number, since a change in version number
signals a change in the entity referred to by the ID after reannotation. If a
version is used, then it will be appended to the ID, separated by
version_sep
.
The transcript and gene IDs are The attribute
field (the last
field) of GTF files can be complicated and inconsistent across different
sources. Please check the attribute
tags in your GTF file and consider
the arguments of this function carefully. The defaults are set according to
Ensembl GTF files; defaults may not work for files from other sources. Due to
the general lack of standards for the attribute
field, you may need to
further clean up the output of this function.
A data frame at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally, gene_name
for
gene names.
The defaults here are for Ensembl GFF3 files. To see all attribute
tags for Ensembl and RefSeq GFF3 files, see data("ensembl_gff_mcols")
and
data("refseq_gff_mcols")
.
ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes ensembl_gtf_mcols ensembl_gff_mcols refseq_gff_mcols
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gtf()
,
transcript2gene()
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gff3_test.gff3", sep = "/") # Default tr2g <- tr2g_gff3(file = file_use, write_tr2g = FALSE, get_transcriptome = FALSE, save_filtered_gff = FALSE) # Excluding version numbers tr2g <- tr2g_gff3(file = file_use, transcript_version = NULL, gene_version = NULL, write_tr2g = FALSE, get_transcriptome = FALSE, save_filtered_gff = FALSE)
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gff3_test.gff3", sep = "/") # Default tr2g <- tr2g_gff3(file = file_use, write_tr2g = FALSE, get_transcriptome = FALSE, save_filtered_gff = FALSE) # Excluding version numbers tr2g <- tr2g_gff3(file = file_use, transcript_version = NULL, gene_version = NULL, write_tr2g = FALSE, get_transcriptome = FALSE, save_filtered_gff = FALSE)
Internal use, for GRanges from GTF files
tr2g_GRanges( gr, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, transcript_id = "transcript_id", gene_id = "gene_id", gene_name = "gene_name", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, save_filtered_gtf = TRUE, overwrite = FALSE )
tr2g_GRanges( gr, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, transcript_id = "transcript_id", gene_id = "gene_id", gene_name = "gene_name", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, save_filtered_gtf = TRUE, overwrite = FALSE )
gr |
A |
Genome |
Either a |
get_transcriptome |
Logical, whether to extract transcriptome from
genome with the GTF file. If filtering biotypes or chromosomes, the filtered
|
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
transcript_id |
Character vector of length 1. Tag in |
gene_id |
Character vector of length 1. Tag in |
gene_name |
Character vector of length 1. Tag in |
transcript_version |
Character vector of length 1. Tag in |
gene_version |
Character vector of length 1. Tag in |
version_sep |
Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
compress_fa |
Logical, whether to compress the output fasta file. If
|
save_filtered_gtf |
Logical. If filtering type, biotypes, and/or
chromosomes, whether to save the filtered |
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
A data frame at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally, gene_name
for
gene names.
This function reads a GTF file and extracts the transcript ID and
corresponding gene ID. This function assumes that the GTF file is properly
formatted. See http://mblab.wustl.edu/GTF2.html for a detailed
description of proper GTF format. Note that GFF3 files have a somewhat
different and more complicated format in the attribute field, which this
function does not support. See http://gmod.org/wiki/GFF3 for a detailed
description of proper GFF3 format. To extract transcript and gene information
from GFF3 files, see the function tr2g_gff3
in this package.
tr2g_gtf( file, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, transcript_id = "transcript_id", gene_id = "gene_id", gene_name = "gene_name", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, save_filtered_gtf = TRUE, overwrite = FALSE )
tr2g_gtf( file, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, transcript_id = "transcript_id", gene_id = "gene_id", gene_name = "gene_name", transcript_version = "transcript_version", gene_version = "gene_version", version_sep = ".", transcript_biotype_col = "transcript_biotype", gene_biotype_col = "gene_biotype", transcript_biotype_use = "all", gene_biotype_use = "all", chrs_only = TRUE, compress_fa = FALSE, save_filtered_gtf = TRUE, overwrite = FALSE )
file |
Path to a GTF file to be read. The file can remain gzipped. Use
|
Genome |
Either a |
get_transcriptome |
Logical, whether to extract transcriptome from
genome with the GTF file. If filtering biotypes or chromosomes, the filtered
|
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
transcript_id |
Character vector of length 1. Tag in |
gene_id |
Character vector of length 1. Tag in |
gene_name |
Character vector of length 1. Tag in |
transcript_version |
Character vector of length 1. Tag in |
gene_version |
Character vector of length 1. Tag in |
version_sep |
Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
compress_fa |
Logical, whether to compress the output fasta file. If
|
save_filtered_gtf |
Logical. If filtering type, biotypes, and/or
chromosomes, whether to save the filtered |
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
Transcript and gene versions may not be present in all GTF files, so these
arguments are optional. This function has arguments for transcript and gene
version numbers because Ensembl IDs have version numbers. For Ensembl IDs, we
recommend including the version number, since a change in version number
signals a change in the entity referred to by the ID after reannotation. If a
version is used, then it will be appended to the ID, separated by
version_sep
.
The transcript and gene IDs are The attribute
field (the last
field) of GTF files can be complicated and inconsistent across different
sources. Please check the attribute
tags in your GTF file and consider
the arguments of this function carefully. The defaults are set according to
Ensembl GTF files; defaults may not work for files from other sources. Due to
the general lack of standards for the attribute
field, you may need to
further clean up the output of this function.
A data frame at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally, gene_name
for
gene names.
ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gff3()
,
transcript2gene()
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gtf_test.gtf", sep = "/") # Default tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE) # Excluding version numbers tr2g <- tr2g_gtf(file = file_use, transcript_version = NULL, gene_version = NULL, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE)
toy_path <- system.file("testdata", package = "BUSpaRse") file_use <- paste(toy_path, "gtf_test.gtf", sep = "/") # Default tr2g <- tr2g_gtf(file = file_use, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE) # Excluding version numbers tr2g <- tr2g_gtf(file = file_use, transcript_version = NULL, gene_version = NULL, get_transcriptome = FALSE, write_tr2g = FALSE, save_filtered_gtf = FALSE)
tr2g for exon-exon junctions
tr2g_junction(tr2g_cdna, junction_names)
tr2g_junction(tr2g_cdna, junction_names)
tr2g_cdna |
The original |
junction_names |
Names of junctions internally generated. |
A tr2g
data frame where "transcripts" are the exon-exon junctions
and genes are the corresponding genes.
The genome and gene annotations of some species can be conveniently obtained
from Bioconductor packages. This is more convenient than downloading GTF
files from Ensembl and reading it into R. In these packages, the gene
annotation is stored in a TxDb
object, which has standardized
names for gene IDs, transcript IDs, exon IDs, and so on, which are stored in
the metadata fields in GTF and GFF3 files, which are not standardized.
This function extracts transcript and corresponding gene information from
gene annotation stored in a TxDb
object.
tr2g_TxDb( txdb, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, chrs_only = TRUE, compress_fa = FALSE, overwrite = FALSE )
tr2g_TxDb( txdb, Genome = NULL, get_transcriptome = TRUE, out_path = ".", write_tr2g = TRUE, chrs_only = TRUE, compress_fa = FALSE, overwrite = FALSE )
txdb |
A |
Genome |
Either a |
get_transcriptome |
Logical, whether to extract transcriptome from
genome with the GTF file. If filtering biotypes or chromosomes, the filtered
|
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
compress_fa |
Logical, whether to compress the output fasta file. If
|
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
A data frame with 3 columns: gene
for gene ID, transcript
for transcript ID, and tx_id
for internal transcript IDs used to avoid
duplicate transcript names. For TxDb packages from Bioconductor, gene ID is
Entrez ID, while transcript IDs are Ensembl IDs with version numbers for
TxDb.Hsapiens.UCSC.hg38.knownGene
. In some cases, the transcript ID
have duplicates, and this is resolved by adding numbers to make the IDs
unique.
A data frame with 3 columns: gene
for gene ID, transcript
for transcript ID, and gene_name
for gene names. If other_attrs
has been specified, then those will also be columns in the data frame returned.
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gff3()
,
tr2g_gtf()
,
transcript2gene()
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gff3()
,
tr2g_gtf()
,
transcript2gene()
library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(BSgenome.Hsapiens.UCSC.hg38) tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38) # Clean up file.remove("transcriptome.fa", "tr2g.tsv")
library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(BSgenome.Hsapiens.UCSC.hg38) tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38) # Clean up file.remove("transcriptome.fa", "tr2g.tsv")
This function is a shortcut to get the correctly sorted data frame with
transcript IDs and the corresponding gene IDs from Ensembl biomart or Ensembl
transcriptome FASTA files. For biomart query, it calls
tr2g_ensembl
and then sort_tr2g
. For FASTA files,
it calls tr2g_fasta
and then sort_tr2g
. Unlike in
tr2g_ensembl
and tr2g_fasta
, multiple species can
be supplied if cells from different species were sequenced together. This
function should only be used if the kallisto inidex was built with
transcriptomes from Ensembl. Also, if querying biomart, please make sure to set
ensembl_version
to match the version where the transcriptomes were
downloaded.
transcript2gene( species, fasta_file, kallisto_out_path, type = "vertebrate", ... )
transcript2gene( species, fasta_file, kallisto_out_path, type = "vertebrate", ... )
species |
A character vector of Latin names of species present in this scRNA-seq dataset. This is used to retrieve Ensembl information from biomart. |
fasta_file |
Character vector of paths to the transcriptome FASTA files
used to build the kallisto index. Exactly one of |
kallisto_out_path |
Path to the |
type |
A character vector indicating the type of each species. Each
element must be one of "vertebrate", "metazoa", "plant", "fungus", and
"protist". If length is 1, then this type will be used for all species specified
here. Can be missing if |
... |
Other arguments passed to |
A data frame with two columns: gene
and transcript
,
with Ensembl gene and transcript IDs (with version number), in the same order
as in the transcriptome index used in kallisto
.
This function has been superseded by the new version of tr2g_* functions that can extract transcriptome for only the biotypes specified and with only the standard chromosomes. The new version of tr2g_* functions also sorts the transcriptome so the tr2g and the transcriptome have transcripts in the same order.
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gff3()
,
tr2g_gtf()
# Download dataset already in BUS format library(TENxBUSData) TENxBUSData(".", dataset = "hgmm100") tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"), type = "vertebrate", save_filtered = FALSE, ensembl_version = 99, kallisto_out_path = "./out_hgmm100") # Clean up files from the example unlink("out_hgmm100")
# Download dataset already in BUS format library(TENxBUSData) TENxBUSData(".", dataset = "hgmm100") tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"), type = "vertebrate", save_filtered = FALSE, ensembl_version = 99, kallisto_out_path = "./out_hgmm100") # Clean up files from the example unlink("out_hgmm100")
Validate input to get_velocity_files
validate_velocity_input( L, Genome, Transcriptome, out_path, compress_fa, width, exon_option )
validate_velocity_input( L, Genome, Transcriptome, out_path, compress_fa, width, exon_option )
L |
Length of the biological read. For instance, 10xv1: 98 nt,
10xv2: 98 nt, 10xv3: 91 nt, Drop-seq: 50 nt. If in doubt check read length
in a fastq file for biological reads with the |
Genome |
Either a |
Transcriptome |
A |
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
compress_fa |
Logical, whether to compress the output fasta file. If
|
width |
Maximum number of letters per line of sequence in the output fasta file. Must be an integer. |
exon_option |
Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following:
|
Will throw error if validation fails. Returns a named list whose first element is the normalized path to output directory, and whose second element is the normalized path to the transcriptome file if specified.
Write the files for RNA velocity to disk, in the specified output directory.
write_velocity_output( out_path, introns, Genome, Transcriptome, isoform_action, exon_option, tr2g_cdna, compress_fa, width )
write_velocity_output( out_path, introns, Genome, Transcriptome, isoform_action, exon_option, tr2g_cdna, compress_fa, width )
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. |
introns |
Intronic ranges plus flanking region, returned by
|
Genome |
Either a |
Transcriptome |
A |
isoform_action |
Character, indicating action to take with different transcripts of the same gene. Must be one of the following:
|
exon_option |
Character, indicating how exonic sequences should be included in the kallisto index. Must be one of the following:
|
tr2g_cdna |
A data frame with columns |
compress_fa |
Logical, whether to compress the output fasta file. If
|
width |
Maximum number of letters per line of sequence in the output fasta file. Must be an integer. |
Nothing into the R session. The files are written to disk.