Title: | Context-Aware Transcript Quantification from Long Read RNA-Seq data |
---|---|
Description: | bambu is a R package for multi-sample transcript discovery and quantification using long read RNA-Seq data. You can use bambu after read alignment to obtain expression estimates for known and novel transcripts and genes. The output from bambu can directly be used for visualisation and downstream analysis such as differential gene expression or transcript usage. |
Authors: | Ying Chen [cre, aut], Andre Sim [aut], Yuk Kei Wan [aut], Jonathan Goeke [aut] |
Maintainer: | Ying Chen <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 3.9.0 |
Built: | 2024-11-18 03:22:55 UTC |
Source: | https://github.com/bioc/bambu |
This function takes bam file of genomic alignments and performs isoform recontruction and gene and transcript expression quantification. It also allows saving of read class files of alignments, extending provided annotations, and quantification based on extended annotations. When multiple samples are provided, extended annotations will be combined across samples to allow comparison.
bambu( reads, annotations = NULL, genome = NULL, NDR = NULL, opt.discovery = NULL, opt.em = NULL, rcOutDir = NULL, discovery = TRUE, quant = TRUE, stranded = FALSE, ncore = 1, yieldSize = NULL, trackReads = FALSE, returnDistTable = FALSE, lowMemory = FALSE, fusionMode = FALSE, verbose = FALSE )
bambu( reads, annotations = NULL, genome = NULL, NDR = NULL, opt.discovery = NULL, opt.em = NULL, rcOutDir = NULL, discovery = TRUE, quant = TRUE, stranded = FALSE, ncore = 1, yieldSize = NULL, trackReads = FALSE, returnDistTable = FALSE, lowMemory = FALSE, fusionMode = FALSE, verbose = FALSE )
reads |
A string or a vector of strings specifying the paths of bam
files for genomic alignments, or a |
annotations |
A path to a .gtf file or a |
genome |
A path to a fasta file or a BSGenome object. |
NDR |
specifying the maximum NDR rate to novel transcript output from detected read classes, defaults to an automatic recommendation |
opt.discovery |
A list of controlling parameters for isoform reconstruction process:
|
opt.em |
A list of controlling parameters for quantification algorithm estimation process:
|
rcOutDir |
A string variable specifying the path to where read class files will be saved. |
discovery |
A logical variable indicating whether annotations are to be extended. Defaults to TRUE |
quant |
A logical variable indicating whether quantification will be performed. If false the output type will change. Defaults to TRUE |
stranded |
A boolean for strandedness, defaults to FALSE. |
ncore |
specifying number of cores used when parallel processing is used, defaults to 1. |
yieldSize |
see |
trackReads |
When TRUE read names will be tracked and output as metadata in the final output as readToTranscriptMaps detailing. the assignment of reads to transcripts. The output is a list with an entry for each sample. |
returnDistTable |
When TRUE the calculated distance table between read classes and annotations will be output as metadata as distTables. The output is a list with an entry for each sample. |
lowMemory |
Read classes will be processed by chromosomes when lowMemory is specified. This option provides an efficient way to process big samples. |
fusionMode |
A logical variable indicating whether run in fusion mode |
verbose |
A logical variable indicating whether processing messages will be printed. |
Main function
bambu
will output different results depending on whether
quant mode is on. By default, quant is set to TRUE, so
bambu
will generate a SummarizedExperiment object that contains
the transcript expression estimates. Transcript expression estimates can be
accessed by counts(), including the following variables
expression estimates
sequencing depth normalized estimates
estimates of read counts mapped as full length reads for each transcript
counts of reads that are uniquely mapped to each transcript
Output annotations that are usually the annotations with/without novel transcripts/genes added, depending on whether discovery mode is on can be accessed by rowRanges() Transcript to gene map can be accessed by rowData(), with eqClass that defining equivalent class for each transcript
In the case when quant is set to FALSE, i.e., only transcript
discovery is performed, bambu
will report the grangeslist of
the extended annotations.
## ===================== test.bam <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu") fa.file <- system.file("extdata", "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", package = "bambu") gr <- readRDS(system.file("extdata", "annotationGranges_txdbGrch38_91_chr9_1_1000000.rds", package = "bambu")) se <- bambu(reads = test.bam, annotations = gr, genome = fa.file, discovery = TRUE, quant = TRUE)
## ===================== test.bam <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu") fa.file <- system.file("extdata", "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", package = "bambu") gr <- readRDS(system.file("extdata", "annotationGranges_txdbGrch38_91_chr9_1_1000000.rds", package = "bambu")) se <- bambu(reads = test.bam, annotations = gr, genome = fa.file, discovery = TRUE, quant = TRUE)
compare alternatively-spliced transcripts
compareTranscripts(query, subject)
compareTranscripts(query, subject)
query |
a GRangesList of transcripts |
subject |
a GRangesList of transcripts |
This function compares two alternatively-spliced transcripts and returns
a tibble
object that determines the type of the alternative splicing between
the query and the subject transcript. Alternative splicing includes exons skipping,
intron retention, alternative 5' exon start site, alternative 3' exon end site, alternative
first exon, alternative last exon, alternative transcription start site (TSS),
alternative transcription end site (TES), internal first exon, internal last exon, or a
mixed combination of them.
a tibble
object with the following columns:
Remark: two exons, one each from query and subject transcript are defined to be equivalent if one of the exons fully covers the another exon.
alternativeFirstExon: FALSE if query transcript and subject transcript have equivalent first exon. TRUE otherwise.
alternativeTSS: +k if the query initiates the transcription k sites earlier than the subject transcript, -k if the query initiates the transcription k sites later than the subject transcript.
internalFirstExon.query: TRUE if the first exon of the query transcript is equivalent to one of the exon of the subject transcript (except the first exon). FALSE otherwise.
internalFirstExon.subject: TRUE if the first exon of the subject transcript is equivalent to one of the exon of the query transcript (except the first exon). FALSE otherwise.
alternativeLastExon: FALSE if query transcript and subject transcript have equivalent last exon. TRUE otherwise.
alternativeTES: +k if the query transcript ends the transcription k sites later than the subject transcript, -k if the query transcript ends the transcription k sites earlier than the subject transcript.
internalLastExon.query: TRUE if the last exon of the query transcript is equivalent to one of the exon in the subject transcript (except the last exon). FALSE otherwise.
internalLastExon.subject: TRUE if the last exon of the subject transcript is equivalent to one of the exon in the query transcript (except the last exon). FALSE otherwise.
intronRetention.subject: k if there is/are k intron(s) in the subject transcript relative to the query transcript.
intronRetention.query: k if there is/are k intron(s) in the query transcript relative to the subject transcript.
exonSkipping.query: k if there is/are k exon(s) in the subject transcript (except the first and last) not in the query transcript.
exonSkipping.subject: k if there is/are k exon(s) in the query transcript (except the first and last) not in the subject transcript.
exon5prime (splicing): k if there is/are k equivalent exon(s) having different 5' start site (except the 5' start site of first exon).
exon3prime (splicing): k if there is/are k equivalent exon(s) having different 3' end site (except the 3' end site of the last exon).
Value
for more details about each of the alternative splicing events.
query <- readRDS(system.file("extdata", "annotateSpliceOverlapByDist_testQuery.rds", package = "bambu")) subject <- readRDS(system.file("extdata", "annotateSpliceOverlapByDist_testSubject.rds", package = "bambu")) compareTranscriptsTable <- compareTranscripts(query, subject) compareTranscriptsTable
query <- readRDS(system.file("extdata", "annotateSpliceOverlapByDist_testQuery.rds", package = "bambu")) subject <- readRDS(system.file("extdata", "annotateSpliceOverlapByDist_testSubject.rds", package = "bambu")) compareTranscriptsTable <- compareTranscripts(query, subject) compareTranscriptsTable
plotSEOuptut
plotBambu( se, group.variable = NULL, type = c("annotation", "pca", "heatmap"), gene_id = NULL, transcript_id = NULL )
plotBambu( se, group.variable = NULL, type = c("annotation", "pca", "heatmap"), gene_id = NULL, transcript_id = NULL )
se |
An summarized experiment object obtained from |
group.variable |
Variable for grouping in plot, has be to provided if choosing to plot PCA. |
type |
plot type variable, a values of annotation for a single gene with heatmap for isoform expressions, pca, or heatmap, see details. |
gene_id |
specifying the gene_id for plotting gene annotation, either gene_id or transcript_id has to be provided when type = "annotation". |
transcript_id |
specifying the transcript_id for plotting transcript annotation, either gene_id or transcript_id has to be provided when type = "annotation" |
type
indicates the type of plots to be plotted. There
are two types of plots can be chosen, PCA or heatmap.
A heatmap plot for all samples
se <- readRDS(system.file("extdata", "seOutputCombined_SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.rds", package = "bambu")) plotBambu(se, type = "PCA")
se <- readRDS(system.file("extdata", "seOutputCombined_SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.rds", package = "bambu")) plotBambu(se, type = "PCA")
prepare reference annotations for long read RNA-Seq analysis with Bambu
prepareAnnotations(x)
prepareAnnotations(x)
x |
A path to gtf file or a |
This function creates a reference annotation object which is used for transcript discovery and quantification in Bambu. prepareAnnotations can use a path to a gtf file or a TxDB object as input, and returns a annotation object that stores additional information about transcripts which is used in Bambu. For each transcript, exons are ranked from first to last exon in direction of transcription.
A GRangesList
object with additional details for each exon and transcript
that are required by Bambu. Exons are ranked by the exon_rank
column, corresponding
to the rank in direction of transcription (from first to last exon). In addition to exon rank,
gene id, transcript id, and the minimum transcript equivalent class is stored as well
(a transcript equivalence class of a transcript x is the collection of transcripts
where their exon junctions contain, in a continuous way, the exon junctions of the transcript x).
The object is designed to be used by Bambu, and the direct access of the metadata is not recommended.
[bambu()] for transcript discovery and quantification from long read RNA-Seq.
gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf", package = "bambu" ) annotations <- prepareAnnotations(x = gtf.file) # run bambu test.bam <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu") fa.file <- system.file("extdata", "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", package = "bambu") se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, discovery = TRUE, quant = TRUE)
gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf", package = "bambu" ) annotations <- prepareAnnotations(x = gtf.file) # run bambu test.bam <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu") fa.file <- system.file("extdata", "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", package = "bambu") se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, discovery = TRUE, quant = TRUE)
Outputs GRangesList object from reading a GTF file
readFromGTF(file, keep.extra.columns = NULL)
readFromGTF(file, keep.extra.columns = NULL)
file |
a .gtf file |
keep.extra.columns |
a vector with names of columns to keep from the the attributes in the gtf file. For ensembl, this could be keep.extra.columns=c('gene_name','gene_biotype', 'transcript_biotype', 'transcript_name') |
grlist a GRangesList
object, with two columns
specifying prefix for new gene Ids (genePrefix.number), defaults to empty
indicating whether filter to remove read classes which are a subset of known transcripts(), defaults to TRUE
gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf", package = "bambu" ) readFromGTF(gtf.file)
gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf", package = "bambu" ) readFromGTF(gtf.file)
This function train a model for use on other data
trainBambu( rcFile = NULL, min.readCount = 2, nrounds = 50, NDR.threshold = 0.1, verbose = TRUE )
trainBambu( rcFile = NULL, min.readCount = 2, nrounds = 50, NDR.threshold = 0.1, verbose = TRUE )
rcFile |
summerized experiment object with read classes/ranges produced from bambu(quant = FALSE, discovery = FALSE) or rcOutdir |
min.readCount |
the minimum number of reads a read class is required to be have to be used for training |
nrounds |
xgboost hyperparameter - the number of decision trees in the final mode |
NDR.threshold |
the effective NDR threshold that bambu will try and match on other samples when using this model |
verbose |
if additional messages should be output Output - A list containing 6 objects which is passed directly into bambu(opt.discovery=list(defaultModels=trainBambu())) transcriptModelME - the model for multi-exon transcripts transcriptModelSE - the model for single-exon transcripts txScoreBaseline - the txScore used for NDR calibration for multi-exon transcripts txScoreBaselineSE - [DEPRECATED] the txScore used for NDR calibration for single-exon transcripts lmNDR = lmNDR - the linear model of the reletionship between txScore and NDR used to calculate the baseline for multi-exon transcripts lmNDR.SE = lmNDR.SE - the linear model of the reletionship between txScore and NDR used to calculate the baseline for single-exon transcripts NDR.threshold - the NDR threshold usd to calculate the txScoreBaseline on the lmNDR (baselineFDR) |
Function to train a model for use on other data
It returns a model object to use in bambu
Reduce transcript expression to gene expression
transcriptToGeneExpression(se)
transcriptToGeneExpression(se)
se |
a summarizedExperiment object from |
A SummarizedExperiment object
se <- readRDS(system.file("extdata", "seOutput_SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.rds", package = "bambu" )) transcriptToGeneExpression(se)
se <- readRDS(system.file("extdata", "seOutput_SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.rds", package = "bambu" )) transcriptToGeneExpression(se)
Write Bambu results to GTF and transcript/gene-count files
writeBambuOutput(se, path, prefix = "")
writeBambuOutput(se, path, prefix = "")
se |
a |
path |
the destination of the output files (gtf, transcript counts, and gene counts) |
prefix |
the prefix of the output files |
The function will write the output from Bambu to files. The annotations will be written to a .gtf file, transcript counts (total counts, CPM, full-length counts, partial-length counts, and unique counts) and gene counts will be written to .txt files.
se <- readRDS(system.file("extdata", "seOutput_SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.rds", package = "bambu" )) path <- tempdir() writeBambuOutput(se, path)
se <- readRDS(system.file("extdata", "seOutput_SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.rds", package = "bambu" )) path <- tempdir() writeBambuOutput(se, path)
Write annotation GRangesList into a GTF file
writeToGTF(annotation, file, geneIDs = NULL)
writeToGTF(annotation, file, geneIDs = NULL)
annotation |
a |
file |
the output gtf file name |
geneIDs |
an optional dataframe of geneIDs (column 2) with the corresponding transcriptIDs (column 1) |
gtf a GTF dataframe
outputGtfFile <- tempfile() gr <- readRDS(system.file("extdata", "annotationGranges_txdbGrch38_91_chr9_1_1000000.rds", package = "bambu" )) writeToGTF(gr, outputGtfFile)
outputGtfFile <- tempfile() gr <- readRDS(system.file("extdata", "annotationGranges_txdbGrch38_91_chr9_1_1000000.rds", package = "bambu" )) writeToGTF(gr, outputGtfFile)